Unit 1 - Warehouse & Mining

Unit 1 - Warehouse & Mining

Topics

Data Warehousing : Multidimensional Models; OLAP - Introduction, Characteristics, Architecture, Multidimensional view Efficient processing of OLAP Queries, OLAP server Architecture ROLAP versus MOLAP Versus HOLAP and data cube, Data cube operations, data cube computation.

Data mining: What is data mining, Challenges, Data Mining Tasks, Data: Types of Data, Data Quality, Data Pre-processing, Measures of Similarity and Dissimilarity

Data Warehouse

Introduction

  • Definition : Compiling/Pooling/Aggregating data from multiple relevant sources while ensuring data quality, consistency, and accuracy. Kind of like supporting management functions (cleaning, integration, consolidation)
  • Techniques : Data Marts, Metadata, OLAP, ETL, OLTP
  • Process : Source 🡪 Extract 🡪Transform 🡪 Load 🡪 Target.
  • enter image description here

OLAP

  • OnLine Analytical Processing Server.
  • Provide multidimensional data analysis and complex data modelling from multiple database systems at the same time.
  • Employed in business intelligence (BI) applications for trends analysis, forecasting, and planning.
  • OLAP databases consists of one or more cubes, known as Hyper-cubes.

enter image description here

Operations

Mnemonic - Drill-Roll Dice-Slice Pivot

Drill down: Less detailed data is converted into highly detailed data.

  • By moving down the concept hierarchy.
  • Thus, adding a new dimension.
  • enter image description here

Roll up (Aggregation): Opposite of the drill-down operation, performs aggregation.

  • By Climbing up in the concept hierarchy.
  • Reducing the dimensions.
  • enter image description here

Dice: It selects a sub-cube(two or more dimensions) from a cube. Gives an overview of subsection.
enter image description here

Slice: Selects a single dimension from the OLAP cube which results in a new sub-cube creation.
enter image description here

Pivot: Also called rotation operation as it rotates the current view/axis to get a new/alternate view.

enter image description here

ROLAP, HOLAP, MOLAP

Differ mainly in terms of data storage and transaction technique.

MOLAP is a multi-dimensional storage mode, while ROLAP is a . HOLAP is a combination of multi-dimensional and relational elements.

M(OLAP) : multidimensional storage model, MDDBs databases are used to store data. Sparse Matrix technique. User pre-calculated data cubes.

ROLAP : Relational mode of storage. Relational Table, SQL. Do not use pre-calculated data cubes.

HOLAP : ROLAP + MOLAP. Relational tables to hold the larger/detailed data, and rest in multidimensional cubes.

Data Warehouse Architecture

Types

  • Single-tier architecture minimize the amount of data stored. This goal is to remove data redundancy.
  • Two-tier architecture separates physically available sources and data warehouse. Can’t support end users and have network limitations
  • Three-tier architecture of the Top, Middle and Bottom Tier.

Design

Top-Down - starts with the overall design and planning
Bottom-Up - starts with experiments and prototypes.
Combined - Exploit the existing business solution

Three Tier

Tier-1/Bottom Tier RDBMS, Back-end tools and utilities are used to feed data into the bottom tier from operational databases or other external sources (such as customer profile information provided by external consultants). These tools and utilities perform data extraction, cleaning, and transformation (e.g., to merge similar data from different sources into a unified format), as well as load and refresh functions to update the data warehouse.

Tier-2/Middle Tier OLAP server that is typically implemented using either a relational OLAP (ROLAP) model or a multidimensional OLAP. OLAP model is an extended relational DBMS that maps operations on multidimensional data to standard relational operations. A multidimensional OLAP (MOLAP) model, that is, a special-purpose server that directly implements multidimensional data and operations.

Tier-3/Front-End Client layer, which contains query and reporting tools, analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so on)

enter image description here

enter image description here

Data Mining

Similarity & Dissimilarity

Similarity measures how data samples are related or closed to each other.

Dissimilarity measures how much the data objects are distinct.

  • Analysing large databases to identify patterns, relationships (correlations) to solve business problems and to get useful insights.
  • Techniques : K-Means, Support Vector
  • Process : Pdam Vi
  • enter image description here

Difference

Warehouse Mining
Store Analyze
Pool Process
Integration for easier reporting AI, ML, Stats
CRM, adds value, Manufacturing, Banking, Retail Customer purchasing behavior, Market/Trend Analysis, Fraud Detection

Relationship

enter image description here

Data Mart

  • Definition : Data mart is a subset of a data warehouse oriented to a specific business line, contains summarized data repositories of specific section or unit within an organization, for example, the sales department. Whereas a data warehouse is a large centralized repositories of many data marts. Kinda like departmental data.

  • Sources : RDBMS, Flat Files, Servers, Application

  • Examples : Purchasing, Inventory, Sales, Finance

  • enter image description here

  • Adv : Easy access to frequent data, improved end user response time, Low Cost

  • Dis : Pre-Processing Difficult, Inconsistency, Performance

Database vs Warehouse

Database : Systematic way of storing and retrieving data, primarily to perform query processing, generally referred to as online transaction processing system. These systems are used in day to day operations of an organization.

OLTP, CRUD Transactions, Single Source, Normalized

Data Warehouse :

OLAP, Complex Analytical Queries, Multiple Source, Denormalized

OLAP vs OLTP

  • Both are online processing systems that help turn data into information.
  • OLTP deals with data transactions, while OLAP deals with data analytics.
  • The main idea is to use the two processes to form a stable data warehouse architecture.

OnLine Transaction Processing - For frequent, fast, integrity transactions for simple business processes.

OnLine Analytical Processing - Less frequent, Multidimensional Complex Queries,

enter image description here

Difference

OLTP OLAP
stands for stands for
INSERT, UPDATE, DELETE SELECT
Modifies and writes data often Queries data, rarely writes
Transactions Analytics
Normalized, Traditional Unnormalized

Data Mining Architecture

  • refers to the detection and extraction of new patterns from the already collected data.

enter image description here

  • Explain in short then each point
  • Data Sources: plain text, spreadsheets, flat files
  • Database Server: actual data ready to be processed
  • Data Mining Engine: ore components of the data mining architecture that performs all kinds of data mining techniques like association, classification, characterization, clustering, prediction, etc.
  • Pattern Evaluation Modules: finding interesting patterns
  • Graphic User Interface: to communicate effectively with user.
  • Knowledge Base: To make result more accurate and reliable.
  • Types : No/Loose/Semi-Tight/Tight Coupling

Comments