Unit 1 - Warehouse & Mining

Topics

Data Warehousing : Multidimensional Models; OLAP - Introduction, Characteristics, Architecture, Multidimensional view Efficient processing of OLAP Queries, OLAP server Architecture ROLAP versus MOLAP Versus HOLAP and data cube, Data cube operations, data cube computation.

Data mining: What is data mining, Challenges, Data Mining Tasks, Data: Types of Data, Data Quality, Data Pre-processing, Measures of Similarity and Dissimilarity

Data Warehouse

Introduction

Definition : Compiling/Pooling/Aggregating data from multiple relevant sources while ensuring data quality, consistency, and accuracy. Kind of like supporting management functions (cleaning, integration, consolidation)
Techniques : Data Marts, Metadata, OLAP, ETL, OLTP
Process : Source 🡪 Extract 🡪Transform 🡪 Load 🡪 Target.

OLAP

OnLine Analytical Processing Server.
Provide multidimensional data analysis and complex data modelling from multiple database systems at the same time.
Employed in business intelligence (BI) applications for trends analysis, forecasting, and planning.
OLAP databases consists of one or more cubes, known as Hyper-cubes.

enter image description here

Operations

Mnemonic - Drill-Roll Dice-Slice Pivot

Drill down: Less detailed data is converted into highly detailed data.

By moving down the concept hierarchy.
Thus, adding a new dimension.

Roll up (Aggregation): Opposite of the drill-down operation, performs aggregation.

By Climbing up in the concept hierarchy.
Reducing the dimensions.

Dice: It selects a sub-cube(two or more dimensions) from a cube. Gives an overview of subsection.
enter image description here

Slice: Selects a single dimension from the OLAP cube which results in a new sub-cube creation.
enter image description here

Pivot: Also called rotation operation as it rotates the current view/axis to get a new/alternate view.

enter image description here

ROLAP, HOLAP, MOLAP

Differ mainly in terms of data storage and transaction technique.

MOLAP is a multi-dimensional storage mode, while ROLAP is a . HOLAP is a combination of multi-dimensional and relational elements.

M(OLAP) : multidimensional storage model, MDDBs databases are used to store data. Sparse Matrix technique. User pre-calculated data cubes.

ROLAP : Relational mode of storage. Relational Table, SQL. Do not use pre-calculated data cubes.

HOLAP : ROLAP + MOLAP. Relational tables to hold the larger/detailed data, and rest in multidimensional cubes.

Data Warehouse Architecture

Types

Single-tier architecture minimize the amount of data stored. This goal is to remove data redundancy.
Two-tier architecture separates physically available sources and data warehouse. Can’t support end users and have network limitations
Three-tier architecture of the Top, Middle and Bottom Tier.

Design

Top-Down - starts with the overall design and planning
Bottom-Up - starts with experiments and prototypes.
Combined - Exploit the existing business solution

Three Tier

Tier-1/Bottom Tier RDBMS, Back-end tools and utilities are used to feed data into the bottom tier from operational databases or other external sources (such as customer profile information provided by external consultants). These tools and utilities perform data extraction, cleaning, and transformation (e.g., to merge similar data from different sources into a unified format), as well as load and refresh functions to update the data warehouse.

Tier-2/Middle Tier OLAP server that is typically implemented using either a relational OLAP (ROLAP) model or a multidimensional OLAP. OLAP model is an extended relational DBMS that maps operations on multidimensional data to standard relational operations. A multidimensional OLAP (MOLAP) model, that is, a special-purpose server that directly implements multidimensional data and operations.

Tier-3/Front-End Client layer, which contains query and reporting tools, analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so on)

enter image description here

Data Mining

Similarity & Dissimilarity

Similarity measures how data samples are related or closed to each other.

Dissimilarity measures how much the data objects are distinct.

Analysing large databases to identify patterns, relationships (correlations) to solve business problems and to get useful insights.
Techniques : K-Means, Support Vector
Process : Pdam Vi

Difference

Warehouse	Mining
Store	Analyze
Pool	Process
Integration for easier reporting	AI, ML, Stats
CRM, adds value, Manufacturing, Banking, Retail	Customer purchasing behavior, Market/Trend Analysis, Fraud Detection

Relationship

enter image description here

Data Mart

Definition : Data mart is a subset of a data warehouse oriented to a specific business line, contains summarized data repositories of specific section or unit within an organization, for example, the sales department. Whereas a data warehouse is a large centralized repositories of many data marts. Kinda like departmental data.
Sources : RDBMS, Flat Files, Servers, Application
Examples : Purchasing, Inventory, Sales, Finance
Adv : Easy access to frequent data, improved end user response time, Low Cost
Dis : Pre-Processing Difficult, Inconsistency, Performance

Database vs Warehouse

Database : Systematic way of storing and retrieving data, primarily to perform query processing, generally referred to as online transaction processing system. These systems are used in day to day operations of an organization.

OLTP, CRUD Transactions, Single Source, Normalized

Data Warehouse :

OLAP, Complex Analytical Queries, Multiple Source, Denormalized

OLAP vs OLTP

Both are online processing systems that help turn data into information.
OLTP deals with data transactions, while OLAP deals with data analytics.
The main idea is to use the two processes to form a stable data warehouse architecture.

OnLine Transaction Processing - For frequent, fast, integrity transactions for simple business processes.

OnLine Analytical Processing - Less frequent, Multidimensional Complex Queries,

enter image description here

Difference

OLTP	OLAP
stands for	stands for
INSERT, UPDATE, DELETE	SELECT
Modifies and writes data often	Queries data, rarely writes
Transactions	Analytics
Normalized, Traditional	Unnormalized

Data Mining Architecture

refers to the detection and extraction of new patterns from the already collected data.

enter image description here

Explain in short then each point
Data Sources: plain text, spreadsheets, flat files
Database Server: actual data ready to be processed
Data Mining Engine: ore components of the data mining architecture that performs all kinds of data mining techniques like association, classification, characterization, clustering, prediction, etc.
Pattern Evaluation Modules: finding interesting patterns
Graphic User Interface: to communicate effectively with user.
Knowledge Base: To make result more accurate and reliable.
Types : No/Loose/Semi-Tight/Tight Coupling

Notes De Yaar

Search This Blog