Topics
Data Warehousing : Multidimensional Models; OLAP - Introduction, Characteristics, Architecture, Multidimensional view Efficient processing of OLAP Queries, OLAP server Architecture ROLAP versus MOLAP Versus HOLAP and data cube, Data cube operations, data cube computation.
Data mining: What is data mining, Challenges, Data Mining Tasks, Data: Types of Data, Data Quality, Data Pre-processing, Measures of Similarity and Dissimilarity
Data Warehouse
Introduction
- Definition : Compiling/Pooling/Aggregating data from multiple relevant sources while ensuring data quality, consistency, and accuracy. Kind of like supporting management functions (cleaning, integration, consolidation)
- Techniques : Data Marts, Metadata, OLAP, ETL, OLTP
- Process : Source 🡪 Extract 🡪Transform 🡪 Load 🡪 Target.
OLAP
- OnLine Analytical Processing Server.
- Provide multidimensional data analysis and complex data modelling from multiple database systems at the same time.
- Employed in business intelligence (BI) applications for trends analysis, forecasting, and planning.
- OLAP databases consists of one or more cubes, known as Hyper-cubes.
Operations
Mnemonic - Drill-Roll Dice-Slice Pivot
Drill down: Less detailed data is converted into highly detailed data.
- By moving down the concept hierarchy.
- Thus, adding a new dimension.
Roll up (Aggregation): Opposite of the drill-down operation, performs aggregation.
- By Climbing up in the concept hierarchy.
- Reducing the dimensions.
Dice: It selects a sub-cube(two or more dimensions) from a cube. Gives an overview of subsection.
Slice: Selects a single dimension from the OLAP cube which results in a new sub-cube creation.
Pivot: Also called rotation operation as it rotates the current view/axis to get a new/alternate view.
ROLAP, HOLAP, MOLAP
Differ mainly in terms of data storage and transaction technique.
MOLAP is a multi-dimensional storage mode, while ROLAP is a . HOLAP is a combination of multi-dimensional and relational elements.
M(OLAP) : multidimensional storage model, MDDBs databases are used to store data. Sparse Matrix technique. User pre-calculated data cubes.
ROLAP : Relational mode of storage. Relational Table, SQL. Do not use pre-calculated data cubes.
HOLAP : ROLAP + MOLAP. Relational tables to hold the larger/detailed data, and rest in multidimensional cubes.
Data Warehouse Architecture
Types
- Single-tier architecture minimize the amount of data stored. This goal is to remove data redundancy.
- Two-tier architecture separates physically available sources and data warehouse. Can’t support end users and have network limitations
- Three-tier architecture of the Top, Middle and Bottom Tier.
Design
Top-Down - starts with the overall design and planning
Bottom-Up - starts with experiments and prototypes.
Combined - Exploit the existing business solution
Three Tier
Tier-1/Bottom Tier RDBMS, Back-end tools and utilities are used to feed data into the bottom tier from operational databases or other external sources (such as customer profile information provided by external consultants). These tools and utilities perform data extraction, cleaning, and transformation (e.g., to merge similar data from different sources into a unified format), as well as load and refresh functions to update the data warehouse.
Tier-2/Middle Tier OLAP server that is typically implemented using either a relational OLAP (ROLAP) model or a multidimensional OLAP. OLAP model is an extended relational DBMS that maps operations on multidimensional data to standard relational operations. A multidimensional OLAP (MOLAP) model, that is, a special-purpose server that directly implements multidimensional data and operations.
Tier-3/Front-End Client layer, which contains query and reporting tools, analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so on)
Data Mining
Similarity & Dissimilarity
Similarity measures how data samples are related or closed to each other.
Dissimilarity measures how much the data objects are distinct.
- Analysing large databases to identify patterns, relationships (correlations) to solve business problems and to get useful insights.
- Techniques : K-Means, Support Vector
- Process : Pdam Vi
Difference
Warehouse | Mining |
---|---|
Store | Analyze |
Pool | Process |
Integration for easier reporting | AI, ML, Stats |
CRM, adds value, Manufacturing, Banking, Retail | Customer purchasing behavior, Market/Trend Analysis, Fraud Detection |
Relationship
Data Mart
-
Definition : Data mart is a subset of a data warehouse oriented to a specific business line, contains summarized data repositories of specific section or unit within an organization, for example, the sales department. Whereas a data warehouse is a large centralized repositories of many data marts. Kinda like departmental data.
-
Sources : RDBMS, Flat Files, Servers, Application
-
Examples : Purchasing, Inventory, Sales, Finance
-
Adv : Easy access to frequent data, improved end user response time, Low Cost
-
Dis : Pre-Processing Difficult, Inconsistency, Performance
Database vs Warehouse
Database : Systematic way of storing and retrieving data, primarily to perform query processing, generally referred to as online transaction processing system. These systems are used in day to day operations of an organization.
OLTP, CRUD Transactions, Single Source, Normalized
Data Warehouse :
OLAP, Complex Analytical Queries, Multiple Source, Denormalized
OLAP vs OLTP
- Both are online processing systems that help turn data into information.
- OLTP deals with data transactions, while OLAP deals with data analytics.
- The main idea is to use the two processes to form a stable data warehouse architecture.
OnLine Transaction Processing - For frequent, fast, integrity transactions for simple business processes.
OnLine Analytical Processing - Less frequent, Multidimensional Complex Queries,
Difference
OLTP | OLAP |
---|---|
stands for | stands for |
INSERT, UPDATE, DELETE | SELECT |
Modifies and writes data often | Queries data, rarely writes |
Transactions | Analytics |
Normalized, Traditional | Unnormalized |
Data Mining Architecture
- refers to the detection and extraction of new patterns from the already collected data.
- Explain in short then each point
- Data Sources: plain text, spreadsheets, flat files
- Database Server: actual data ready to be processed
- Data Mining Engine: ore components of the data mining architecture that performs all kinds of data mining techniques like association, classification, characterization, clustering, prediction, etc.
- Pattern Evaluation Modules: finding interesting patterns
- Graphic User Interface: to communicate effectively with user.
- Knowledge Base: To make result more accurate and reliable.
- Types : No/Loose/Semi-Tight/Tight Coupling
Comments
Post a Comment