SlideShare a Scribd company logo
1 of 55
Download to read offline
Lecture 1
Dr. Fawad Hussain
GIK Institute
Fall 2015
Data Warehousing and MiningData Warehousing and MiningData Warehousing and MiningData Warehousing and Mining
(CS437)(CS437)(CS437)(CS437)
Some lectures in this course have been partially adapted from lecture series by StephenA. Brobst, Chief Technology Officer atTeradata and
professor at MIT.
General Course Description
Datawarehousing
What is the motivation behind DataWarehousing and Mining?
Advanced Indexing, Query Processing and Optimization.
Building DataWarehouses.
Data Cubes, OLAP, De-Normalization,etc.
Data MiningTechniques
Regression
Clustering
DecisionTrees
Other Information
Office Hours (Pasted on the office door)
Office: G03 (FCSE)
CourseTA (Mr. Bilal)
Text Books (Optional)
Introduction to Data Mining;Tan, Steinbach & Kumar.
Data Mining: Concepts andTechniques by Jiawei Han and
Micheline Kamber Morgan Kaufmann Publishers, 2nd Edition,
March 2006, ISBN 1-55860-901-6.
Building a DataWarehouse for Decision Support byVidette
Poe.
Fundamentals of Database Systems by Elmasri and Navathe
Addison-Wesley, 5th Edition, 2007.
Grading Plan
Grading Plan for Course %
Tentative
Number(s)
Midterm Exam 25 01
Quizzes 10 06
Project 20 02
Final Exam 45 01
Tentative Schedule
Tentative Schedule
Tentative Schedule
Lecture 1
Introduction and Overview
Why this Course?
The world is changing (actually changed), either change or be
left behind.
Missing the opportunities or going in the wrong direction has
prevented us from growing.
What is the right direction?
Harnessing the data, in a knowledge driven economy.
The Need
Knowledge is power, Intelligence is
absolute power!
“Drowning in data and starving for
information”
Data Processing Steps
DATA
INFORMATION
POWER
INTELLIGENCE
$
End goal?
Historical Overview
1960
Master Files & Reports
1965
Lots of Master files!
1970
Direct Access Memory & DBMS
1975
Online high performance transaction processing
1980
PCs and 4GL Technology (MIS/DSS)
Post 1990
Data Warehousing and Data Mining
Crises of Credibility
What is the financial health of our company?
-10%
+10%
??
Why a Data Warehouse?
Data recording and storage is growing.
History is excellent predictor of the future.
Gives total view of the organization.
Intelligent decision-support is required for decision-
making.
Why Data Warehouse?
Size of Data Sets are going up ↑.
Cost of data storage is coming down ↓.
The amount of data average business collects and stores
is doubling every year
Total hardware and software cost to store and manage 1
Mbyte of data
1990: ~ $15
2002: ~ ¢15 (Down 100 times)
By 2007: < ¢1 (Down 150 times)
Why Data Warehouse?
A Few Examples
WalMart: 24TB
FranceTelecom: ~ 100TB
CERN: Up to 20 PB by 2006
Stanford LinearAccelerator Center (SLAC): 500TB
Businesses demand Intelligence (BI).
Complex questions from integrated data.
“Intelligent Enterprise”
List of all items that were sold last month?
List of all items purchased by X?
The total sales of the last month grouped by branch?
How many sales transactions occurred during the month of
January?
DBMS Approach
Which items sell together? Which items to stock?
Where and how to place the items? What
discounts to offer?
How best to target customers to increase sales at
a branch?
Which customers are most likely to respond to
my next promotional campaign, and why?
Intelligent Enterprise
What is a Data Warehouse?
A complete repository of historical corporate data extracted
from transaction systems that is available for ad-hoc access by
knowledge workers.
What is Data Mining?
“There are things that we know that we know…
there are things that we know that we don’t know…
there are things that we don’t know we don’t know.”
Donald Rumsfield
Former US Secretary of Defence
What is Data Mining?
Tell me something that I should know.
When you don’t know what you should be knowing,
how do you write SQL?
You cant!!
What is Data Mining?
Knowledge Discovery in Databases (KDD).
Data mining digs out valuable non-trivial information from large
multidimensional apparently unrelated data bases(sets).
It’s the integration of business knowledge, people, information,
algorithms, statistics and computing technology.
Discovering useful hidden patterns and relationships in data.
HUGE VOLUME THERE IS WAY TOO MUCH
DATA & GROWING!
Data collected much faster than it can be processed or
managed. NASA Earth Observation System (EOS), will
alone, collect 15 Peta bytes by 2007
(15,000,000,000,000,000 bytes).
• Much of which won't be used - ever!
• Much of which won't be seen - ever!
• Why not?
There's so much volume, usefulness of some of it will never
be discovered
SOLUTION: Reduce the volume and/or raise the
information content by structuring, querying, filtering,
summarizing, aggregating, mining...
Requires solution of fundamentally new
problems
1. developing algorithms and systems to mine large, massive
and high dimensional data sets;
2. developing algorithms and systems to mine new types of
data (images, music, videos);
3. developing algorithms, protocols, and other infrastructure
to mine distributed data; and
4. improving the ease of use of data mining systems;
5. developing appropriate privacy and security techniques
for data mining.
Future of Data Mining
10 Hottest Jobs of year 2025
TIME Magazine,22 May,2000
10 emerging areas of technology
MIT’s Magazine ofTechnology Review,
Jan/Feb,2001
Data Mining
Data Mining
Machine
Learning
Database
Technology
Statistics
Visualization
Other
Disciplines
Information
Science
Logical and Physical DatabaseLogical and Physical DatabaseLogical and Physical DatabaseLogical and Physical Database
DesignDesignDesignDesign
Data Mining is one step of Knowledge
Discovery in Databases (KDD)
Raw
Data
Preprocessing
• Extraction
• Transformation
• Cleansing
• Validation
Data Mining
• Identify Patterns
• Create Models
Interpretation/
Evaluation
• Visualization
• Feature Extraction
• Analysis
Clean
Data
$ $ $
Knowledge
Information Evolution in a Data
Warehouse Environment
Primarily Batch Event Based
Triggering
Takes Hold
Increase in
Ad Hoc
Queries
Analytical
Modeling
Grows
Continuous Update &
Time Sensitive Queries
Become Important
Batch Ad Hoc Analytics Continuous Update/Short Queries Event-Based Triggering
STAGE 2:
ANALYZE
WHY did
it happen?
STAGE 3:
PREDICT
What WILL
happen?
STAGE 1:
REPORT
WHAT happened?
STAGE 4:
OPERATIONALIZE
What IS happening?
STAGE 5:
ACTIVATE
What do you WANT to
happen?
Normalization and Denormalization
Normalization
A relational database relates subsets of a dataset to each other.
A dataset is a set of tables (or schema in Oracle)
A table defines the structure and contains the row and column data for each
subset.
Tables are related to each other by linking them based on common items and
values between two tables.
Normalization is the optimization of record keeping for insertion, deletion
and updation (in addition to selection, ofcourse)
De-normalization
Why denormalize?
When to denormalize
How to denormalize
Why De-normalization?
Do you have performance problems?
If not, then you shouldn’t be studying this course!
The root cause of 99% of database performance problems is
poorly written SQL code.
Usually as a result of poorly optimized underlying structure
Do you have disk storage problems?
Consider separating large, less used datasets and frequently used
datasets.
When to Denormalize?
Denormalization sometimes implies the undoing of some of the
steps of Normalization
Denormalization is not necessarily the reverse of the steps of
Normalization.
Denormalization does not imply complete removal of specific
Normal Form levels.
Denormalization results in duplication.
It is quite possible that table structure is much too granular or possibly even
incompatible with structure imposed by applications.
Denormalization usually involves merging of multiple
transactional tables or multiple static tables into single
When to Denormalize?
Look for one-to-one relationships.
These may be unnecessary if the required removal of null values
causes costly joins. Disk space is cheap. Complex SQL join statements
can destroy performance.
Do you have many-to-many join resolution entities? Are they all
necessary? Are they all used by applications?
When constructing SQL statement joins are you finding many
tables in joins where those tables are scattered throughout the
entity relationship diagram?
When searching for static data items such as customer details are
you querying a single or multiple tables?
A single table is much more efficient than multiple tables.
How to Denormalize?
Common Forms of Denormalization
Pre-join de-normalization.
Column replication or movement.
Pre-aggregation.
Considerations in Assessing
De-normalization
Performance implications
Storage implications
Ease-of-use implications
Maintenance implications
Most commonly missed/disregarded.
Pre-join Denormalization
Take tables which are frequently joined and “glue” them together
into a single table.
Avoids performance impact of the frequent joins.
Typically increases storage requirements.
Pre-join Denormalization
A simplified retail example...
Before denormalization:
sale_id store_id sale_dt …
tx_id sale_id item_id … item_qty sale$
1
m
Pre-join Denormalization
tx_id sale_id store_id sale_dt item_id … item_qty $
A simplified retail example...
After denormalization:
Points to Ponder
Which Normal Form is being violated?
Will there be maintenance issues?
Pre-join Denormalization
Storage implications...
Assume 1:3 record count ratio between sales header and detail.
Assume 1 billion sales (3 billion sales detail).
Assume 8 byte sales_id.
Assume 30 byte header and 40 byte detail records.
Which businesses will be most hurt, in terms of storage capacity, by
this form of denormalization?
Pre-join Denormalization
Storage implications...
Before denormalization: 150 GB raw data.
After denormalization: 186 GB raw data.
Net result is 24% increase in raw data size for the database.
Pre-join may actually result in space saving, if many concurrent queries are
demanding frequent joins on the joined tables! HOW?
Pre-join Denormalization
Sample Query:
What was my total $ volume betweenThanksgiving and Christmas in
1999?
Pre-join Denormalization
Before de-normalization:
select sum(sales_detail.sale_amt)
from sales
,sales_detail
where sales.sales_id =
sales_detail.sales_id
and sales.sales_dt between '1999-11-26'
and '1999-12-25'
;
Pre-join Denormalization
After de-normalization:
select sum(d_sales_detail.sale_amt)
from d_sales_detail
where d_sales_detail.sales_dt between '1999-
11-26' and '1999-12-25'
;
No join operation performed.
How to compare performance?
Pre-join Denormalization
But consider the question...
How many sales (transactions) did I make betweenThanksgiving and
Christmas in 1999?
Pre-join Denormalization
Before denormalization:
select count(*)
from sales
where sales.sales_dt between '1999-11-26' and
'1999-12-25';
After denormalization:
select count(distinct d_sales_detail.sales_id)
from d_sales_detail
where d_sales_detail.sales_dt between '1999-11-
26' and '1999-12-25';
Which query will perform better?
Pre-join Denormalization
Performance implications...
Performance penalty for count distinct (forces sort) can be quite large.
May be worth 30 GB overhead to keep sales header records if this is a common
query structure because both ease-of-use and performance will be enhanced (at
some cost in storage)?
Considerations in Assessing
De-normalization
Performance implications
Storage implications
Ease-of-use implications
Maintenance implications
Most commonly missed/disregarded.
Column Replication or Movement
Take columns that are frequently accessed via large scale joins and
replicate (or move) them into detail table(s) to avoid join
operation.
Avoids performance impact of the frequent joins.
Increases storage requirements for database.
Possible to “move” frequently accessed column to detail instead of
replicating it.
Note: This technique is no different than a limited form of the pre-
join denormalization described previously.
ColA ColB
Table_1
ColA ColC ColD … ColZ
Table_2
ColA ColB
Table_1’
ColA ColC ColD … ColZ
Table_2
ColC
Column Replication or Movement
Health Care DW Example: Take member_id from claim header
and move it to claim detail.
Result: An extra ten bytes per row on claim line table allows
avoiding join to claim header table on some (many?) queries.
Which normal form does this technique violates?
Column Replication or Movement
Beware of the results of de-normalization:
Assuming a 100 byte record before the denormalization, all scans
through the claim line detail will now take 10% longer than
previously.
A significant percentage of queries must get benefit from access to
the denormalized column in order to justify movement into the
claim line table.
Need to quantify both cost and benefit of each denormalization
decision.
Column Replication or Movement
May want to replicate columns in order to facilitate co-location of commonly joined
tables.
Before denormalization:
A three table join requires re-distribution of significant amounts of data to answer many
important questions related to customer transaction behavior.
Customer_Id Customer_Nm Address Ph …
Account_Id Customer_Id Balance$ Open_Dt …
Tx_Id Account_Id Tx$ Tx_Dt Location_Id …
1
m
1
m
CustTable
AcctTable
TrxTable

More Related Content

What's hot

Architecting a Data Warehouse: A Case Study
Architecting a Data Warehouse: A Case StudyArchitecting a Data Warehouse: A Case Study
Architecting a Data Warehouse: A Case StudyMark Ginnebaugh
 
An introduction to data warehousing
An introduction to data warehousingAn introduction to data warehousing
An introduction to data warehousingShahed Khalili
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousingsumit621
 
DM Lecture 3
DM Lecture 3DM Lecture 3
DM Lecture 3asad199
 
Data Warehouse Design & Dimensional Modeling
Data Warehouse Design & Dimensional ModelingData Warehouse Design & Dimensional Modeling
Data Warehouse Design & Dimensional ModelingCode Mastery
 
DATA Warehousing & Data Mining
DATA Warehousing & Data MiningDATA Warehousing & Data Mining
DATA Warehousing & Data Miningcpjcollege
 
Chapter 13 data warehousing
Chapter 13   data warehousingChapter 13   data warehousing
Chapter 13 data warehousingsumit621
 
Data Warehouse Project Report
Data Warehouse Project Report Data Warehouse Project Report
Data Warehouse Project Report Tom Donoghue
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data LakeCaserta
 
Data warehousing - Dr. Radhika Kotecha
Data warehousing - Dr. Radhika KotechaData warehousing - Dr. Radhika Kotecha
Data warehousing - Dr. Radhika KotechaRadhika Kotecha
 
Best Practices: Datawarehouse Automation Conference September 20, 2012 - Amst...
Best Practices: Datawarehouse Automation Conference September 20, 2012 - Amst...Best Practices: Datawarehouse Automation Conference September 20, 2012 - Amst...
Best Practices: Datawarehouse Automation Conference September 20, 2012 - Amst...Erik Fransen
 
142230 633685297550892500
142230 633685297550892500142230 633685297550892500
142230 633685297550892500sumit621
 
Datamining
DataminingDatamining
Dataminingsumit621
 
Basic Introduction of Data Warehousing from Adiva Consulting
Basic Introduction of  Data Warehousing from Adiva ConsultingBasic Introduction of  Data Warehousing from Adiva Consulting
Basic Introduction of Data Warehousing from Adiva Consultingadivasoft
 
Data Ware Housing And Data Mining
Data Ware Housing And Data MiningData Ware Housing And Data Mining
Data Ware Housing And Data Miningcpjcollege
 
Warehouse components
Warehouse componentsWarehouse components
Warehouse componentsganblues
 
WHAT IS A DATA LAKE? Know DATA LAKES & SALES ECOSYSTEM
WHAT IS A DATA LAKE? Know DATA LAKES & SALES ECOSYSTEMWHAT IS A DATA LAKE? Know DATA LAKES & SALES ECOSYSTEM
WHAT IS A DATA LAKE? Know DATA LAKES & SALES ECOSYSTEMRajaraj64
 
001 More introduction to big data analytics
001   More introduction to big data analytics001   More introduction to big data analytics
001 More introduction to big data analyticsDendej Sawarnkatat
 
Enterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable DigitalEnterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable Digitalsambiswal
 

What's hot (20)

Architecting a Data Warehouse: A Case Study
Architecting a Data Warehouse: A Case StudyArchitecting a Data Warehouse: A Case Study
Architecting a Data Warehouse: A Case Study
 
An introduction to data warehousing
An introduction to data warehousingAn introduction to data warehousing
An introduction to data warehousing
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
 
DM Lecture 3
DM Lecture 3DM Lecture 3
DM Lecture 3
 
Data Warehouse Design & Dimensional Modeling
Data Warehouse Design & Dimensional ModelingData Warehouse Design & Dimensional Modeling
Data Warehouse Design & Dimensional Modeling
 
DATA Warehousing & Data Mining
DATA Warehousing & Data MiningDATA Warehousing & Data Mining
DATA Warehousing & Data Mining
 
Chapter 13 data warehousing
Chapter 13   data warehousingChapter 13   data warehousing
Chapter 13 data warehousing
 
Data Warehouse Project Report
Data Warehouse Project Report Data Warehouse Project Report
Data Warehouse Project Report
 
Part1
Part1Part1
Part1
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data Lake
 
Data warehousing - Dr. Radhika Kotecha
Data warehousing - Dr. Radhika KotechaData warehousing - Dr. Radhika Kotecha
Data warehousing - Dr. Radhika Kotecha
 
Best Practices: Datawarehouse Automation Conference September 20, 2012 - Amst...
Best Practices: Datawarehouse Automation Conference September 20, 2012 - Amst...Best Practices: Datawarehouse Automation Conference September 20, 2012 - Amst...
Best Practices: Datawarehouse Automation Conference September 20, 2012 - Amst...
 
142230 633685297550892500
142230 633685297550892500142230 633685297550892500
142230 633685297550892500
 
Datamining
DataminingDatamining
Datamining
 
Basic Introduction of Data Warehousing from Adiva Consulting
Basic Introduction of  Data Warehousing from Adiva ConsultingBasic Introduction of  Data Warehousing from Adiva Consulting
Basic Introduction of Data Warehousing from Adiva Consulting
 
Data Ware Housing And Data Mining
Data Ware Housing And Data MiningData Ware Housing And Data Mining
Data Ware Housing And Data Mining
 
Warehouse components
Warehouse componentsWarehouse components
Warehouse components
 
WHAT IS A DATA LAKE? Know DATA LAKES & SALES ECOSYSTEM
WHAT IS A DATA LAKE? Know DATA LAKES & SALES ECOSYSTEMWHAT IS A DATA LAKE? Know DATA LAKES & SALES ECOSYSTEM
WHAT IS A DATA LAKE? Know DATA LAKES & SALES ECOSYSTEM
 
001 More introduction to big data analytics
001   More introduction to big data analytics001   More introduction to big data analytics
001 More introduction to big data analytics
 
Enterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable DigitalEnterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable Digital
 

Viewers also liked

Ici final project report
Ici final project reportIci final project report
Ici final project reportJıa Yıı
 
Ken Hughes and morning presentations at ECR Ireland Category Management Shopp...
Ken Hughes and morning presentations at ECR Ireland Category Management Shopp...Ken Hughes and morning presentations at ECR Ireland Category Management Shopp...
Ken Hughes and morning presentations at ECR Ireland Category Management Shopp...ecrireland
 
1st group!!
1st group!! 1st group!!
1st group!! ichaa17
 
Why I love the Rain and You Will too - Guarenteed
Why I love the Rain and You Will too - GuarenteedWhy I love the Rain and You Will too - Guarenteed
Why I love the Rain and You Will too - GuarenteedJane Coombs
 
Lu siau vay_616_wds_
Lu siau vay_616_wds_Lu siau vay_616_wds_
Lu siau vay_616_wds_Vay Lu
 
Health supervision policy for the workplace
Health supervision policy for the workplaceHealth supervision policy for the workplace
Health supervision policy for the workplaceJane Coombs
 
Year 7 energy_resources_and_electrical_circuits_mark_scheme (1)
Year 7 energy_resources_and_electrical_circuits_mark_scheme (1)Year 7 energy_resources_and_electrical_circuits_mark_scheme (1)
Year 7 energy_resources_and_electrical_circuits_mark_scheme (1)Nurul Aron
 
Programme on recently recruited clerks of UCB/DCC/State Cooperative Banks
Programme on recently recruited clerks of UCB/DCC/State Cooperative BanksProgramme on recently recruited clerks of UCB/DCC/State Cooperative Banks
Programme on recently recruited clerks of UCB/DCC/State Cooperative Banksvamnicom123
 
Fit notes and work
Fit notes and workFit notes and work
Fit notes and workJane Coombs
 
Web coding principle
Web coding principleWeb coding principle
Web coding principleZongYing Lyu
 
經濟部訴願委員會第A410501007號決定書
經濟部訴願委員會第A410501007號決定書經濟部訴願委員會第A410501007號決定書
經濟部訴願委員會第A410501007號決定書Max Chang
 

Viewers also liked (20)

Forever Living Products… where ordinary people achieve extraordinary results
Forever Living Products… where ordinary people achieve extraordinary resultsForever Living Products… where ordinary people achieve extraordinary results
Forever Living Products… where ordinary people achieve extraordinary results
 
Ici final project report
Ici final project reportIci final project report
Ici final project report
 
Ken Hughes and morning presentations at ECR Ireland Category Management Shopp...
Ken Hughes and morning presentations at ECR Ireland Category Management Shopp...Ken Hughes and morning presentations at ECR Ireland Category Management Shopp...
Ken Hughes and morning presentations at ECR Ireland Category Management Shopp...
 
Hukum newton gravitasi
Hukum newton gravitasiHukum newton gravitasi
Hukum newton gravitasi
 
Engranajes fotos
Engranajes fotosEngranajes fotos
Engranajes fotos
 
1st group!!
1st group!! 1st group!!
1st group!!
 
Balance of payments
Balance of paymentsBalance of payments
Balance of payments
 
Why I love the Rain and You Will too - Guarenteed
Why I love the Rain and You Will too - GuarenteedWhy I love the Rain and You Will too - Guarenteed
Why I love the Rain and You Will too - Guarenteed
 
Lu siau vay_616_wds_
Lu siau vay_616_wds_Lu siau vay_616_wds_
Lu siau vay_616_wds_
 
Obesity
ObesityObesity
Obesity
 
Health supervision policy for the workplace
Health supervision policy for the workplaceHealth supervision policy for the workplace
Health supervision policy for the workplace
 
Cs437 lecture 13
Cs437 lecture 13Cs437 lecture 13
Cs437 lecture 13
 
Year 7 energy_resources_and_electrical_circuits_mark_scheme (1)
Year 7 energy_resources_and_electrical_circuits_mark_scheme (1)Year 7 energy_resources_and_electrical_circuits_mark_scheme (1)
Year 7 energy_resources_and_electrical_circuits_mark_scheme (1)
 
Programme on recently recruited clerks of UCB/DCC/State Cooperative Banks
Programme on recently recruited clerks of UCB/DCC/State Cooperative BanksProgramme on recently recruited clerks of UCB/DCC/State Cooperative Banks
Programme on recently recruited clerks of UCB/DCC/State Cooperative Banks
 
Fit notes and work
Fit notes and workFit notes and work
Fit notes and work
 
5G Info Briefing
5G Info Briefing 5G Info Briefing
5G Info Briefing
 
Perusahaan jasa
Perusahaan jasaPerusahaan jasa
Perusahaan jasa
 
Web coding principle
Web coding principleWeb coding principle
Web coding principle
 
Digital business briefing January 2015
Digital business briefing January 2015Digital business briefing January 2015
Digital business briefing January 2015
 
經濟部訴願委員會第A410501007號決定書
經濟部訴願委員會第A410501007號決定書經濟部訴願委員會第A410501007號決定書
經濟部訴願委員會第A410501007號決定書
 

Similar to Cs437 lecture 1-6

Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousingEr. Nawaraj Bhandari
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesIvo Andreev
 
1-_Intro_to_Data_Minning__DWH.ppt
1-_Intro_to_Data_Minning__DWH.ppt1-_Intro_to_Data_Minning__DWH.ppt
1-_Intro_to_Data_Minning__DWH.pptBsMath3rdsem
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data WarehousingJason S
 
Information Systems For Business and BeyondChapter 4Data a.docx
Information Systems For Business and BeyondChapter 4Data a.docxInformation Systems For Business and BeyondChapter 4Data a.docx
Information Systems For Business and BeyondChapter 4Data a.docxjaggernaoma
 
Emerging database landscape july 2011
Emerging database landscape july 2011Emerging database landscape july 2011
Emerging database landscape july 2011navaidkhan
 
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysWhat is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysNEWYORKSYS-IT SOLUTIONS
 
Big Data
Big DataBig Data
Big DataNGDATA
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureJames Serra
 
The Art of Requesting Data from IT
The Art of Requesting Data from ITThe Art of Requesting Data from IT
The Art of Requesting Data from ITBrad Adams
 
Lecture 01.ppt
Lecture 01.pptLecture 01.ppt
Lecture 01.pptHFLEX
 
Top 60+ Data Warehouse Interview Questions and Answers.pdf
Top 60+ Data Warehouse Interview Questions and Answers.pdfTop 60+ Data Warehouse Interview Questions and Answers.pdf
Top 60+ Data Warehouse Interview Questions and Answers.pdfDatacademy.ai
 
What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?RTTS
 
Making MySQL Great For Business Intelligence
Making MySQL Great For Business IntelligenceMaking MySQL Great For Business Intelligence
Making MySQL Great For Business IntelligenceCalpont
 
Database Shootout: What's best for BI?
Database Shootout: What's best for BI?Database Shootout: What's best for BI?
Database Shootout: What's best for BI?Jos van Dongen
 
One Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database RevolutionOne Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database Revolutionmark madsen
 
UNIT-5 DATA WAREHOUSING.docx
UNIT-5 DATA WAREHOUSING.docxUNIT-5 DATA WAREHOUSING.docx
UNIT-5 DATA WAREHOUSING.docxDURGADEVIL
 

Similar to Cs437 lecture 1-6 (20)

Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousing
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
 
1-_Intro_to_Data_Minning__DWH.ppt
1-_Intro_to_Data_Minning__DWH.ppt1-_Intro_to_Data_Minning__DWH.ppt
1-_Intro_to_Data_Minning__DWH.ppt
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
Information Systems For Business and BeyondChapter 4Data a.docx
Information Systems For Business and BeyondChapter 4Data a.docxInformation Systems For Business and BeyondChapter 4Data a.docx
Information Systems For Business and BeyondChapter 4Data a.docx
 
Emerging database landscape july 2011
Emerging database landscape july 2011Emerging database landscape july 2011
Emerging database landscape july 2011
 
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysWhat is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
 
Big Data
Big DataBig Data
Big Data
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
 
Presentation
PresentationPresentation
Presentation
 
The Art of Requesting Data from IT
The Art of Requesting Data from ITThe Art of Requesting Data from IT
The Art of Requesting Data from IT
 
Lecture 01.ppt
Lecture 01.pptLecture 01.ppt
Lecture 01.ppt
 
Top 60+ Data Warehouse Interview Questions and Answers.pdf
Top 60+ Data Warehouse Interview Questions and Answers.pdfTop 60+ Data Warehouse Interview Questions and Answers.pdf
Top 60+ Data Warehouse Interview Questions and Answers.pdf
 
What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?
 
Making MySQL Great For Business Intelligence
Making MySQL Great For Business IntelligenceMaking MySQL Great For Business Intelligence
Making MySQL Great For Business Intelligence
 
Database Shootout: What's best for BI?
Database Shootout: What's best for BI?Database Shootout: What's best for BI?
Database Shootout: What's best for BI?
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Date Analysis .pdf
Date Analysis .pdfDate Analysis .pdf
Date Analysis .pdf
 
One Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database RevolutionOne Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database Revolution
 
UNIT-5 DATA WAREHOUSING.docx
UNIT-5 DATA WAREHOUSING.docxUNIT-5 DATA WAREHOUSING.docx
UNIT-5 DATA WAREHOUSING.docx
 

More from Aneeb_Khawar

Cs437 lecture 16-18
Cs437 lecture 16-18Cs437 lecture 16-18
Cs437 lecture 16-18Aneeb_Khawar
 
Cs437 lecture 14_15
Cs437 lecture 14_15Cs437 lecture 14_15
Cs437 lecture 14_15Aneeb_Khawar
 
Cs437 lecture 10-12
Cs437 lecture 10-12Cs437 lecture 10-12
Cs437 lecture 10-12Aneeb_Khawar
 
Developing for Windows 8 based devices
Developing for Windows 8 based devicesDeveloping for Windows 8 based devices
Developing for Windows 8 based devicesAneeb_Khawar
 

More from Aneeb_Khawar (6)

Cs437 lecture 16-18
Cs437 lecture 16-18Cs437 lecture 16-18
Cs437 lecture 16-18
 
Cs437 lecture 14_15
Cs437 lecture 14_15Cs437 lecture 14_15
Cs437 lecture 14_15
 
Cs437 lecture 10-12
Cs437 lecture 10-12Cs437 lecture 10-12
Cs437 lecture 10-12
 
Cs437 lecture 09
Cs437 lecture 09Cs437 lecture 09
Cs437 lecture 09
 
Cs437 lecture 7-8
Cs437 lecture 7-8Cs437 lecture 7-8
Cs437 lecture 7-8
 
Developing for Windows 8 based devices
Developing for Windows 8 based devicesDeveloping for Windows 8 based devices
Developing for Windows 8 based devices
 

Recently uploaded

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 

Recently uploaded (20)

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 

Cs437 lecture 1-6

  • 1. Lecture 1 Dr. Fawad Hussain GIK Institute Fall 2015 Data Warehousing and MiningData Warehousing and MiningData Warehousing and MiningData Warehousing and Mining (CS437)(CS437)(CS437)(CS437) Some lectures in this course have been partially adapted from lecture series by StephenA. Brobst, Chief Technology Officer atTeradata and professor at MIT.
  • 2. General Course Description Datawarehousing What is the motivation behind DataWarehousing and Mining? Advanced Indexing, Query Processing and Optimization. Building DataWarehouses. Data Cubes, OLAP, De-Normalization,etc. Data MiningTechniques Regression Clustering DecisionTrees Other Information Office Hours (Pasted on the office door) Office: G03 (FCSE) CourseTA (Mr. Bilal)
  • 3. Text Books (Optional) Introduction to Data Mining;Tan, Steinbach & Kumar. Data Mining: Concepts andTechniques by Jiawei Han and Micheline Kamber Morgan Kaufmann Publishers, 2nd Edition, March 2006, ISBN 1-55860-901-6. Building a DataWarehouse for Decision Support byVidette Poe. Fundamentals of Database Systems by Elmasri and Navathe Addison-Wesley, 5th Edition, 2007.
  • 4. Grading Plan Grading Plan for Course % Tentative Number(s) Midterm Exam 25 01 Quizzes 10 06 Project 20 02 Final Exam 45 01
  • 9. Why this Course? The world is changing (actually changed), either change or be left behind. Missing the opportunities or going in the wrong direction has prevented us from growing. What is the right direction? Harnessing the data, in a knowledge driven economy.
  • 10. The Need Knowledge is power, Intelligence is absolute power! “Drowning in data and starving for information”
  • 12. Historical Overview 1960 Master Files & Reports 1965 Lots of Master files! 1970 Direct Access Memory & DBMS 1975 Online high performance transaction processing 1980 PCs and 4GL Technology (MIS/DSS) Post 1990 Data Warehousing and Data Mining
  • 13. Crises of Credibility What is the financial health of our company? -10% +10% ??
  • 14. Why a Data Warehouse? Data recording and storage is growing. History is excellent predictor of the future. Gives total view of the organization. Intelligent decision-support is required for decision- making.
  • 15. Why Data Warehouse? Size of Data Sets are going up ↑. Cost of data storage is coming down ↓. The amount of data average business collects and stores is doubling every year Total hardware and software cost to store and manage 1 Mbyte of data 1990: ~ $15 2002: ~ ¢15 (Down 100 times) By 2007: < ¢1 (Down 150 times)
  • 16. Why Data Warehouse? A Few Examples WalMart: 24TB FranceTelecom: ~ 100TB CERN: Up to 20 PB by 2006 Stanford LinearAccelerator Center (SLAC): 500TB Businesses demand Intelligence (BI). Complex questions from integrated data. “Intelligent Enterprise”
  • 17. List of all items that were sold last month? List of all items purchased by X? The total sales of the last month grouped by branch? How many sales transactions occurred during the month of January? DBMS Approach
  • 18. Which items sell together? Which items to stock? Where and how to place the items? What discounts to offer? How best to target customers to increase sales at a branch? Which customers are most likely to respond to my next promotional campaign, and why? Intelligent Enterprise
  • 19. What is a Data Warehouse? A complete repository of historical corporate data extracted from transaction systems that is available for ad-hoc access by knowledge workers.
  • 20. What is Data Mining? “There are things that we know that we know… there are things that we know that we don’t know… there are things that we don’t know we don’t know.” Donald Rumsfield Former US Secretary of Defence
  • 21. What is Data Mining? Tell me something that I should know. When you don’t know what you should be knowing, how do you write SQL? You cant!!
  • 22. What is Data Mining? Knowledge Discovery in Databases (KDD). Data mining digs out valuable non-trivial information from large multidimensional apparently unrelated data bases(sets). It’s the integration of business knowledge, people, information, algorithms, statistics and computing technology. Discovering useful hidden patterns and relationships in data.
  • 23. HUGE VOLUME THERE IS WAY TOO MUCH DATA & GROWING! Data collected much faster than it can be processed or managed. NASA Earth Observation System (EOS), will alone, collect 15 Peta bytes by 2007 (15,000,000,000,000,000 bytes). • Much of which won't be used - ever! • Much of which won't be seen - ever! • Why not? There's so much volume, usefulness of some of it will never be discovered SOLUTION: Reduce the volume and/or raise the information content by structuring, querying, filtering, summarizing, aggregating, mining...
  • 24. Requires solution of fundamentally new problems 1. developing algorithms and systems to mine large, massive and high dimensional data sets; 2. developing algorithms and systems to mine new types of data (images, music, videos); 3. developing algorithms, protocols, and other infrastructure to mine distributed data; and 4. improving the ease of use of data mining systems; 5. developing appropriate privacy and security techniques for data mining.
  • 25. Future of Data Mining 10 Hottest Jobs of year 2025 TIME Magazine,22 May,2000 10 emerging areas of technology MIT’s Magazine ofTechnology Review, Jan/Feb,2001
  • 27. Logical and Physical DatabaseLogical and Physical DatabaseLogical and Physical DatabaseLogical and Physical Database DesignDesignDesignDesign
  • 28. Data Mining is one step of Knowledge Discovery in Databases (KDD) Raw Data Preprocessing • Extraction • Transformation • Cleansing • Validation Data Mining • Identify Patterns • Create Models Interpretation/ Evaluation • Visualization • Feature Extraction • Analysis Clean Data $ $ $ Knowledge
  • 29.
  • 30. Information Evolution in a Data Warehouse Environment Primarily Batch Event Based Triggering Takes Hold Increase in Ad Hoc Queries Analytical Modeling Grows Continuous Update & Time Sensitive Queries Become Important Batch Ad Hoc Analytics Continuous Update/Short Queries Event-Based Triggering STAGE 2: ANALYZE WHY did it happen? STAGE 3: PREDICT What WILL happen? STAGE 1: REPORT WHAT happened? STAGE 4: OPERATIONALIZE What IS happening? STAGE 5: ACTIVATE What do you WANT to happen?
  • 31. Normalization and Denormalization Normalization A relational database relates subsets of a dataset to each other. A dataset is a set of tables (or schema in Oracle) A table defines the structure and contains the row and column data for each subset. Tables are related to each other by linking them based on common items and values between two tables. Normalization is the optimization of record keeping for insertion, deletion and updation (in addition to selection, ofcourse) De-normalization Why denormalize? When to denormalize How to denormalize
  • 32.
  • 33.
  • 34. Why De-normalization? Do you have performance problems? If not, then you shouldn’t be studying this course! The root cause of 99% of database performance problems is poorly written SQL code. Usually as a result of poorly optimized underlying structure Do you have disk storage problems? Consider separating large, less used datasets and frequently used datasets.
  • 35. When to Denormalize? Denormalization sometimes implies the undoing of some of the steps of Normalization Denormalization is not necessarily the reverse of the steps of Normalization. Denormalization does not imply complete removal of specific Normal Form levels. Denormalization results in duplication. It is quite possible that table structure is much too granular or possibly even incompatible with structure imposed by applications. Denormalization usually involves merging of multiple transactional tables or multiple static tables into single
  • 36. When to Denormalize? Look for one-to-one relationships. These may be unnecessary if the required removal of null values causes costly joins. Disk space is cheap. Complex SQL join statements can destroy performance. Do you have many-to-many join resolution entities? Are they all necessary? Are they all used by applications? When constructing SQL statement joins are you finding many tables in joins where those tables are scattered throughout the entity relationship diagram? When searching for static data items such as customer details are you querying a single or multiple tables? A single table is much more efficient than multiple tables.
  • 37. How to Denormalize? Common Forms of Denormalization Pre-join de-normalization. Column replication or movement. Pre-aggregation.
  • 38. Considerations in Assessing De-normalization Performance implications Storage implications Ease-of-use implications Maintenance implications Most commonly missed/disregarded.
  • 39. Pre-join Denormalization Take tables which are frequently joined and “glue” them together into a single table. Avoids performance impact of the frequent joins. Typically increases storage requirements.
  • 40. Pre-join Denormalization A simplified retail example... Before denormalization: sale_id store_id sale_dt … tx_id sale_id item_id … item_qty sale$ 1 m
  • 41. Pre-join Denormalization tx_id sale_id store_id sale_dt item_id … item_qty $ A simplified retail example... After denormalization: Points to Ponder Which Normal Form is being violated? Will there be maintenance issues?
  • 42. Pre-join Denormalization Storage implications... Assume 1:3 record count ratio between sales header and detail. Assume 1 billion sales (3 billion sales detail). Assume 8 byte sales_id. Assume 30 byte header and 40 byte detail records. Which businesses will be most hurt, in terms of storage capacity, by this form of denormalization?
  • 43. Pre-join Denormalization Storage implications... Before denormalization: 150 GB raw data. After denormalization: 186 GB raw data. Net result is 24% increase in raw data size for the database. Pre-join may actually result in space saving, if many concurrent queries are demanding frequent joins on the joined tables! HOW?
  • 44. Pre-join Denormalization Sample Query: What was my total $ volume betweenThanksgiving and Christmas in 1999?
  • 45. Pre-join Denormalization Before de-normalization: select sum(sales_detail.sale_amt) from sales ,sales_detail where sales.sales_id = sales_detail.sales_id and sales.sales_dt between '1999-11-26' and '1999-12-25' ;
  • 46. Pre-join Denormalization After de-normalization: select sum(d_sales_detail.sale_amt) from d_sales_detail where d_sales_detail.sales_dt between '1999- 11-26' and '1999-12-25' ; No join operation performed. How to compare performance?
  • 47. Pre-join Denormalization But consider the question... How many sales (transactions) did I make betweenThanksgiving and Christmas in 1999?
  • 48. Pre-join Denormalization Before denormalization: select count(*) from sales where sales.sales_dt between '1999-11-26' and '1999-12-25'; After denormalization: select count(distinct d_sales_detail.sales_id) from d_sales_detail where d_sales_detail.sales_dt between '1999-11- 26' and '1999-12-25'; Which query will perform better?
  • 49. Pre-join Denormalization Performance implications... Performance penalty for count distinct (forces sort) can be quite large. May be worth 30 GB overhead to keep sales header records if this is a common query structure because both ease-of-use and performance will be enhanced (at some cost in storage)?
  • 50. Considerations in Assessing De-normalization Performance implications Storage implications Ease-of-use implications Maintenance implications Most commonly missed/disregarded.
  • 51. Column Replication or Movement Take columns that are frequently accessed via large scale joins and replicate (or move) them into detail table(s) to avoid join operation. Avoids performance impact of the frequent joins. Increases storage requirements for database. Possible to “move” frequently accessed column to detail instead of replicating it. Note: This technique is no different than a limited form of the pre- join denormalization described previously.
  • 52. ColA ColB Table_1 ColA ColC ColD … ColZ Table_2 ColA ColB Table_1’ ColA ColC ColD … ColZ Table_2 ColC
  • 53. Column Replication or Movement Health Care DW Example: Take member_id from claim header and move it to claim detail. Result: An extra ten bytes per row on claim line table allows avoiding join to claim header table on some (many?) queries. Which normal form does this technique violates?
  • 54. Column Replication or Movement Beware of the results of de-normalization: Assuming a 100 byte record before the denormalization, all scans through the claim line detail will now take 10% longer than previously. A significant percentage of queries must get benefit from access to the denormalized column in order to justify movement into the claim line table. Need to quantify both cost and benefit of each denormalization decision.
  • 55. Column Replication or Movement May want to replicate columns in order to facilitate co-location of commonly joined tables. Before denormalization: A three table join requires re-distribution of significant amounts of data to answer many important questions related to customer transaction behavior. Customer_Id Customer_Nm Address Ph … Account_Id Customer_Id Balance$ Open_Dt … Tx_Id Account_Id Tx$ Tx_Dt Location_Id … 1 m 1 m CustTable AcctTable TrxTable