SlideShare a Scribd company logo
1 of 89
1
 Part 1: Data Warehouses
 Part 2: OLAP
 Part 3: Data Mining
 Part 4: Big Data
2
3
 I can’t find the data I need
◦ data is scattered over the network
◦ many versions, subtle differences
4
 I can’t get the data I need
need an expert to get the data
 I can’t understand the data I
found
available data poorly documented
 I can’t use the data I found
results are unexpected
data needs to be transformed from
one form to other
A single, complete and
consistent store of data
obtained from a variety
of different sources
made available to end
users in a what they can
understand and use in a
business context.
[Barry Devlin]
5
6
Which are our
lowest/highest margin
customers ?
Who are my customers
and what products
are they buying?
Which customers
are most likely to go
to the competition ?
What impact will
new products/services
have on revenue
and margins?
What product prom-
-otions have the biggest
impact on revenue?
What is the most
effective distribution
channel?
 Used to manage and control business
 Data is historical or point-in-time
 Optimized for inquiry rather than update
 Used by managers and end-users to
understand the business and make
judgements
7
 Since 1970s, organizations gained
competitive advantage through systems
that automate business processes to offer
more efficient and cost-effective services to
the customer.
 This resulted in accumulation of growing
amounts of data in operational databases.
8
 A subject-oriented, integrated, time-variant,
and non-volatile collection of data in support
of management’s decision-making process
(Inmon, 1993).
9
 The warehouse is organized around the
major subjects of the enterprise (e.g.
customers, products, and sales) rather than
the major application areas (e.g. customer
invoicing, stock control, and product
sales).
 This is reflected in the need to store
decision-support data rather than
application-oriented data.
10
 The data warehouse integrates corporate
application-oriented data from different
source systems, which often includes data
that is inconsistent.
 The integrated data source must be made
consistent to present a unified view of the
data to the users.
11
 Data in the warehouse is only accurate and
valid at some point in time or over some
time interval.
 Time-variance is also shown in the
extended time that the data is held, the
implicit or explicit association of time with
all data, and the fact that the data
represents a series of snapshots.
12
 Data in the warehouse is not updated in real-
time but is refreshed from operational
systems on a regular basis.
 New data is always added as a supplement to
the database, rather than a replacement.
13
 Potential high returns on investment
 Competitive advantage
 Increased productivity of corporate decision-
makers
14
15
 The types of queries that a data warehouse
is expected to answer ranges from the
relatively simple to the highly complex and
is dependent on the type of end-user
access tools used.
 End-user access tools include:
◦ Reporting, query, and application development
tools
◦ Executive information systems (EIS)
◦ OLAP tools
◦ Data mining tools
16
 What was the total revenue for Scotland in the third quarter of
2004?
 What was the total revenue for property sales for each type of
property in Great Britain in 2003?
 What are the three most popular areas in each city for the renting
of property in 2004 and how does this compare with the figures
for the previous two years?
 What is the monthly revenue for property sales at each branch
office, compared with rolling 12-monthly prior figures?
 What would be the effect on property sales in the different
regions of Britain if legal costs went up by 3.5% and Government
taxes went down by 1.5% for properties over £100,000?
 Which type of property sells for prices above the average selling
price for properties in the main cities of Great Britain and how
does this correlate to demographic data?
 What is the relationship between the total annual revenue
generated by each branch office and the total number of sales
staff assigned to each branch office?
17
 Underestimation of resources for data
loading
 Hidden problems with source systems
 Required data not captured
 Increased end-user demands
 Data homogenization
18
 High demand for resources
 Data ownership
 High maintenance
 Long duration projects
 Complexity of integration
19
20
 A subset of a data warehouse that supports
the requirements of a particular
department or business function.
 Characteristics include
◦ Focuses on only the requirements of one
department or business function.
◦ Do not normally contain detailed operational
data unlike data warehouses.
◦ More easily understood and navigated.
21
 To give users access to the data they need
to analyze most often.
 To provide data in a form that matches the
collective view of the data by a group of
users in a department or business function
area.
 To improve end-user response time due to
the reduction in the volume of data to be
accessed.
22
 To provide appropriately structured data as
dictated by the requirements of the end-
user access tools.
 Building a data mart is simpler compared
with establishing a corporate data
warehouse.
 The cost of implementing data marts is
normally less than that required to establish
a data warehouse.
23
 The potential users of a data mart are more
clearly defined and can be more easily
targeted to obtain support for a data mart
project rather than a corporate data
warehouse project.
24
25
Departmentally
Structured
Individually
Structured
Data Warehouse
Organizationally
Structured
Less
More
History
Normalized
Detailed
Data
Information
26
 Aggregation -- (total sales, percent-to-total)
 Comparison -- Budget vs. Expenses
 Ranking -- Top 10, quartile analysis
 Access to detailed and aggregate data
 Complex criteria specification
 Visualization
 Need interactive response to aggregate
queries
27
 Accompanying the growth in data
warehousing is an ever-increasing
demand by users for more powerful
access tools that provide advanced
analytical capabilities.
 There are two main types of access tools
available to meet this demand, namely
Online Analytical Processing (OLAP) and
data mining.
28
 OLAP and Data Mining differ in what they
offer the user and because of this they
are complementary technologies.
 An environment that includes a data
warehouse (or more commonly one or
more data marts) together with tools
such as OLAP and /or data mining are
collectively referred to as Business
Intelligence (BI) technologies.
29
 The dynamic synthesis, analysis, and
consolidation of large volumes of multi-
dimensional data, Codd (1993).
 Describes a technology that uses a multi-
dimensional view of aggregate data to
provide quick access to strategic
information for the purposes of advanced
analysis.
30
 Enables users to gain a deeper
understanding and knowledge about
various aspects of their corporate data
through fast, consistent, interactive access
to a wide variety of possible views of the
data.
 Allows users to view corporate data in such
a way that it is a better model of the true
dimensionality of the enterprise.
31
 Can easily answer ‘who?’ and ‘what?’
questions, however, ability to answer ‘what
if?’ and ‘why?’ type questions distinguishes
OLAP from general-purpose query tools.
 Types of analysis ranges from basic
navigation and browsing (slicing and dicing)
to calculations, to more complex analyses
such as time series and complex modeling.
32
33
 Although OLAP applications are found in
widely divergent functional areas, they all
have the following key features:
◦ multi-dimensional views of data
◦ support for complex calculations
◦ time intelligence
34
 Must provide a range of powerful
computational methods such as that required
by sales forecasting, which uses trend
algorithms such as moving averages and
percentage growth.
35
 Key feature of almost any analytical
application as performance is almost always
judged over time.
 Time hierarchy is not always used in the
same manner as other hierarchies.
 Concepts such as year-to-date and period-
over-period comparisons should be easily
defined.
36
 Increased productivity of end-users.
 Reduced backlog of applications
development for IT staff.
 Retention of organizational control over the
integrity of corporate data.
 Reduced query drag and network traffic on
OLTP systems or on the data warehouse.
 Improved potential revenue and
profitability.
37
 Example of two-dimensional query.
 What is the total revenue generated by property sales in
each city, in each quarter of 2004?’
 Choice of representation is based on types
of queries end-user may ask.
 Compare representation - three-field
relational table versus two-dimensional
matrix.
38
39
 Example of three-dimensional query.
◦ ‘What is the total revenue generated by property
sales for each type of property (Flat or House) in
each city, in each quarter of 2004?’
 Compare representation - four-field
relational table versus three-dimensional
cube.
40
41
 Cube represents data as cells in an array.
 Relational table only represents multi-
dimensional data in two dimensions.
42
 Measure - sales (actual, plan, variance)
43
Month
1 2 3 4 765
Product
Toothpaste
Juice
Cola
Milk
Cream
Soap
W
S
N
Dimensions: Product, Region, Time
Hierarchical summarization paths
Product Region Time
Industry Country Year
Category Region Quarter
Product City Month week
Office Day
 It is a powerful
visualization tool
 It provides fast, interactive
response times
 It is good for analyzing
time series
 It can be useful to find
some clusters and outliners
 Many vendors offer OLAP
tools
44
 Andyne Computing --
Pablo
 Arbor Software --
Essbase
 Cognos -- PowerPlay
 Comshare -- Commander
OLAP
 Holistic Systems -- Holos
 Information Advantage --
AXSYS, WebOLAP
 Informix -- Metacube
 Microstrategies --
DSS/Agent
 Oracle -- Express
 Pilot -- LightShip
 Planning Sciences --
Gentium
 Platinum Technology --
ProdeaBeacon, Forest &
Trees
 SAS Institute -- SAS/EIS,
OLAP++
 Speedware -- Media
45
46
 The process of extracting valid, previously
unknown, comprehensible, and actionable
information from large databases and using
it to make crucial business decisions,
(Simoudis,1996).
 Involves the analysis of data and the use of
software techniques for finding hidden and
unexpected patterns and relationships in
sets of data.
47
 Reveals information that is hidden and
unexpected, as little value in finding
patterns and relationships that are already
intuitive.
 Patterns and relationships are identified by
examining the underlying rules and features
in the data.
48
 Most accurate results normally require large
volumes of data to deliver reliable
conclusions.
 Starts by developing an optimal
representation of structure of sample data
49
 Data mining can provide huge paybacks for
companies who have made a significant
investment in data warehousing.
 Relatively new technology, however already
used in a number of industries.
50
 Retail / Marketing
◦ Identifying buying patterns of customers
◦ Finding associations among customer demographic
characteristics
◦ Predicting response to mailing campaigns
◦ Market basket analysis
51
 Banking
◦ Detecting patterns of fraudulent credit card use
◦ Identifying loyal customers
◦ Predicting customers likely to change their credit
card affiliation
◦ Determining credit card spending by customer
groups
52
 Insurance
◦ Claims analysis
◦ Predicting which customers will buy new policies
 Medicine
◦ Characterizing patient behavior to predict surgery
visits
◦ Identifying successful medical therapies for
different illnesses
53
 Four main operations include:
◦ Predictive modeling
◦ Database segmentation
◦ Link analysis
◦ Deviation detection
 There are recognized associations between
the applications and the corresponding
operations.
◦ e.g. Direct marketing strategies use database
segmentation.
54
 Techniques are specific implementations of
the data mining operations.
 Each operation has its own strengths and
weaknesses.
55
56
 Similar to the human learning experience
◦ uses observations to form a model of the
important characteristics of some phenomenon.
 Uses generalizations of ‘real world’ and
ability to fit new data into a general
framework.
 Can analyze a database to determine
essential characteristics (model) about the
data set.
57
 Model is developed using a supervised
learning approach, which has two phases:
training and testing.
◦ Training builds a model using a large sample of
historical data called a training set.
◦ Testing involves trying out the model on new,
previously unseen data to determine its accuracy
and physical performance characteristics.
58
 Applications of predictive modeling include
customer retention management, credit
approval, cross selling, and direct
marketing.
 There are two techniques associated with
predictive modeling: classification and value
prediction, which are distinguished by the
nature of the variable being predicted.
59
60
 Used to estimate a continuous numeric
value that is associated with a database
record.
 Uses the traditional statistical techniques of
linear regression and nonlinear regression.
 Relatively easy-to-use and understand.
61
 Linear regression attempts to fit a straight
line through a plot of the data, such that
the line is the best representation of the
average of all observations at that point in
the plot.
 Problem is that the technique only works
well with linear data and is sensitive to the
presence of outliers (that is, data values,
which do not conform to the expected
norm).
62
 Data mining requires statistical methods that
can accommodate non-linearity, outliers, and
non-numeric data.
 Applications of value prediction include credit
card fraud detection or target mailing list
identification.
63
 Aim is to partition a database into an
unknown number of segments, or clusters, of
similar records.
 Uses unsupervised learning to discover
homogeneous sub-populations in a database
to improve the accuracy of the profiles.
64
 Less precise than other operations thus
less sensitive to redundant and irrelevant
features.
 Applications of database segmentation
include customer profiling, direct
marketing, and cross selling.
65
66
 Aims to establish links (associations)
between records, or sets of records, in a
database.
 There are three specializations
◦ Associations discovery
◦ Sequential pattern discovery
◦ Similar time sequence discovery
 Applications include product affinity
analysis, direct marketing, and stock price
movement.
67
 Finds items that imply the presence of other
items in the same event.
 Affinities between items are represented by
association rules.
◦ e.g. ‘When a customer rents property for more
than 2 years and is more than 25 years old, in
40% of cases, the customer will buy a property.
This association happens in 35% of all customers
who rent properties’.
68
 Finds patterns between events such that the
presence of one set of items is followed by
another set of items in a database of events
over a period of time.
◦ e.g. Used to understand long term customer buying
behavior.
69
 Finds links between two sets of data that are
time-dependent, and is based on the degree
of similarity between the patterns that both
time series demonstrate.
◦ e.g. Within three months of buying property, new
home owners will purchase goods such as cookers,
freezers, and washing machines.
70
 Relatively new operation in terms of
commercially available data mining tools.
 Often a source of true discovery because it
identifies outliers, which express deviation
from some previously known expectation
and norm.
71
 Can be performed using statistics and
visualization techniques or as a by-product
of data mining.
 Applications include fraud detection in the
use of credit cards and insurance claims,
quality control, and defects tracing.
72
73
What is Big Data?
What makes data, “Big” Data?
74
 No single standard definition…
“Big Data” is data whose scale, diversity, and
complexity require new architecture,
techniques, algorithms, and analytics to
manage it and extract value and hidden
knowledge from it…
75
 Data Volume
◦ 44x increase from 2009 2020
◦ From 0.8 zettabytes to 35zb
 Data volume is increasing
exponentially
76
Exponentialincreasein
collected/generateddata
 Various formats, types, and
structures
 Text, numerical, images, audio,
video, sequences, time series,
social media data, multi-dim
arrays, etc…
 Static data vs. streaming data
 A single application can be
generating/collecting many
types of data
77
To extract knowledge all
these types of data need to
linked together
 Data is begin generated fast and need to be
processed fast
 Online Data Analytics
 Late decisions  missing opportunities
 Examples
◦ E-Promotions: Based on your current location, your purchase
history, what you like  send promotions right now for store
next to you
◦ Healthcare monitoring: sensors monitoring your activities and
body  any abnormal measurements require immediate
reaction
78
79
80
 OLTP: Online Transaction Processing (DBMSs)
 OLAP: Online Analytical Processing (Data Warehousing)
 RTAP: Real-Time Analytics Processing (Big Data Architecture &
technology)
81
Social media andnetworks
(all of us aregenerating data)
Scientific instruments
(collecting all sorts of data)
Mobiledevices
(tracking all objects all the time)
Sensor technology andnetworks
(measuring all kinds ofdata)
 The progress and innovation is no longer hindered by the ability to collect data
 But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable fashion
82
 The Model of Generating/Consuming Data has
Changed
Old Model: Fewcompanies aregenerating data, all others areconsuming data
New Model: all of us are generating data, and all of us are consuming data
83
- Ad-hoc querying and reporting
- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time
84
 Big data is more real-time in
nature than traditional DW
applications
 Traditional DW architectures
(e.g. Exadata, Teradata) are
not well-suited for big data
apps
 Shared nothing, massively
parallel processing, scale out
architectures are well-suited
for big data apps
85
 The Bottleneck is in technology
◦ New architecture, algorithms, techniques are needed
 Also in technical skills
◦ Experts in using the new technology and dealing
with big data
86
What Technology Do We Have
For Big Data ??
87
88
89

More Related Content

What's hot

Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing concepts
pcherukumalla
 
Introduction to data warehousing
Introduction to data warehousing   Introduction to data warehousing
Introduction to data warehousing
Girish Dhareshwar
 
Data Warehouse Architectures
Data Warehouse ArchitecturesData Warehouse Architectures
Data Warehouse Architectures
Theju Paul
 

What's hot (20)

Data warehousing
Data warehousingData warehousing
Data warehousing
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Data Warehouse Basic Guide
Data Warehouse Basic GuideData Warehouse Basic Guide
Data Warehouse Basic Guide
 
Star schema PPT
Star schema PPTStar schema PPT
Star schema PPT
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing concepts
 
OLAP OnLine Analytical Processing
OLAP OnLine Analytical ProcessingOLAP OnLine Analytical Processing
OLAP OnLine Analytical Processing
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Data warehouse 21 snowflake schema
Data warehouse 21 snowflake schemaData warehouse 21 snowflake schema
Data warehouse 21 snowflake schema
 
Data Mining: Data processing
Data Mining: Data processingData Mining: Data processing
Data Mining: Data processing
 
Introduction to data warehousing
Introduction to data warehousing   Introduction to data warehousing
Introduction to data warehousing
 
Dw & etl concepts
Dw & etl conceptsDw & etl concepts
Dw & etl concepts
 
Data modeling star schema
Data modeling star schemaData modeling star schema
Data modeling star schema
 
Microsoft SQL Server Analysis Services (SSAS) - A Practical Introduction
Microsoft SQL Server Analysis Services (SSAS) - A Practical Introduction Microsoft SQL Server Analysis Services (SSAS) - A Practical Introduction
Microsoft SQL Server Analysis Services (SSAS) - A Practical Introduction
 
Data warehouse design
Data warehouse designData warehouse design
Data warehouse design
 
data warehouse , data mart, etl
data warehouse , data mart, etldata warehouse , data mart, etl
data warehouse , data mart, etl
 
Data Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisData Streaming in Big Data Analysis
Data Streaming in Big Data Analysis
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
Data Warehouse Architectures
Data Warehouse ArchitecturesData Warehouse Architectures
Data Warehouse Architectures
 

Similar to Data warehouse,data mining & Big Data

enterprise-data-everywhere
enterprise-data-everywhereenterprise-data-everywhere
enterprise-data-everywhere
Bill Peer
 
Bloor Research Comparative costs and uses for data integration platforms
Bloor Research Comparative costs and uses for data integration platformsBloor Research Comparative costs and uses for data integration platforms
Bloor Research Comparative costs and uses for data integration platforms
Fiona Lew
 
On-premises, consumption-based private cloud creates opportunity for enterpri...
On-premises, consumption-based private cloud creates opportunity for enterpri...On-premises, consumption-based private cloud creates opportunity for enterpri...
On-premises, consumption-based private cloud creates opportunity for enterpri...
Stanton Jones
 
Electronic Commerce
Electronic CommerceElectronic Commerce
Electronic Commerce
ellamee27
 

Similar to Data warehouse,data mining & Big Data (20)

Big Data, Analytics and Data Science
Big Data, Analytics and Data ScienceBig Data, Analytics and Data Science
Big Data, Analytics and Data Science
 
UNIT - 1 : Part 1: Data Warehousing and Data Mining
UNIT - 1 : Part 1: Data Warehousing and Data MiningUNIT - 1 : Part 1: Data Warehousing and Data Mining
UNIT - 1 : Part 1: Data Warehousing and Data Mining
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Data Virtualization, a Strategic IT Investment to Build Modern Enterprise Dat...
Data Virtualization, a Strategic IT Investment to Build Modern Enterprise Dat...Data Virtualization, a Strategic IT Investment to Build Modern Enterprise Dat...
Data Virtualization, a Strategic IT Investment to Build Modern Enterprise Dat...
 
Semantic 'Radar' Steers Users to Insights in the Data Lake
Semantic 'Radar' Steers Users to Insights in the Data LakeSemantic 'Radar' Steers Users to Insights in the Data Lake
Semantic 'Radar' Steers Users to Insights in the Data Lake
 
IDC Rethinking the datacenter
IDC Rethinking the datacenterIDC Rethinking the datacenter
IDC Rethinking the datacenter
 
enterprise-data-everywhere
enterprise-data-everywhereenterprise-data-everywhere
enterprise-data-everywhere
 
To Become a Data-Driven Enterprise, Data Democratization is Essential
To Become a Data-Driven Enterprise, Data Democratization is EssentialTo Become a Data-Driven Enterprise, Data Democratization is Essential
To Become a Data-Driven Enterprise, Data Democratization is Essential
 
Slow Data Kills Business eBook - Improve the Customer Experience
Slow Data Kills Business eBook - Improve the Customer ExperienceSlow Data Kills Business eBook - Improve the Customer Experience
Slow Data Kills Business eBook - Improve the Customer Experience
 
Bloor Research Comparative costs and uses for data integration platforms
Bloor Research Comparative costs and uses for data integration platformsBloor Research Comparative costs and uses for data integration platforms
Bloor Research Comparative costs and uses for data integration platforms
 
Semantic 'Radar' Steers Users to Insights in the Data Lake
Semantic 'Radar' Steers Users to Insights in the Data LakeSemantic 'Radar' Steers Users to Insights in the Data Lake
Semantic 'Radar' Steers Users to Insights in the Data Lake
 
Big Data is Here for Financial Services White Paper
Big Data is Here for Financial Services White PaperBig Data is Here for Financial Services White Paper
Big Data is Here for Financial Services White Paper
 
The rise of the digital supply network - IIOT Industry40 distribution
The rise of the digital supply network - IIOT Industry40 distribution The rise of the digital supply network - IIOT Industry40 distribution
The rise of the digital supply network - IIOT Industry40 distribution
 
Consumption-based public cloud (CBPC) model
Consumption-based public cloud (CBPC) modelConsumption-based public cloud (CBPC) model
Consumption-based public cloud (CBPC) model
 
On-premises, consumption-based private cloud creates opportunity for enterpri...
On-premises, consumption-based private cloud creates opportunity for enterpri...On-premises, consumption-based private cloud creates opportunity for enterpri...
On-premises, consumption-based private cloud creates opportunity for enterpri...
 
Sgcp14dunlea
Sgcp14dunleaSgcp14dunlea
Sgcp14dunlea
 
Electronic Commerce
Electronic CommerceElectronic Commerce
Electronic Commerce
 
Using Information Technology to Engage in Electronic Commerce
Using Information Technology to Engage in Electronic CommerceUsing Information Technology to Engage in Electronic Commerce
Using Information Technology to Engage in Electronic Commerce
 
Data Driven Efficiency
Data Driven EfficiencyData Driven Efficiency
Data Driven Efficiency
 
Offshore Projects
Offshore ProjectsOffshore Projects
Offshore Projects
 

More from Ravinder Kamboj (14)

DDBMS
DDBMSDDBMS
DDBMS
 
Cost estimation for Query Optimization
Cost estimation for Query OptimizationCost estimation for Query Optimization
Cost estimation for Query Optimization
 
Query processing and optimization (updated)
Query processing and optimization (updated)Query processing and optimization (updated)
Query processing and optimization (updated)
 
Query processing
Query processingQuery processing
Query processing
 
Normalization of Data Base
Normalization of Data BaseNormalization of Data Base
Normalization of Data Base
 
Architecture of dbms(lecture 3)
Architecture of dbms(lecture 3)Architecture of dbms(lecture 3)
Architecture of dbms(lecture 3)
 
Sql fundamentals
Sql fundamentalsSql fundamentals
Sql fundamentals
 
Lecture 1&2(rdbms-ii)
Lecture 1&2(rdbms-ii)Lecture 1&2(rdbms-ii)
Lecture 1&2(rdbms-ii)
 
Java script
Java scriptJava script
Java script
 
File Management
File ManagementFile Management
File Management
 
HTML Forms
HTML FormsHTML Forms
HTML Forms
 
DHTML
DHTMLDHTML
DHTML
 
CSA lecture-1
CSA lecture-1CSA lecture-1
CSA lecture-1
 
Relational database management system (rdbms) i
Relational database management system (rdbms) iRelational database management system (rdbms) i
Relational database management system (rdbms) i
 

Recently uploaded

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 

Recently uploaded (20)

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 

Data warehouse,data mining & Big Data

  • 1. 1
  • 2.  Part 1: Data Warehouses  Part 2: OLAP  Part 3: Data Mining  Part 4: Big Data 2
  • 3. 3
  • 4.  I can’t find the data I need ◦ data is scattered over the network ◦ many versions, subtle differences 4  I can’t get the data I need need an expert to get the data  I can’t understand the data I found available data poorly documented  I can’t use the data I found results are unexpected data needs to be transformed from one form to other
  • 5. A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context. [Barry Devlin] 5
  • 6. 6 Which are our lowest/highest margin customers ? Who are my customers and what products are they buying? Which customers are most likely to go to the competition ? What impact will new products/services have on revenue and margins? What product prom- -otions have the biggest impact on revenue? What is the most effective distribution channel?
  • 7.  Used to manage and control business  Data is historical or point-in-time  Optimized for inquiry rather than update  Used by managers and end-users to understand the business and make judgements 7
  • 8.  Since 1970s, organizations gained competitive advantage through systems that automate business processes to offer more efficient and cost-effective services to the customer.  This resulted in accumulation of growing amounts of data in operational databases. 8
  • 9.  A subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process (Inmon, 1993). 9
  • 10.  The warehouse is organized around the major subjects of the enterprise (e.g. customers, products, and sales) rather than the major application areas (e.g. customer invoicing, stock control, and product sales).  This is reflected in the need to store decision-support data rather than application-oriented data. 10
  • 11.  The data warehouse integrates corporate application-oriented data from different source systems, which often includes data that is inconsistent.  The integrated data source must be made consistent to present a unified view of the data to the users. 11
  • 12.  Data in the warehouse is only accurate and valid at some point in time or over some time interval.  Time-variance is also shown in the extended time that the data is held, the implicit or explicit association of time with all data, and the fact that the data represents a series of snapshots. 12
  • 13.  Data in the warehouse is not updated in real- time but is refreshed from operational systems on a regular basis.  New data is always added as a supplement to the database, rather than a replacement. 13
  • 14.  Potential high returns on investment  Competitive advantage  Increased productivity of corporate decision- makers 14
  • 15. 15
  • 16.  The types of queries that a data warehouse is expected to answer ranges from the relatively simple to the highly complex and is dependent on the type of end-user access tools used.  End-user access tools include: ◦ Reporting, query, and application development tools ◦ Executive information systems (EIS) ◦ OLAP tools ◦ Data mining tools 16
  • 17.  What was the total revenue for Scotland in the third quarter of 2004?  What was the total revenue for property sales for each type of property in Great Britain in 2003?  What are the three most popular areas in each city for the renting of property in 2004 and how does this compare with the figures for the previous two years?  What is the monthly revenue for property sales at each branch office, compared with rolling 12-monthly prior figures?  What would be the effect on property sales in the different regions of Britain if legal costs went up by 3.5% and Government taxes went down by 1.5% for properties over £100,000?  Which type of property sells for prices above the average selling price for properties in the main cities of Great Britain and how does this correlate to demographic data?  What is the relationship between the total annual revenue generated by each branch office and the total number of sales staff assigned to each branch office? 17
  • 18.  Underestimation of resources for data loading  Hidden problems with source systems  Required data not captured  Increased end-user demands  Data homogenization 18
  • 19.  High demand for resources  Data ownership  High maintenance  Long duration projects  Complexity of integration 19
  • 20. 20
  • 21.  A subset of a data warehouse that supports the requirements of a particular department or business function.  Characteristics include ◦ Focuses on only the requirements of one department or business function. ◦ Do not normally contain detailed operational data unlike data warehouses. ◦ More easily understood and navigated. 21
  • 22.  To give users access to the data they need to analyze most often.  To provide data in a form that matches the collective view of the data by a group of users in a department or business function area.  To improve end-user response time due to the reduction in the volume of data to be accessed. 22
  • 23.  To provide appropriately structured data as dictated by the requirements of the end- user access tools.  Building a data mart is simpler compared with establishing a corporate data warehouse.  The cost of implementing data marts is normally less than that required to establish a data warehouse. 23
  • 24.  The potential users of a data mart are more clearly defined and can be more easily targeted to obtain support for a data mart project rather than a corporate data warehouse project. 24
  • 26. 26
  • 27.  Aggregation -- (total sales, percent-to-total)  Comparison -- Budget vs. Expenses  Ranking -- Top 10, quartile analysis  Access to detailed and aggregate data  Complex criteria specification  Visualization  Need interactive response to aggregate queries 27
  • 28.  Accompanying the growth in data warehousing is an ever-increasing demand by users for more powerful access tools that provide advanced analytical capabilities.  There are two main types of access tools available to meet this demand, namely Online Analytical Processing (OLAP) and data mining. 28
  • 29.  OLAP and Data Mining differ in what they offer the user and because of this they are complementary technologies.  An environment that includes a data warehouse (or more commonly one or more data marts) together with tools such as OLAP and /or data mining are collectively referred to as Business Intelligence (BI) technologies. 29
  • 30.  The dynamic synthesis, analysis, and consolidation of large volumes of multi- dimensional data, Codd (1993).  Describes a technology that uses a multi- dimensional view of aggregate data to provide quick access to strategic information for the purposes of advanced analysis. 30
  • 31.  Enables users to gain a deeper understanding and knowledge about various aspects of their corporate data through fast, consistent, interactive access to a wide variety of possible views of the data.  Allows users to view corporate data in such a way that it is a better model of the true dimensionality of the enterprise. 31
  • 32.  Can easily answer ‘who?’ and ‘what?’ questions, however, ability to answer ‘what if?’ and ‘why?’ type questions distinguishes OLAP from general-purpose query tools.  Types of analysis ranges from basic navigation and browsing (slicing and dicing) to calculations, to more complex analyses such as time series and complex modeling. 32
  • 33. 33
  • 34.  Although OLAP applications are found in widely divergent functional areas, they all have the following key features: ◦ multi-dimensional views of data ◦ support for complex calculations ◦ time intelligence 34
  • 35.  Must provide a range of powerful computational methods such as that required by sales forecasting, which uses trend algorithms such as moving averages and percentage growth. 35
  • 36.  Key feature of almost any analytical application as performance is almost always judged over time.  Time hierarchy is not always used in the same manner as other hierarchies.  Concepts such as year-to-date and period- over-period comparisons should be easily defined. 36
  • 37.  Increased productivity of end-users.  Reduced backlog of applications development for IT staff.  Retention of organizational control over the integrity of corporate data.  Reduced query drag and network traffic on OLTP systems or on the data warehouse.  Improved potential revenue and profitability. 37
  • 38.  Example of two-dimensional query.  What is the total revenue generated by property sales in each city, in each quarter of 2004?’  Choice of representation is based on types of queries end-user may ask.  Compare representation - three-field relational table versus two-dimensional matrix. 38
  • 39. 39
  • 40.  Example of three-dimensional query. ◦ ‘What is the total revenue generated by property sales for each type of property (Flat or House) in each city, in each quarter of 2004?’  Compare representation - four-field relational table versus three-dimensional cube. 40
  • 41. 41
  • 42.  Cube represents data as cells in an array.  Relational table only represents multi- dimensional data in two dimensions. 42
  • 43.  Measure - sales (actual, plan, variance) 43 Month 1 2 3 4 765 Product Toothpaste Juice Cola Milk Cream Soap W S N Dimensions: Product, Region, Time Hierarchical summarization paths Product Region Time Industry Country Year Category Region Quarter Product City Month week Office Day
  • 44.  It is a powerful visualization tool  It provides fast, interactive response times  It is good for analyzing time series  It can be useful to find some clusters and outliners  Many vendors offer OLAP tools 44
  • 45.  Andyne Computing -- Pablo  Arbor Software -- Essbase  Cognos -- PowerPlay  Comshare -- Commander OLAP  Holistic Systems -- Holos  Information Advantage -- AXSYS, WebOLAP  Informix -- Metacube  Microstrategies -- DSS/Agent  Oracle -- Express  Pilot -- LightShip  Planning Sciences -- Gentium  Platinum Technology -- ProdeaBeacon, Forest & Trees  SAS Institute -- SAS/EIS, OLAP++  Speedware -- Media 45
  • 46. 46
  • 47.  The process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decisions, (Simoudis,1996).  Involves the analysis of data and the use of software techniques for finding hidden and unexpected patterns and relationships in sets of data. 47
  • 48.  Reveals information that is hidden and unexpected, as little value in finding patterns and relationships that are already intuitive.  Patterns and relationships are identified by examining the underlying rules and features in the data. 48
  • 49.  Most accurate results normally require large volumes of data to deliver reliable conclusions.  Starts by developing an optimal representation of structure of sample data 49
  • 50.  Data mining can provide huge paybacks for companies who have made a significant investment in data warehousing.  Relatively new technology, however already used in a number of industries. 50
  • 51.  Retail / Marketing ◦ Identifying buying patterns of customers ◦ Finding associations among customer demographic characteristics ◦ Predicting response to mailing campaigns ◦ Market basket analysis 51
  • 52.  Banking ◦ Detecting patterns of fraudulent credit card use ◦ Identifying loyal customers ◦ Predicting customers likely to change their credit card affiliation ◦ Determining credit card spending by customer groups 52
  • 53.  Insurance ◦ Claims analysis ◦ Predicting which customers will buy new policies  Medicine ◦ Characterizing patient behavior to predict surgery visits ◦ Identifying successful medical therapies for different illnesses 53
  • 54.  Four main operations include: ◦ Predictive modeling ◦ Database segmentation ◦ Link analysis ◦ Deviation detection  There are recognized associations between the applications and the corresponding operations. ◦ e.g. Direct marketing strategies use database segmentation. 54
  • 55.  Techniques are specific implementations of the data mining operations.  Each operation has its own strengths and weaknesses. 55
  • 56. 56
  • 57.  Similar to the human learning experience ◦ uses observations to form a model of the important characteristics of some phenomenon.  Uses generalizations of ‘real world’ and ability to fit new data into a general framework.  Can analyze a database to determine essential characteristics (model) about the data set. 57
  • 58.  Model is developed using a supervised learning approach, which has two phases: training and testing. ◦ Training builds a model using a large sample of historical data called a training set. ◦ Testing involves trying out the model on new, previously unseen data to determine its accuracy and physical performance characteristics. 58
  • 59.  Applications of predictive modeling include customer retention management, credit approval, cross selling, and direct marketing.  There are two techniques associated with predictive modeling: classification and value prediction, which are distinguished by the nature of the variable being predicted. 59
  • 60. 60
  • 61.  Used to estimate a continuous numeric value that is associated with a database record.  Uses the traditional statistical techniques of linear regression and nonlinear regression.  Relatively easy-to-use and understand. 61
  • 62.  Linear regression attempts to fit a straight line through a plot of the data, such that the line is the best representation of the average of all observations at that point in the plot.  Problem is that the technique only works well with linear data and is sensitive to the presence of outliers (that is, data values, which do not conform to the expected norm). 62
  • 63.  Data mining requires statistical methods that can accommodate non-linearity, outliers, and non-numeric data.  Applications of value prediction include credit card fraud detection or target mailing list identification. 63
  • 64.  Aim is to partition a database into an unknown number of segments, or clusters, of similar records.  Uses unsupervised learning to discover homogeneous sub-populations in a database to improve the accuracy of the profiles. 64
  • 65.  Less precise than other operations thus less sensitive to redundant and irrelevant features.  Applications of database segmentation include customer profiling, direct marketing, and cross selling. 65
  • 66. 66
  • 67.  Aims to establish links (associations) between records, or sets of records, in a database.  There are three specializations ◦ Associations discovery ◦ Sequential pattern discovery ◦ Similar time sequence discovery  Applications include product affinity analysis, direct marketing, and stock price movement. 67
  • 68.  Finds items that imply the presence of other items in the same event.  Affinities between items are represented by association rules. ◦ e.g. ‘When a customer rents property for more than 2 years and is more than 25 years old, in 40% of cases, the customer will buy a property. This association happens in 35% of all customers who rent properties’. 68
  • 69.  Finds patterns between events such that the presence of one set of items is followed by another set of items in a database of events over a period of time. ◦ e.g. Used to understand long term customer buying behavior. 69
  • 70.  Finds links between two sets of data that are time-dependent, and is based on the degree of similarity between the patterns that both time series demonstrate. ◦ e.g. Within three months of buying property, new home owners will purchase goods such as cookers, freezers, and washing machines. 70
  • 71.  Relatively new operation in terms of commercially available data mining tools.  Often a source of true discovery because it identifies outliers, which express deviation from some previously known expectation and norm. 71
  • 72.  Can be performed using statistics and visualization techniques or as a by-product of data mining.  Applications include fraud detection in the use of credit cards and insurance claims, quality control, and defects tracing. 72
  • 73. 73
  • 74. What is Big Data? What makes data, “Big” Data? 74
  • 75.  No single standard definition… “Big Data” is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it… 75
  • 76.  Data Volume ◦ 44x increase from 2009 2020 ◦ From 0.8 zettabytes to 35zb  Data volume is increasing exponentially 76 Exponentialincreasein collected/generateddata
  • 77.  Various formats, types, and structures  Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc…  Static data vs. streaming data  A single application can be generating/collecting many types of data 77 To extract knowledge all these types of data need to linked together
  • 78.  Data is begin generated fast and need to be processed fast  Online Data Analytics  Late decisions  missing opportunities  Examples ◦ E-Promotions: Based on your current location, your purchase history, what you like  send promotions right now for store next to you ◦ Healthcare monitoring: sensors monitoring your activities and body  any abnormal measurements require immediate reaction 78
  • 79. 79
  • 80. 80
  • 81.  OLTP: Online Transaction Processing (DBMSs)  OLAP: Online Analytical Processing (Data Warehousing)  RTAP: Real-Time Analytics Processing (Big Data Architecture & technology) 81
  • 82. Social media andnetworks (all of us aregenerating data) Scientific instruments (collecting all sorts of data) Mobiledevices (tracking all objects all the time) Sensor technology andnetworks (measuring all kinds ofdata)  The progress and innovation is no longer hindered by the ability to collect data  But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion 82
  • 83.  The Model of Generating/Consuming Data has Changed Old Model: Fewcompanies aregenerating data, all others areconsuming data New Model: all of us are generating data, and all of us are consuming data 83
  • 84. - Ad-hoc querying and reporting - Data mining techniques - Structured data, typical sources - Small to mid-size datasets - Optimizations and predictive analytics - Complex statistical analysis - All types of data, and many sources - Very large datasets - More of a real-time 84
  • 85.  Big data is more real-time in nature than traditional DW applications  Traditional DW architectures (e.g. Exadata, Teradata) are not well-suited for big data apps  Shared nothing, massively parallel processing, scale out architectures are well-suited for big data apps 85
  • 86.  The Bottleneck is in technology ◦ New architecture, algorithms, techniques are needed  Also in technical skills ◦ Experts in using the new technology and dealing with big data 86
  • 87. What Technology Do We Have For Big Data ?? 87
  • 88. 88
  • 89. 89