SlideShare a Scribd company logo
1 of 40
1

INTRODUCTION
2

A young, fast growing and promising
field
INTRODUCTION
3










Data mining (the analysis step of the
"Knowledge Discovery and Data Mining"
process, or KDD)
Extracting hidden information
An interdisciplinary subfield of computer
science
The computational process of discovering
patterns in large data sets
Involving methods at the intersection of
Artificial intelligence, Machine learning,
Statistics, and Database systems.
INTORODUCTION(CONTD..)
4

The overall goal of the data mining process is to
extract information from a data set and transform
it into an understandable structure for further use.
Aside from the raw analysis step, it involves
•
database and data management aspects
•





•

data pre-processing
model
inference considerations

complexity considerations, post-processing of
discovered structures, visualization, and online
updating.
Why Data Mining?
5



The Explosive Growth of Data: from terabytes to petabytes



Eg: Global backbone telecommunication network carry tens of
petabytes everyday
(1024 Gigabytes = 1 Terabyte)( 1024 Terabytes = 1 Petabyte)


Data collection and data availability


Automated data collection tools, database systems, Web,
computerized society



Major sources of abundant data


Business: Web, e-commerce, transactions, stocks, …



Science: Remote sensing, bioinformatics, scientific simulation, …



Society and everyone: news, digital cameras,…
Why Data Mining?
6

“Necessity is the mother of invention” - Data
mining—Automated analysis of massive data
sets
What Motivated Data Mining?
7



We are drowning in data, but starving for
knowledge!
Evolution of Database
Technology

8

Data mining can be viewed as a result of natural evolution
of IT


1960s:




1970s:




Data collection, database creation and network DBMS
Relational data model, relational DBMS implementation

1980s:


RDBMS, advanced data models (extended-relational, OO,
deductive, etc.)



Application-oriented DBMS (spatial, scientific, engineering, etc.)
Evolution of Database Technology
9



1990s:




Data mining, data warehousing, multimedia
databases, and Web databases

2000s


Stream data management and mining



Data mining and its applications



Web technology (XML, data integration) and global
information systems
10
What Is Data Mining?
11



Data mining (knowledge discovery from data)


Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns or knowledge from huge amount
of data



Alternative names




Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.

Watch out: Is everything “data mining”?


Simple search and query processing



(Deductive) expert systems
Data Mining: Confluence of Multiple Disciplines
12

Database
Technology

Machine
Learning
Pattern
Recognition

Statistics

Data Mining

Algorithm

Visualization

Other
Disciplines
Knowledge Discovery (KDD) Process
13



Data mining—core of
knowledge discovery
process

Pattern Evaluation
Data Mining

Task-relevant Data
Data
Warehouse
Data Cleaning
Data Integration
Databases

Selection
Knowledge Process
14

1.
2.
3.
4.

5.

6.

7.

Data cleaning – to remove noise and inconsistent data
Data integration – to combine multiple source
Data selection – to retrieve relevant data for analysis
Data transformation – to transform data into
appropriate form for data mining
Data mining- An essential process where intelligent
methods are applied to extract data patterns
Pattern Evaluation-Identify truly interesting patterns
representing knowledge based on interestingness
measure
Knowledge presentation-visualization and
representation techniques
Example: A Web Mining Framework
15



Web mining usually involves









Data cleaning
Data integration from multiple sources
Warehousing the data
Data cube construction
Data selection for data mining
Data mining
Presentation of the mining results
Patterns and knowledge to be used or stored into
knowledge-base
Data Mining in Business Intelligence
Increasing potential
to support
business decisions

End User

Decision
Making

Business
Analyst

Data Presentation
Visualization Techniques
Data Mining
Information Discovery

Data
Analyst

Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
16

DBA
KDD Process: A Typical View from ML and
Statistics

Input Data

Data PreProcessing

Data integration
Normalization
Feature selection
Dimension reduction



Data
Mining

Pattern discovery
Association & correlation
Classification
Clustering
Outlier analysis
…………

PostProcessing

Pattern evaluation
Pattern selection
Pattern interpretation
Pattern visualization

This is a view from typical machine learning and statistics communities
17
Data Mining: On What Kinds of Data?
18



Database-oriented data sets and applications




Relational database, data warehouse, transactional database

Advanced data sets and advanced applications


Data streams and sensor data



Time-series data, temporal data, sequence data (incl. bio-sequences)



Structure data, graphs, social networks and multi-linked data



Object-relational databases



Heterogeneous databases and legacy databases



Spatial data



Multimedia database



Text databases



The World-Wide Web
RDBMS
19









A database that has a collection of tables of data items, all of
which is formally described and organized according to the
relational model.
Data in a single table represents a relation.
Each table schema must identify a column or group of
columns, called the p rim a ry ke y , to uniquely identify each row.
A relationship can then be established between each row in
the table and a row in another table by creating a fo re ig n ke y ,
a column or group of columns in one table that points to the
primary key of another table.
RDBMS
20
•

•

•

•

•

Database normalization: The relational model offers various levels
of refinement of table organization and reorganization .
DBMS of a relational database is called an RDBMS, and is the
software of a relational database.
The relational database was first defined in June 1970 by Edgar
Codd, of IBM's San Jose Research Laboratory.
Codd's view of what qualifies as an RDBMS is summarized in
Codd's 12 rules.
A relational database has become the predominant choice in
storing data.
21

Relational database
terminology.

A relation is defined as a set of tuples that have the same
attributes
RDMS(contd..)
22

Example :Allelectronics(Company described by relation
tables:Customer,item,employee and branch)
Relation : customer is a group of entities describing the
customer information(Cust_id,cust_name,
Age,Occupation,annual income, credit information and
category)
Tables: used to represent the relationship between or
among multiple entities
 Database queries(SQL): For data accessing using
relational operations such as join, selection and projection
Mining Relational databases
23








Can go further by searching for trends or data patterns
Examples
Analyze customer data to predict the risk of customers
based on their income ,age
Detect deviations: sales comparison with previous year
RDBMS are one of the most commonly available and
richest information repositories for data mining
What is a Data
Warehouse?

24



Defined in many different ways, but not rigorously.


A decision support database that is maintained separately from
the organization’s operational database



Support information processing by providing a solid platform of
consolidated, historical data for analysis.



“A data warehouse is a subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of management’s decisionmaking process.”—W. H. Inmon



Data warehousing:


The process of constructing and using data warehouses
DATA WAREHOUSES
25

Is a repository of information collected from
multiple sources, stored under a unified
schema.
Constructed via
 Data cleaning
 Data integration
 Data transformation
 Data Loading and periodic data refreshing

26
DATA WAREHOUSES(contd…)
27





Data warehouse is modeled by a multidimensional data
structure
Data cube: precomputation &fast access of
summarized data




Each dimension corresponds to an attribute or a set of attributes
in a schema
Each cell stores the value of some aggregate measure (count,
sum etc)



Example:



In Allelectronics the cube has three dimension :

•

Address(with city values, U S A, Canada, Mexico)

•

Time (with quarter values Q1,Q2,Q3,Q4)

•

Item(with type values )
Multidimensional Data
28

Sales volume as a function of product, month,
and region
Re
g

io
n

Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region

Year

Category Country Quarter
Product



Product

City
Office

Month

Month
Day

Week
A Sample Data Cube
29

Pr

TV
PC
VCR
sum

1Qtr

2Qtr

3Qtr

4Qtr

sum

Total annual sales
of TVs in U.S.A.
U.S.A
Canada
Mexico
sum

Country

od
uc
t

Date
Data mining functionalities
30



Tasks can be classified :




Predictive(makes prediction about values of data using known
results found from different data)
Descriptive( characterize properties of a target data set)
 Explore the properties of the data examined

Data mining functionalities are used to specify the kinds
of patterns






Characterization and Discrimination
The mining of frequent patterns, associations and correlations
Classification and regression
Cluster analysis
Outlier analysis
Characterization and Discrimination
31





Data characterization is a summarization of the general

characteristics or features of a target class of data
Output of characterization can be presented in various forms
 Pie charts
 Bar charts
 Curves

multidimensional data cube
 Multidimensional tables
Descriptions presented in generalized relations- Characteristic
rules
Example: In Allelectronics : Sum m a riz e the c ha ra c te ris tic o f
c us to m e rs who s p e nd m o re tha n $ 5 0 0 0 a y e a r a t A le c tro nic s
lle
this can be view in any dimension, such as on occupation to view
these customers according to their type of employment.
Data Discrimination
32









Data discrimination is a comparison of the general
features of the target class data objects against the
general features of objects from one or more
multiple contrasting class
Output representation similar to characterization
description
Discrimination description expressed in the form of
rules –Discrimination rules
Target and contrasting class specified by the user

Example:


Us e r wa nt to c o m p a re the g e ne ra l fe a ture s o f s o ftwa re p ro d uc ts with
s a le s tha t inc re a s e d by 1 0 % a nd d e c re a s e d by 3 0 % d uring the s a m e
p e rio d
Mining Frequent Patterns, Associations,
Correlations
33



Frequent pattern
Frequent item sets(Milk, bread)
 Frequent subsequences(Latop ,digital camera
,memory
card)
 Frequent sub structures (graphs ,trees)
Mining frequent patterns leads to the discovery of
interesting associations and correlation within
data.

Association analysis(example)
34

Item frequently purchased together
buys(X, ”computer”) =>buys(X, ”software”)
[support=1%, confidence=50%]
X - a variable representing a customer
A confidence or certainty – 50%(chance)
1%(under analysis)
Association rule- with single-dimension association rules
“computer => software[1%,50%]”.
Age(X,”20..29”) ^ income(X,”40K..49K”)=>buys(X ,”laptop”)
[support=2%, confidence=60%] (Multidimensional association rule)
Classification and Regression for Predictive
Analysis
35






Classification: the process of finding a
model(function)that describes and
distinguishes data classes or concepts
Model derived from analysis of a set of training data
Models are represented as




Classification rules(IF-THEN rules)
Decision trees
Mathematical formulae or Neural networks

 Regression:

Statistical methodology for
numeric prediction
36

Cluster Analysis and Outlier
Analysis


Cluster Analysis:






Determining similarity among data on predefined
attributes
The most similar data are grouped into clusters

Outlier Analysis






Outliers: The dataset contain objects that do not
required for the model of the data
Analysis of outlier data is referred to as Outlier

Analysis or Anomaly mining
Detected using statstical tests
Which Technologies Are Used?
Machine
Learning

Applications

Algorithm

Pattern
Recognition

Statistics

Visualization

Data Mining

Database
Technology

High-Performance
Computing

37
Potential Applications of Data Mining
Where there are data there are
data mining applications
38


Data analysis and decision support ( Business Intelligence)


Market analysis and management




Risk analysis and management





Target marketing, customer relationship management (CRM),
market basket analysis, cross selling, market segmentation
Forecasting, customer retention, improved underwriting, quality
control, competitive analysis

Fraud detection and detection of unusual patterns (outliers)

Other Applications


Text mining (news group, email, documents) and Web mining



Stream data mining



Bioinformatics and bio-data analysis
Major Issues in Data Mining (1)


Mining Methodology



Mining knowledge in multi-dimensional space



Data mining: An interdisciplinary effort



Boosting the power of discovery in a networked environment



Handling noise, uncertainty, and incompleteness of data




Mining various and new kinds of knowledge

Pattern evaluation and pattern- or constraint-guided mining

User Interaction


Interactive mining



Incorporation of background knowledge



Presentation and visualization of data mining results
39
Major Issues in Data Mining (2)


Efficiency and Scalability





Efficiency and scalability of data mining algorithms
Parallel, distributed, stream, and incremental mining methods

Diversity of data types





Handling complex types of data
Mining dynamic, networked, and global data repositories

Data mining and society


Social impacts of data mining



Privacy-preserving data mining



Invisible data mining
40

More Related Content

What's hot (20)

A review on data mining
A  review on data miningA  review on data mining
A review on data mining
 
01 intro
01 intro01 intro
01 intro
 
Data mining
Data miningData mining
Data mining
 
Introduction data mining
Introduction data miningIntroduction data mining
Introduction data mining
 
Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data Mining
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
 
Chapter 8. Classification Basic Concepts.ppt
Chapter 8. Classification Basic Concepts.pptChapter 8. Classification Basic Concepts.ppt
Chapter 8. Classification Basic Concepts.ppt
 
03 preprocessing
03 preprocessing03 preprocessing
03 preprocessing
 
Data mining presentation.ppt
Data mining presentation.pptData mining presentation.ppt
Data mining presentation.ppt
 
InfoSphere BigInsights
InfoSphere BigInsightsInfoSphere BigInsights
InfoSphere BigInsights
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olapData Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olap
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data Analysis
 
Multidimentional data model
Multidimentional data modelMultidimentional data model
Multidimentional data model
 
5.3 mining sequential patterns
5.3 mining sequential patterns5.3 mining sequential patterns
5.3 mining sequential patterns
 
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
 
Introduction
IntroductionIntroduction
Introduction
 
Data Mining
Data MiningData Mining
Data Mining
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
Application of predictive analytics
Application of predictive analyticsApplication of predictive analytics
Application of predictive analytics
 

Similar to Introduction to DataMining

20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.pptPalaniKumarR2
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 abhagathk
 
Data Warehouse and Data Mining
Data Warehouse and Data MiningData Warehouse and Data Mining
Data Warehouse and Data MiningRanak Ghosh
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.pptSamPrem3
 
Data Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesData Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesasnaparveen414
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data WarehousingAswathy S Nair
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationDr. Abdul Ahad Abro
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data WarehousingAmdocs
 
MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)Krishan Pareek
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slidestafosepsdfasg
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dmsumit621
 

Similar to Introduction to DataMining (20)

2. olap warehouse
2. olap warehouse2. olap warehouse
2. olap warehouse
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 a
 
Data Warehouse and Data Mining
Data Warehouse and Data MiningData Warehouse and Data Mining
Data Warehouse and Data Mining
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt
 
Dm unit i r16
Dm unit i   r16Dm unit i   r16
Dm unit i r16
 
Data Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesData Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notes
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data Warehousing
 
dwdm unit 1.ppt
dwdm unit 1.pptdwdm unit 1.ppt
dwdm unit 1.ppt
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
Chapter 2 - EMTE.pptx
Chapter 2 - EMTE.pptxChapter 2 - EMTE.pptx
Chapter 2 - EMTE.pptx
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data Warehousing
 
BDA-Module-1.pptx
BDA-Module-1.pptxBDA-Module-1.pptx
BDA-Module-1.pptx
 
Chapter 1. Introduction.ppt
Chapter 1. Introduction.pptChapter 1. Introduction.ppt
Chapter 1. Introduction.ppt
 
MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slides
 
Cs501 dm intro
Cs501 dm introCs501 dm intro
Cs501 dm intro
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
 
03 data mining : data warehouse
03 data mining : data warehouse03 data mining : data warehouse
03 data mining : data warehouse
 

Recently uploaded

The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersChitralekhaTherkar
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 

Recently uploaded (20)

The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of Powders
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 

Introduction to DataMining

  • 2. 2 A young, fast growing and promising field
  • 3. INTRODUCTION 3      Data mining (the analysis step of the "Knowledge Discovery and Data Mining" process, or KDD) Extracting hidden information An interdisciplinary subfield of computer science The computational process of discovering patterns in large data sets Involving methods at the intersection of Artificial intelligence, Machine learning, Statistics, and Database systems.
  • 4. INTORODUCTION(CONTD..) 4 The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Aside from the raw analysis step, it involves • database and data management aspects •    • data pre-processing model inference considerations complexity considerations, post-processing of discovered structures, visualization, and online updating.
  • 5. Why Data Mining? 5  The Explosive Growth of Data: from terabytes to petabytes  Eg: Global backbone telecommunication network carry tens of petabytes everyday (1024 Gigabytes = 1 Terabyte)( 1024 Terabytes = 1 Petabyte)  Data collection and data availability  Automated data collection tools, database systems, Web, computerized society  Major sources of abundant data  Business: Web, e-commerce, transactions, stocks, …  Science: Remote sensing, bioinformatics, scientific simulation, …  Society and everyone: news, digital cameras,…
  • 6. Why Data Mining? 6 “Necessity is the mother of invention” - Data mining—Automated analysis of massive data sets
  • 7. What Motivated Data Mining? 7  We are drowning in data, but starving for knowledge!
  • 8. Evolution of Database Technology 8 Data mining can be viewed as a result of natural evolution of IT  1960s:   1970s:   Data collection, database creation and network DBMS Relational data model, relational DBMS implementation 1980s:  RDBMS, advanced data models (extended-relational, OO, deductive, etc.)  Application-oriented DBMS (spatial, scientific, engineering, etc.)
  • 9. Evolution of Database Technology 9  1990s:   Data mining, data warehousing, multimedia databases, and Web databases 2000s  Stream data management and mining  Data mining and its applications  Web technology (XML, data integration) and global information systems
  • 10. 10
  • 11. What Is Data Mining? 11  Data mining (knowledge discovery from data)  Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data  Alternative names   Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. Watch out: Is everything “data mining”?  Simple search and query processing  (Deductive) expert systems
  • 12. Data Mining: Confluence of Multiple Disciplines 12 Database Technology Machine Learning Pattern Recognition Statistics Data Mining Algorithm Visualization Other Disciplines
  • 13. Knowledge Discovery (KDD) Process 13  Data mining—core of knowledge discovery process Pattern Evaluation Data Mining Task-relevant Data Data Warehouse Data Cleaning Data Integration Databases Selection
  • 14. Knowledge Process 14 1. 2. 3. 4. 5. 6. 7. Data cleaning – to remove noise and inconsistent data Data integration – to combine multiple source Data selection – to retrieve relevant data for analysis Data transformation – to transform data into appropriate form for data mining Data mining- An essential process where intelligent methods are applied to extract data patterns Pattern Evaluation-Identify truly interesting patterns representing knowledge based on interestingness measure Knowledge presentation-visualization and representation techniques
  • 15. Example: A Web Mining Framework 15  Web mining usually involves         Data cleaning Data integration from multiple sources Warehousing the data Data cube construction Data selection for data mining Data mining Presentation of the mining results Patterns and knowledge to be used or stored into knowledge-base
  • 16. Data Mining in Business Intelligence Increasing potential to support business decisions End User Decision Making Business Analyst Data Presentation Visualization Techniques Data Mining Information Discovery Data Analyst Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems 16 DBA
  • 17. KDD Process: A Typical View from ML and Statistics Input Data Data PreProcessing Data integration Normalization Feature selection Dimension reduction  Data Mining Pattern discovery Association & correlation Classification Clustering Outlier analysis ………… PostProcessing Pattern evaluation Pattern selection Pattern interpretation Pattern visualization This is a view from typical machine learning and statistics communities 17
  • 18. Data Mining: On What Kinds of Data? 18  Database-oriented data sets and applications   Relational database, data warehouse, transactional database Advanced data sets and advanced applications  Data streams and sensor data  Time-series data, temporal data, sequence data (incl. bio-sequences)  Structure data, graphs, social networks and multi-linked data  Object-relational databases  Heterogeneous databases and legacy databases  Spatial data  Multimedia database  Text databases  The World-Wide Web
  • 19. RDBMS 19     A database that has a collection of tables of data items, all of which is formally described and organized according to the relational model. Data in a single table represents a relation. Each table schema must identify a column or group of columns, called the p rim a ry ke y , to uniquely identify each row. A relationship can then be established between each row in the table and a row in another table by creating a fo re ig n ke y , a column or group of columns in one table that points to the primary key of another table.
  • 20. RDBMS 20 • • • • • Database normalization: The relational model offers various levels of refinement of table organization and reorganization . DBMS of a relational database is called an RDBMS, and is the software of a relational database. The relational database was first defined in June 1970 by Edgar Codd, of IBM's San Jose Research Laboratory. Codd's view of what qualifies as an RDBMS is summarized in Codd's 12 rules. A relational database has become the predominant choice in storing data.
  • 21. 21 Relational database terminology. A relation is defined as a set of tuples that have the same attributes
  • 22. RDMS(contd..) 22 Example :Allelectronics(Company described by relation tables:Customer,item,employee and branch) Relation : customer is a group of entities describing the customer information(Cust_id,cust_name, Age,Occupation,annual income, credit information and category) Tables: used to represent the relationship between or among multiple entities  Database queries(SQL): For data accessing using relational operations such as join, selection and projection
  • 23. Mining Relational databases 23      Can go further by searching for trends or data patterns Examples Analyze customer data to predict the risk of customers based on their income ,age Detect deviations: sales comparison with previous year RDBMS are one of the most commonly available and richest information repositories for data mining
  • 24. What is a Data Warehouse? 24  Defined in many different ways, but not rigorously.  A decision support database that is maintained separately from the organization’s operational database  Support information processing by providing a solid platform of consolidated, historical data for analysis.  “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decisionmaking process.”—W. H. Inmon  Data warehousing:  The process of constructing and using data warehouses
  • 25. DATA WAREHOUSES 25 Is a repository of information collected from multiple sources, stored under a unified schema. Constructed via  Data cleaning  Data integration  Data transformation  Data Loading and periodic data refreshing 
  • 26. 26
  • 27. DATA WAREHOUSES(contd…) 27   Data warehouse is modeled by a multidimensional data structure Data cube: precomputation &fast access of summarized data   Each dimension corresponds to an attribute or a set of attributes in a schema Each cell stores the value of some aggregate measure (count, sum etc)  Example:  In Allelectronics the cube has three dimension : • Address(with city values, U S A, Canada, Mexico) • Time (with quarter values Q1,Q2,Q3,Q4) • Item(with type values )
  • 28. Multidimensional Data 28 Sales volume as a function of product, month, and region Re g io n Dimensions: Product, Location, Time Hierarchical summarization paths Industry Region Year Category Country Quarter Product  Product City Office Month Month Day Week
  • 29. A Sample Data Cube 29 Pr TV PC VCR sum 1Qtr 2Qtr 3Qtr 4Qtr sum Total annual sales of TVs in U.S.A. U.S.A Canada Mexico sum Country od uc t Date
  • 30. Data mining functionalities 30  Tasks can be classified :   Predictive(makes prediction about values of data using known results found from different data) Descriptive( characterize properties of a target data set)  Explore the properties of the data examined Data mining functionalities are used to specify the kinds of patterns      Characterization and Discrimination The mining of frequent patterns, associations and correlations Classification and regression Cluster analysis Outlier analysis
  • 31. Characterization and Discrimination 31   Data characterization is a summarization of the general characteristics or features of a target class of data Output of characterization can be presented in various forms  Pie charts  Bar charts  Curves  multidimensional data cube  Multidimensional tables Descriptions presented in generalized relations- Characteristic rules Example: In Allelectronics : Sum m a riz e the c ha ra c te ris tic o f c us to m e rs who s p e nd m o re tha n $ 5 0 0 0 a y e a r a t A le c tro nic s lle this can be view in any dimension, such as on occupation to view these customers according to their type of employment.
  • 32. Data Discrimination 32     Data discrimination is a comparison of the general features of the target class data objects against the general features of objects from one or more multiple contrasting class Output representation similar to characterization description Discrimination description expressed in the form of rules –Discrimination rules Target and contrasting class specified by the user Example:  Us e r wa nt to c o m p a re the g e ne ra l fe a ture s o f s o ftwa re p ro d uc ts with s a le s tha t inc re a s e d by 1 0 % a nd d e c re a s e d by 3 0 % d uring the s a m e p e rio d
  • 33. Mining Frequent Patterns, Associations, Correlations 33  Frequent pattern Frequent item sets(Milk, bread)  Frequent subsequences(Latop ,digital camera ,memory card)  Frequent sub structures (graphs ,trees) Mining frequent patterns leads to the discovery of interesting associations and correlation within data. 
  • 34. Association analysis(example) 34 Item frequently purchased together buys(X, ”computer”) =>buys(X, ”software”) [support=1%, confidence=50%] X - a variable representing a customer A confidence or certainty – 50%(chance) 1%(under analysis) Association rule- with single-dimension association rules “computer => software[1%,50%]”. Age(X,”20..29”) ^ income(X,”40K..49K”)=>buys(X ,”laptop”) [support=2%, confidence=60%] (Multidimensional association rule)
  • 35. Classification and Regression for Predictive Analysis 35    Classification: the process of finding a model(function)that describes and distinguishes data classes or concepts Model derived from analysis of a set of training data Models are represented as    Classification rules(IF-THEN rules) Decision trees Mathematical formulae or Neural networks  Regression: Statistical methodology for numeric prediction
  • 36. 36 Cluster Analysis and Outlier Analysis  Cluster Analysis:    Determining similarity among data on predefined attributes The most similar data are grouped into clusters Outlier Analysis    Outliers: The dataset contain objects that do not required for the model of the data Analysis of outlier data is referred to as Outlier Analysis or Anomaly mining Detected using statstical tests
  • 37. Which Technologies Are Used? Machine Learning Applications Algorithm Pattern Recognition Statistics Visualization Data Mining Database Technology High-Performance Computing 37
  • 38. Potential Applications of Data Mining Where there are data there are data mining applications 38  Data analysis and decision support ( Business Intelligence)  Market analysis and management   Risk analysis and management    Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation Forecasting, customer retention, improved underwriting, quality control, competitive analysis Fraud detection and detection of unusual patterns (outliers) Other Applications  Text mining (news group, email, documents) and Web mining  Stream data mining  Bioinformatics and bio-data analysis
  • 39. Major Issues in Data Mining (1)  Mining Methodology   Mining knowledge in multi-dimensional space  Data mining: An interdisciplinary effort  Boosting the power of discovery in a networked environment  Handling noise, uncertainty, and incompleteness of data   Mining various and new kinds of knowledge Pattern evaluation and pattern- or constraint-guided mining User Interaction  Interactive mining  Incorporation of background knowledge  Presentation and visualization of data mining results 39
  • 40. Major Issues in Data Mining (2)  Efficiency and Scalability    Efficiency and scalability of data mining algorithms Parallel, distributed, stream, and incremental mining methods Diversity of data types    Handling complex types of data Mining dynamic, networked, and global data repositories Data mining and society  Social impacts of data mining  Privacy-preserving data mining  Invisible data mining 40