SlideShare a Scribd company logo
1 of 30
Download to read offline
Sanjivani Rural Education Society’s
Sanjivani College of Engineering, Kopargaon-423 603
(An Autonomous Institute, Affiliated to Savitribai Phule Pune University, Pune)
NACC ‘A’ Grade Accredited, ISO 9001:2015 Certified
Department of Computer Engineering
(NBA Accredited)
Prof. S.A.Shivarkar
Assistant Professor
Contact No.8275032712
Email- shivarkarsandipcomp@sanjivani.org.in
Subject- Data Mining and Warehousing (CO314)
Unit –I: Introduction to Data Mining
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 2
Content
 Data Mining, Kinds of pattern and technologies
 Data Mining Task Primitives
 Issues in mining
 KDD vs data mining
 OLAP, knowledge
 representation, Information and Knowledge;
 Attribute Types: Nominal, Binary, Ordinal and Numeric attributes,
Discrete versus Continuous attributes.
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 3
Why Data Mining
 “We are living in the information age” is popular saying, but actually living in the data age.
 The Explosive Growth of Data: from terabytes to petabytes pour into computer networks by
www, various data storage devices everyday from business, society, science and engineering,
medicine and almost every other aspect of daily life.
 Data collection and data availability
 Automated data collection tools, database systems, Web, computerized society
 Major sources of abundant data
 Business: Web, e-commerce, transactions, stocks, …
 Science: Remote sensing, bioinformatics, scientific simulation, …
 Society and everyone: news, digital cameras, YouTube
 We are drowning in data, but starving for knowledge!
 “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 4
How Data Mining turns large collection of data into knowledge
 A search engine e.g.,Google receives hundreds of millions of queries every day can be viewed as a
transaction where the user describes her or his information need.
 What novel and useful knowledge can a search engine learn from such a huge collection of queries
collected from users over time?
 Interestingly, some patterns found in user search queries can disclose invaluable knowledge that
cannot be obtained by reading individual data items alone. For example, Google’s Flu Trends uses
specific search terms as indicators of flu activity.
 It found a close relationship between the number of people who search for flu-related information
and the number of people who actually have flu symptoms.
 A pattern emerges when all of the search queries related to flu are aggregated.
 Using aggregated Google search data, Flu Trends can estimate flu activity up to two weeks faster
than traditional systems can.
 This example shows how data mining can turn a large collection of data into knowledge that can
help meet a current global challenge.
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 5
Evolution of Sciences
 Before 1600, empirical science
 1600-1950s, theoretical science
 Each discipline has grown a theoretical component. Theoretical models often motivate experiments and
generalize our understanding.
 1950s-1990s, computational science
 Over the last 50 years, most disciplines have grown a third, computational branch (e.g. empirical,
theoretical, and computational ecology, or physics, or linguistics.)
 Computational Science traditionally meant simulation. It grew out of our inability to find closed-form
solutions for complex mathematical models.
 1990-now, data science
 The flood of data from new scientific instruments and simulations
 The ability to economically store and manage petabytes of data online
 The Internet and computing Grid that makes all these archives universally accessible
 Scientific info. management, acquisition, organization, query, and visualization tasks scale almost linearly
with data volumes. Data mining is a major new challenge!
 Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm. ACM, 45(11):
50-54, Nov. 2002
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 6
The world is data rich but information poor
 The abundance of data, coupled with the need for powerful data analysis
tools, has been described as a data rich but information poor situation
 The data collected in large data repositories become “data tombs”
 Consequently, important decisions are often made based not on the
information-rich data stored in data repositories but rather on a decision
maker’s intuition, simply because the decision maker does not have the
tools to extract the valuable knowledge embedded in the vast amounts of
data.
 Efforts have been made to develop expert system and knowledge-based
technologies, which typically rely on users or domain experts to manually
input knowledge into knowledge bases.
 Unfortunately, however, the manual knowledge input procedure is prone to
biases and
 errors and is extremely costly and time consuming.
 The widening gap between data and information calls for the
systematic development of data mining tools that can turn data
tombs into “golden nuggets” of knowledge.
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 7
What is Data Mining
 Data mining (knowledge discovery from data)
 Extraction of interesting (non-trivial, implicit, previously unknown and potentially
useful) patterns or knowledge from huge amount of data
 Data mining: a misnomer?
 Alternative names
 Knowledge discovery (mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archeology, data dredging, information harvesting,
business intelligence, etc.
 Watch out: Is everything “data mining”?
 Simple search and query processing
 (Deductive) expert systems
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 8
Data mining—searching for knowledge (interesting patterns) in data.
 Many other terms have a similar
meaning to data mining—for example,
knowledge mining from data, knowledge
extraction, data/pattern analysis, data
archaeology, and data dredging.
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 9
Knowledge Discovery (KDD) Process
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 10
Data mining as a step in the process of knowledge discovery
1. Data cleaning : to remove noise and inconsistent data
2. Data integration : where multiple data sources may be combined
3. Data selection : where data relevant to the analysis task are retrieved
from the database
4. Data transformation : where data are transformed and consolidated into
forms appropriate for mining by performing summary or aggregation
operations
5. Data mining : an essential process where intelligent methods are applied
to extract data patterns
6. Pattern evaluation : to identify the truly interesting patterns representing
knowledge based on interestingness measures.
7. Knowledge presentation : where visualization and knowledge
representation techniques are used to present mined knowledge to users
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 11
Example: A Web Mining Framework
 Web mining usually involves
Data cleaning
Data integration from multiple sources
Warehousing the data
Data cube construction
Data selection for data mining
Data mining
Presentation of the mining results
Patterns and knowledge to be used or stored into knowledge-base
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 12
Data Mining Business Intelligence (BI)
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 13
What is data?
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 14
Attributes types in Data Mining
 The attribute is the property of the object. The attribute represents
different features of the object. Example:
 In this example, Roll No, Name, and Result are attributes of the
object student.
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 15
Attributes types in Data Mining
 Binary
 Nominal
 Numeric
 Interval-scaled: e.g. calendar dates, temperature in Celsius or
Fahrenheit
 Ratio-scaled: these are like free numbers they can be anything e.g.
length, time, counts
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 16
Attributes types in data mining conti…
 Binary:
Binary data have only two values/states.
Binary attributes types:
 Symmetric
 Asymmetric
Symmetric
Both values are equally important
Asymmetric
Both values are not equally important
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 17
Attributes types in data mining conti…
 Ordinal:
All Values have a meaningful order.
e.g. rankings (scalingfrom1to10), grades, height
(tall ,medium, short)
 Discrete:
Discrete data have finite value. It can be in
numerical form and can also be in categorical form.
e.g. number of birds in a flock; the number of
heads realized when a coin is flipped 10 times
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 18
Attributes types in data mining conti…
 Continuous:
Continuous data technically have an infinite
number of steps.
Continuous data is in float type.
There can be many numbers in between 1 and 2.
e.g. weights and heights of birds, temperature of a
day
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 19
Attributes types summary
20
Data Mining Function: (1) Generalization
 Information integration and data warehouse construction
 Data cleaning, transformation, integration, and
multidimensional data model
 Data cube technology
 Scalable methods for computing (i.e., materializing)
multidimensional aggregates
 OLAP (online analytical processing)
 Multidimensional concept description: Characterization
and discrimination
 Generalize, summarize, and contrast data characteristics,
e.g., dry vs. wet region
21
Data Mining Function: (2) Association and Correlation Analysis
 Frequent patterns (or frequent itemsets)
 What items are frequently purchased together in your
Walmart?
 Association, correlation vs. causality
 A typical association rule
Diaper  Beer [0.5%, 75%] (support, confidence)
 Are strongly associated items also strongly correlated?
 How to mine such patterns and rules efficiently in large
datasets?
 How to use such patterns for classification, clustering,
and other applications?
22
Data Mining Function: (3) Classification
 Classification and label prediction
 Construct models (functions) based on some training examples
 Describe and distinguish classes or concepts for future prediction
E.g., classify countries based on (climate), or classify cars based on (gas mileage)
 Predict some unknown class labels
 Typical methods
 Decision trees, naïve Bayesian classification, support vector machines, neural
networks, rule-based classification, pattern-based classification, logistic
regression, …
 Typical applications:
 Credit card fraud detection, direct marketing, classifying stars, diseases,
web-pages, …
23
Data Mining Function: (4) Cluster Analysis
 Unsupervised learning (i.e., Class label is unknown)
 Group data to form new categories (i.e., clusters), e.g.,
cluster houses to find distribution patterns
 Principle: Maximizing intra-class similarity & minimizing
interclass similarity
 Many methods and applications
24
Data Mining Function: (5) Outlier Analysis
 Outlier analysis
 Outlier: A data object that does not comply with the general
behavior of the data
 Noise or exception? ― One person’s garbage could be
another person’s treasure
 Methods: by product of clustering or regression analysis, …
 Useful in fraud detection, rare events analysis
25
Data Mining: Confluence of Multiple Disciplines
26
Which Technologies Are Used?
27
Typical framework of a data warehouse for AllElectronics
28
Data Cubes
 Grouping of data in a multidimensional
matrix is called data cubes.
 In dataware housing, we generally
deal with various multidimensional
data models as the data will be
represented by multiple dimensions
and multiple attributes.
 This multidimensional data is
represented in the data cube as the
cube represents a high-dimensional
space.
 The Data cube pictorially shows how
different attributes of data are
arranged in the data model.
29
Typical framework of a data warehouse for AllElectronics
 A Figure multidimensional data
cube, commonly used for data
warehousing,
 (a) showing summarized data for
AllElectronics and
 (b) showing summarized data
resulting fromdrill-down and roll-
up operations on the cube in
 (c). For improved readability,
only some of the cube cell values
are shown.
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 30
Reference
 Han, Jiawei Kamber, Micheline Pei and Jian, “Data Mining: Concepts and
Techniques”,Elsevier Publishers, ISBN:9780123814791, 9780123814807.
 https://onlinecourses.nptel.ac.in/noc24_cs22

More Related Content

Similar to Introduction to Data Mining KDD Process OLAP

The trend analysis of the level of fin metrics and e-stat tools for research ...
The trend analysis of the level of fin metrics and e-stat tools for research ...The trend analysis of the level of fin metrics and e-stat tools for research ...
The trend analysis of the level of fin metrics and e-stat tools for research ...
Alexander Decker
 
11.the trend analysis of the level of fin metrics and e-stat tools for resear...
11.the trend analysis of the level of fin metrics and e-stat tools for resear...11.the trend analysis of the level of fin metrics and e-stat tools for resear...
11.the trend analysis of the level of fin metrics and e-stat tools for resear...
Alexander Decker
 
Ed Fox on Learning Technologies
Ed Fox on Learning TechnologiesEd Fox on Learning Technologies
Ed Fox on Learning Technologies
Gardner Campbell
 
accelerating-data-driven
accelerating-data-drivenaccelerating-data-driven
accelerating-data-driven
Joshua Chudy
 

Similar to Introduction to Data Mining KDD Process OLAP (20)

Data Science Demystified_ Journeying Through Insights and Innovations
Data Science Demystified_ Journeying Through Insights and InnovationsData Science Demystified_ Journeying Through Insights and Innovations
Data Science Demystified_ Journeying Through Insights and Innovations
 
The trend analysis of the level of fin metrics and e-stat tools for research ...
The trend analysis of the level of fin metrics and e-stat tools for research ...The trend analysis of the level of fin metrics and e-stat tools for research ...
The trend analysis of the level of fin metrics and e-stat tools for research ...
 
11.the trend analysis of the level of fin metrics and e-stat tools for resear...
11.the trend analysis of the level of fin metrics and e-stat tools for resear...11.the trend analysis of the level of fin metrics and e-stat tools for resear...
11.the trend analysis of the level of fin metrics and e-stat tools for resear...
 
Scientific Knowledge Graphs: an Overview
Scientific Knowledge Graphs: an OverviewScientific Knowledge Graphs: an Overview
Scientific Knowledge Graphs: an Overview
 
Drowning in information – the need of macroscopes for research funding
Drowning in information – the need of macroscopes for research fundingDrowning in information – the need of macroscopes for research funding
Drowning in information – the need of macroscopes for research funding
 
Towards a Community-driven Data Science Body of Knowledge – Data Management S...
Towards a Community-driven Data Science Body of Knowledge – Data Management S...Towards a Community-driven Data Science Body of Knowledge – Data Management S...
Towards a Community-driven Data Science Body of Knowledge – Data Management S...
 
Ed Fox on Learning Technologies
Ed Fox on Learning TechnologiesEd Fox on Learning Technologies
Ed Fox on Learning Technologies
 
Data science
Data science Data science
Data science
 
accelerating-data-driven
accelerating-data-drivenaccelerating-data-driven
accelerating-data-driven
 
Big Data and Computer Science Education
Big Data and Computer Science EducationBig Data and Computer Science Education
Big Data and Computer Science Education
 
Impulsion of Mining Paradigm with Density Based Clustering of Multi Dimension...
Impulsion of Mining Paradigm with Density Based Clustering of Multi Dimension...Impulsion of Mining Paradigm with Density Based Clustering of Multi Dimension...
Impulsion of Mining Paradigm with Density Based Clustering of Multi Dimension...
 
How to crack down big data?
How to crack down big data? How to crack down big data?
How to crack down big data?
 
Luciano uvi hackfest.28.10.2020
Luciano uvi hackfest.28.10.2020Luciano uvi hackfest.28.10.2020
Luciano uvi hackfest.28.10.2020
 
IC-SDV 2019: Semantic e-Science - The Future of Information Provision in the ...
IC-SDV 2019: Semantic e-Science - The Future of Information Provision in the ...IC-SDV 2019: Semantic e-Science - The Future of Information Provision in the ...
IC-SDV 2019: Semantic e-Science - The Future of Information Provision in the ...
 
Data Science: Origins, Methods, Challenges and the future?
Data Science: Origins, Methods, Challenges and the future?Data Science: Origins, Methods, Challenges and the future?
Data Science: Origins, Methods, Challenges and the future?
 
10 problems 06
10 problems 0610 problems 06
10 problems 06
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
 
DMDW Unit 1.pdf
DMDW Unit 1.pdfDMDW Unit 1.pdf
DMDW Unit 1.pdf
 
Data science for everyone
Data science for everyoneData science for everyone
Data science for everyone
 

More from ShivarkarSandip

Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...
Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...
Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...
ShivarkarSandip
 

More from ShivarkarSandip (20)

Cluster Analysis: Measuring Similarity & Dissimilarity
Cluster Analysis: Measuring Similarity & DissimilarityCluster Analysis: Measuring Similarity & Dissimilarity
Cluster Analysis: Measuring Similarity & Dissimilarity
 
Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...
Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...
Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...
 
Frequent Pattern Analysis, Apriori and FP Growth Algorithm
Frequent Pattern Analysis, Apriori and FP Growth AlgorithmFrequent Pattern Analysis, Apriori and FP Growth Algorithm
Frequent Pattern Analysis, Apriori and FP Growth Algorithm
 
Data Warehouse and Architecture, OLAP Operation
Data Warehouse and Architecture, OLAP OperationData Warehouse and Architecture, OLAP Operation
Data Warehouse and Architecture, OLAP Operation
 
Data Preparation and Preprocessing , Data Cleaning
Data Preparation and Preprocessing , Data CleaningData Preparation and Preprocessing , Data Cleaning
Data Preparation and Preprocessing , Data Cleaning
 
Introduction to Data Mining, KDD Process, OLTP and OLAP
Introduction to Data Mining, KDD Process, OLTP and OLAPIntroduction to Data Mining, KDD Process, OLTP and OLAP
Introduction to Data Mining, KDD Process, OLTP and OLAP
 
Issues in data mining Patterns Online Analytical Processing
Issues in data mining  Patterns Online Analytical ProcessingIssues in data mining  Patterns Online Analytical Processing
Issues in data mining Patterns Online Analytical Processing
 
Introduction to data mining which covers the basics
Introduction to data mining which covers the basicsIntroduction to data mining which covers the basics
Introduction to data mining which covers the basics
 
Introduction to Data Communication.pdf
Introduction to Data Communication.pdfIntroduction to Data Communication.pdf
Introduction to Data Communication.pdf
 
Classification of Signal.pdf
Classification of Signal.pdfClassification of Signal.pdf
Classification of Signal.pdf
 
Sequential Circuit Design-2.pdf
Sequential Circuit Design-2.pdfSequential Circuit Design-2.pdf
Sequential Circuit Design-2.pdf
 
Sequential Ckt.pdf
Sequential Ckt.pdfSequential Ckt.pdf
Sequential Ckt.pdf
 
Flip Flop.pdf
Flip Flop.pdfFlip Flop.pdf
Flip Flop.pdf
 
Combinational Ckt.pdf
Combinational Ckt.pdfCombinational Ckt.pdf
Combinational Ckt.pdf
 
Boolean Algebra Terminologies.pdf
Boolean Algebra Terminologies.pdfBoolean Algebra Terminologies.pdf
Boolean Algebra Terminologies.pdf
 
Logic Minimization.pdf
Logic Minimization.pdfLogic Minimization.pdf
Logic Minimization.pdf
 
Unit III Introduction to DWH.pdf
Unit III Introduction to DWH.pdfUnit III Introduction to DWH.pdf
Unit III Introduction to DWH.pdf
 
Unit II Decision Making Basics and Concepts.pdf
Unit II Decision Making Basics and Concepts.pdfUnit II Decision Making Basics and Concepts.pdf
Unit II Decision Making Basics and Concepts.pdf
 
Unit I Factors Responsible for Successful BI Project.pdf
Unit I Factors Responsible for Successful BI Project.pdfUnit I Factors Responsible for Successful BI Project.pdf
Unit I Factors Responsible for Successful BI Project.pdf
 
Unit I Operational data Informational data.pdf
Unit I Operational data  Informational data.pdfUnit I Operational data  Informational data.pdf
Unit I Operational data Informational data.pdf
 

Recently uploaded

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 

Recently uploaded (20)

ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 

Introduction to Data Mining KDD Process OLAP

  • 1. Sanjivani Rural Education Society’s Sanjivani College of Engineering, Kopargaon-423 603 (An Autonomous Institute, Affiliated to Savitribai Phule Pune University, Pune) NACC ‘A’ Grade Accredited, ISO 9001:2015 Certified Department of Computer Engineering (NBA Accredited) Prof. S.A.Shivarkar Assistant Professor Contact No.8275032712 Email- shivarkarsandipcomp@sanjivani.org.in Subject- Data Mining and Warehousing (CO314) Unit –I: Introduction to Data Mining
  • 2. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 2 Content  Data Mining, Kinds of pattern and technologies  Data Mining Task Primitives  Issues in mining  KDD vs data mining  OLAP, knowledge  representation, Information and Knowledge;  Attribute Types: Nominal, Binary, Ordinal and Numeric attributes, Discrete versus Continuous attributes.
  • 3. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 3 Why Data Mining  “We are living in the information age” is popular saying, but actually living in the data age.  The Explosive Growth of Data: from terabytes to petabytes pour into computer networks by www, various data storage devices everyday from business, society, science and engineering, medicine and almost every other aspect of daily life.  Data collection and data availability  Automated data collection tools, database systems, Web, computerized society  Major sources of abundant data  Business: Web, e-commerce, transactions, stocks, …  Science: Remote sensing, bioinformatics, scientific simulation, …  Society and everyone: news, digital cameras, YouTube  We are drowning in data, but starving for knowledge!  “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets
  • 4. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 4 How Data Mining turns large collection of data into knowledge  A search engine e.g.,Google receives hundreds of millions of queries every day can be viewed as a transaction where the user describes her or his information need.  What novel and useful knowledge can a search engine learn from such a huge collection of queries collected from users over time?  Interestingly, some patterns found in user search queries can disclose invaluable knowledge that cannot be obtained by reading individual data items alone. For example, Google’s Flu Trends uses specific search terms as indicators of flu activity.  It found a close relationship between the number of people who search for flu-related information and the number of people who actually have flu symptoms.  A pattern emerges when all of the search queries related to flu are aggregated.  Using aggregated Google search data, Flu Trends can estimate flu activity up to two weeks faster than traditional systems can.  This example shows how data mining can turn a large collection of data into knowledge that can help meet a current global challenge.
  • 5. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 5 Evolution of Sciences  Before 1600, empirical science  1600-1950s, theoretical science  Each discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding.  1950s-1990s, computational science  Over the last 50 years, most disciplines have grown a third, computational branch (e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)  Computational Science traditionally meant simulation. It grew out of our inability to find closed-form solutions for complex mathematical models.  1990-now, data science  The flood of data from new scientific instruments and simulations  The ability to economically store and manage petabytes of data online  The Internet and computing Grid that makes all these archives universally accessible  Scientific info. management, acquisition, organization, query, and visualization tasks scale almost linearly with data volumes. Data mining is a major new challenge!  Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm. ACM, 45(11): 50-54, Nov. 2002
  • 6. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 6 The world is data rich but information poor  The abundance of data, coupled with the need for powerful data analysis tools, has been described as a data rich but information poor situation  The data collected in large data repositories become “data tombs”  Consequently, important decisions are often made based not on the information-rich data stored in data repositories but rather on a decision maker’s intuition, simply because the decision maker does not have the tools to extract the valuable knowledge embedded in the vast amounts of data.  Efforts have been made to develop expert system and knowledge-based technologies, which typically rely on users or domain experts to manually input knowledge into knowledge bases.  Unfortunately, however, the manual knowledge input procedure is prone to biases and  errors and is extremely costly and time consuming.  The widening gap between data and information calls for the systematic development of data mining tools that can turn data tombs into “golden nuggets” of knowledge.
  • 7. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 7 What is Data Mining  Data mining (knowledge discovery from data)  Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data  Data mining: a misnomer?  Alternative names  Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.  Watch out: Is everything “data mining”?  Simple search and query processing  (Deductive) expert systems
  • 8. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 8 Data mining—searching for knowledge (interesting patterns) in data.  Many other terms have a similar meaning to data mining—for example, knowledge mining from data, knowledge extraction, data/pattern analysis, data archaeology, and data dredging.
  • 9. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 9 Knowledge Discovery (KDD) Process
  • 10. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 10 Data mining as a step in the process of knowledge discovery 1. Data cleaning : to remove noise and inconsistent data 2. Data integration : where multiple data sources may be combined 3. Data selection : where data relevant to the analysis task are retrieved from the database 4. Data transformation : where data are transformed and consolidated into forms appropriate for mining by performing summary or aggregation operations 5. Data mining : an essential process where intelligent methods are applied to extract data patterns 6. Pattern evaluation : to identify the truly interesting patterns representing knowledge based on interestingness measures. 7. Knowledge presentation : where visualization and knowledge representation techniques are used to present mined knowledge to users
  • 11. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 11 Example: A Web Mining Framework  Web mining usually involves Data cleaning Data integration from multiple sources Warehousing the data Data cube construction Data selection for data mining Data mining Presentation of the mining results Patterns and knowledge to be used or stored into knowledge-base
  • 12. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 12 Data Mining Business Intelligence (BI)
  • 13. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 13 What is data?
  • 14. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 14 Attributes types in Data Mining  The attribute is the property of the object. The attribute represents different features of the object. Example:  In this example, Roll No, Name, and Result are attributes of the object student.
  • 15. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 15 Attributes types in Data Mining  Binary  Nominal  Numeric  Interval-scaled: e.g. calendar dates, temperature in Celsius or Fahrenheit  Ratio-scaled: these are like free numbers they can be anything e.g. length, time, counts
  • 16. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 16 Attributes types in data mining conti…  Binary: Binary data have only two values/states. Binary attributes types:  Symmetric  Asymmetric Symmetric Both values are equally important Asymmetric Both values are not equally important
  • 17. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 17 Attributes types in data mining conti…  Ordinal: All Values have a meaningful order. e.g. rankings (scalingfrom1to10), grades, height (tall ,medium, short)  Discrete: Discrete data have finite value. It can be in numerical form and can also be in categorical form. e.g. number of birds in a flock; the number of heads realized when a coin is flipped 10 times
  • 18. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 18 Attributes types in data mining conti…  Continuous: Continuous data technically have an infinite number of steps. Continuous data is in float type. There can be many numbers in between 1 and 2. e.g. weights and heights of birds, temperature of a day
  • 19. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 19 Attributes types summary
  • 20. 20 Data Mining Function: (1) Generalization  Information integration and data warehouse construction  Data cleaning, transformation, integration, and multidimensional data model  Data cube technology  Scalable methods for computing (i.e., materializing) multidimensional aggregates  OLAP (online analytical processing)  Multidimensional concept description: Characterization and discrimination  Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet region
  • 21. 21 Data Mining Function: (2) Association and Correlation Analysis  Frequent patterns (or frequent itemsets)  What items are frequently purchased together in your Walmart?  Association, correlation vs. causality  A typical association rule Diaper  Beer [0.5%, 75%] (support, confidence)  Are strongly associated items also strongly correlated?  How to mine such patterns and rules efficiently in large datasets?  How to use such patterns for classification, clustering, and other applications?
  • 22. 22 Data Mining Function: (3) Classification  Classification and label prediction  Construct models (functions) based on some training examples  Describe and distinguish classes or concepts for future prediction E.g., classify countries based on (climate), or classify cars based on (gas mileage)  Predict some unknown class labels  Typical methods  Decision trees, naïve Bayesian classification, support vector machines, neural networks, rule-based classification, pattern-based classification, logistic regression, …  Typical applications:  Credit card fraud detection, direct marketing, classifying stars, diseases, web-pages, …
  • 23. 23 Data Mining Function: (4) Cluster Analysis  Unsupervised learning (i.e., Class label is unknown)  Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patterns  Principle: Maximizing intra-class similarity & minimizing interclass similarity  Many methods and applications
  • 24. 24 Data Mining Function: (5) Outlier Analysis  Outlier analysis  Outlier: A data object that does not comply with the general behavior of the data  Noise or exception? ― One person’s garbage could be another person’s treasure  Methods: by product of clustering or regression analysis, …  Useful in fraud detection, rare events analysis
  • 25. 25 Data Mining: Confluence of Multiple Disciplines
  • 27. 27 Typical framework of a data warehouse for AllElectronics
  • 28. 28 Data Cubes  Grouping of data in a multidimensional matrix is called data cubes.  In dataware housing, we generally deal with various multidimensional data models as the data will be represented by multiple dimensions and multiple attributes.  This multidimensional data is represented in the data cube as the cube represents a high-dimensional space.  The Data cube pictorially shows how different attributes of data are arranged in the data model.
  • 29. 29 Typical framework of a data warehouse for AllElectronics  A Figure multidimensional data cube, commonly used for data warehousing,  (a) showing summarized data for AllElectronics and  (b) showing summarized data resulting fromdrill-down and roll- up operations on the cube in  (c). For improved readability, only some of the cube cell values are shown.
  • 30. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 30 Reference  Han, Jiawei Kamber, Micheline Pei and Jian, “Data Mining: Concepts and Techniques”,Elsevier Publishers, ISBN:9780123814791, 9780123814807.  https://onlinecourses.nptel.ac.in/noc24_cs22