SlideShare a Scribd company logo
LECTURE 1:
INTRODUCTION TO DATA
MINING FOR DATA
SCIENCE APPLICATIONS
Acknowledgment: Compiled Slides for Academic use
What is data mining?
 Data mining is also called knowledge
discovery and data mining (KDD)
 Data mining is
 extraction of useful patterns from data sources,
e.g., databases, texts, web, image.
 Patterns must be:
 valid, novel, potentially useful, understandable
Data
Knowledge
Patterns
Data Mining
Knowledge Discovery in Data:
Process
Interpretation/
Evaluation
Knowledge Discovery in Data:
Process
Data
Volume
- Big Data
- Small Data
Variety
- Transaction
- Temporal
- Spatial
…
Velocity
- Data Stream
- Static
Knowledge Discovery in Data:
Challenges
5
Outline (Part 1)
 Introduction to Data
 Transactional Data
 Temporal Data
 Spatial & Spatial-Temporal Data
 Data Preprocessing
 Missing Values
 Summarization
INTRODUCTION TO DATA
Data Come from Everywhere
8
Hospital
Stock Exchange
Weather Station
Grocery Markets E-Commerce
Social Media
But, they have different form
What is Data?
 Collection of records and their
attributes
 An attribute is a characteristic
of an object
 A collection of attributes
describe an object
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Attributes
Objects
Types of Data
 Record Data
 Transactional Data
 Temporal Data
 Time Series Data
 Sequence Data
 Spatial & Spatial-
Temporal Data
 Spatial Data
 Spatial-Temporal Data
 Graph Data
 Transactional Data
 UnStructured Data
 Twitter Status Message
 Review, news article
 Semi-Structured Data
 Paper Publications Data
 XML format
Record Data
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Market-Basket Dataset
• Transaction Data
Data Matrix
 If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each
dimension represents a distinct attribute
 Such data set can be represented by an m by n matrix,
where there are m rows, one for each object, and n
columns, one for each attribute
Data Matrix Example for
Documents
 Each document becomes a `term' vector,
 each term is a component (attribute) of the vector,
 the value of each component is the number of
times the corresponding term occurs in the
document.
Document 1
season
timeout
lost
wi
n
game
score
ball
pla
y
coach
team
Document 2
Document 3
3 0 5 0 2 6 0 2 0 2
0
0
7 0 2 1 0 0 3 0 0
1 0 0 1 2 2 0 3 0
Distance Matrix
0
1
2
3
0 1 2 3 4 5 6
p1
p2
p3 p4
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
Distance Matrix
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Temporal Data
 Sequences Data
(Patient Data obtained from Zhang’s KDD 06 Paper)
Temporal Data
 Time Series Data
Yahoo Finance Website
Biological Sequence Data
Interval Data
A
EL= { (A, 1, 5),( C, 3, 12), ( B, 4, 9), ( D, 9, 15) }
C
1 5
3 12
(A overlaps C )
B
4 9
contains B )
(
D
15
overlaps D )
(
time
(Interval Patient Data obtained from Amit’s M.Tech. Thesis
Work)
Spatial & Spatial-Temporal Data
(Spatial Distribution of Objects of Various Types : Prof. Shashi
Shekhar)
• Spatial Data
Spatial & Spatial-Temporal Data
Average Monthly Temperature of land and ocean
 Spatial Data
Spatial & Spatial-Temporal Data
 Spatial Data
Dengue Disease Dataset (Singapore)
Spatial & Spatial-Temporal Data
 Trajectory Data: Set of Harricans
http://csc.noaa.gov/hurricanes
Spatial & Spatial-Temporal Data
 Trajectory Data: (of 87 users obtained
using RFID)
Vast 2008 Challenge – RFID Dataset
User Movement Data
 Trajectory
 Movement trail of a user
 Sampling Points: <latitude, longitude, time>
P1 on weekends
Home
Swimming Pool
Movie Complex
Stadium
Thanks to Shreyash and Sahoishnu (M.Tech.
Graph Data
Semi-structured Data
Unstructured Data
Data can help us solve specific problems.
How should these pictures be
placed into 3 groups?
How should these pictures be placed into
groups? How many groups should there
be?
Which genes are associated with a disease? How
can expression values be used to predict survival?
What items should Amazon display
for me?
Is it likely that this stock was traded
based on illegal insider
information?
Where are the faces in this
picture?
Is this spam?
Will I like 300?
What techniques people apply on
data?
 They apply data mining algorithms and discover
useful knowledge
 So, what are the some of the well-known Data
mining Tasks?
 Clustering,
 Classification,
 Frequent Patterns,
 Association Rules,
 ….
What people do with the time
series data?
Clustering
Classification
Query by
Content
Rule
Discovery
10

s = 0.5
c = 0.3
Motif Discovery
Novelty Detection
Visualization Motif Association
What people do with the
trajectory data?
Clusterin
g
Motif Discovery
Refinery
Fishery
Port A Port B
Container Port
Container Ships Tankers Fishing Boats
Visualization
Frequent Travel Patterns
Classification
Prediction
Types of Data
 Transactional Data
 Sequence Data
 Interval Data
 Time Series Data
 Spatial Data
 Spatio-Temporal Data
 Data Set with Multiple
Kinds of Data
 ….
In, Summary
Data Mining
Methods
 Frequent Pattern
Discovery
 Classification
 Clustering
 Outlier Detection
 Statistical Analysis
 …
Algorith
ms
Activity 1
 Find top 3 recent research activities around the
world that are analyzing data. You need to write
short summary for each research activities. First
three line must follow following format:
 Line 1: Problem they are trying to sole along with
dataset they are using
 Line 2: How they are solving the problem
 Line 3: Justify yourself why you rate this work as a
top 5 activities
 Remaining lines… you can think yourself ….
BigN’Smart Research group at IIT-Roorkee is analyzing
“YelpReview” Dataset for learning Location-to-activity
Tagging. They are applying … . I feel this is an interesting
research because …
Activity 2: Why Data Mining
???
 Google
 Facebook
 Netflix
 eHarmony
 FICO
 FlightCaster
 IBM’s Watson
Read
About
Their
Story
Related Field
43
Statistics
Machine
Learning
Databases
Visualization
Data Mining and
Knowledge Discovery
Related Field
 Statistics:
 more theory-based
 more focused on testing hypotheses
 Machine learning
 more heuristic
 focused on improving performance of a learning agent
 also looks at real-time learning and robotics – areas not part of data
mining
 Data Mining and Knowledge Discovery
 integrates theory and heuristics
 focus on the entire process of knowledge discovery, including data
cleaning, learning, and integration and visualization of results
 Distinctions are fuzzy
45
Classification
Learn a method for predicting the instance class from pre-
labeled (classified) instances
Many approaches:
Statistics,
Decision Trees, Neural
Networks,
...
46
Clustering
Find “natural” grouping of instances
given un-labeled data
47
Association Rules & Frequent
Itemsets
TID Produce
1 MILK, BREAD, EGGS
2 BREAD, SUGAR
3 BREAD, CEREAL
4 MILK, BREAD, SUGAR
5 MILK, CEREAL
6 BREAD, CEREAL
7 MILK, CEREAL
8 MILK, BREAD, CEREAL, EGGS
9 MILK, BREAD, CEREAL
Transactions Frequent Itemsets:
Milk, Bread (4)
Bread, Cereal (3)
Milk, Bread, Cereal (2)
…
Rules:
Milk => Bread (66%)
48
Visualization & Data Mining
 Visualizing the data to
facilitate human
discovery
 Presenting the
discovered results in a
visually "nice" way
49
Summarization
 Describe features of the
selected group
 Use natural language and
graphics
 Usually in Combination with
Deviation detection or other
methods
Average length of stay in this study area rose 45.7 percent,
from 4.3 days to 6.2 days, because ...
Data Mining Models and Tasks
Obtained from Prof. Srini’s Lecture notes

More Related Content

Similar to KDD, Data Mining, Data Science_I.pptx

data.2.pptx
data.2.pptxdata.2.pptx
data.2.pptx
VaishnavGhadge1
 
G045033841
G045033841G045033841
G045033841
IJERA Editor
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introductionbutest
 
KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdf
Dr. Radhey Shyam
 
Data-Mining-ppt (1).pptx
Data-Mining-ppt (1).pptxData-Mining-ppt (1).pptx
Data-Mining-ppt (1).pptx
Parvathyparu25
 
Data-Mining-ppt.pptx
Data-Mining-ppt.pptxData-Mining-ppt.pptx
Data-Mining-ppt.pptx
ayush309565
 
Data Mining
Data MiningData Mining
Data Miningshrapb
 
Introduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .pptIntroduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .ppt
SangrangBargayary3
 
data mining
data miningdata mining
data mining
manasa polu
 
Data mining
Data miningData mining
Data Mining System and Applications: A Review
Data Mining System and Applications: A ReviewData Mining System and Applications: A Review
Data Mining System and Applications: A Review
ijdpsjournal
 
Introduction to feature subset selection method
Introduction to feature subset selection methodIntroduction to feature subset selection method
Introduction to feature subset selection method
IJSRD
 
Effective data mining for proper
Effective data mining for properEffective data mining for proper
Effective data mining for proper
IJDKP
 
A Comprehensive Study on Outlier Detection in Data Mining
A Comprehensive Study on Outlier Detection in Data MiningA Comprehensive Study on Outlier Detection in Data Mining
A Comprehensive Study on Outlier Detection in Data Mining
BRNSSPublicationHubI
 
Göteborg university(condensed)
Göteborg university(condensed)Göteborg university(condensed)
Göteborg university(condensed)
Zenodia Charpy
 
Introduction To Data Mining
Introduction To Data MiningIntroduction To Data Mining
Introduction To Data Miningdataminers.ir
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining Phi Jack
 
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using RapidminerStudy and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
IJERA Editor
 
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfKIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
Dr. Radhey Shyam
 

Similar to KDD, Data Mining, Data Science_I.pptx (20)

data.2.pptx
data.2.pptxdata.2.pptx
data.2.pptx
 
G045033841
G045033841G045033841
G045033841
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introduction
 
KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdf
 
Data-Mining-ppt (1).pptx
Data-Mining-ppt (1).pptxData-Mining-ppt (1).pptx
Data-Mining-ppt (1).pptx
 
Data-Mining-ppt.pptx
Data-Mining-ppt.pptxData-Mining-ppt.pptx
Data-Mining-ppt.pptx
 
Data Mining
Data MiningData Mining
Data Mining
 
Lecture1
Lecture1Lecture1
Lecture1
 
Introduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .pptIntroduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .ppt
 
data mining
data miningdata mining
data mining
 
Data mining
Data miningData mining
Data mining
 
Data Mining System and Applications: A Review
Data Mining System and Applications: A ReviewData Mining System and Applications: A Review
Data Mining System and Applications: A Review
 
Introduction to feature subset selection method
Introduction to feature subset selection methodIntroduction to feature subset selection method
Introduction to feature subset selection method
 
Effective data mining for proper
Effective data mining for properEffective data mining for proper
Effective data mining for proper
 
A Comprehensive Study on Outlier Detection in Data Mining
A Comprehensive Study on Outlier Detection in Data MiningA Comprehensive Study on Outlier Detection in Data Mining
A Comprehensive Study on Outlier Detection in Data Mining
 
Göteborg university(condensed)
Göteborg university(condensed)Göteborg university(condensed)
Göteborg university(condensed)
 
Introduction To Data Mining
Introduction To Data MiningIntroduction To Data Mining
Introduction To Data Mining
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
 
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using RapidminerStudy and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
 
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfKIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
 

Recently uploaded

Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
2023240532
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 

Recently uploaded (20)

Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 

KDD, Data Mining, Data Science_I.pptx

  • 1. LECTURE 1: INTRODUCTION TO DATA MINING FOR DATA SCIENCE APPLICATIONS Acknowledgment: Compiled Slides for Academic use
  • 2. What is data mining?  Data mining is also called knowledge discovery and data mining (KDD)  Data mining is  extraction of useful patterns from data sources, e.g., databases, texts, web, image.  Patterns must be:  valid, novel, potentially useful, understandable
  • 3. Data Knowledge Patterns Data Mining Knowledge Discovery in Data: Process Interpretation/ Evaluation
  • 4. Knowledge Discovery in Data: Process
  • 5. Data Volume - Big Data - Small Data Variety - Transaction - Temporal - Spatial … Velocity - Data Stream - Static Knowledge Discovery in Data: Challenges 5
  • 6. Outline (Part 1)  Introduction to Data  Transactional Data  Temporal Data  Spatial & Spatial-Temporal Data  Data Preprocessing  Missing Values  Summarization
  • 8. Data Come from Everywhere 8 Hospital Stock Exchange Weather Station Grocery Markets E-Commerce Social Media But, they have different form
  • 9. What is Data?  Collection of records and their attributes  An attribute is a characteristic of an object  A collection of attributes describe an object Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Attributes Objects
  • 10. Types of Data  Record Data  Transactional Data  Temporal Data  Time Series Data  Sequence Data  Spatial & Spatial- Temporal Data  Spatial Data  Spatial-Temporal Data  Graph Data  Transactional Data  UnStructured Data  Twitter Status Message  Review, news article  Semi-Structured Data  Paper Publications Data  XML format
  • 11. Record Data TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk Market-Basket Dataset • Transaction Data
  • 12. Data Matrix  If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute  Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute
  • 13. Data Matrix Example for Documents  Each document becomes a `term' vector,  each term is a component (attribute) of the vector,  the value of each component is the number of times the corresponding term occurs in the document. Document 1 season timeout lost wi n game score ball pla y coach team Document 2 Document 3 3 0 5 0 2 6 0 2 0 2 0 0 7 0 2 1 0 0 3 0 0 1 0 0 1 2 2 0 3 0
  • 14. Distance Matrix 0 1 2 3 0 1 2 3 4 5 6 p1 p2 p3 p4 point x y p1 0 2 p2 2 0 p3 3 1 p4 5 1 Distance Matrix p1 p2 p3 p4 p1 0 2.828 3.162 5.099 p2 2.828 0 1.414 3.162 p3 3.162 1.414 0 2 p4 5.099 3.162 2 0
  • 15. Temporal Data  Sequences Data (Patient Data obtained from Zhang’s KDD 06 Paper)
  • 16. Temporal Data  Time Series Data Yahoo Finance Website
  • 18. Interval Data A EL= { (A, 1, 5),( C, 3, 12), ( B, 4, 9), ( D, 9, 15) } C 1 5 3 12 (A overlaps C ) B 4 9 contains B ) ( D 15 overlaps D ) ( time (Interval Patient Data obtained from Amit’s M.Tech. Thesis Work)
  • 19. Spatial & Spatial-Temporal Data (Spatial Distribution of Objects of Various Types : Prof. Shashi Shekhar) • Spatial Data
  • 20. Spatial & Spatial-Temporal Data Average Monthly Temperature of land and ocean  Spatial Data
  • 21. Spatial & Spatial-Temporal Data  Spatial Data Dengue Disease Dataset (Singapore)
  • 22. Spatial & Spatial-Temporal Data  Trajectory Data: Set of Harricans http://csc.noaa.gov/hurricanes
  • 23. Spatial & Spatial-Temporal Data  Trajectory Data: (of 87 users obtained using RFID) Vast 2008 Challenge – RFID Dataset
  • 24. User Movement Data  Trajectory  Movement trail of a user  Sampling Points: <latitude, longitude, time> P1 on weekends Home Swimming Pool Movie Complex Stadium Thanks to Shreyash and Sahoishnu (M.Tech.
  • 28. Data can help us solve specific problems.
  • 29. How should these pictures be placed into 3 groups?
  • 30. How should these pictures be placed into groups? How many groups should there be?
  • 31. Which genes are associated with a disease? How can expression values be used to predict survival?
  • 32. What items should Amazon display for me?
  • 33. Is it likely that this stock was traded based on illegal insider information?
  • 34. Where are the faces in this picture?
  • 36. Will I like 300?
  • 37. What techniques people apply on data?  They apply data mining algorithms and discover useful knowledge  So, what are the some of the well-known Data mining Tasks?  Clustering,  Classification,  Frequent Patterns,  Association Rules,  ….
  • 38. What people do with the time series data? Clustering Classification Query by Content Rule Discovery 10  s = 0.5 c = 0.3 Motif Discovery Novelty Detection Visualization Motif Association
  • 39. What people do with the trajectory data? Clusterin g Motif Discovery Refinery Fishery Port A Port B Container Port Container Ships Tankers Fishing Boats Visualization Frequent Travel Patterns Classification Prediction
  • 40. Types of Data  Transactional Data  Sequence Data  Interval Data  Time Series Data  Spatial Data  Spatio-Temporal Data  Data Set with Multiple Kinds of Data  …. In, Summary Data Mining Methods  Frequent Pattern Discovery  Classification  Clustering  Outlier Detection  Statistical Analysis  … Algorith ms
  • 41. Activity 1  Find top 3 recent research activities around the world that are analyzing data. You need to write short summary for each research activities. First three line must follow following format:  Line 1: Problem they are trying to sole along with dataset they are using  Line 2: How they are solving the problem  Line 3: Justify yourself why you rate this work as a top 5 activities  Remaining lines… you can think yourself …. BigN’Smart Research group at IIT-Roorkee is analyzing “YelpReview” Dataset for learning Location-to-activity Tagging. They are applying … . I feel this is an interesting research because …
  • 42. Activity 2: Why Data Mining ???  Google  Facebook  Netflix  eHarmony  FICO  FlightCaster  IBM’s Watson Read About Their Story
  • 44. Related Field  Statistics:  more theory-based  more focused on testing hypotheses  Machine learning  more heuristic  focused on improving performance of a learning agent  also looks at real-time learning and robotics – areas not part of data mining  Data Mining and Knowledge Discovery  integrates theory and heuristics  focus on the entire process of knowledge discovery, including data cleaning, learning, and integration and visualization of results  Distinctions are fuzzy
  • 45. 45 Classification Learn a method for predicting the instance class from pre- labeled (classified) instances Many approaches: Statistics, Decision Trees, Neural Networks, ...
  • 46. 46 Clustering Find “natural” grouping of instances given un-labeled data
  • 47. 47 Association Rules & Frequent Itemsets TID Produce 1 MILK, BREAD, EGGS 2 BREAD, SUGAR 3 BREAD, CEREAL 4 MILK, BREAD, SUGAR 5 MILK, CEREAL 6 BREAD, CEREAL 7 MILK, CEREAL 8 MILK, BREAD, CEREAL, EGGS 9 MILK, BREAD, CEREAL Transactions Frequent Itemsets: Milk, Bread (4) Bread, Cereal (3) Milk, Bread, Cereal (2) … Rules: Milk => Bread (66%)
  • 48. 48 Visualization & Data Mining  Visualizing the data to facilitate human discovery  Presenting the discovered results in a visually "nice" way
  • 49. 49 Summarization  Describe features of the selected group  Use natural language and graphics  Usually in Combination with Deviation detection or other methods Average length of stay in this study area rose 45.7 percent, from 4.3 days to 6.2 days, because ...
  • 50. Data Mining Models and Tasks Obtained from Prof. Srini’s Lecture notes