SlideShare a Scribd company logo
1 of 56
Presented By:
Mr. Sameer Kumar Das
Asst. Professor In CSE (GATE)
B.TECH,M.TECH,PH.D.(ENGINEERING)/C...
1
1. Data mining techniques
2. Demo
3. Overview of KDD and data mining
4. Summary
Asst. Professor in CSE
3
Knowledge-Based System
User
Shell
KnowledgeBase
UserInterface
Knowledge
Extraction
Explication
Inference
Data/Expert
System-
Engineer
Knowledge
Engineer
Data Mining
Asst.Professor In CSE
 What is KDD?
 Why is KDD necessary
 The KDD process
 KDD operations and methods
4Asst. Professor In CSE
The iterative and interactive process of
discovering valid, novel, useful, and
understandable knowledge ( patterns,
models, rules etc.) in Massive
databases
5
What is data mining?
Asst. Professor In CSE
 Valid: generalize to the future
 Novel: what we don't know
 Useful: be able to take some action
 Understandable: leading to insight
 Iterative: takes multiple passes
 Interactive: human in the loop
6Asst. Professor In CSE
 Data volume too large for classical analysis
 Number of records too large (millions or billions)
 High dimensional (attributes/features/ fields) data
(thousands)
 Increased opportunity for access
 Web navigation, on-line collections
7Asst. Professor In CSE
 Prediction
 What? Opaque
 Description
 Why?Transparent
8Asst. Professor In CSE
 Verification driven
 Validating hypothesis
 Querying and reporting (spreadsheets, pivot
tables)
 Multidimensional analysis (dimensional
summaries); On Line Analytical Processing
 Statistical analysis
9Asst. Professor In CSE
 Discovery driven
 Exploratory data analysis
 Predictive modeling
 Database segmentation
 Link analysis
 Deviation detection
10Asst. Professor In CSE
11
Original
Data
Target
Data
Preprocessed
Data
Transformed
Data
Patterns
Knowledge
Selection
Preprocessing
Transformation
Data Mining
Interpretation
Asst. Professor In CSE
 Understand application domain
 Prior knowledge, user goals
 Create target dataset
 Select data, focus on subsets
 Data cleaning and transformation
 Remove noise, outliers, missing values
 Select features, reduce dimensions
12Asst. Professor In CSE
 Apply data mining algorithm
 Associations, sequences, classification, clustering,
etc.
 Interpret, evaluate and visualize patterns
 What's new and interesting?
 Iterate if needed
 Manage discovered knowledge
 Close the loop
13Asst. Professor In CSE
14
Original DB
Create/select
target database Select sample
Supply missing
values
Eliminate
noisy data
Normalize
values
Transform
values
Create derived
attributes
Relevant
attributes
Select DM
task (s)
Select DM
method (s)
Extract
knowledge
Test
knowledge
Refine
knowledge
Presentation,
visualization
Asst. Professor In CSE
 AI
 Machine learning
 Statistics
 Databases and data warehousing
 High performance computing
 Visualization
15Asst. Professor In CSE
 Human analysis breaks down with volume
and dimensionality
 How quickly can one digest 1 million records, with
100 attributes
 High rate of growth, changing sources
 What is done by non-statisticians?
 Select a few fields and fit simple models or
attempt to visualize
16Asst. Professor In CSE
1. Overview of KDD and data mining
2. Data mining techniques
3. Demo
4. Summary
5. KDD resources pointers
17Asst. professor In CSE
 Predictive modeling (classification,
regression)
 Segmentation (clustering)
 Dependency modeling (graphical models,
density estimation)
 Summarization (associations)
 Change and deviation detection
18Asst. Professor In CSE
 Association rules: detect sets of attributes that
frequently co-occur, and rules among them, e.g. 90%
of the people who buy cookies, also buy milk (60% of
all grocery shoppers buy both)
 Sequence mining (categorical): discover sequences of
events that commonly occur together, .e.g. In a set of
DNA sequences ACGTC is followed by GTCA after a
gap of 9, with 30% probability
19Asst. Professor In CSE
 CBR or Similarity search: given a database of objects,
and a “query” object, find the object(s) that are within
a user-defined distance of the queried object, or find
all pairs within some distance of each other.
 Deviation detection: find the record(s) that is (are) the
most different from the other records, i.e., find all
outliers.These may be thrown away as noise or may
be the “interesting” ones.
20Asst. Professor In CSE
 Many other methods, such as
 Decision trees
 Neural networks
 Genetic algorithms
 Hidden markov models
 Time series
 Bayesian networks
 Soft computing: rough and fuzzy sets
21Asst. Professor In CSE
 Classification and regression: assign a new data record
to one of several predefined categories or classes.
Regression deals with predicting real-valued fields.
Also called supervised learning.
 Clustering: partition the dataset into subsets or
groups such that elements of a group share a common
set of properties, with high within group similarity and
small inter-group similarity. Also called unsupervised
learning.
22Asst. Professor In CSE
 Scalability
 Efficient and sufficient sampling
 In-memory vs. disk-based processing
 High performance computing
 Automation
 Ease of use
 Using prior knowledge
23Asst. Professor In CSE
 General descriptive knowledge
 Summarizations
 symbolic descriptions of subsets
 Discriminative knowledge
 Distinguish between K classes
 Accurate classification (also black box)
 Separate spaces
24Asst. Professor In CSE
 Representation: language for
patterns/models, expressive power
 Evaluation: scoring methods for deciding
what is a good fit of model to data
 Search: method for enumerating
patterns/models
25Asst. Professor In CSE
 Association rules
 Sequence mining
 Classification(decision tree etc.)
 Clustering
 Deviation detection
 K-nearest neighbors
26Asst. Professor In CSE
 Given a set of items/attributes, and a set of
objects containing a subset of the items
 Find rules: if I1 then I2 (sup, conf)
 I1, I2 are sets of items
 I1, I2 have sufficient support: P(I1+I2)
 Rule has sufficient confidence: P(I2|I1)
27Asst. Professor In CSE
28
What is association mining?
Ex:
If A and B then C
If A and not B then C
If A and B and C then D etc.
Asst. Professor In CSE
29
Support is defined as the minimum percentage of transactions in the
DB containing A and B.
Confidence is defined as the minimum percentage of those
transactions containing A that also contain B.
Ex. Suppose the DB contains 1 million transactions andt that 10‘000
of those transactions contain both A and B.
We can then say that the support of the association if A then B is:
S= 10‘000/1‘000‘000 = 1%.
Likewise, if 50‘000 of the transactions contain A and 10‘000 out of
those 50‘000 also contain B then the association rule if A then B has a
confidence 10‘000/50‘000 = 20%.
Confidence is just the conditional probability of B given A.
Asst. Professor In CSE
30
R: LS  RS
Supp(R)= supp(LS RS)
= #Transaction verifying R / (Total # of Transaction)
Conf(R) = supp(LS RS)/supp(LS)
Ex:
R: Milk=> cookies,
A support(R) of 0.8 means in 80% of transaktion Milk and
cookies are together.
The confidence means the correlation, the relation between
the LS and the RS.
Asst. Professor In CSE
Ticket 1 Ticket 2 Ticket 3 Ticket 4
Farine Oeufs Farine Oeufs
Sucre Sucre Oeufs Chocolat
Lait Chocolat Sucre Thé
Chocolat
31
Farine  Sucre has a confidence of 100%, this is the force
of the association and a support of 2/3.  number of
association farine => Sucre divided by number of ticket where
sucre or farine exist.
Asst. Professor In CSE
32
TransaktionID PassagierID Ziel
431 102 New York
431 102 London
431 102 Cairo
431 102 Paris
701 38 New York
701 38 London
701 38 Cairo
11 531 New York
11 531 Cairo
301 102 New York
301 102 London
301 102 Paris
Asst. Professor In CSE
33
What is Support and what is Confidence?
Having the following Rule:
Rule: Who visit New York, visit London too.<==>
New York=>London.
Calculate the support the Support and the Confidence of this
Rule.
Asst. Professor In CSE
 Given a set of items, list of events per
sequence ordered in time
 Find rules: if S1 then S2 (sup, conf)
 S1, S2 are sequences of items
 S1, S2 have sufficient support: P(S1+S2)
 Rule has sufficient confidence: P(S2|S1)
34Asst. Professor In CSE
 User specifies “interestingness”
 Minimum support (minsup)
 Minimum confidence (minconf)
 Find all frequent sequences (> minsup)
 Exponential Search Space
 Computation and I/O Intensive
 Generate strong rules (> minconf)
 Relatively cheap
35Asst. Professor In CSE
 A “black box” that makes predictions about
the future based on information from the past
and present
 Large number of input available
36
Model
Age
Salary
CarType
High/Low Risk
Asst. Professor In CSE
 Some models are better than others
 Accuracy
 Understandability
 Models range from easy to understand to
incomprehensible
 Decision trees
 Rule induction
 Regression models
 Neural networks
37
Easier
Harder
Asst. Professor In CSE
Classification is the process of assigning new
objects to predefined categories or classes
 Given a set of labeled records
 Build a model (decision tree)
 Predict labels for future unlabeled records
38Asst. Professor In CSE
 Supervised learning (labels known)
 Example described in terms of attributes
 Categorical (unordered symbolic values)
 Numeric (integers, reals)
 Class (output/predicted attribute): categorical
for classification, numeric for regression
39Asst. Professor In CSE
40
Tid Age Car Type Class
0 23 Family High
1 17 Sports High
2 43 Sports High
3 68 Family Low
4 32 Truck Low
5 20 Family High
Age < 27.5
High Low
CarType  {Sports}
High
Age=40, CarType=Family  Class=Low
Numeric Categorical
Asst. Professor In CSE
41
Age < 27.5
High Low
CarType  {Sports}
High
1) Age < 27.5  High
2) Age >= 27.5 and
CarType = Sports  High
3) Age >= 27.5 and
CarType  Sports  Low
Asst. Professor In CSE
Given N k-dimensional feature vectors , find a
“meaningful” partition of the N examples into
c subsets or groups
 Discover the “labels” automatically
 c may be given, or “discovered”
 much more difficult than classification, since
in the latter the groups are given, and we seek
a compact description
42Asst. Professor In CSE
 Have to define some notion of “similarity”
between examples
 Goal: maximize intra-cluster similarity and
minimize inter-cluster similarity
 Feature vector be
 All numeric (well defined distances)
 All categorical or mixed (harder to define
similarity; geometric notions don’t work)
43Asst. Professor In CSE
 Distance-based
 Numeric
▪ Euclidean distance (root of sum of squared differences
along each dimension)
▪ Angle between two vectors
 Categorical
▪ Number of common features (categorical)
 Partition-based
 Enumerate partitions and score each
44Asst. Professor In CSE
45
Initial seeds
Asst. Professor In CSE
46
New centers
Asst. Professor In CSE
47
Final centers
Asst. Professor In CSE
48
outlier
Asst. Professor In CSE
 Classification technique to assign a class to a
new example
 Find k-nearest neighbors, i.e., most similar
points in the dataset (compare against all
points!)
 Assign the new case to the same class to
which most of its neighbors belong
49Asst. Professor In CSE
50
Neighborhood
5 of class
3 of class
=
Asst. Professor In CSE
1. Overview of KDD and data mining
2. Data mining techniques
3. Demo
4. ResearchTrends
5. Summary
6. KDD resources pointers
51Asst. Professor In CSE
1. Overview of KDD and data mining
2. Data mining techniques
3. Demo
4. Summary
5. KDD resources pointers
52Asst. Professor In CSE
 Scientific and economic need for KDD
 Made possible by recent advances in data
collection, processing power, and
sophisticated techniques fromAI, databases
and visualization
 KDD is a complex process
 Several techniques need to be used
53Asst. Professor In CSE
 Need for rich knowledge representation
 Need to integrate specific domain
knowledge.
 KDD using Fuzzy-categorical and Uncertainty
Techniques
 Web Mining and User profile
 KDD for Bio-Informatique
54Asst. Professor In CSE
 ACM SIGKDD: www.acm.org/sigkdd
 KDD Nuggets: www.kdnuggets.com
 Book: Advances in KDD, MIT Press, ’96
 Journal: Data Mining and KDD,
 research.microsoft.com/datamine
55Asst. Professor In CSE
56
Question ?
Asst. Professor In CSE

More Related Content

What's hot

Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401butest
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersFunctional Imperative
 
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...Statistics And Probability Tutorial | Statistics And Probability for Data Sci...
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...Edureka!
 
The Geometry of Learning
The Geometry of LearningThe Geometry of Learning
The Geometry of Learningfridolin.wild
 
Qda ces 2013 toronto workshop
Qda ces 2013 toronto workshopQda ces 2013 toronto workshop
Qda ces 2013 toronto workshopCesToronto
 
Informs presentation new ppt
Informs presentation new pptInforms presentation new ppt
Informs presentation new pptSalford Systems
 
Search Engines
Search EnginesSearch Engines
Search Enginesbutest
 
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai UniversityMadhav Mishra
 

What's hot (18)

sigir2018tutorial
sigir2018tutorialsigir2018tutorial
sigir2018tutorial
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
 
sigir2020
sigir2020sigir2020
sigir2020
 
1st sem
1st sem1st sem
1st sem
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning Classifiers
 
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...Statistics And Probability Tutorial | Statistics And Probability for Data Sci...
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...
 
587_EswarPrasadReddyMachireddy_CEE
587_EswarPrasadReddyMachireddy_CEE587_EswarPrasadReddyMachireddy_CEE
587_EswarPrasadReddyMachireddy_CEE
 
The Geometry of Learning
The Geometry of LearningThe Geometry of Learning
The Geometry of Learning
 
Qda ces 2013 toronto workshop
Qda ces 2013 toronto workshopQda ces 2013 toronto workshop
Qda ces 2013 toronto workshop
 
4 module 3 --
4 module 3 --4 module 3 --
4 module 3 --
 
Informs presentation new ppt
Informs presentation new pptInforms presentation new ppt
Informs presentation new ppt
 
7 decision tree
7 decision tree7 decision tree
7 decision tree
 
603_SaiKiranPutta_CEE
603_SaiKiranPutta_CEE603_SaiKiranPutta_CEE
603_SaiKiranPutta_CEE
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
598_RamaSrikanthJakkam_CEE
598_RamaSrikanthJakkam_CEE598_RamaSrikanthJakkam_CEE
598_RamaSrikanthJakkam_CEE
 
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
 
662_AravindKumarN_CEE
662_AravindKumarN_CEE662_AravindKumarN_CEE
662_AravindKumarN_CEE
 
Techniques Machine Learning
Techniques Machine LearningTechniques Machine Learning
Techniques Machine Learning
 

Similar to Kdd by Mr.Sameer Kumar Das

kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptbutest
 
Data mining-2
Data mining-2Data mining-2
Data mining-2Nit Hik
 
AutoML for Data Science Productivity and Toward Better Digital Decisions
AutoML for Data Science Productivity and Toward Better Digital DecisionsAutoML for Data Science Productivity and Toward Better Digital Decisions
AutoML for Data Science Productivity and Toward Better Digital DecisionsSteven Gustafson
 
Data Mining Presentation on Science Day 2023
Data Mining Presentation on Science Day 2023Data Mining Presentation on Science Day 2023
Data Mining Presentation on Science Day 2023SakshiTiwari490123
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationDr. Abdul Ahad Abro
 
Lec1-Into
Lec1-IntoLec1-Into
Lec1-Intobutest
 
(Talk in Powerpoint Format)
(Talk in Powerpoint Format)(Talk in Powerpoint Format)
(Talk in Powerpoint Format)butest
 
Lexicon base approch
Lexicon base approchLexicon base approch
Lexicon base approchanil maurya
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and workAmr Abd El Latief
 
Mining frequent patterns association
Mining frequent patterns associationMining frequent patterns association
Mining frequent patterns associationDeepaR42
 
On Machine Learning and Data Mining
On Machine Learning and Data MiningOn Machine Learning and Data Mining
On Machine Learning and Data Miningbutest
 
algorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyalgorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyPaolo Missier
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Researchjim
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Researchbutest
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Researchkevinlan
 
Mining Frequent Patterns, Associations, and.pptx
 Mining Frequent Patterns, Associations, and.pptx Mining Frequent Patterns, Associations, and.pptx
Mining Frequent Patterns, Associations, and.pptxRushikeshChikane2
 
EE-232-LEC-01 Data_structures.pptx
EE-232-LEC-01 Data_structures.pptxEE-232-LEC-01 Data_structures.pptx
EE-232-LEC-01 Data_structures.pptxiamultapromax
 

Similar to Kdd by Mr.Sameer Kumar Das (20)

kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
 
Data-Mining-2.ppt
Data-Mining-2.pptData-Mining-2.ppt
Data-Mining-2.ppt
 
Data mining-2
Data mining-2Data mining-2
Data mining-2
 
Data mining-2
Data mining-2Data mining-2
Data mining-2
 
AutoML for Data Science Productivity and Toward Better Digital Decisions
AutoML for Data Science Productivity and Toward Better Digital DecisionsAutoML for Data Science Productivity and Toward Better Digital Decisions
AutoML for Data Science Productivity and Toward Better Digital Decisions
 
Data Mining Presentation on Science Day 2023
Data Mining Presentation on Science Day 2023Data Mining Presentation on Science Day 2023
Data Mining Presentation on Science Day 2023
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
 
Lec1-Into
Lec1-IntoLec1-Into
Lec1-Into
 
(Talk in Powerpoint Format)
(Talk in Powerpoint Format)(Talk in Powerpoint Format)
(Talk in Powerpoint Format)
 
Lexicon base approch
Lexicon base approchLexicon base approch
Lexicon base approch
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and work
 
Mining frequent patterns association
Mining frequent patterns associationMining frequent patterns association
Mining frequent patterns association
 
On Machine Learning and Data Mining
On Machine Learning and Data MiningOn Machine Learning and Data Mining
On Machine Learning and Data Mining
 
Data mining
Data mining Data mining
Data mining
 
algorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyalgorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparency
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Research
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Mining Frequent Patterns, Associations, and.pptx
 Mining Frequent Patterns, Associations, and.pptx Mining Frequent Patterns, Associations, and.pptx
Mining Frequent Patterns, Associations, and.pptx
 
EE-232-LEC-01 Data_structures.pptx
EE-232-LEC-01 Data_structures.pptxEE-232-LEC-01 Data_structures.pptx
EE-232-LEC-01 Data_structures.pptx
 

Recently uploaded

OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacingjaychoudhary37
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 

Recently uploaded (20)

OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacing
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 

Kdd by Mr.Sameer Kumar Das

  • 1. Presented By: Mr. Sameer Kumar Das Asst. Professor In CSE (GATE) B.TECH,M.TECH,PH.D.(ENGINEERING)/C... 1
  • 2. 1. Data mining techniques 2. Demo 3. Overview of KDD and data mining 4. Summary Asst. Professor in CSE
  • 4.  What is KDD?  Why is KDD necessary  The KDD process  KDD operations and methods 4Asst. Professor In CSE
  • 5. The iterative and interactive process of discovering valid, novel, useful, and understandable knowledge ( patterns, models, rules etc.) in Massive databases 5 What is data mining? Asst. Professor In CSE
  • 6.  Valid: generalize to the future  Novel: what we don't know  Useful: be able to take some action  Understandable: leading to insight  Iterative: takes multiple passes  Interactive: human in the loop 6Asst. Professor In CSE
  • 7.  Data volume too large for classical analysis  Number of records too large (millions or billions)  High dimensional (attributes/features/ fields) data (thousands)  Increased opportunity for access  Web navigation, on-line collections 7Asst. Professor In CSE
  • 8.  Prediction  What? Opaque  Description  Why?Transparent 8Asst. Professor In CSE
  • 9.  Verification driven  Validating hypothesis  Querying and reporting (spreadsheets, pivot tables)  Multidimensional analysis (dimensional summaries); On Line Analytical Processing  Statistical analysis 9Asst. Professor In CSE
  • 10.  Discovery driven  Exploratory data analysis  Predictive modeling  Database segmentation  Link analysis  Deviation detection 10Asst. Professor In CSE
  • 12.  Understand application domain  Prior knowledge, user goals  Create target dataset  Select data, focus on subsets  Data cleaning and transformation  Remove noise, outliers, missing values  Select features, reduce dimensions 12Asst. Professor In CSE
  • 13.  Apply data mining algorithm  Associations, sequences, classification, clustering, etc.  Interpret, evaluate and visualize patterns  What's new and interesting?  Iterate if needed  Manage discovered knowledge  Close the loop 13Asst. Professor In CSE
  • 14. 14 Original DB Create/select target database Select sample Supply missing values Eliminate noisy data Normalize values Transform values Create derived attributes Relevant attributes Select DM task (s) Select DM method (s) Extract knowledge Test knowledge Refine knowledge Presentation, visualization Asst. Professor In CSE
  • 15.  AI  Machine learning  Statistics  Databases and data warehousing  High performance computing  Visualization 15Asst. Professor In CSE
  • 16.  Human analysis breaks down with volume and dimensionality  How quickly can one digest 1 million records, with 100 attributes  High rate of growth, changing sources  What is done by non-statisticians?  Select a few fields and fit simple models or attempt to visualize 16Asst. Professor In CSE
  • 17. 1. Overview of KDD and data mining 2. Data mining techniques 3. Demo 4. Summary 5. KDD resources pointers 17Asst. professor In CSE
  • 18.  Predictive modeling (classification, regression)  Segmentation (clustering)  Dependency modeling (graphical models, density estimation)  Summarization (associations)  Change and deviation detection 18Asst. Professor In CSE
  • 19.  Association rules: detect sets of attributes that frequently co-occur, and rules among them, e.g. 90% of the people who buy cookies, also buy milk (60% of all grocery shoppers buy both)  Sequence mining (categorical): discover sequences of events that commonly occur together, .e.g. In a set of DNA sequences ACGTC is followed by GTCA after a gap of 9, with 30% probability 19Asst. Professor In CSE
  • 20.  CBR or Similarity search: given a database of objects, and a “query” object, find the object(s) that are within a user-defined distance of the queried object, or find all pairs within some distance of each other.  Deviation detection: find the record(s) that is (are) the most different from the other records, i.e., find all outliers.These may be thrown away as noise or may be the “interesting” ones. 20Asst. Professor In CSE
  • 21.  Many other methods, such as  Decision trees  Neural networks  Genetic algorithms  Hidden markov models  Time series  Bayesian networks  Soft computing: rough and fuzzy sets 21Asst. Professor In CSE
  • 22.  Classification and regression: assign a new data record to one of several predefined categories or classes. Regression deals with predicting real-valued fields. Also called supervised learning.  Clustering: partition the dataset into subsets or groups such that elements of a group share a common set of properties, with high within group similarity and small inter-group similarity. Also called unsupervised learning. 22Asst. Professor In CSE
  • 23.  Scalability  Efficient and sufficient sampling  In-memory vs. disk-based processing  High performance computing  Automation  Ease of use  Using prior knowledge 23Asst. Professor In CSE
  • 24.  General descriptive knowledge  Summarizations  symbolic descriptions of subsets  Discriminative knowledge  Distinguish between K classes  Accurate classification (also black box)  Separate spaces 24Asst. Professor In CSE
  • 25.  Representation: language for patterns/models, expressive power  Evaluation: scoring methods for deciding what is a good fit of model to data  Search: method for enumerating patterns/models 25Asst. Professor In CSE
  • 26.  Association rules  Sequence mining  Classification(decision tree etc.)  Clustering  Deviation detection  K-nearest neighbors 26Asst. Professor In CSE
  • 27.  Given a set of items/attributes, and a set of objects containing a subset of the items  Find rules: if I1 then I2 (sup, conf)  I1, I2 are sets of items  I1, I2 have sufficient support: P(I1+I2)  Rule has sufficient confidence: P(I2|I1) 27Asst. Professor In CSE
  • 28. 28 What is association mining? Ex: If A and B then C If A and not B then C If A and B and C then D etc. Asst. Professor In CSE
  • 29. 29 Support is defined as the minimum percentage of transactions in the DB containing A and B. Confidence is defined as the minimum percentage of those transactions containing A that also contain B. Ex. Suppose the DB contains 1 million transactions andt that 10‘000 of those transactions contain both A and B. We can then say that the support of the association if A then B is: S= 10‘000/1‘000‘000 = 1%. Likewise, if 50‘000 of the transactions contain A and 10‘000 out of those 50‘000 also contain B then the association rule if A then B has a confidence 10‘000/50‘000 = 20%. Confidence is just the conditional probability of B given A. Asst. Professor In CSE
  • 30. 30 R: LS  RS Supp(R)= supp(LS RS) = #Transaction verifying R / (Total # of Transaction) Conf(R) = supp(LS RS)/supp(LS) Ex: R: Milk=> cookies, A support(R) of 0.8 means in 80% of transaktion Milk and cookies are together. The confidence means the correlation, the relation between the LS and the RS. Asst. Professor In CSE
  • 31. Ticket 1 Ticket 2 Ticket 3 Ticket 4 Farine Oeufs Farine Oeufs Sucre Sucre Oeufs Chocolat Lait Chocolat Sucre Thé Chocolat 31 Farine  Sucre has a confidence of 100%, this is the force of the association and a support of 2/3.  number of association farine => Sucre divided by number of ticket where sucre or farine exist. Asst. Professor In CSE
  • 32. 32 TransaktionID PassagierID Ziel 431 102 New York 431 102 London 431 102 Cairo 431 102 Paris 701 38 New York 701 38 London 701 38 Cairo 11 531 New York 11 531 Cairo 301 102 New York 301 102 London 301 102 Paris Asst. Professor In CSE
  • 33. 33 What is Support and what is Confidence? Having the following Rule: Rule: Who visit New York, visit London too.<==> New York=>London. Calculate the support the Support and the Confidence of this Rule. Asst. Professor In CSE
  • 34.  Given a set of items, list of events per sequence ordered in time  Find rules: if S1 then S2 (sup, conf)  S1, S2 are sequences of items  S1, S2 have sufficient support: P(S1+S2)  Rule has sufficient confidence: P(S2|S1) 34Asst. Professor In CSE
  • 35.  User specifies “interestingness”  Minimum support (minsup)  Minimum confidence (minconf)  Find all frequent sequences (> minsup)  Exponential Search Space  Computation and I/O Intensive  Generate strong rules (> minconf)  Relatively cheap 35Asst. Professor In CSE
  • 36.  A “black box” that makes predictions about the future based on information from the past and present  Large number of input available 36 Model Age Salary CarType High/Low Risk Asst. Professor In CSE
  • 37.  Some models are better than others  Accuracy  Understandability  Models range from easy to understand to incomprehensible  Decision trees  Rule induction  Regression models  Neural networks 37 Easier Harder Asst. Professor In CSE
  • 38. Classification is the process of assigning new objects to predefined categories or classes  Given a set of labeled records  Build a model (decision tree)  Predict labels for future unlabeled records 38Asst. Professor In CSE
  • 39.  Supervised learning (labels known)  Example described in terms of attributes  Categorical (unordered symbolic values)  Numeric (integers, reals)  Class (output/predicted attribute): categorical for classification, numeric for regression 39Asst. Professor In CSE
  • 40. 40 Tid Age Car Type Class 0 23 Family High 1 17 Sports High 2 43 Sports High 3 68 Family Low 4 32 Truck Low 5 20 Family High Age < 27.5 High Low CarType  {Sports} High Age=40, CarType=Family  Class=Low Numeric Categorical Asst. Professor In CSE
  • 41. 41 Age < 27.5 High Low CarType  {Sports} High 1) Age < 27.5  High 2) Age >= 27.5 and CarType = Sports  High 3) Age >= 27.5 and CarType  Sports  Low Asst. Professor In CSE
  • 42. Given N k-dimensional feature vectors , find a “meaningful” partition of the N examples into c subsets or groups  Discover the “labels” automatically  c may be given, or “discovered”  much more difficult than classification, since in the latter the groups are given, and we seek a compact description 42Asst. Professor In CSE
  • 43.  Have to define some notion of “similarity” between examples  Goal: maximize intra-cluster similarity and minimize inter-cluster similarity  Feature vector be  All numeric (well defined distances)  All categorical or mixed (harder to define similarity; geometric notions don’t work) 43Asst. Professor In CSE
  • 44.  Distance-based  Numeric ▪ Euclidean distance (root of sum of squared differences along each dimension) ▪ Angle between two vectors  Categorical ▪ Number of common features (categorical)  Partition-based  Enumerate partitions and score each 44Asst. Professor In CSE
  • 49.  Classification technique to assign a class to a new example  Find k-nearest neighbors, i.e., most similar points in the dataset (compare against all points!)  Assign the new case to the same class to which most of its neighbors belong 49Asst. Professor In CSE
  • 50. 50 Neighborhood 5 of class 3 of class = Asst. Professor In CSE
  • 51. 1. Overview of KDD and data mining 2. Data mining techniques 3. Demo 4. ResearchTrends 5. Summary 6. KDD resources pointers 51Asst. Professor In CSE
  • 52. 1. Overview of KDD and data mining 2. Data mining techniques 3. Demo 4. Summary 5. KDD resources pointers 52Asst. Professor In CSE
  • 53.  Scientific and economic need for KDD  Made possible by recent advances in data collection, processing power, and sophisticated techniques fromAI, databases and visualization  KDD is a complex process  Several techniques need to be used 53Asst. Professor In CSE
  • 54.  Need for rich knowledge representation  Need to integrate specific domain knowledge.  KDD using Fuzzy-categorical and Uncertainty Techniques  Web Mining and User profile  KDD for Bio-Informatique 54Asst. Professor In CSE
  • 55.  ACM SIGKDD: www.acm.org/sigkdd  KDD Nuggets: www.kdnuggets.com  Book: Advances in KDD, MIT Press, ’96  Journal: Data Mining and KDD,  research.microsoft.com/datamine 55Asst. Professor In CSE