SlideShare a Scribd company logo
1 of 38
1
Mrs. Dipali Meher
Modern College of Arts, Science and Commerce,
Ganeshkhind, Pune 411016
Data Mining : An Introduction
2
Bayes Thm(1763)
Regression(1805)
KDD(1989)
Support Vector Machine(1992)
Data Science(2001)
Moneyball(2003)
Turing(1963)
Neural Networks(1943)
Evolutionary Computation(1965)
Databases(1970)
Genetic Algorithms(1975)
Big Data
From Then till Now…..
3
DBMS
RDBMS
Distributed DBMS
Data Mining
4
Data Mining deals with the discovery of
hidden Knowledge , unexpected pattern
and new rules from large data sets
5
Examples of Information extracted using query
language
 List customers who use credit card to purchase
more than Rs. 10000 worth groceries
 List patients who had at least one heart attack
 List students who had at least one backlog
 List employees who have taken home loans
6
Examples of what data mining is used for
 Develop a general profile of credit card customers
 Determine patients whose lifestyle is prone to getting a
heart attack in near future
 Differentiate poor credit risk customers from good
credit card customers
 Differentiate students who had one backlogs in their
academic
 Determine employees who have taken loan for any
purpose
Data Mining differs from usual query processing in
many ways
Query Processing Data Mining
Query Wel formed as
Select…
From…
Where……
Query is not well formed.
What is found out that is
usually hidden
Data Data from online
transaction processing
systems generally in table
formats
Data is integrated from
various sources. Huge
amount of data
Output Subset of databases Not only subset but also
in analyzed and in terms
of patterns
7
8
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or
knowledge from huge amount of data
Data mining: a misnomer?
Alternative names
Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
•Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or
knowledge from huge amount of data
Data mining: a misnomer?
•Alternative names
Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
9
Knowledge discovery in databases (KDD)-is a multistep
process of finding useful information and patterns in
data while Data Mining is one of the steps in KDD of
using algorithms for extraction of patterns
Steps Of KDD
1. Selection-
Data Extraction -Obtaining Data from heterogeneous data sources -
Databases, Data warehouses, World wide web or other information
repositories
2. Preprocessing-
Data Cleaning- Incomplete , noisy, inconsistent data to be cleaned-
Missing data may be ignored or predicted, erroneous data may be deleted
or corrected
10
3. Transformation-
Data Integration- Combines data from multiple sources
into a coherent store -Data can be encoded in common
formats, normalized, reduced
4. Data mining –
Apply algorithms to transformed data an extract patterns
5. Pattern Interpretation/evaluation -
Pattern Evaluation- Evaluate the interestingness of resulting
patterns or apply interestingness measures to filter out
discovered patterns
Knowledge presentation- present the mined knowledge-
visualization techniques can be used
11
Transformation
KDD is the nontrivial extraction of
implicit previously unknown and
potentially useful knowledge from
data
Knowledge Discovery Process
Preprocessing
Data Mining
Pattern Interpretation and
evaluation
Selection
12
13
27
40
34
54
24 25
29
0
10
20
30
40
50
60
a b c d e f g
Graph 1
a
b
c
d
e
f
g
Graphical-bar charts, pie
charts histograms
Icon-based- using colors figures as
icons
14
Hierarchical- Hierarchically dividing
display area
Geometric-boxplot, scatter plot
15
Pixel-based- data as colored pixels
Hybrid- combination of above
approaches
Why Data Mining?—Potential Applications
 Data analysis and decision support
 Market analysis and management
 Target marketing, customer relationship management (CRM), market basket
analysis, cross selling, market segmentation
 Risk analysis and management
 Forecasting, customer retention, improved underwriting, quality control,
competitive analysis
 Fraud detection and detection of unusual patterns (outliers)
 Other Applications
 Text mining (news group, email, documents) and Web mining
 Stream data mining
 Bioinformatics and bio-data analysis
16
Data Mining algorithms-All algorithms attempt to fit a model
closest to the data being examined.
Model is based on the analysis of attributes of a training data
set
The Model is than evaluated using a test data set
Data Model can be
 Predictive model makes predictions regarding data values
using the results found from available data. Thus it makes use
of historical data to make predictions
 Descriptive model identifies patterns or relationships in data. It
finds out the properties of existing data and does not predict
the new properties.
17
Data Mining
Predictive Descriptive
Classification
Regression
Time series Analysis
Prediction
Clustering
Summarization
Association rules
Sequence Discovery
18
Classification- maps data into predefined groups or classes
It uses supervised learning .
The algorithm uses learning phase to build a classifier using training
data set containing data attributes and associated class labels
Example : result of a student. In which class students result will be…
Pattern recognition is type of classification where input patter is
classified into several classes based on its similarity to predefined
classes.
Example: to identify terrorists from passengers. They are identified with
their basic pattern as distance between eyes, size and shape. Then
these patterns are compared with entries into data to see whether
any match were found.
19
20
21
Grade Useful Heat Value(kcal/kg)
A >6200
B 5601 - 6200
C 4941 - 5600
D 4201 - 4940
E 3361 - 4200
F 2401 - 3360
G 1301 - 2400
22
Regression-maps data into real-valued prediction variable.
Algorithm tries to find best function (linear, Non-linear that fits the
training data). Assumes that target data always fits into some
function.
Example . College professor determines his retirement plan based on
current savings and income. If professor want to do more savings
then he must alter his experiences by using simple linear regression
formula.
23
Time Series Analysis- the value of an attribute is examined as it varies over
time
It can be used to determine similarities, classify the behavior or predict future
values
Example
Share market
Prediction – predicts future values using regression, time series analysis or other
approaches
Example
To find out flood prediction of river depending on water level, rain amount time,
humidity. Sensors at different locations are placed in the river area which will
monitor flood condition and flood prediction can be done.
Whether analysis
Pollution analysis
24
25
Clustering -Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters
Unsupervised learning: no predefined classes
Interpretability and usability-results should be comprehensible
and usable-domain expert is required
Example
Students are clustered among various attributes like good
academics, area in which they live, age, height, weight, body
mass index, extra curricular activities.
Clusters do not have specific size and shape.
26
Outlier
27
Summarization - maps data into subsets with simple descriptions- It extracts or
derives representative summary type of information
Example
Summary of student result whish give you number of students appeared for the
exam passed, failed and according to classes
Association rules–discovers relationship among data – used in
Market basket analysis to find item frequently purchased together
Example: person buying a sugar in the mall also buys milk. The thing
which person buy together will always kept together.
28
Sequence Discovery- discovers sequential patterns in
data-order in which items are purchased or data is
accessed
Example:
When TV set will be purchased by customer , sales
manager assumes that customer also buys some cds and
music system.
29
Influence from many disciplines
Data Mining
Artificial
IntelligenceInformation
Technology Database
Technology
Machine
Learning
Pattern
Recognition
Statistics
Algorithm
Visualization
Mathematical
Modeling
30
Depending on data mining approach, techniques from
other disciplines may be applied such as
•Information Retrieval
•Artificial Intelligence
•Neural networks
•Fuzzy set theory
•Knowledge representation
•Logic programming
•High performance computing
31
Data Mining issues
 Human interaction- to analyze the output and find the
correct inference after data mining step interfaces required
with both domain and technical experts
 Over fitting – It occurs when the model fits for the current
data exactly but does not fit for future data-if training
dataset will be wrong then over fitting occurs
 Outliers – The model may get distorted because of the
presence of outliers
 Interpretation of results- experts are required due to
interpretability problems
 Visualization of results- visualization helps to display
analyzed data – but for multi-dimensional data visualization
becomes problematic
32
Data Mining issues continued…
 Large datasets- scalability may arise – as algorithms do not
scale well with massive real-world datasets- sampling and
parallelization are effective tools are used to solve this problem
 High dimensionality -Conventional database may contain
many different attributes out of them all are not relevant. Some
may increases complexity and reduces efficiency. This is known
as dimensionality curse -data reduction can be done so that
dimensionality reduction will also be there.
 Multimedia data - found in GIS databases proves
conventional data mining algorithms ineffective
 Missing data -It is not always possible to ignore missing data
but in preprocessing data mining algorithms can be used to
replace missing data with estimates
33
Data Mining issues continued…
 Irrelevant data – data reduced by removing irrelevant data
 Noisy data –Invalid , incorrect data will lead to poor quality
data mining
 Changing data- Data warehouses contain non-volatile data-
Dynamic data is uploaded and then algorithms are reapplied to
check their correct working.
 Integration- KDD requests are one time needs-data mining
functions are now integrated into traditional database systems
 Applications – Effective use of output of mining algorithm is
a challenge rather than the complexity of the mining algorithm
34
Data Mining Metrics
How to measure the effectiveness of data mining process?
-KDD process is expensive- Return on investment will be the
saving due to decision process using the results
-Difficult to measure and quantify
Social Implications of Data mining
It is two sides of the coin
Data mining can be used to improve customer service and
satisfaction
Data mining can be used to confront one’s right to privacy
Omnipresent Invisible Data mining affecting everyone
35
Data mining should follow certain Guidelines
Purpose specification and use limitation
Openness
Security safeguards
Individual participation
Privacy Preserving data mining
- secure Multiparty computation
- data obscuration
36
Applications of Data Mining
Security-To find out terrorists using classification
technique
Whether- To predict whether, pollution
Finance-Share market
Ecommerce-Market basket analysis
Education-Student result preparation
Bank- Analysis of customer for buying loan
Research- Data Analysis
Fraud detection
Marketing-targeting customers
Molecular biology
Astronomy
Health- to find out disease in peoples
37
Books for Reference
Data Mining, Introduction and Advanced Topics by
Margaret H. Dunham and Sridhar
Pearson Education
ISBN 81-7758-785-4
Data Mining Concepts and Techniques by Jiawei Han
and Micheline Kamber
Morgan Kaufmann Publishers
ISBN 81-312-0535-5
.
38

More Related Content

What's hot

Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data miningHadi Fadlallah
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with PythonDavis David
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingankur bhalla
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1malathieswaran29
 
Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Seerat Malik
 
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Salah Amean
 
Data warehousing and online analytical processing
Data warehousing and online analytical processingData warehousing and online analytical processing
Data warehousing and online analytical processingVijayasankariS
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data miningDevakumar Jain
 
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Salah Amean
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessingSalah Amean
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining Phi Jack
 

What's hot (20)

Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data mining
 
Data mining tasks
Data mining tasksData mining tasks
Data mining tasks
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
 
Data mining
Data miningData mining
Data mining
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1
 
Data Mining: Data Preprocessing
Data Mining: Data PreprocessingData Mining: Data Preprocessing
Data Mining: Data Preprocessing
 
Data mining
Data miningData mining
Data mining
 
Data mining
Data miningData mining
Data mining
 
Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?
 
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
 
Data warehousing and online analytical processing
Data warehousing and online analytical processingData warehousing and online analytical processing
Data warehousing and online analytical processing
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data mining
 
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
data mining
data miningdata mining
data mining
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
 
2. visualization in data mining
2. visualization in data mining2. visualization in data mining
2. visualization in data mining
 

Similar to Data mining an introduction

Data mining introduction
Data mining introductionData mining introduction
Data mining introductionBasma Gamal
 
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using RapidminerStudy and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using RapidminerIJERA Editor
 
Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptxHarsha Patel
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slidestafosepsdfasg
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CSThanveen
 
DM-Unit-1-Part 1-R.pdf
DM-Unit-1-Part 1-R.pdfDM-Unit-1-Part 1-R.pdf
DM-Unit-1-Part 1-R.pdfssuserb933d8
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data miningUjjawal
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesFellowBuddy.com
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceMahir Haque
 
Data Mining and Data Warehousing (MAKAUT)
Data Mining and Data Warehousing (MAKAUT)Data Mining and Data Warehousing (MAKAUT)
Data Mining and Data Warehousing (MAKAUT)Bikramjit Sarkar, Ph.D.
 

Similar to Data mining an introduction (20)

Data mining introduction
Data mining introductionData mining introduction
Data mining introduction
 
Unit 1.pptx
Unit 1.pptxUnit 1.pptx
Unit 1.pptx
 
Seminar Presentation
Seminar PresentationSeminar Presentation
Seminar Presentation
 
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using RapidminerStudy and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
 
Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptx
 
Data mining
Data miningData mining
Data mining
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slides
 
Chapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data MiningChapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data Mining
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CS
 
Unit i
Unit iUnit i
Unit i
 
DM-Unit-1-Part 1-R.pdf
DM-Unit-1-Part 1-R.pdfDM-Unit-1-Part 1-R.pdf
DM-Unit-1-Part 1-R.pdf
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data mining
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture Notes
 
Talk
TalkTalk
Talk
 
Data Mining and Knowledge
Data Mining and KnowledgeData Mining and Knowledge
Data Mining and Knowledge
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Data Mining and Data Warehousing (MAKAUT)
Data Mining and Data Warehousing (MAKAUT)Data Mining and Data Warehousing (MAKAUT)
Data Mining and Data Warehousing (MAKAUT)
 
Data mining
Data miningData mining
Data mining
 

More from Dr-Dipali Meher

More from Dr-Dipali Meher (16)

Database Security Methods, DAC, MAC,View
Database Security Methods, DAC, MAC,ViewDatabase Security Methods, DAC, MAC,View
Database Security Methods, DAC, MAC,View
 
Version Stamps in NOSQL Databases
Version Stamps in NOSQL DatabasesVersion Stamps in NOSQL Databases
Version Stamps in NOSQL Databases
 
Literature Review
Literature ReviewLiterature Review
Literature Review
 
Research Problem
Research ProblemResearch Problem
Research Problem
 
Formulation of Research Design
Formulation of Research DesignFormulation of Research Design
Formulation of Research Design
 
Types of Research
Types of ResearchTypes of Research
Types of Research
 
Research Methodology-Intorduction
Research Methodology-IntorductionResearch Methodology-Intorduction
Research Methodology-Intorduction
 
Introduction to Research
Introduction to ResearchIntroduction to Research
Introduction to Research
 
Neo4j session
Neo4j sessionNeo4j session
Neo4j session
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Consistency in NoSQL
Consistency in NoSQLConsistency in NoSQL
Consistency in NoSQL
 
Data models in NoSQL
Data models in NoSQLData models in NoSQL
Data models in NoSQL
 
Schema migrations in no sql
Schema migrations in no sqlSchema migrations in no sql
Schema migrations in no sql
 
Polyglot Persistence
Polyglot Persistence Polyglot Persistence
Polyglot Persistence
 
Naive bayesian classification
Naive bayesian classificationNaive bayesian classification
Naive bayesian classification
 
Function Pointer
Function PointerFunction Pointer
Function Pointer
 

Recently uploaded

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 

Recently uploaded (20)

Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 

Data mining an introduction

  • 1. 1 Mrs. Dipali Meher Modern College of Arts, Science and Commerce, Ganeshkhind, Pune 411016 Data Mining : An Introduction
  • 2. 2 Bayes Thm(1763) Regression(1805) KDD(1989) Support Vector Machine(1992) Data Science(2001) Moneyball(2003) Turing(1963) Neural Networks(1943) Evolutionary Computation(1965) Databases(1970) Genetic Algorithms(1975) Big Data From Then till Now…..
  • 4. 4 Data Mining deals with the discovery of hidden Knowledge , unexpected pattern and new rules from large data sets
  • 5. 5 Examples of Information extracted using query language  List customers who use credit card to purchase more than Rs. 10000 worth groceries  List patients who had at least one heart attack  List students who had at least one backlog  List employees who have taken home loans
  • 6. 6 Examples of what data mining is used for  Develop a general profile of credit card customers  Determine patients whose lifestyle is prone to getting a heart attack in near future  Differentiate poor credit risk customers from good credit card customers  Differentiate students who had one backlogs in their academic  Determine employees who have taken loan for any purpose
  • 7. Data Mining differs from usual query processing in many ways Query Processing Data Mining Query Wel formed as Select… From… Where…… Query is not well formed. What is found out that is usually hidden Data Data from online transaction processing systems generally in table formats Data is integrated from various sources. Huge amount of data Output Subset of databases Not only subset but also in analyzed and in terms of patterns 7
  • 8. 8 Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Data mining: a misnomer? Alternative names Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
  • 9. •Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Data mining: a misnomer? •Alternative names Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. 9
  • 10. Knowledge discovery in databases (KDD)-is a multistep process of finding useful information and patterns in data while Data Mining is one of the steps in KDD of using algorithms for extraction of patterns Steps Of KDD 1. Selection- Data Extraction -Obtaining Data from heterogeneous data sources - Databases, Data warehouses, World wide web or other information repositories 2. Preprocessing- Data Cleaning- Incomplete , noisy, inconsistent data to be cleaned- Missing data may be ignored or predicted, erroneous data may be deleted or corrected 10
  • 11. 3. Transformation- Data Integration- Combines data from multiple sources into a coherent store -Data can be encoded in common formats, normalized, reduced 4. Data mining – Apply algorithms to transformed data an extract patterns 5. Pattern Interpretation/evaluation - Pattern Evaluation- Evaluate the interestingness of resulting patterns or apply interestingness measures to filter out discovered patterns Knowledge presentation- present the mined knowledge- visualization techniques can be used 11
  • 12. Transformation KDD is the nontrivial extraction of implicit previously unknown and potentially useful knowledge from data Knowledge Discovery Process Preprocessing Data Mining Pattern Interpretation and evaluation Selection 12
  • 13. 13 27 40 34 54 24 25 29 0 10 20 30 40 50 60 a b c d e f g Graph 1 a b c d e f g Graphical-bar charts, pie charts histograms Icon-based- using colors figures as icons
  • 14. 14 Hierarchical- Hierarchically dividing display area Geometric-boxplot, scatter plot
  • 15. 15 Pixel-based- data as colored pixels Hybrid- combination of above approaches
  • 16. Why Data Mining?—Potential Applications  Data analysis and decision support  Market analysis and management  Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation  Risk analysis and management  Forecasting, customer retention, improved underwriting, quality control, competitive analysis  Fraud detection and detection of unusual patterns (outliers)  Other Applications  Text mining (news group, email, documents) and Web mining  Stream data mining  Bioinformatics and bio-data analysis 16
  • 17. Data Mining algorithms-All algorithms attempt to fit a model closest to the data being examined. Model is based on the analysis of attributes of a training data set The Model is than evaluated using a test data set Data Model can be  Predictive model makes predictions regarding data values using the results found from available data. Thus it makes use of historical data to make predictions  Descriptive model identifies patterns or relationships in data. It finds out the properties of existing data and does not predict the new properties. 17
  • 18. Data Mining Predictive Descriptive Classification Regression Time series Analysis Prediction Clustering Summarization Association rules Sequence Discovery 18
  • 19. Classification- maps data into predefined groups or classes It uses supervised learning . The algorithm uses learning phase to build a classifier using training data set containing data attributes and associated class labels Example : result of a student. In which class students result will be… Pattern recognition is type of classification where input patter is classified into several classes based on its similarity to predefined classes. Example: to identify terrorists from passengers. They are identified with their basic pattern as distance between eyes, size and shape. Then these patterns are compared with entries into data to see whether any match were found. 19
  • 20. 20
  • 21. 21 Grade Useful Heat Value(kcal/kg) A >6200 B 5601 - 6200 C 4941 - 5600 D 4201 - 4940 E 3361 - 4200 F 2401 - 3360 G 1301 - 2400
  • 22. 22 Regression-maps data into real-valued prediction variable. Algorithm tries to find best function (linear, Non-linear that fits the training data). Assumes that target data always fits into some function. Example . College professor determines his retirement plan based on current savings and income. If professor want to do more savings then he must alter his experiences by using simple linear regression formula.
  • 23. 23 Time Series Analysis- the value of an attribute is examined as it varies over time It can be used to determine similarities, classify the behavior or predict future values Example Share market
  • 24. Prediction – predicts future values using regression, time series analysis or other approaches Example To find out flood prediction of river depending on water level, rain amount time, humidity. Sensors at different locations are placed in the river area which will monitor flood condition and flood prediction can be done. Whether analysis Pollution analysis 24
  • 25. 25 Clustering -Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters Unsupervised learning: no predefined classes Interpretability and usability-results should be comprehensible and usable-domain expert is required Example Students are clustered among various attributes like good academics, area in which they live, age, height, weight, body mass index, extra curricular activities. Clusters do not have specific size and shape.
  • 27. 27 Summarization - maps data into subsets with simple descriptions- It extracts or derives representative summary type of information Example Summary of student result whish give you number of students appeared for the exam passed, failed and according to classes
  • 28. Association rules–discovers relationship among data – used in Market basket analysis to find item frequently purchased together Example: person buying a sugar in the mall also buys milk. The thing which person buy together will always kept together. 28
  • 29. Sequence Discovery- discovers sequential patterns in data-order in which items are purchased or data is accessed Example: When TV set will be purchased by customer , sales manager assumes that customer also buys some cds and music system. 29
  • 30. Influence from many disciplines Data Mining Artificial IntelligenceInformation Technology Database Technology Machine Learning Pattern Recognition Statistics Algorithm Visualization Mathematical Modeling 30
  • 31. Depending on data mining approach, techniques from other disciplines may be applied such as •Information Retrieval •Artificial Intelligence •Neural networks •Fuzzy set theory •Knowledge representation •Logic programming •High performance computing 31
  • 32. Data Mining issues  Human interaction- to analyze the output and find the correct inference after data mining step interfaces required with both domain and technical experts  Over fitting – It occurs when the model fits for the current data exactly but does not fit for future data-if training dataset will be wrong then over fitting occurs  Outliers – The model may get distorted because of the presence of outliers  Interpretation of results- experts are required due to interpretability problems  Visualization of results- visualization helps to display analyzed data – but for multi-dimensional data visualization becomes problematic 32
  • 33. Data Mining issues continued…  Large datasets- scalability may arise – as algorithms do not scale well with massive real-world datasets- sampling and parallelization are effective tools are used to solve this problem  High dimensionality -Conventional database may contain many different attributes out of them all are not relevant. Some may increases complexity and reduces efficiency. This is known as dimensionality curse -data reduction can be done so that dimensionality reduction will also be there.  Multimedia data - found in GIS databases proves conventional data mining algorithms ineffective  Missing data -It is not always possible to ignore missing data but in preprocessing data mining algorithms can be used to replace missing data with estimates 33
  • 34. Data Mining issues continued…  Irrelevant data – data reduced by removing irrelevant data  Noisy data –Invalid , incorrect data will lead to poor quality data mining  Changing data- Data warehouses contain non-volatile data- Dynamic data is uploaded and then algorithms are reapplied to check their correct working.  Integration- KDD requests are one time needs-data mining functions are now integrated into traditional database systems  Applications – Effective use of output of mining algorithm is a challenge rather than the complexity of the mining algorithm 34
  • 35. Data Mining Metrics How to measure the effectiveness of data mining process? -KDD process is expensive- Return on investment will be the saving due to decision process using the results -Difficult to measure and quantify Social Implications of Data mining It is two sides of the coin Data mining can be used to improve customer service and satisfaction Data mining can be used to confront one’s right to privacy Omnipresent Invisible Data mining affecting everyone 35
  • 36. Data mining should follow certain Guidelines Purpose specification and use limitation Openness Security safeguards Individual participation Privacy Preserving data mining - secure Multiparty computation - data obscuration 36
  • 37. Applications of Data Mining Security-To find out terrorists using classification technique Whether- To predict whether, pollution Finance-Share market Ecommerce-Market basket analysis Education-Student result preparation Bank- Analysis of customer for buying loan Research- Data Analysis Fraud detection Marketing-targeting customers Molecular biology Astronomy Health- to find out disease in peoples 37
  • 38. Books for Reference Data Mining, Introduction and Advanced Topics by Margaret H. Dunham and Sridhar Pearson Education ISBN 81-7758-785-4 Data Mining Concepts and Techniques by Jiawei Han and Micheline Kamber Morgan Kaufmann Publishers ISBN 81-312-0535-5 . 38