SlideShare a Scribd company logo
We live in a world where vast amounts of data are collected daily.
Analyzing such data is an important need.
One emerging data repository architecture is the data warehouse .
This is a repository of multiple heterogeneous data sources.
Data mining turns a large collection of data into knowledge.
A search engine (e.g.,Google) receives hundreds of millions of queries
every day.
Each query can be viewed as a transaction where the user describes her
or his information need.
Data Mining
11/9/20191
Data Mining and Data Preprocessing
We live in a world where vast amounts of data are collected daily.
Analyzing such data is an important need.
One emerging data repository architecture is the data warehouse .
This is a repository of multiple heterogeneous data sources.
Data mining turns a large collection of data into knowledge.
A search engine (e.g.,Google) receives hundreds of millions of queries
every day.
Each query can be viewed as a transaction where the user describes her
or his information need.
Data Mining Introduction
11/9/20192
Data Mining and Data Preprocessing
Data Collection and Database Creation
(1960s and earlier)
Primitive file processing
Data Mining Introduction
11/9/20193
Data Mining and Data Preprocessing
Database Management Systems
(1970s to early 1980s)
Hierarchical and network database systems
Relational database systems
Data modeling: entity-relationship models, etc.
Indexing and accessing methods
Data Mining Introduction
11/9/20194
Data Mining and Data Preprocessing
Query languages: SQL, etc.
User interfaces, forms, and reports
Query processing and optimization
Transactions, concurrency control, and recovery
Online transaction processing (OLTP)
Data Mining Introduction
11/9/20195
Data Mining and Data Preprocessing
Advanced Database Systems
(mid-1980s to present)
Advanced data models: extended-relational, object relational, deductive,
etc.
Managing complex data: spatial, temporal, multimedia, sequence and
structured,
scientific, engineering, moving objects, etc.
Data Mining Introduction
11/9/20196
Data Mining and Data Preprocessing
Data streams and cyber-physical data systems
Web-based databases (XML, semantic web)
Managing uncertain data and data cleaning
Integration of heterogeneous sources
Text database systems and integration with information retrieval
Data Mining Introduction
11/9/20197
Data Mining and Data Preprocessing
Extremely large data management
Database system tuning and adaptive systems
Advanced queries: ranking, skyline, etc.
Cloud computing and parallel data processing
Issues of data privacy and security
Data Mining Introduction
11/9/20198
Data Mining and Data Preprocessing
Advanced Data Analysis
(late- 1980s to present)
Data warehouse and OLAP
Data mining and knowledge discovery: classification, clustering, outlier
analysis, association and correlation, comparative summary,
discrimination analysis, pattern
discovery, trend and deviation analysis, etc.
Mining complex types of data: streams, sequence, text, spatial, temporal,
multimedia, Web, networks, etc.
Data Mining Introduction
11/9/20199
Data Mining and Data Preprocessing
Data mining applications: business, society, retail, banking,
telecommunications, science and engineering, blogs, daily life, etc.
Data mining and society: invisible data mining, privacy-preserving data
mining,
mining social and information networks, recommender systems, etc.
One emerging data repository architecture is the data warehouse.
This is a repository of multiple heterogeneous data sources
Data Mining Introduction
11/9/201910
Data Mining and Data Preprocessing
Data Mining
The process that finds a small set of precious nuggets from a great deal
of raw material.
Knowledge mining from data.
Also known as knowledge mining from data, knowledge extraction,
data/pattern analysis, data archaeology, and data dredging.
The process of knowledge discovery.
Knowledge Discovery from Data,
or KDD
11/9/201911
Data Mining and Data Preprocessing
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved
from the
database)
4. Data transformation (where data are transformed and consolidated into
forms appropriate for mining by performing summary or aggregation
operations)
Knowledge Discovery from Data,
or KDD
11/9/201912
Data Mining and Data Preprocessing
5. Data mining (an essential process where intelligent methods are applied
to extract data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing
knowledge
7. Knowledge presentation (where visualization and knowledge
representation techniques are used to present mined knowledge to users)
Knowledge Discovery from Data,
or KDD
11/9/201913
Data Mining and Data Preprocessing
Data mining is the process of discovering interesting patterns and
knowledge from large amounts of data.
The data sources can include databases, data warehouses, the Web, other
information repositories, or data that are streamed into the system
dynamically.
Knowledge Discovery from Data,
or KDD
11/9/201914
Data Mining and Data Preprocessing
Knowledge Discovery from Data, or
KDD
11/9/201915
Data Mining and Data Preprocessing
Figure 1.1 Data mining as
a step in the process of
knowledge discovery.
Data mining is the process of discovering interesting patterns and
knowledge from large amounts of data.
The data sources can include databases, data warehouses, the Web, other
information repositories, or data that are streamed into the system
dynamically.
Knowledge Discovery from Data,
or KDD
11/9/201916
Data Mining and Data Preprocessing
Today’s real-world databases are highly susceptible to noisy, missing,
and inconsistent data due to their typically huge size.
They origin from multiple, heterogeneous sources.
Low-quality data will lead to low-quality mining results.
The data be preprocessed in order to help improve the quality of the data
and, consequently, of the mining results.
To improve the efficiency and ease of the mining process.
Data Preprocessing
11/9/201917
Data Mining and Data Preprocessing
There are several data preprocessing techniques.
Data cleaning can be applied to remove noise and correct inconsistencies in
data.
Data integration merges data from multiple sources into a coherent data store
such as a data warehouse.
Data reduction can reduce data size by, for instance, aggregating, eliminating
redundant features, or clustering.
Data transformations (e.g., normalization) may be applied, where data are
scaled to fall within a smaller range like 0.0 to 1.0. This can improve the
accuracy and efficiency of mining algorithms.
Data Preprocessing
11/9/201918
Data Mining and Data Preprocessing
Data Quality
Data have quality if they satisfy the requirements of the intended use.
There are many factors comprising data quality
1. Accuracy
2. Completeness
3. Consistency
4. Timeliness
5. Believability and Interpretability.
Data Preprocessing
11/9/201919
Data Mining and Data Preprocessing
Inaccurate Data:
The data collection instruments used may be faulty.
There may have been human or computer errors occurring at data
entry.
Users may purposely submit incorrect data values for mandatory fields
when they do not wish to submit personal information.
Data Preprocessing
11/9/201920
Data Mining and Data Preprocessing
Incomplete Data:
Incomplete data can occur for a number of reasons.
 Attributes of interest may not always be available, such as customer
information for sales transaction data.
Other data may not be included simply because they were not
considered important at the time of entry.
Relevant data may not be recorded due to a misunderstanding or
because of equipment malfunctions.
Data Preprocessing
11/9/201921
Data Mining and Data Preprocessing
Inconsistent Data:
Incorrect data may also result from inconsistencies in naming
conventions or data codes, or inconsistent formats for input fields (e.g.,
date).
Timeliness:
The fact that the month-end data are not updated in a timely fashion
has a negative impact on the data quality.
Data Preprocessing
11/9/201922
Data Mining and Data Preprocessing
Believability :
It reflects how much the data are trusted by users.
Interpretability:
It reflects how easy the data are understood. Suppose that a database, at
one point, had several errors, all of which have since been corrected.
Even though the database is now accurate, complete, consistent, and
timely, sales department users may regard it as of low quality due to poor
believability and interpretability.
Data Preprocessing
11/9/201923
Data Mining and Data Preprocessing
Fill in missing values, smooth out noise while identifying outliers, and
correct inconsistencies in the data.
Missing Values
Ignore the tuple: This is usually done when the class label is
missing.
This method is not very effective, unless the tuple contains
several attributes with missing values.
Data Cleaning
11/9/201924
Data Mining and Data Preprocessing
Fill in the missing value manually: Time consuming and may not be
feasible given a large data set with many missing values.
Use a global constant to fill in the missing value: Replace all missing
attribute values by the same constant such as a label like “Unknown”.
 Use a measure of central tendency for the attribute (e.g., the
mean or median) to fill in the missing value.
Use the attribute mean or median for all samples belonging
to the same class as the given tuple.
Use the most probable value to fill in the missing value.
Data Cleaning
11/9/201925
Data Mining and Data Preprocessing
Noisy Data:
Random error or variance in a measured variable.
Data smoothing techniques:
Binning:
These methods smooth a sorted data value by consulting its “neighbor-
hood,” that is, the values around it.
The sorted values are distributed into a number of “buckets,” or bins.
Binning methods consult the neighborhood of values which perform
local smoothing.
Data Cleaning
11/9/201926
Data Mining and Data Preprocessing
Binning methods for data smoothing:
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency) bins: Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means: Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries: Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
Data Cleaning
11/9/201927 Data Mining and Data Preprocessing
In smoothing by bin medians , the each bin value is replaced by the
bin median.
In smoothing by bin boundaries, the minimum and maximum values
in a given bin are identified as the bin boundaries. Each bin value is then
replaced by the closest boundary value.
Data Cleaning
11/9/201928
Data Mining and Data Preprocessing
Regression:
A technique that con- forms data values to a function.
Linear regression involves finding the “best” line to fit two attributes .
One attribute can be used to predict the other.
Multiple linear regression is an extension of linear regression, where
more than two attributes are involved .
The data are fit to a multidimensional surface.
Data Cleaning
11/9/201929
Data Mining and Data Preprocessing
Outlier analysis:
Outliers may be detected by clustering.
The similar values are organized into groups, or “clusters.”.
The values that fall outside of the set of clusters may be considered
outliers.
Data Cleaning
11/9/201930
Data Mining and Data Preprocessing
Data Cleaning
11/9/201931
Data Mining and Data Preprocessing
Figure 1.2 A 2-D customer data plot with respect to customer locations in a city,
showing three data clusters. Outliers may be detected as values that fall outside of the
cluster sets.
Data Cleaning as a process:
Discrepancies can be caused by several factors, including poorly
designed data entry forms that have many optional fields, human
error in data entry, deliberate errors and data decay.
Discrepancies may also arise from inconsistent data
representations and inconsistent use of codes.
Other sources of discrepancies include errors in instrumentation
devices that record data and system errors.
 There may also be inconsistencies due to data integration.
Data Cleaning
11/9/201932
Data Mining and Data Preprocessing
Discrepancy Detection Methods:
Field overloading is another error source that typically results when
developers squeeze new attribute definitions into unused (bit) portions of
already defined attributes.
A unique rule says that each value of the given attribute must be
different from all other values for that attribute.
A consecutive rule says that there can be no missing values between the
lowest and highest values for the attribute, and that all values must also be
unique (e.g., as in check numbers).
Data Cleaning
11/9/201933
Data Mining and Data Preprocessing
Discrepancy Detection Methods:
A null rule specifies the use of blanks, question marks, special
characters, or other strings that may indicate the null condition.
Data scrubbing tools use simple domain knowledge (e.g., knowledge
of postal addresses and spell-checking) to detect errors and make
corrections in the data.. Data auditing tools find discrepancies by
analyzing the data to discover rules and relationships, and detecting data
that violate such conditions.
They are variants of data mining tools. For example, they may employ
statistical analysis to find correlations, or clustering to identify outliers.
Data Cleaning
11/9/201934
Data Mining and Data Preprocessing
Commercial tools can assist in the data transformation step.
Data migration tools allow simple transformations to be specified such
as to replace the string “gender” by “sex.”
ETL (extraction/transformation/loading) tools allow users to specify
transforms through a graphical user interface (GUI).
Data Cleaning
11/9/201935
Data Mining and Data Preprocessing
Thank You
11/9/201936
Data Mining and Data Preprocessing

More Related Content

What's hot

Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
TanujaSomvanshi1
 
Data Mining
Data MiningData Mining
Data Mining
SHIKHA GAUTAM
 
Data science life cycle
Data science life cycleData science life cycle
Data science life cycle
Manoj Mishra
 
Data Analytics
Data AnalyticsData Analytics
Data Analytics
Srinimf-Slides
 
Data visualisation & analytics with Tableau
Data visualisation & analytics with Tableau Data visualisation & analytics with Tableau
Data visualisation & analytics with Tableau
Outreach Digital
 
Heart Disease Identification Method Using Machine Learnin in E-healthcare.
Heart Disease Identification Method Using Machine Learnin in E-healthcare.Heart Disease Identification Method Using Machine Learnin in E-healthcare.
Heart Disease Identification Method Using Machine Learnin in E-healthcare.
SUJIT SHIBAPRASAD MAITY
 
Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?
Seerat Malik
 
Machine Learning in Healthcare and Life Science
Machine Learning in Healthcare and Life ScienceMachine Learning in Healthcare and Life Science
Machine Learning in Healthcare and Life Science
IDEAS - Int'l Data Engineering and Science Association
 
Predicting Diabetes Using Machine Learning
Predicting Diabetes Using Machine LearningPredicting Diabetes Using Machine Learning
Predicting Diabetes Using Machine Learning
John Alex
 
Ppt for Application of big data
Ppt for Application of big dataPpt for Application of big data
Ppt for Application of big data
Prashant Sharma
 
Predictive analytics
Predictive analytics Predictive analytics
Predictive analytics
SAS Singapore Institute Pte Ltd
 
Data mining
Data miningData mining
Data mining
Birju Tank
 
Business analytics and data visualisation
Business analytics and data visualisationBusiness analytics and data visualisation
Business analytics and data visualisation
Shwetabh Jaiswal
 
Machine learning
Machine learningMachine learning
Machine learning
Navdeep Asteya
 
Deep learning for medical imaging
Deep learning for medical imagingDeep learning for medical imaging
Deep learning for medical imaging
geetachauhan
 
Data Science
Data ScienceData Science
Data Science
Prakhyath Rai
 
Computer vision
Computer visionComputer vision
Computer vision
AnkitKamal6
 
HEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHM
HEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHMHEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHM
HEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHM
amiteshg
 
Data mining slides
Data mining slidesData mining slides
Data mining slidessmj
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
Vikram Nandini
 

What's hot (20)

Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
 
Data Mining
Data MiningData Mining
Data Mining
 
Data science life cycle
Data science life cycleData science life cycle
Data science life cycle
 
Data Analytics
Data AnalyticsData Analytics
Data Analytics
 
Data visualisation & analytics with Tableau
Data visualisation & analytics with Tableau Data visualisation & analytics with Tableau
Data visualisation & analytics with Tableau
 
Heart Disease Identification Method Using Machine Learnin in E-healthcare.
Heart Disease Identification Method Using Machine Learnin in E-healthcare.Heart Disease Identification Method Using Machine Learnin in E-healthcare.
Heart Disease Identification Method Using Machine Learnin in E-healthcare.
 
Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?
 
Machine Learning in Healthcare and Life Science
Machine Learning in Healthcare and Life ScienceMachine Learning in Healthcare and Life Science
Machine Learning in Healthcare and Life Science
 
Predicting Diabetes Using Machine Learning
Predicting Diabetes Using Machine LearningPredicting Diabetes Using Machine Learning
Predicting Diabetes Using Machine Learning
 
Ppt for Application of big data
Ppt for Application of big dataPpt for Application of big data
Ppt for Application of big data
 
Predictive analytics
Predictive analytics Predictive analytics
Predictive analytics
 
Data mining
Data miningData mining
Data mining
 
Business analytics and data visualisation
Business analytics and data visualisationBusiness analytics and data visualisation
Business analytics and data visualisation
 
Machine learning
Machine learningMachine learning
Machine learning
 
Deep learning for medical imaging
Deep learning for medical imagingDeep learning for medical imaging
Deep learning for medical imaging
 
Data Science
Data ScienceData Science
Data Science
 
Computer vision
Computer visionComputer vision
Computer vision
 
HEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHM
HEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHMHEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHM
HEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHM
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 

Similar to Data mining

Machine learning topics machine learning algorithm into three main parts.
Machine learning topics  machine learning algorithm into three main parts.Machine learning topics  machine learning algorithm into three main parts.
Machine learning topics machine learning algorithm into three main parts.
DurgaDeviP2
 
Lec 1 introduction
Lec 1 introductionLec 1 introduction
Lec 1 introduction
Shimul Ahmmed
 
Data Mining Overview
Data Mining OverviewData Mining Overview
Data Mining Overview
Golda Margret Sheeba J
 
Dm unit i r16
Dm unit i   r16Dm unit i   r16
Dm unit i r16
Kishore Kumar
 
Chapter 1. Introduction.ppt
Chapter 1. Introduction.pptChapter 1. Introduction.ppt
Chapter 1. Introduction.ppt
Subrata Kumer Paul
 
Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1
DanWooster1
 
DM Lecture 3
DM Lecture 3DM Lecture 3
DM Lecture 3
asad199
 
Data Mining Intro
Data Mining IntroData Mining Intro
Data Mining Intro
ShubhamSamrat5
 
01Intro.ppt
01Intro.ppt01Intro.ppt
01Intro.ppt
AidaMustapha6
 
01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.ppt01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.ppt
admsoyadm4
 
data mining
data miningdata mining
data mining
AMITKUMAR202236
 
01Intro.ppt
01Intro.ppt01Intro.ppt
01Intro.ppt
VaibhavGupta447155
 
Data Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesData Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notes
asnaparveen414
 
Big Data & Data Science
Big Data & Data ScienceBig Data & Data Science
Big Data & Data Science
Piyumi Sendanayaka
 
U - 2 Emerging.pptx
U - 2 Emerging.pptxU - 2 Emerging.pptx
U - 2 Emerging.pptx
MulukenTamrat2
 
Data Mining - Presentation.pptx
Data Mining - Presentation.pptxData Mining - Presentation.pptx
Data Mining - Presentation.pptx
fahadusman23
 

Similar to Data mining (20)

Machine learning topics machine learning algorithm into three main parts.
Machine learning topics  machine learning algorithm into three main parts.Machine learning topics  machine learning algorithm into three main parts.
Machine learning topics machine learning algorithm into three main parts.
 
Lec 1 introduction
Lec 1 introductionLec 1 introduction
Lec 1 introduction
 
Data mining
Data miningData mining
Data mining
 
Data Mining Overview
Data Mining OverviewData Mining Overview
Data Mining Overview
 
Dm unit i r16
Dm unit i   r16Dm unit i   r16
Dm unit i r16
 
Data Mining
Data MiningData Mining
Data Mining
 
Chapter 1. Introduction.ppt
Chapter 1. Introduction.pptChapter 1. Introduction.ppt
Chapter 1. Introduction.ppt
 
Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1
 
My3prep
My3prepMy3prep
My3prep
 
DM Lecture 3
DM Lecture 3DM Lecture 3
DM Lecture 3
 
Data Mining Intro
Data Mining IntroData Mining Intro
Data Mining Intro
 
01Intro.ppt
01Intro.ppt01Intro.ppt
01Intro.ppt
 
01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.ppt01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.ppt
 
data mining
data miningdata mining
data mining
 
01Intro.ppt
01Intro.ppt01Intro.ppt
01Intro.ppt
 
Data Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesData Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notes
 
Big Data & Data Science
Big Data & Data ScienceBig Data & Data Science
Big Data & Data Science
 
Abstract
AbstractAbstract
Abstract
 
U - 2 Emerging.pptx
U - 2 Emerging.pptxU - 2 Emerging.pptx
U - 2 Emerging.pptx
 
Data Mining - Presentation.pptx
Data Mining - Presentation.pptxData Mining - Presentation.pptx
Data Mining - Presentation.pptx
 

More from Saraswathi Murugan

Data mining in e commerce
Data mining in e commerceData mining in e commerce
Data mining in e commerce
Saraswathi Murugan
 
PHP Basics
PHP BasicsPHP Basics
PHP Basics
Saraswathi Murugan
 
Python modulesfinal
Python modulesfinalPython modulesfinal
Python modulesfinal
Saraswathi Murugan
 
National Identity Elements of India
National Identity Elements of IndiaNational Identity Elements of India
National Identity Elements of India
Saraswathi Murugan
 
Advanced Concepts in Python
Advanced Concepts in PythonAdvanced Concepts in Python
Advanced Concepts in Python
Saraswathi Murugan
 
Pythonppt28 11-18
Pythonppt28 11-18Pythonppt28 11-18
Pythonppt28 11-18
Saraswathi Murugan
 
FUNDAMENTALS OF PYTHON LANGUAGE
 FUNDAMENTALS OF PYTHON LANGUAGE  FUNDAMENTALS OF PYTHON LANGUAGE
FUNDAMENTALS OF PYTHON LANGUAGE
Saraswathi Murugan
 

More from Saraswathi Murugan (7)

Data mining in e commerce
Data mining in e commerceData mining in e commerce
Data mining in e commerce
 
PHP Basics
PHP BasicsPHP Basics
PHP Basics
 
Python modulesfinal
Python modulesfinalPython modulesfinal
Python modulesfinal
 
National Identity Elements of India
National Identity Elements of IndiaNational Identity Elements of India
National Identity Elements of India
 
Advanced Concepts in Python
Advanced Concepts in PythonAdvanced Concepts in Python
Advanced Concepts in Python
 
Pythonppt28 11-18
Pythonppt28 11-18Pythonppt28 11-18
Pythonppt28 11-18
 
FUNDAMENTALS OF PYTHON LANGUAGE
 FUNDAMENTALS OF PYTHON LANGUAGE  FUNDAMENTALS OF PYTHON LANGUAGE
FUNDAMENTALS OF PYTHON LANGUAGE
 

Recently uploaded

Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
Jheel Barad
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
beazzy04
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
Vivekanand Anglo Vedic Academy
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
camakaiclarkmusic
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
Atul Kumar Singh
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
EduSkills OECD
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
Peter Windle
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
DeeptiGupta154
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Jean Carlos Nunes Paixão
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
timhan337
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
MIRIAMSALINAS13
 
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
DhatriParmar
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
Vikramjit Singh
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
TechSoup
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
JosvitaDsouza2
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 

Recently uploaded (20)

Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 

Data mining

  • 1. We live in a world where vast amounts of data are collected daily. Analyzing such data is an important need. One emerging data repository architecture is the data warehouse . This is a repository of multiple heterogeneous data sources. Data mining turns a large collection of data into knowledge. A search engine (e.g.,Google) receives hundreds of millions of queries every day. Each query can be viewed as a transaction where the user describes her or his information need. Data Mining 11/9/20191 Data Mining and Data Preprocessing
  • 2. We live in a world where vast amounts of data are collected daily. Analyzing such data is an important need. One emerging data repository architecture is the data warehouse . This is a repository of multiple heterogeneous data sources. Data mining turns a large collection of data into knowledge. A search engine (e.g.,Google) receives hundreds of millions of queries every day. Each query can be viewed as a transaction where the user describes her or his information need. Data Mining Introduction 11/9/20192 Data Mining and Data Preprocessing
  • 3. Data Collection and Database Creation (1960s and earlier) Primitive file processing Data Mining Introduction 11/9/20193 Data Mining and Data Preprocessing
  • 4. Database Management Systems (1970s to early 1980s) Hierarchical and network database systems Relational database systems Data modeling: entity-relationship models, etc. Indexing and accessing methods Data Mining Introduction 11/9/20194 Data Mining and Data Preprocessing
  • 5. Query languages: SQL, etc. User interfaces, forms, and reports Query processing and optimization Transactions, concurrency control, and recovery Online transaction processing (OLTP) Data Mining Introduction 11/9/20195 Data Mining and Data Preprocessing
  • 6. Advanced Database Systems (mid-1980s to present) Advanced data models: extended-relational, object relational, deductive, etc. Managing complex data: spatial, temporal, multimedia, sequence and structured, scientific, engineering, moving objects, etc. Data Mining Introduction 11/9/20196 Data Mining and Data Preprocessing
  • 7. Data streams and cyber-physical data systems Web-based databases (XML, semantic web) Managing uncertain data and data cleaning Integration of heterogeneous sources Text database systems and integration with information retrieval Data Mining Introduction 11/9/20197 Data Mining and Data Preprocessing
  • 8. Extremely large data management Database system tuning and adaptive systems Advanced queries: ranking, skyline, etc. Cloud computing and parallel data processing Issues of data privacy and security Data Mining Introduction 11/9/20198 Data Mining and Data Preprocessing
  • 9. Advanced Data Analysis (late- 1980s to present) Data warehouse and OLAP Data mining and knowledge discovery: classification, clustering, outlier analysis, association and correlation, comparative summary, discrimination analysis, pattern discovery, trend and deviation analysis, etc. Mining complex types of data: streams, sequence, text, spatial, temporal, multimedia, Web, networks, etc. Data Mining Introduction 11/9/20199 Data Mining and Data Preprocessing
  • 10. Data mining applications: business, society, retail, banking, telecommunications, science and engineering, blogs, daily life, etc. Data mining and society: invisible data mining, privacy-preserving data mining, mining social and information networks, recommender systems, etc. One emerging data repository architecture is the data warehouse. This is a repository of multiple heterogeneous data sources Data Mining Introduction 11/9/201910 Data Mining and Data Preprocessing
  • 11. Data Mining The process that finds a small set of precious nuggets from a great deal of raw material. Knowledge mining from data. Also known as knowledge mining from data, knowledge extraction, data/pattern analysis, data archaeology, and data dredging. The process of knowledge discovery. Knowledge Discovery from Data, or KDD 11/9/201911 Data Mining and Data Preprocessing
  • 12. 1. Data cleaning (to remove noise and inconsistent data) 2. Data integration (where multiple data sources may be combined) 3. Data selection (where data relevant to the analysis task are retrieved from the database) 4. Data transformation (where data are transformed and consolidated into forms appropriate for mining by performing summary or aggregation operations) Knowledge Discovery from Data, or KDD 11/9/201912 Data Mining and Data Preprocessing
  • 13. 5. Data mining (an essential process where intelligent methods are applied to extract data patterns) 6. Pattern evaluation (to identify the truly interesting patterns representing knowledge 7. Knowledge presentation (where visualization and knowledge representation techniques are used to present mined knowledge to users) Knowledge Discovery from Data, or KDD 11/9/201913 Data Mining and Data Preprocessing
  • 14. Data mining is the process of discovering interesting patterns and knowledge from large amounts of data. The data sources can include databases, data warehouses, the Web, other information repositories, or data that are streamed into the system dynamically. Knowledge Discovery from Data, or KDD 11/9/201914 Data Mining and Data Preprocessing
  • 15. Knowledge Discovery from Data, or KDD 11/9/201915 Data Mining and Data Preprocessing Figure 1.1 Data mining as a step in the process of knowledge discovery.
  • 16. Data mining is the process of discovering interesting patterns and knowledge from large amounts of data. The data sources can include databases, data warehouses, the Web, other information repositories, or data that are streamed into the system dynamically. Knowledge Discovery from Data, or KDD 11/9/201916 Data Mining and Data Preprocessing
  • 17. Today’s real-world databases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size. They origin from multiple, heterogeneous sources. Low-quality data will lead to low-quality mining results. The data be preprocessed in order to help improve the quality of the data and, consequently, of the mining results. To improve the efficiency and ease of the mining process. Data Preprocessing 11/9/201917 Data Mining and Data Preprocessing
  • 18. There are several data preprocessing techniques. Data cleaning can be applied to remove noise and correct inconsistencies in data. Data integration merges data from multiple sources into a coherent data store such as a data warehouse. Data reduction can reduce data size by, for instance, aggregating, eliminating redundant features, or clustering. Data transformations (e.g., normalization) may be applied, where data are scaled to fall within a smaller range like 0.0 to 1.0. This can improve the accuracy and efficiency of mining algorithms. Data Preprocessing 11/9/201918 Data Mining and Data Preprocessing
  • 19. Data Quality Data have quality if they satisfy the requirements of the intended use. There are many factors comprising data quality 1. Accuracy 2. Completeness 3. Consistency 4. Timeliness 5. Believability and Interpretability. Data Preprocessing 11/9/201919 Data Mining and Data Preprocessing
  • 20. Inaccurate Data: The data collection instruments used may be faulty. There may have been human or computer errors occurring at data entry. Users may purposely submit incorrect data values for mandatory fields when they do not wish to submit personal information. Data Preprocessing 11/9/201920 Data Mining and Data Preprocessing
  • 21. Incomplete Data: Incomplete data can occur for a number of reasons.  Attributes of interest may not always be available, such as customer information for sales transaction data. Other data may not be included simply because they were not considered important at the time of entry. Relevant data may not be recorded due to a misunderstanding or because of equipment malfunctions. Data Preprocessing 11/9/201921 Data Mining and Data Preprocessing
  • 22. Inconsistent Data: Incorrect data may also result from inconsistencies in naming conventions or data codes, or inconsistent formats for input fields (e.g., date). Timeliness: The fact that the month-end data are not updated in a timely fashion has a negative impact on the data quality. Data Preprocessing 11/9/201922 Data Mining and Data Preprocessing
  • 23. Believability : It reflects how much the data are trusted by users. Interpretability: It reflects how easy the data are understood. Suppose that a database, at one point, had several errors, all of which have since been corrected. Even though the database is now accurate, complete, consistent, and timely, sales department users may regard it as of low quality due to poor believability and interpretability. Data Preprocessing 11/9/201923 Data Mining and Data Preprocessing
  • 24. Fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. Missing Values Ignore the tuple: This is usually done when the class label is missing. This method is not very effective, unless the tuple contains several attributes with missing values. Data Cleaning 11/9/201924 Data Mining and Data Preprocessing
  • 25. Fill in the missing value manually: Time consuming and may not be feasible given a large data set with many missing values. Use a global constant to fill in the missing value: Replace all missing attribute values by the same constant such as a label like “Unknown”.  Use a measure of central tendency for the attribute (e.g., the mean or median) to fill in the missing value. Use the attribute mean or median for all samples belonging to the same class as the given tuple. Use the most probable value to fill in the missing value. Data Cleaning 11/9/201925 Data Mining and Data Preprocessing
  • 26. Noisy Data: Random error or variance in a measured variable. Data smoothing techniques: Binning: These methods smooth a sorted data value by consulting its “neighbor- hood,” that is, the values around it. The sorted values are distributed into a number of “buckets,” or bins. Binning methods consult the neighborhood of values which perform local smoothing. Data Cleaning 11/9/201926 Data Mining and Data Preprocessing
  • 27. Binning methods for data smoothing: Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34 Partition into (equal-frequency) bins: Bin 1: 4, 8, 15 Bin 2: 21, 21, 24 Bin 3: 25, 28, 34 Smoothing by bin means: Bin 1: 9, 9, 9 Bin 2: 22, 22, 22 Bin 3: 29, 29, 29 Smoothing by bin boundaries: Bin 1: 4, 4, 15 Bin 2: 21, 21, 24 Bin 3: 25, 25, 34 Data Cleaning 11/9/201927 Data Mining and Data Preprocessing
  • 28. In smoothing by bin medians , the each bin value is replaced by the bin median. In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value. Data Cleaning 11/9/201928 Data Mining and Data Preprocessing
  • 29. Regression: A technique that con- forms data values to a function. Linear regression involves finding the “best” line to fit two attributes . One attribute can be used to predict the other. Multiple linear regression is an extension of linear regression, where more than two attributes are involved . The data are fit to a multidimensional surface. Data Cleaning 11/9/201929 Data Mining and Data Preprocessing
  • 30. Outlier analysis: Outliers may be detected by clustering. The similar values are organized into groups, or “clusters.”. The values that fall outside of the set of clusters may be considered outliers. Data Cleaning 11/9/201930 Data Mining and Data Preprocessing
  • 31. Data Cleaning 11/9/201931 Data Mining and Data Preprocessing Figure 1.2 A 2-D customer data plot with respect to customer locations in a city, showing three data clusters. Outliers may be detected as values that fall outside of the cluster sets.
  • 32. Data Cleaning as a process: Discrepancies can be caused by several factors, including poorly designed data entry forms that have many optional fields, human error in data entry, deliberate errors and data decay. Discrepancies may also arise from inconsistent data representations and inconsistent use of codes. Other sources of discrepancies include errors in instrumentation devices that record data and system errors.  There may also be inconsistencies due to data integration. Data Cleaning 11/9/201932 Data Mining and Data Preprocessing
  • 33. Discrepancy Detection Methods: Field overloading is another error source that typically results when developers squeeze new attribute definitions into unused (bit) portions of already defined attributes. A unique rule says that each value of the given attribute must be different from all other values for that attribute. A consecutive rule says that there can be no missing values between the lowest and highest values for the attribute, and that all values must also be unique (e.g., as in check numbers). Data Cleaning 11/9/201933 Data Mining and Data Preprocessing
  • 34. Discrepancy Detection Methods: A null rule specifies the use of blanks, question marks, special characters, or other strings that may indicate the null condition. Data scrubbing tools use simple domain knowledge (e.g., knowledge of postal addresses and spell-checking) to detect errors and make corrections in the data.. Data auditing tools find discrepancies by analyzing the data to discover rules and relationships, and detecting data that violate such conditions. They are variants of data mining tools. For example, they may employ statistical analysis to find correlations, or clustering to identify outliers. Data Cleaning 11/9/201934 Data Mining and Data Preprocessing
  • 35. Commercial tools can assist in the data transformation step. Data migration tools allow simple transformations to be specified such as to replace the string “gender” by “sex.” ETL (extraction/transformation/loading) tools allow users to specify transforms through a graphical user interface (GUI). Data Cleaning 11/9/201935 Data Mining and Data Preprocessing
  • 36. Thank You 11/9/201936 Data Mining and Data Preprocessing