SlideShare a Scribd company logo
Module - 3
By
Dr.Ramkumar.T
ramkumar.thirunavukarasu@vit.ac.in
Data mining knowledge representation
What Defines a Data Mining Task?
• Task relevant data: where and how to retrieve the data
to be used for mining
• Background knowledge: Concept hierarchies
• Interestingness measures: informal and formal
selection techniques to be applied to the output
knowledge
• Representing input data and output knowledge: the
structures used to represent the input of the output of
the data mining techniques
• Visualization techniques: needed to best view and
document the results of the whole process
Task relevant data
• Database or data warehouse name: where to find the
data
• Database tables or data warehouse cubes
• Condition for data selection, relevant attributes or
dimensions and data grouping criteria: all this is used in
the SQL query to retrieve the data
Background knowledge: Concept hierarchies
• The concept hierarchies are induced by a partial order
over the values of a given attribute. Depending on the
type of the ordering relation, we distinguish several
types of concept hierarchies.
– Schema hierarchy - Relating concept generality. The
ordering reflects the generality of the attribute values,
Example : street < city < state < country.
– Set-grouping hierarchy - The ordering relation is the subset
relation (⊆). Applies to set values. Example: {13, ..., 39} =
young; {13, ..., 19} = teenage; {13, ..., 19} ⊆ {13, ..., 39} ⇒
teenage < young.
Background knowledge: Concept hierarchies
– Operation-derived hierarchy - Produced by applying an
operation (encoding, decoding, information extraction)
Example : markovz@cs.ccsu.edu
instantiates the hierarcy user−name < department < university
– Rule-based hierarchy - Using rules to define the partial
order.
Example : if antecedent  consequent
Interestingness measures
 Criteria to evaluate hypotheses (knowledge extracted from data
when applying data mining techniques).
Example :
Representing input data and output
knowledge
 Concepts (classes, categories, hypotheses): things to be
mined/learned
Representing input data and output
knowledge
 Instances (examples, tuples, transactions)
Representing input data and output
knowledge
 Attributes (Features)
Representing input data and output
knowledge
 Output Knowledge Representations
Visualization techniques
• Visualization techniques enable us to visually identify
trends, ranges, frequency distributions, relationships,
outliers and make comparisons
• Some of the common graphs used in exploratory data
analysis and data mining are
• Frequency Polygrams and Histograms
• Scatterplots
• Box Plots
• Multiple Graphs
Frequency polygrams
• Frequency polygrams - Plot information according to
the number of observations reported for each value (or
ranges of values) for a particular variable (usually for
continuous variables)
• The shape of the plot reveals trends
Frequency polygram displaying a count for cars per year
Histograms
• Histograms provide a clear way of viewing the
frequency distribution for a single variable.
• Variables that are not continuous can also be shown as a
histogram
• The length of the bar is proportional to the size of the
group
• For continuous variables, a histogram can be very
useful in displaying the frequency distribution.
• The central values, the shape, the range of values as
well as any outliers can be identified through
Histograms
Various Histogram representations
Histogram showing categorical variable Diabetes
Histogram representing counts for ranges in the variable
Length
Histogram showing an outlier
Scatterplots
• Scatterplots can be used to identify whether any
relationship exists between two continuous variables
based on the ratio or interval scales
• The two variables are plotted on the x- and y-axes. Each
point displayed on the scatterplot is a single observation
• The position of the point is determined by the value of
the two variables.
• Scatterplots allow you to see the type of relationship
that may exist between the two variables
Various Scatter Plot representations
Scatterplot showing an outlier (X) Scatterplot showing a nonlinear relationship
Scatterplot showing no discernable relationship Scatterplot showing a negative relationship
Box Plots
• Box plots (also called box-and-whisker plots) provide a succinct
summary of the overall distribution for a variable
• Five points are displayed: the lower extreme value, the lower
quartile, the median, the upper quartile, the upper extreme and
the mean
• The values on the box plot are defined as follows:
– Lower extreme: The lowest value for the variable.
– Lower quartile: The point below which 25% of all observations fall.
– Median: The point below which 50% of all observations fall.
– Upper quartile: The point below which 75% of all observations fall.
– Upper extreme: The highest value for the variable.
– Mean: The average value for the variable.
Box Plot representation
A standard Box Plot
Box Plot for the Variable MPG

More Related Content

What's hot

Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Aiswaryadevi Jaganmohan
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
lavanya marichamy
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
thamizh arasi
 
Dsa unit 1
Dsa unit 1Dsa unit 1
Dsa unit 1
thamizh arasi
 
Decision tree induction
Decision tree inductionDecision tree induction
Decision tree induction
thamizh arasi
 
Classification
ClassificationClassification
Classification
thamizh arasi
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
Valerii Klymchuk
 
Data mining Basics and complete description
Data mining Basics and complete description Data mining Basics and complete description
Data mining Basics and complete description
Sulman Ahmed
 
Dma unit 2
Dma unit  2Dma unit  2
Dma unit 2
thamizh arasi
 
Clustering
ClusteringClustering
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onword
Sulman Ahmed
 
Data mining
Data miningData mining
Data mining
Maulik Togadiya
 
Classification
ClassificationClassification
Classification
Dr. C.V. Suresh Babu
 
Survey on Various Classification Techniques in Data Mining
Survey on Various Classification Techniques in Data MiningSurvey on Various Classification Techniques in Data Mining
Survey on Various Classification Techniques in Data Mining
ijsrd.com
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methods
rajshreemuthiah
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Salah Amean
 
Data Analysis, Intepretation
Data Analysis, IntepretationData Analysis, Intepretation
Data preparation
Data preparationData preparation
Data preparation
Harry Potter
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
DataminingTools Inc
 

What's hot (20)

Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
Dsa unit 1
Dsa unit 1Dsa unit 1
Dsa unit 1
 
Decision tree induction
Decision tree inductionDecision tree induction
Decision tree induction
 
Classification
ClassificationClassification
Classification
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
 
Data mining Basics and complete description
Data mining Basics and complete description Data mining Basics and complete description
Data mining Basics and complete description
 
Dma unit 2
Dma unit  2Dma unit  2
Dma unit 2
 
Clustering
ClusteringClustering
Clustering
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onword
 
Data mining
Data miningData mining
Data mining
 
Classification
ClassificationClassification
Classification
 
Survey on Various Classification Techniques in Data Mining
Survey on Various Classification Techniques in Data MiningSurvey on Various Classification Techniques in Data Mining
Survey on Various Classification Techniques in Data Mining
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methods
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
Datapreprocessing
DatapreprocessingDatapreprocessing
Datapreprocessing
 
Data Analysis, Intepretation
Data Analysis, IntepretationData Analysis, Intepretation
Data Analysis, Intepretation
 
Data preparation
Data preparationData preparation
Data preparation
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
 

Similar to 4 module 3 --

Excel and research
Excel and researchExcel and research
Excel and researchNursing Path
 
EDA.pptx
EDA.pptxEDA.pptx
EDA.pptx
Rahul Borate
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
Sitamarhi Institute of Technology
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
Sitamarhi Institute of Technology
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
Sitamarhi Institute of Technology
 
UNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data MiningUNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data Mining
Nandakumar P
 
Excel and research
Excel and researchExcel and research
Excel and researchNursing Path
 
Lect4 principal component analysis-I
Lect4 principal component analysis-ILect4 principal component analysis-I
Lect4 principal component analysis-I
hktripathy
 
Data Mining StepsProblem Definition Market AnalysisC
Data Mining StepsProblem Definition Market AnalysisCData Mining StepsProblem Definition Market AnalysisC
Data Mining StepsProblem Definition Market AnalysisC
sharondabriggs
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitization
Venkata Reddy Konasani
 
Machine learning meetup
Machine learning meetupMachine learning meetup
Machine learning meetup
QuantUniversity
 
Data mining knowledge representation Notes
Data mining knowledge representation NotesData mining knowledge representation Notes
Data mining knowledge representation Notes
RevathiSundar4
 
unit 1.pptx
unit 1.pptxunit 1.pptx
unit 1.pptx
sirishaYerraboina1
 
ML SFCSE.pptx
ML SFCSE.pptxML SFCSE.pptx
ML SFCSE.pptx
NIKHILGR3
 
17329274.ppt
17329274.ppt17329274.ppt
17329274.ppt
KowsalyaG17
 
Anomaly detection Meetup Slides
Anomaly detection Meetup SlidesAnomaly detection Meetup Slides
Anomaly detection Meetup Slides
QuantUniversity
 
LECTURE 3 - inferential statistics bmaths
LECTURE 3 - inferential statistics bmathsLECTURE 3 - inferential statistics bmaths
LECTURE 3 - inferential statistics bmaths
jafari12
 
Organizational Data Analysis by Mr Mumba.pptx
Organizational Data Analysis by Mr Mumba.pptxOrganizational Data Analysis by Mr Mumba.pptx
Organizational Data Analysis by Mr Mumba.pptx
bentrym2
 
R training4
R training4R training4
R training4
Hellen Gakuruh
 
EXPLORATORY DATA ANALYSIS
EXPLORATORY DATA ANALYSISEXPLORATORY DATA ANALYSIS
EXPLORATORY DATA ANALYSIS
BabasID2
 

Similar to 4 module 3 -- (20)

Excel and research
Excel and researchExcel and research
Excel and research
 
EDA.pptx
EDA.pptxEDA.pptx
EDA.pptx
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
 
UNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data MiningUNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data Mining
 
Excel and research
Excel and researchExcel and research
Excel and research
 
Lect4 principal component analysis-I
Lect4 principal component analysis-ILect4 principal component analysis-I
Lect4 principal component analysis-I
 
Data Mining StepsProblem Definition Market AnalysisC
Data Mining StepsProblem Definition Market AnalysisCData Mining StepsProblem Definition Market AnalysisC
Data Mining StepsProblem Definition Market AnalysisC
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitization
 
Machine learning meetup
Machine learning meetupMachine learning meetup
Machine learning meetup
 
Data mining knowledge representation Notes
Data mining knowledge representation NotesData mining knowledge representation Notes
Data mining knowledge representation Notes
 
unit 1.pptx
unit 1.pptxunit 1.pptx
unit 1.pptx
 
ML SFCSE.pptx
ML SFCSE.pptxML SFCSE.pptx
ML SFCSE.pptx
 
17329274.ppt
17329274.ppt17329274.ppt
17329274.ppt
 
Anomaly detection Meetup Slides
Anomaly detection Meetup SlidesAnomaly detection Meetup Slides
Anomaly detection Meetup Slides
 
LECTURE 3 - inferential statistics bmaths
LECTURE 3 - inferential statistics bmathsLECTURE 3 - inferential statistics bmaths
LECTURE 3 - inferential statistics bmaths
 
Organizational Data Analysis by Mr Mumba.pptx
Organizational Data Analysis by Mr Mumba.pptxOrganizational Data Analysis by Mr Mumba.pptx
Organizational Data Analysis by Mr Mumba.pptx
 
R training4
R training4R training4
R training4
 
EXPLORATORY DATA ANALYSIS
EXPLORATORY DATA ANALYSISEXPLORATORY DATA ANALYSIS
EXPLORATORY DATA ANALYSIS
 

4 module 3 --

  • 2. What Defines a Data Mining Task? • Task relevant data: where and how to retrieve the data to be used for mining • Background knowledge: Concept hierarchies • Interestingness measures: informal and formal selection techniques to be applied to the output knowledge • Representing input data and output knowledge: the structures used to represent the input of the output of the data mining techniques • Visualization techniques: needed to best view and document the results of the whole process
  • 3. Task relevant data • Database or data warehouse name: where to find the data • Database tables or data warehouse cubes • Condition for data selection, relevant attributes or dimensions and data grouping criteria: all this is used in the SQL query to retrieve the data
  • 4. Background knowledge: Concept hierarchies • The concept hierarchies are induced by a partial order over the values of a given attribute. Depending on the type of the ordering relation, we distinguish several types of concept hierarchies. – Schema hierarchy - Relating concept generality. The ordering reflects the generality of the attribute values, Example : street < city < state < country. – Set-grouping hierarchy - The ordering relation is the subset relation (⊆). Applies to set values. Example: {13, ..., 39} = young; {13, ..., 19} = teenage; {13, ..., 19} ⊆ {13, ..., 39} ⇒ teenage < young.
  • 5. Background knowledge: Concept hierarchies – Operation-derived hierarchy - Produced by applying an operation (encoding, decoding, information extraction) Example : markovz@cs.ccsu.edu instantiates the hierarcy user−name < department < university – Rule-based hierarchy - Using rules to define the partial order. Example : if antecedent  consequent
  • 6. Interestingness measures  Criteria to evaluate hypotheses (knowledge extracted from data when applying data mining techniques). Example :
  • 7. Representing input data and output knowledge  Concepts (classes, categories, hypotheses): things to be mined/learned
  • 8. Representing input data and output knowledge  Instances (examples, tuples, transactions)
  • 9. Representing input data and output knowledge  Attributes (Features)
  • 10. Representing input data and output knowledge  Output Knowledge Representations
  • 11. Visualization techniques • Visualization techniques enable us to visually identify trends, ranges, frequency distributions, relationships, outliers and make comparisons • Some of the common graphs used in exploratory data analysis and data mining are • Frequency Polygrams and Histograms • Scatterplots • Box Plots • Multiple Graphs
  • 12. Frequency polygrams • Frequency polygrams - Plot information according to the number of observations reported for each value (or ranges of values) for a particular variable (usually for continuous variables) • The shape of the plot reveals trends Frequency polygram displaying a count for cars per year
  • 13. Histograms • Histograms provide a clear way of viewing the frequency distribution for a single variable. • Variables that are not continuous can also be shown as a histogram • The length of the bar is proportional to the size of the group • For continuous variables, a histogram can be very useful in displaying the frequency distribution. • The central values, the shape, the range of values as well as any outliers can be identified through Histograms
  • 14. Various Histogram representations Histogram showing categorical variable Diabetes Histogram representing counts for ranges in the variable Length Histogram showing an outlier
  • 15. Scatterplots • Scatterplots can be used to identify whether any relationship exists between two continuous variables based on the ratio or interval scales • The two variables are plotted on the x- and y-axes. Each point displayed on the scatterplot is a single observation • The position of the point is determined by the value of the two variables. • Scatterplots allow you to see the type of relationship that may exist between the two variables
  • 16. Various Scatter Plot representations Scatterplot showing an outlier (X) Scatterplot showing a nonlinear relationship Scatterplot showing no discernable relationship Scatterplot showing a negative relationship
  • 17. Box Plots • Box plots (also called box-and-whisker plots) provide a succinct summary of the overall distribution for a variable • Five points are displayed: the lower extreme value, the lower quartile, the median, the upper quartile, the upper extreme and the mean • The values on the box plot are defined as follows: – Lower extreme: The lowest value for the variable. – Lower quartile: The point below which 25% of all observations fall. – Median: The point below which 50% of all observations fall. – Upper quartile: The point below which 75% of all observations fall. – Upper extreme: The highest value for the variable. – Mean: The average value for the variable.
  • 18. Box Plot representation A standard Box Plot Box Plot for the Variable MPG