SlideShare a Scribd company logo
1 of 34
Storing and Managing Information
Table of data
Database management Systems (DBMS)
Storage and retrieval of properties of objects
Spreadsheets
Manipulations of and calculations with the data in the table
Each row is a particular object
Each column is a property associated with that objects
Two examples/paradigms of management systems
Database Management System (DBMS)
Organizes data
in sets of
tables
Relational Database Management System
(RDBMS)
Name Address Parcel #
John Smith 18 Lawyers Dr.756554
T. Brown 14 Summers Tr. 887419
Table A
Table B
Parcel # Assessed Value
887419 152,000
446397 100,000
Provides relationships
Between
data in the tables
Using SQL- Structured Query Language
• SQL is a standard database protocol, adopted by most
‘relational’ databases
• Provides syntax for data:
– Definition
– Retrieval
– Functions (COUNT, SUM, MIN, MAX, etc)
– Updates and Deletes
• SELECT list FROM table WHERE condition
• list - a list of items or * for all items
o WHERE - a logical expression limiting the number of records selected
o can be combined with Boolean logic: AND, OR, NOT
o ORDER may be used to format results
Spreadsheets
Every row is
a different “object”
with a set of properties
Every column is
a different property
of the row object
Spreadsheet
Organization of elements
Column A Column B Column C
Row 1
Row 2
Row 3
Row and column
indicies
Cells with addresses
A7 B4 C10 D5
Accessing each cell
Spreadsheet Formulas
Formula: Combination of
values or cell references
and mathematical
operators such as +, -, /, *
The formula displays in
the entry bar. This
formula is used to add the
values in the four cells.
The sum is displayed in
cell B7.
The results of a formula
display in the cell.
With cell, row and column functions
Ex. Average, sum, min,max,
Visualizing data:
Charts
Making Sense of Knowledge
Time flies like an arrow proverb
Fruit flies like a banana Groucho Marx
There is a semantic and context behind all words
Flies:
1. The act of flying
2. The insect
Like:
1. Similar to
2. Are fond of
There is also the elusive “Common Sense”
1. One type of fly, the fruit fly, is fond of bananas
2. Fruit, in general, flies through the air just like a banana
3. One type of fly, the fruit fly, is just like a banana
A bit complicated because we are speaking metaphorically,
Time is not really an object, like a bird, which flies
Translation is not just doing a one-to-one search in the dictionary
Complex Searches is not just searching for individual words
Google translate
Adding Semantics:
Ontologies
Concept
conceptual entity of the domain
Attribute
property of a concept
Relation
relationship between concepts
or properties
Axiom
coherent description between
Concepts / Properties /
Relations via logical expressions
10
Person
Student Professor
Lecture
isA – hierarchy (taxonomy)
name email
student
nr.
research
field
topic
lecture
nr.
attends
holds
Structuring of:
• Background Knowledge
• “Common Sense” knowledge
Structure of an Ontology
Ontologies typically have two distinct components:
Names for important concepts in the domain
– Elephant is a concept whose members are a kind of animal
– Herbivore is a concept whose members are exactly those animals who eat
only plants or parts of plants
– Adult_Elephant is a concept whose members are exactly those elephants
whose age is greater than 20 years
Background knowledge/constraints on the domain
– Adult_Elephants weigh at least 2,000 kg
– All Elephants are either African_Elephants or Indian_Elephants
– No individual can be both a Herbivore and a Carnivore
11
Ontology Definition
12
Formal, explicit specification of a shared conceptualization
commonly accepted
understanding
conceptual model
of a domain
(ontological theory)
unambiguous
terminology definitions
machine-readability
with computational
semantics
[Gruber93]
The Semantic Web
Ontology implementation
13
"The Semantic Web is an extension of the current web in which information is given
well-defined meaning, better enabling computers and people to work in
cooperation." -- Tim Berners-Lee
“the wedding cake”
Abstracting
Knowledge
Several levels and reasons to abstract knowledge
Feature abstraction
Simplifying “reality” so the know can be used in
Computer data structures and algorithms
Concept Abstraction
Organizing and making sense of the immense amount of
data/knowledge we have
Modeling abstraction
Making usable and predictive models of reality
Prediction Based on Bayes’ Theorem
• Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem
• Informally, this can be viewed as
posteriori = likelihood x prior/evidence
• Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
• Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost
15
)(/)()|(
)(
)()|()|( XX
X
XX PHPHP
P
HPHPHP 
Naïve Bayes Classifier
age income studentcredit_ratingbuys_comput
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
16
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
P(buys_computer = “yes”)
= 9/14 = 0.643
P(buys_computer = “no”)
= 5/14= 0.357
X = (age <= 30 , income = medium, student =
yes, credit_rating = fair)
Naïve Bayes Classifier
age income studentcredit_ratingbuys_comput
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
17
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Want to classify
X =
(age <= 30 ,
income = medium,
student = yes,
credit_rating = fair)
Will X buy a computer?
Naïve Bayes Classifier
18
Key: Conditional probability
P(X|Y) The probability that X is true, given Y
P(not rain| sunny) > P(rain | sunny)
P(not rain| not sunny) < P(rain | not sunny)
Classifier: Have to include the probability of the condition
P(not rain | sunny)*P(sunny)
How often did it really not rain, given that it was actually sunny
Naïve Bayes Classifier
19
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Want to classify
X =
(age <= 30 ,
income = medium,
student = yes,
credit_rating = fair)
Will X buy a computer?
Which “conditional probability” is greater?
P(X|C1)*P(C1) > P(X|C2) *P(C2) X will buy a computer
P(X|C1) *P(C1) < P(X|C2) *P(C2) X will not buy a computer
Naïve Bayes Classifier
age income studentcredit_ratingbuys_comput
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
20
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
X =
(age <= 30 ,
income = medium,
student = yes,
credit_rating = fair)
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
Naïve Bayes Classifier
• Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
21
Naïve Bayes Classifier
P(X|Ci) :
P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) :
P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
Bigger
Decision Tree Classifier
Ross Quinlan
AntennaLength
10
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
Abdomen Length
Abdomen Length > 7.1?
no yes
KatydidAntenna Length > 6.0?
no yes
KatydidGrasshopper
Grasshopper
Antennae shorter than body?
Cricket
Foretiba has ears?
Katydids Camel Cricket
Yes
Yes
Yes
No
No
3 Tarsi?
No
Decision trees predate computers
• Decision tree
– A flow-chart-like tree structure
– Internal node denotes a test on an attribute
– Branch represents an outcome of the test
– Leaf nodes represent class labels or class distribution
• Decision tree generation consists of two phases
– Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
– Tree pruning
• Identify and remove branches that reflect noise or outliers
• Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the decision tree
Decision Tree Classification
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they can be discretized
in advance)
– Examples are partitioned recursively based on selected attributes.
– Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
– There are no samples left
How do we construct the decision tree?
Information Gain as A Splitting Criteria
• Select the attribute with the highest information gain (information gain is the
expected reduction in entropy).
• Assume there are two classes, P and N
– Let the set of examples S contain p elements of class P and n elements of
class N
– The amount of information, needed to decide if an arbitrary example in S
belongs to P or N is defined as















np
n
np
n
np
p
np
p
SE 22 loglog)(
0 log(0) is defined as 0
nformation Gain in Decision Tree Induction
• Assume that using attribute A, a current set will be
partitioned into some number of child sets
• The encoding information that would be gained by
branching on A
)()()( setschildallEsetCurrentEAGain 
Note: entropy is at its minimum if the collection of objects is completely uniform
Person Hair
Length
Weight Age Class
Homer 0” 250 36 M
Marge 10” 150 34 F
Bart 2” 90 10 M
Lisa 6” 78 8 F
Maggie 4” 20 1 F
Abe 1” 170 70 M
Selma 8” 160 41 F
Otto 10” 180 38 M
Krusty 6” 200 45 M
Comic 8” 290 38 ?
Hair Length <= 5?
yes no
Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)
= 0.9911















np
n
np
n
np
p
np
p
SEntropy 22 loglog)(
Gain(Hair Length <= 5) = 0.9911 – (4/9 * 0.8113 + 5/9 * 0.9710 ) = 0.0911
)()()( setschildallEsetCurrentEAGain 
Let us try splitting on
Hair length
Weight <= 160?
yes no
Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)
= 0.9911















np
n
np
n
np
p
np
p
SEntropy 22 loglog)(
Gain(Weight <= 160) = 0.9911 – (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900
)()()( setschildallEsetCurrentEAGain 
Let us try splitting on
Weight
age <= 40?
yes no
Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)
= 0.9911















np
n
np
n
np
p
np
p
SEntropy 22 loglog)(
Gain(Age <= 40) = 0.9911 – (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183
)()()( setschildallEsetCurrentEAGain 
Let us try splitting on
Age
Weight <= 160?
yes no
Hair Length <= 2?
yes no
Of the 3 features we had, Weight was best.
But while people who weigh over 160 are
perfectly classified (as males), the under 160
people are not perfectly classified… So we
simply recurse!
This time we find that we can split on
Hair length, and we are done!
Weight <= 160?
yes no
Hair Length <= 2?
yes no
We need don’t need to keep the data around,
just the test conditions.
Male
Male Female
How would these
people be
classified?

More Related Content

What's hot

Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kambererror007
 
www1.cs.columbia.edu
www1.cs.columbia.eduwww1.cs.columbia.edu
www1.cs.columbia.edubutest
 
CSA 3702 machine learning module 2
CSA 3702 machine learning module 2CSA 3702 machine learning module 2
CSA 3702 machine learning module 2Nandhini S
 
Bayesian classification
Bayesian classification Bayesian classification
Bayesian classification Zul Kawsar
 
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsData Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsSalah Amean
 
week9_Machine_Learning.ppt
week9_Machine_Learning.pptweek9_Machine_Learning.ppt
week9_Machine_Learning.pptbutest
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision treeKrish_ver2
 
Machine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree LearningMachine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree Learningbutest
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree LearningMilind Gokhale
 
CS364 Artificial Intelligence Machine Learning
CS364 Artificial Intelligence Machine LearningCS364 Artificial Intelligence Machine Learning
CS364 Artificial Intelligence Machine Learningbutest
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401butest
 
Covering (Rules-based) Algorithm
Covering (Rules-based) AlgorithmCovering (Rules-based) Algorithm
Covering (Rules-based) AlgorithmZHAO Sam
 
Lect9 Decision tree
Lect9 Decision treeLect9 Decision tree
Lect9 Decision treehktripathy
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.pptbutest
 
08 classbasic
08 classbasic08 classbasic
08 classbasicengrasi
 

What's hot (20)

Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
 
www1.cs.columbia.edu
www1.cs.columbia.eduwww1.cs.columbia.edu
www1.cs.columbia.edu
 
CSA 3702 machine learning module 2
CSA 3702 machine learning module 2CSA 3702 machine learning module 2
CSA 3702 machine learning module 2
 
Bayesian classification
Bayesian classification Bayesian classification
Bayesian classification
 
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsData Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
 
week9_Machine_Learning.ppt
week9_Machine_Learning.pptweek9_Machine_Learning.ppt
week9_Machine_Learning.ppt
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
 
Machine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree LearningMachine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree Learning
 
Machine learning
Machine learningMachine learning
Machine learning
 
Decision trees
Decision treesDecision trees
Decision trees
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
CS364 Artificial Intelligence Machine Learning
CS364 Artificial Intelligence Machine LearningCS364 Artificial Intelligence Machine Learning
CS364 Artificial Intelligence Machine Learning
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
 
Covering (Rules-based) Algorithm
Covering (Rules-based) AlgorithmCovering (Rules-based) Algorithm
Covering (Rules-based) Algorithm
 
Unit 3classification
Unit 3classificationUnit 3classification
Unit 3classification
 
Lect9 Decision tree
Lect9 Decision treeLect9 Decision tree
Lect9 Decision tree
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.ppt
 
08 classbasic
08 classbasic08 classbasic
08 classbasic
 
08 classbasic
08 classbasic08 classbasic
08 classbasic
 
ppt
pptppt
ppt
 

Similar to Managing Data: storage, decisions and classification

Business Analytics using R.ppt
Business Analytics using R.pptBusiness Analytics using R.ppt
Business Analytics using R.pptRohit Raj
 
3_learning.ppt
3_learning.ppt3_learning.ppt
3_learning.pptbutest
 
ML-DecisionTrees.ppt
ML-DecisionTrees.pptML-DecisionTrees.ppt
ML-DecisionTrees.pptssuserec53e73
 
ML-DecisionTrees.ppt
ML-DecisionTrees.pptML-DecisionTrees.ppt
ML-DecisionTrees.pptssuserec53e73
 
ML-DecisionTrees.ppt
ML-DecisionTrees.pptML-DecisionTrees.ppt
ML-DecisionTrees.pptssuser71fb63
 
MachineLearningAndDataAnalytics_034739.pptx
MachineLearningAndDataAnalytics_034739.pptxMachineLearningAndDataAnalytics_034739.pptx
MachineLearningAndDataAnalytics_034739.pptxSakshiSingh770619
 
Search Engines
Search EnginesSearch Engines
Search Enginesbutest
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.butest
 
Unit-4 classification
Unit-4 classificationUnit-4 classification
Unit-4 classificationLokarchanaD
 
Kishore - blooms taxonomy
Kishore - blooms taxonomyKishore - blooms taxonomy
Kishore - blooms taxonomykavya173kishore
 
Machine Learning and Inductive Inference
Machine Learning and Inductive InferenceMachine Learning and Inductive Inference
Machine Learning and Inductive Inferencebutest
 
Lecture 09(introduction to machine learning)
Lecture 09(introduction to machine learning)Lecture 09(introduction to machine learning)
Lecture 09(introduction to machine learning)Jeet Das
 
Computational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding RegionsComputational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding Regionsbutest
 

Similar to Managing Data: storage, decisions and classification (20)

Business Analytics using R.ppt
Business Analytics using R.pptBusiness Analytics using R.ppt
Business Analytics using R.ppt
 
3_learning.ppt
3_learning.ppt3_learning.ppt
3_learning.ppt
 
Unit-2.ppt
Unit-2.pptUnit-2.ppt
Unit-2.ppt
 
ML-DecisionTrees.ppt
ML-DecisionTrees.pptML-DecisionTrees.ppt
ML-DecisionTrees.ppt
 
ML-DecisionTrees.ppt
ML-DecisionTrees.pptML-DecisionTrees.ppt
ML-DecisionTrees.ppt
 
ML-DecisionTrees.ppt
ML-DecisionTrees.pptML-DecisionTrees.ppt
ML-DecisionTrees.ppt
 
MachineLearningAndDataAnalytics_034739.pptx
MachineLearningAndDataAnalytics_034739.pptxMachineLearningAndDataAnalytics_034739.pptx
MachineLearningAndDataAnalytics_034739.pptx
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
 
Data Mining
Data MiningData Mining
Data Mining
 
Unit-4 classification
Unit-4 classificationUnit-4 classification
Unit-4 classification
 
Kishore - blooms taxonomy
Kishore - blooms taxonomyKishore - blooms taxonomy
Kishore - blooms taxonomy
 
Unit-1.pdf
Unit-1.pdfUnit-1.pdf
Unit-1.pdf
 
Machine Learning and Inductive Inference
Machine Learning and Inductive InferenceMachine Learning and Inductive Inference
Machine Learning and Inductive Inference
 
Lecture 09(introduction to machine learning)
Lecture 09(introduction to machine learning)Lecture 09(introduction to machine learning)
Lecture 09(introduction to machine learning)
 
Kdd by Mr.Sameer Kumar Das
Kdd by Mr.Sameer Kumar DasKdd by Mr.Sameer Kumar Das
Kdd by Mr.Sameer Kumar Das
 
ai4.ppt
ai4.pptai4.ppt
ai4.ppt
 
ai4.ppt
ai4.pptai4.ppt
ai4.ppt
 
11 id3 &amp; c4
11 id3 &amp; c411 id3 &amp; c4
11 id3 &amp; c4
 
Computational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding RegionsComputational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding Regions
 

More from Edward Blurock

KEOD23-JThermodynamcsCloud
KEOD23-JThermodynamcsCloudKEOD23-JThermodynamcsCloud
KEOD23-JThermodynamcsCloudEdward Blurock
 
BlurockPresentation-KEOD2023
BlurockPresentation-KEOD2023BlurockPresentation-KEOD2023
BlurockPresentation-KEOD2023Edward Blurock
 
ChemConnect: Poster for European Combustion Meeting 2017
ChemConnect: Poster for European Combustion Meeting 2017ChemConnect: Poster for European Combustion Meeting 2017
ChemConnect: Poster for European Combustion Meeting 2017Edward Blurock
 
ChemConnect: SMARTCATS presentation
ChemConnect: SMARTCATS presentationChemConnect: SMARTCATS presentation
ChemConnect: SMARTCATS presentationEdward Blurock
 
EU COST Action CM1404: WG€ - Efficient Data Exchange
EU COST Action CM1404: WG€ - Efficient Data ExchangeEU COST Action CM1404: WG€ - Efficient Data Exchange
EU COST Action CM1404: WG€ - Efficient Data ExchangeEdward Blurock
 
ChemConnect: Viewing the datasets in the repository
ChemConnect: Viewing the datasets in the repositoryChemConnect: Viewing the datasets in the repository
ChemConnect: Viewing the datasets in the repositoryEdward Blurock
 
ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...
ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...
ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...Edward Blurock
 
Poster: Characterizing Ignition behavior through morphing to generic curves
Poster: Characterizing Ignition behavior through morphing to generic curvesPoster: Characterizing Ignition behavior through morphing to generic curves
Poster: Characterizing Ignition behavior through morphing to generic curvesEdward Blurock
 
Poster: Very Open Data Project
Poster: Very Open Data ProjectPoster: Very Open Data Project
Poster: Very Open Data ProjectEdward Blurock
 
Poster: Adaptive On-­‐the-­‐fly Regression Tabula@on: Beyond ISAT
Poster: Adaptive On-­‐the-­‐fly Regression Tabula@on: Beyond ISATPoster: Adaptive On-­‐the-­‐fly Regression Tabula@on: Beyond ISAT
Poster: Adaptive On-­‐the-­‐fly Regression Tabula@on: Beyond ISATEdward Blurock
 
Characterization Ignition Behavior through Morphing to Generic Ignition Curves
Characterization Ignition Behavior through Morphing to Generic Ignition CurvesCharacterization Ignition Behavior through Morphing to Generic Ignition Curves
Characterization Ignition Behavior through Morphing to Generic Ignition CurvesEdward Blurock
 
Computability, turing machines and lambda calculus
Computability, turing machines and lambda calculusComputability, turing machines and lambda calculus
Computability, turing machines and lambda calculusEdward Blurock
 
Imperative programming
Imperative programmingImperative programming
Imperative programmingEdward Blurock
 
Database normalization
Database normalizationDatabase normalization
Database normalizationEdward Blurock
 

More from Edward Blurock (20)

KEOD23-JThermodynamcsCloud
KEOD23-JThermodynamcsCloudKEOD23-JThermodynamcsCloud
KEOD23-JThermodynamcsCloud
 
BlurockPresentation-KEOD2023
BlurockPresentation-KEOD2023BlurockPresentation-KEOD2023
BlurockPresentation-KEOD2023
 
KEOD-2023-Poster.pptx
KEOD-2023-Poster.pptxKEOD-2023-Poster.pptx
KEOD-2023-Poster.pptx
 
ChemConnect: Poster for European Combustion Meeting 2017
ChemConnect: Poster for European Combustion Meeting 2017ChemConnect: Poster for European Combustion Meeting 2017
ChemConnect: Poster for European Combustion Meeting 2017
 
ChemConnect: SMARTCATS presentation
ChemConnect: SMARTCATS presentationChemConnect: SMARTCATS presentation
ChemConnect: SMARTCATS presentation
 
EU COST Action CM1404: WG€ - Efficient Data Exchange
EU COST Action CM1404: WG€ - Efficient Data ExchangeEU COST Action CM1404: WG€ - Efficient Data Exchange
EU COST Action CM1404: WG€ - Efficient Data Exchange
 
ChemConnect: Viewing the datasets in the repository
ChemConnect: Viewing the datasets in the repositoryChemConnect: Viewing the datasets in the repository
ChemConnect: Viewing the datasets in the repository
 
ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...
ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...
ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...
 
Poster: Characterizing Ignition behavior through morphing to generic curves
Poster: Characterizing Ignition behavior through morphing to generic curvesPoster: Characterizing Ignition behavior through morphing to generic curves
Poster: Characterizing Ignition behavior through morphing to generic curves
 
Poster: Very Open Data Project
Poster: Very Open Data ProjectPoster: Very Open Data Project
Poster: Very Open Data Project
 
Poster: Adaptive On-­‐the-­‐fly Regression Tabula@on: Beyond ISAT
Poster: Adaptive On-­‐the-­‐fly Regression Tabula@on: Beyond ISATPoster: Adaptive On-­‐the-­‐fly Regression Tabula@on: Beyond ISAT
Poster: Adaptive On-­‐the-­‐fly Regression Tabula@on: Beyond ISAT
 
Characterization Ignition Behavior through Morphing to Generic Ignition Curves
Characterization Ignition Behavior through Morphing to Generic Ignition CurvesCharacterization Ignition Behavior through Morphing to Generic Ignition Curves
Characterization Ignition Behavior through Morphing to Generic Ignition Curves
 
Paradigms
ParadigmsParadigms
Paradigms
 
Computability, turing machines and lambda calculus
Computability, turing machines and lambda calculusComputability, turing machines and lambda calculus
Computability, turing machines and lambda calculus
 
Imperative programming
Imperative programmingImperative programming
Imperative programming
 
Programming Languages
Programming LanguagesProgramming Languages
Programming Languages
 
Relational algebra
Relational algebraRelational algebra
Relational algebra
 
Database normalization
Database normalizationDatabase normalization
Database normalization
 
Overview
OverviewOverview
Overview
 
Networks
NetworksNetworks
Networks
 

Recently uploaded

“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 

Recently uploaded (20)

“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 

Managing Data: storage, decisions and classification

  • 1. Storing and Managing Information Table of data Database management Systems (DBMS) Storage and retrieval of properties of objects Spreadsheets Manipulations of and calculations with the data in the table Each row is a particular object Each column is a property associated with that objects Two examples/paradigms of management systems
  • 2. Database Management System (DBMS) Organizes data in sets of tables
  • 3. Relational Database Management System (RDBMS) Name Address Parcel # John Smith 18 Lawyers Dr.756554 T. Brown 14 Summers Tr. 887419 Table A Table B Parcel # Assessed Value 887419 152,000 446397 100,000 Provides relationships Between data in the tables
  • 4. Using SQL- Structured Query Language • SQL is a standard database protocol, adopted by most ‘relational’ databases • Provides syntax for data: – Definition – Retrieval – Functions (COUNT, SUM, MIN, MAX, etc) – Updates and Deletes • SELECT list FROM table WHERE condition • list - a list of items or * for all items o WHERE - a logical expression limiting the number of records selected o can be combined with Boolean logic: AND, OR, NOT o ORDER may be used to format results
  • 5. Spreadsheets Every row is a different “object” with a set of properties Every column is a different property of the row object
  • 6. Spreadsheet Organization of elements Column A Column B Column C Row 1 Row 2 Row 3 Row and column indicies Cells with addresses A7 B4 C10 D5 Accessing each cell
  • 7. Spreadsheet Formulas Formula: Combination of values or cell references and mathematical operators such as +, -, /, * The formula displays in the entry bar. This formula is used to add the values in the four cells. The sum is displayed in cell B7. The results of a formula display in the cell. With cell, row and column functions Ex. Average, sum, min,max,
  • 9. Making Sense of Knowledge Time flies like an arrow proverb Fruit flies like a banana Groucho Marx There is a semantic and context behind all words Flies: 1. The act of flying 2. The insect Like: 1. Similar to 2. Are fond of There is also the elusive “Common Sense” 1. One type of fly, the fruit fly, is fond of bananas 2. Fruit, in general, flies through the air just like a banana 3. One type of fly, the fruit fly, is just like a banana A bit complicated because we are speaking metaphorically, Time is not really an object, like a bird, which flies Translation is not just doing a one-to-one search in the dictionary Complex Searches is not just searching for individual words Google translate
  • 10. Adding Semantics: Ontologies Concept conceptual entity of the domain Attribute property of a concept Relation relationship between concepts or properties Axiom coherent description between Concepts / Properties / Relations via logical expressions 10 Person Student Professor Lecture isA – hierarchy (taxonomy) name email student nr. research field topic lecture nr. attends holds Structuring of: • Background Knowledge • “Common Sense” knowledge
  • 11. Structure of an Ontology Ontologies typically have two distinct components: Names for important concepts in the domain – Elephant is a concept whose members are a kind of animal – Herbivore is a concept whose members are exactly those animals who eat only plants or parts of plants – Adult_Elephant is a concept whose members are exactly those elephants whose age is greater than 20 years Background knowledge/constraints on the domain – Adult_Elephants weigh at least 2,000 kg – All Elephants are either African_Elephants or Indian_Elephants – No individual can be both a Herbivore and a Carnivore 11
  • 12. Ontology Definition 12 Formal, explicit specification of a shared conceptualization commonly accepted understanding conceptual model of a domain (ontological theory) unambiguous terminology definitions machine-readability with computational semantics [Gruber93]
  • 13. The Semantic Web Ontology implementation 13 "The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation." -- Tim Berners-Lee “the wedding cake”
  • 14. Abstracting Knowledge Several levels and reasons to abstract knowledge Feature abstraction Simplifying “reality” so the know can be used in Computer data structures and algorithms Concept Abstraction Organizing and making sense of the immense amount of data/knowledge we have Modeling abstraction Making usable and predictive models of reality
  • 15. Prediction Based on Bayes’ Theorem • Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the Bayes’ theorem • Informally, this can be viewed as posteriori = likelihood x prior/evidence • Predicts X belongs to Ci iff the probability P(Ci|X) is the highest among all the P(Ck|X) for all the k classes • Practical difficulty: It requires initial knowledge of many probabilities, involving significant computational cost 15 )(/)()|( )( )()|()|( XX X XX PHPHP P HPHPHP 
  • 16. Naïve Bayes Classifier age income studentcredit_ratingbuys_comput <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no 16 Class: C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’ P(buys_computer = “yes”) = 9/14 = 0.643 P(buys_computer = “no”) = 5/14= 0.357 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
  • 17. Naïve Bayes Classifier age income studentcredit_ratingbuys_comput <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no 17 Class: C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’ Want to classify X = (age <= 30 , income = medium, student = yes, credit_rating = fair) Will X buy a computer?
  • 18. Naïve Bayes Classifier 18 Key: Conditional probability P(X|Y) The probability that X is true, given Y P(not rain| sunny) > P(rain | sunny) P(not rain| not sunny) < P(rain | not sunny) Classifier: Have to include the probability of the condition P(not rain | sunny)*P(sunny) How often did it really not rain, given that it was actually sunny
  • 19. Naïve Bayes Classifier 19 Class: C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’ Want to classify X = (age <= 30 , income = medium, student = yes, credit_rating = fair) Will X buy a computer? Which “conditional probability” is greater? P(X|C1)*P(C1) > P(X|C2) *P(C2) X will buy a computer P(X|C1) *P(C1) < P(X|C2) *P(C2) X will not buy a computer
  • 20. Naïve Bayes Classifier age income studentcredit_ratingbuys_comput <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no 20 Class: C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’ X = (age <= 30 , income = medium, student = yes, credit_rating = fair) P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
  • 21. Naïve Bayes Classifier • Compute P(X|Ci) for each class P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4 21
  • 22. Naïve Bayes Classifier P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028 P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007 Therefore, X belongs to class (“buys_computer = yes”) Bigger
  • 23. Decision Tree Classifier Ross Quinlan AntennaLength 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 Abdomen Length Abdomen Length > 7.1? no yes KatydidAntenna Length > 6.0? no yes KatydidGrasshopper
  • 24. Grasshopper Antennae shorter than body? Cricket Foretiba has ears? Katydids Camel Cricket Yes Yes Yes No No 3 Tarsi? No Decision trees predate computers
  • 25. • Decision tree – A flow-chart-like tree structure – Internal node denotes a test on an attribute – Branch represents an outcome of the test – Leaf nodes represent class labels or class distribution • Decision tree generation consists of two phases – Tree construction • At start, all the training examples are at the root • Partition examples recursively based on selected attributes – Tree pruning • Identify and remove branches that reflect noise or outliers • Use of decision tree: Classifying an unknown sample – Test the attribute values of the sample against the decision tree Decision Tree Classification
  • 26. • Basic algorithm (a greedy algorithm) – Tree is constructed in a top-down recursive divide-and-conquer manner – At start, all the training examples are at the root – Attributes are categorical (if continuous-valued, they can be discretized in advance) – Examples are partitioned recursively based on selected attributes. – Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) • Conditions for stopping partitioning – All samples for a given node belong to the same class – There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf – There are no samples left How do we construct the decision tree?
  • 27. Information Gain as A Splitting Criteria • Select the attribute with the highest information gain (information gain is the expected reduction in entropy). • Assume there are two classes, P and N – Let the set of examples S contain p elements of class P and n elements of class N – The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined as                np n np n np p np p SE 22 loglog)( 0 log(0) is defined as 0
  • 28. nformation Gain in Decision Tree Induction • Assume that using attribute A, a current set will be partitioned into some number of child sets • The encoding information that would be gained by branching on A )()()( setschildallEsetCurrentEAGain  Note: entropy is at its minimum if the collection of objects is completely uniform
  • 29. Person Hair Length Weight Age Class Homer 0” 250 36 M Marge 10” 150 34 F Bart 2” 90 10 M Lisa 6” 78 8 F Maggie 4” 20 1 F Abe 1” 170 70 M Selma 8” 160 41 F Otto 10” 180 38 M Krusty 6” 200 45 M Comic 8” 290 38 ?
  • 30. Hair Length <= 5? yes no Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911                np n np n np p np p SEntropy 22 loglog)( Gain(Hair Length <= 5) = 0.9911 – (4/9 * 0.8113 + 5/9 * 0.9710 ) = 0.0911 )()()( setschildallEsetCurrentEAGain  Let us try splitting on Hair length
  • 31. Weight <= 160? yes no Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911                np n np n np p np p SEntropy 22 loglog)( Gain(Weight <= 160) = 0.9911 – (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900 )()()( setschildallEsetCurrentEAGain  Let us try splitting on Weight
  • 32. age <= 40? yes no Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911                np n np n np p np p SEntropy 22 loglog)( Gain(Age <= 40) = 0.9911 – (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183 )()()( setschildallEsetCurrentEAGain  Let us try splitting on Age
  • 33. Weight <= 160? yes no Hair Length <= 2? yes no Of the 3 features we had, Weight was best. But while people who weigh over 160 are perfectly classified (as males), the under 160 people are not perfectly classified… So we simply recurse! This time we find that we can split on Hair length, and we are done!
  • 34. Weight <= 160? yes no Hair Length <= 2? yes no We need don’t need to keep the data around, just the test conditions. Male Male Female How would these people be classified?