SlideShare a Scribd company logo
1 of 143
Download to read offline
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Good data preparation is key to
producing valid and reliable models…
Deriving Knowledge from Data at Scale
Lecture 7 Agenda
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Heads Up, Lecture 8 Agenda
Deriving Knowledge from Data at Scale
Course Project
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
https://www.kaggle.com/c/springleaf-marketing-response
not
Determine whether to send a direct mail piece to a customer
Deriving Knowledge from Data at Scale
The Data
Deriving Knowledge from Data at Scale
The Rules
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
what is the data telling you
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
The Tragedy of the Titanic
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Does the new data point x* exactly match a previous point xi?
If so, assign it to the same class as xi
Otherwise, just guess.
This is the “rote” classifier
Deriving Knowledge from Data at Scale
Does the new data point x* match a set pf previous points xi on some specific attribute?
If so, take a vote to determine class.
Example: If most females survived, then assume every female survives
But there are lots of possible rules like this.
And an attribute can have more than two values.
If most people under 4 years old survive, then assume everyone under 4 survives
If most people with 1 sibling survive, then assume everyone with 1 sibling survives
How do we choose?
Deriving Knowledge from Data at Scale
IF sex=‘female’ THEN survive=yes
ELSE IF sex=‘male’ THEN survive = no
confusion matrix
no yes <-- classified as
468 109 | no
81 233 | yes
(468 + 233) / (468+109+81+233) = 79% correct (and 21% incorrect)
Not bad!
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
IF pclass=‘1’ THEN survive=yes
ELSE IF pclass=‘2’ THEN survive=yes
ELSE IF pclass=‘3’ THEN survive=no
confusion matrix
no yes <-- classified as
372 119 | no
177 223 | yes
(372 + 223) / (372+119+223+177) = 67% correct (and 33% incorrect)
a little worse
Deriving Knowledge from Data at Scale
Support Vector Machines
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
fx yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you
classify this data?
Estimation:
w: weight vector
x: data vector
Deriving Knowledge from Data at Scale
fx
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you
classify this data?
Deriving Knowledge from Data at Scale
fx
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you
classify this data?
Deriving Knowledge from Data at Scale
fx
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you
classify this data?
Deriving Knowledge from Data at Scale
fx
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
Any of these
would be fine..
..but which is best?
Deriving Knowledge from Data at Scale
fx
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
Define the margin of a linear classifier
as the width that the boundary could
be increased by before hitting a
datapoint.
Deriving Knowledge from Data at Scale
O
X
O
O
O
X
X
X
SUPPORT VECTORS
Deriving Knowledge from Data at Scale
O
X
X
O
O
O
X
X
X
SUPPORT VECTORS
Deriving Knowledge from Data at Scale
fx
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
The maximum margin
linear classifier is the
linear classifier with the,
um, maximum margin.
This is the simplest kind
of SVM (Called an LSVM)
Linear SVM
Deriving Knowledge from Data at Scale2016/4/12 37
fx
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x + b)
The maximum margin
linear classifier is the
linear classifier with the,
um, maximum margin.
This is the simplest kind
of SVM (Called an LSVM)Support Vectors
are those
datapoints that
the margin
pushes up
against
Linear SVM
Deriving Knowledge from Data at Scale
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
Support Vectors
are those
datapoints that
the margin
pushes up
against
1. Intuitively this feels safest.
2. If we’ve made a small error in the location of
the boundary this gives us least chance of
causing a misclassification.
3. LOOCV (leave one out cross validation) is
easy since the model is immune to removal
of any nonsupport-vector data points.
5. Empirically it works very well.
Deriving Knowledge from Data at Scale
Class 1
Class 2
m
Deriving Knowledge from Data at Scale
This is going to be a problem!
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
What if the data looks like that below?...
kernel function
Deriving Knowledge from Data at Scale
What if the data looks like that below?...
Deriving Knowledge from Data at Scale
Everything you need to know in one slide…
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Probably the most tricky part of using SVM
RBF is a good first option…
Depends on your data—try several.
• Kernels have even been developed for nonnumeric data like sequences,
structures, and trees/graphs.
May help to use a combination of several kernels.
Don’t touch your evaluation data while you’re trying out different
kernels and parameters.
– Use cross-validation for this if you’re short on data
Deriving Knowledge from Data at Scale
Complexity of the optimization problem remains only dependent on the dimensionality of
the input space and not of the feature space!
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
• SVM 1 learns “Output==1” vs “Output != 1”
• SVM 2 learns “Output==2” vs “Output != 2”
….
• SVM N learns “Output==N” vs “Output != N”
Deriving Knowledge from Data at Scale2016/4/12 51
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
• weka.classifiers.functions.SMO
• weka.classifiers.functions.libSVM
• weka.classifiers.functions.SMOreg
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
IF pclass=‘1’ THEN survive=yes
ELSE IF pclass=‘2’ THEN survive=yes
ELSE IF pclass=‘3’ THEN survive=no
confusion matrix
no yes <-- classified as
372 119 | no
177 223 | yes
(372 + 223) / (372+119+223+177) = 67% correct (and 33% incorrect)
a little worse
Deriving Knowledge from Data at Scale
Support Vector Machine Model, Titanic Data, Linear Kernel
Deriving Knowledge from Data at Scale
Support Vector Machine Model, RBF Kernel
Titanic Data
Deriving Knowledge from Data at Scale
Support Vector Machine Model, RBF Kernel
Titanic Data
overfitting?
Deriving Knowledge from Data at Scale
Bill Howe, UW 63
Support Vector Machine Model, RBF Kernel
Titanic Data
A gamma, parameter that controls/balances model complexity against accuracy
Deriving Knowledge from Data at Scale
How They Won It!
Lessons from data mining
the past Kaggle contests…
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
10 Minute Break…
Deriving Knowledge from Data at Scale
Data
Wrangling
Deriving Knowledge from Data at Scale
gold standard
data sets
Deriving Knowledge from Data at Scale
one dominant attribute even
distribution
Deriving Knowledge from Data at Scale
• Rule of thumb: 5,000 or more desired
• Rule of thumb: for each attribute, 10 or more instances
• Rule of thumb: >100 for each class
Deriving Knowledge from Data at Scale
Data cleaning
Data integration
Data transformation
Data reduction
Data discretization
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
1. Missing values
2. Outliers
3. Coding
4. Constraints
Deriving Knowledge from Data at Scale
ReplaceMissingValues
• RemoveMisclassified
• MergeTwoValues
Deriving Knowledge from Data at Scale
Missing values – UCI machine learning repository, 31 of 68 data sets
reported to have missing values. “Missing” can mean many things…
MAR: "Missing at Random":
– usually best case
– usually not true
Non-randomly missing
Presumed normal, so not measured
Causally missing
– attribute value is missing because of other attribute values (or because of
the outcome value!)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Representation matters…
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Bank Data A
Bank Data B
Q
Q
Deriving Knowledge from Data at Scale
Simple transformations can often have a large impact in performance
Example transformations (not all for performance improvement):
• Difference of two date attributes, distance between coordinate,…
• Ratio of two numeric (ratioscale) attributes, average for smoothing,….
• Concatenating the values of nominal attributes
• Encoding (probabilistic) cluster membership
• Adding noise to data (for robustness tests)
• Removing data randomly or selectively
• Obfuscating the data (for anonymity)
Intuition: add features that increase class discrimination (E, IG)…
Data Transformation
Deriving Knowledge from Data at Scale
• Combine attributes
• Normalizing data
• Simplifying data
Deriving Knowledge from Data at Scale
Change of scale
1
1
1
1
1
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Aggregated data tends to have less variability and potentially more
information for better predictions…
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
fills in the missing values for an instance with the expected
values
Deriving Knowledge from Data at Scale
Specify the new ordering…
2-last, 1
1-5, 7-last, 6
Deriving Knowledge from Data at Scale
instance
Resample, a random subsample
with or without replacement;
To replace or not…
Same random seed, will result in
same (repeatable) sample.
Sample size, as percentage of
original data set size.
Deriving Knowledge from Data at Scale
Suggestions for you to try…
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Curse of Dimensionality
exponentially
In many cases the information that is lost by
discarding variables is made up for by a more
accurate mapping/sampling in the lower-
dimensional space !
Deriving Knowledge from Data at Scale
• Sampling
Deriving Knowledge from Data at Scale
work almost as well as using the entire
data set
the same
property (of interest) as the original set of data
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
• Principle Component Analysis
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
find k ≤ n new features
(principal components) that can best represent data
 Works for numeric data only…
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Attribute Selection
(feature selection)
Deriving Knowledge from Data at Scale
What is Feature selection ?
DIMENSIONALITY REDUCTION
Deriving Knowledge from Data at Scale
Problem: Where to focus attention?
Deriving Knowledge from Data at Scale
Feature Selection, starts with you…
smallest subset of attributes
Deriving Knowledge from Data at Scale
What is Evaluated?
Attributes
Subsets of
Attributes
Evaluation
Method
Independent
Filters Filters
Learning
Algorithm Wrappers
Deriving Knowledge from Data at Scale
What is Evaluated?
Attributes
Subsets of
Attributes
Evaluation
Method
Independent
Filters Filters
Learning
Algorithm Wrappers
Deriving Knowledge from Data at Scale
list of attributes
evaluated individually
select
subset
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A correlation coefficient shows the degree of linear dependence of x and y. In other words, the coefficient shows
how close two variables lie along a line. If the coefficient is equal to 1 or -1, all the points lie along a line. If the
correlation coefficient is equal to zero, there is no linear relation between x and y. However, this does not
necessarily mean that there is no relation between the two variables. There could e.g. be a non-linear relation.
Deriving Knowledge from Data at Scale
Tab for selecting attributes in a data set…
Deriving Knowledge from Data at Scale
Interface for classes that evaluate attributes…
Interface for ranking or searching for a subset of attributes…
Deriving Knowledge from Data at Scale
Select CorrelationAttributeEval for Pearson Correlation…
False, doesn’t return R score
True, returns R scores;
Deriving Knowledge from Data at Scale
Ranks attributes by their individual evaluations, used in
conjunction with GainRatio, Entropy, Pearson, etc…
Number of attributes to return,
-1 returns all ranked attributes;
Attributes to ignore (skip) in the
evaluation forma: [1, 3-5, 10];
Cutoff at which attributes can
be discarded, -1 no cutoff;
Deriving Knowledge from Data at Scale
Predicting Self-Reported Health Status
The Data Set, NHANES_data.csv (National Health and Nutrition Examination Survey)
How would you say your health in general is?
Excellent predictor of mortality, health care utilization & disability
How I processed it…
• 4000 variables;
• Attributes with > 30% missing values removed (dropped column);
• 105 variables remaining;
• Chi-square test, variable and target, remove variables with P value < .20;
• Impute all missing values using expectation minimization;
• 85 variables remaining;
Pearson Correlation Exercise…
Deriving Knowledge from Data at Scale
NHANES_data.csv
Deriving Knowledge from Data at Scale
NHANES_data.csv
• Convert the last column from numeric to nominal
• Find the top 15 features using Pearson Correlation
Deriving Knowledge from Data at Scale
• OneRAttributeEval
• GainRatioAttributeEval
• InfoGainAttributeEval
• ChiSquaredAttributeEval
• ReliefFAttributeEval
Deriving Knowledge from Data at Scale
• Right Click on the new line in the Result list;
• From the pop-up menu, select the item
Save reduced data…
Deriving Knowledge from Data at Scale
• Right Click on the new line in the Result list;
• From the pop-up menu, select the item
Save reduced data…
• Save the dataset with 15 selected attributes
to file NHanesPearson.arff
Deriving Knowledge from Data at Scale
• Right Click on the new line in the Result list;
• From the pop-up menu, select the item
Save reduced data…
• Save the dataset with 15 selected attributes
to file NHanesPearson.arff
• Switch to the Preprocess mode in Explorer
• Click on Open file… and open the file
NHanesPearson.arff
• Switch to the Classify submode
• Click on Choose, select classifier and use this
feature set and data to build a predictive
model;
Deriving Knowledge from Data at Scale
• Anything below 0.3 isn’t highly
correlated with the target…
Deriving Knowledge from Data at Scale
What is Evaluated?
Attributes
Subsets of
Attributes
Evaluation
Method
Independent
Filters Filters
Learning
Algorithm Wrappers
Deriving Knowledge from Data at Scale
given the already picked features
independent
Deriving Knowledge from Data at Scale
all
makes the least contribution
independent
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Interface for classes that evaluate attributes…
Interface for ranking or searching for a subset of attributes…
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Forward, Backward, Bi-Directional
Attributes to “seed” the search,
listed individually or by range.
Cutoff for backtracking…
Deriving Knowledge from Data at Scale
True: Adds features that are correlated
with class and NOT intercorrelated with
other features already in selection.
False: Eliminates redundant features.
Precompute the correlation matrix in
advance, useful for fast backtracking, or
compute lazily. When given a large
number of attributes, compute lazily…
CfsSubsetEval
Deriving Knowledge from Data at Scale
NHANES_data.csv
• Convert the last column from numeric to nominal
• Set the search method as Best First, Forward
• Set the attribute evaluator as CfsSubsetEval
• Run across all attributes in data set…
Deriving Knowledge from Data at Scale
• Feature selection can significantly increase the performance of a learning
algorithm (both accuracy and computation time) – but it is not easy!
• Relevance <-> Optimality
• Correlation and Mutual information between single variables and the target are
often used as Ranking-Criteria of variables.
Important points 1/2
Deriving Knowledge from Data at Scale
Important points 2/2
• One can not automatically discard variables with small scores – they may still be
useful together with other variables.
• Filters – Wrappers - Embedded Methods
• How to search the space of all feature subsets ?
• How to asses performance of a learner that uses a particular feature subset ?
Deriving Knowledge from Data at Scale
not all about accuracy
• Filtering is fast linear intuitive
• Filtering model oblivious
may not be optimal
• Wrappers model-aware slow nonintuitive
• PCA and SVD are lossy
work on the entire data set
start
with fast feature filtering first
NOT to use any feature selection
Deriving Knowledge from Data at Scale
That’s all for tonight…

More Related Content

What's hot

Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Lior Rokach
 

What's hot (20)

Managing machine learning
Managing machine learningManaging machine learning
Managing machine learning
 
Introduction to machine learning and deep learning
Introduction to machine learning and deep learningIntroduction to machine learning and deep learning
Introduction to machine learning and deep learning
 
H2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandryH2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark Landry
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
 
Module 5: Decision Trees
Module 5: Decision TreesModule 5: Decision Trees
Module 5: Decision Trees
 
Fairly Measuring Fairness In Machine Learning
Fairly Measuring Fairness In Machine LearningFairly Measuring Fairness In Machine Learning
Fairly Measuring Fairness In Machine Learning
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World Applications
 
H2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
H2O World - Top 10 Deep Learning Tips & Tricks - Arno CandelH2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
H2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
 
Module 1.2 data preparation
Module 1.2  data preparationModule 1.2  data preparation
Module 1.2 data preparation
 
Ml1 introduction to-supervised_learning_and_k_nearest_neighbors
Ml1 introduction to-supervised_learning_and_k_nearest_neighborsMl1 introduction to-supervised_learning_and_k_nearest_neighbors
Ml1 introduction to-supervised_learning_and_k_nearest_neighbors
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine Learning
 
Module 7: Unsupervised Learning
Module 7:  Unsupervised LearningModule 7:  Unsupervised Learning
Module 7: Unsupervised Learning
 
Module 2: Machine Learning Deep Dive
Module 2:  Machine Learning Deep DiveModule 2:  Machine Learning Deep Dive
Module 2: Machine Learning Deep Dive
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitions
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Module 3: Linear Regression
Module 3:  Linear RegressionModule 3:  Linear Regression
Module 3: Linear Regression
 
Feature Reduction Techniques
Feature Reduction TechniquesFeature Reduction Techniques
Feature Reduction Techniques
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentation
 

Similar to Barga Data Science lecture 7

Computational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding RegionsComputational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding Regions
butest
 
Cs221 lecture5-fall11
Cs221 lecture5-fall11Cs221 lecture5-fall11
Cs221 lecture5-fall11
darwinrlo
 

Similar to Barga Data Science lecture 7 (20)

It's Not Magic - Explaining classification algorithms
It's Not Magic - Explaining classification algorithmsIt's Not Magic - Explaining classification algorithms
It's Not Magic - Explaining classification algorithms
 
4. Classification.pdf
4. Classification.pdf4. Classification.pdf
4. Classification.pdf
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
 
Machine learning for_finance
Machine learning for_financeMachine learning for_finance
Machine learning for_finance
 
Computational decision making
Computational decision makingComputational decision making
Computational decision making
 
Support Vector Machine and Implementation using Weka
Support Vector Machine and Implementation using WekaSupport Vector Machine and Implementation using Weka
Support Vector Machine and Implementation using Weka
 
Robustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning MethodsRobustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning Methods
 
MyStataLab Assignment Help
MyStataLab Assignment HelpMyStataLab Assignment Help
MyStataLab Assignment Help
 
Barga Data Science lecture 6
Barga Data Science lecture 6Barga Data Science lecture 6
Barga Data Science lecture 6
 
Computational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding RegionsComputational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding Regions
 
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
 
Data Mining the City - A (practical) introduction to Machine Learning
Data Mining the City - A (practical) introduction to Machine LearningData Mining the City - A (practical) introduction to Machine Learning
Data Mining the City - A (practical) introduction to Machine Learning
 
ML MODULE 2.pdf
ML MODULE 2.pdfML MODULE 2.pdf
ML MODULE 2.pdf
 
Data Science Full Course | Edureka
Data Science Full Course | EdurekaData Science Full Course | Edureka
Data Science Full Course | Edureka
 
07 learning
07 learning07 learning
07 learning
 
Cs221 lecture5-fall11
Cs221 lecture5-fall11Cs221 lecture5-fall11
Cs221 lecture5-fall11
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Classification
ClassificationClassification
Classification
 
Explore ml day 2
Explore ml day 2Explore ml day 2
Explore ml day 2
 

More from Roger Barga (7)

RS Barga STRATA'18 New York City
RS Barga STRATA'18 New York CityRS Barga STRATA'18 New York City
RS Barga STRATA'18 New York City
 
Barga Strata'18 presentation
Barga Strata'18 presentationBarga Strata'18 presentation
Barga Strata'18 presentation
 
Barga ACM DEBS 2013 Keynote
Barga ACM DEBS 2013 KeynoteBarga ACM DEBS 2013 Keynote
Barga ACM DEBS 2013 Keynote
 
Data Driven Engineering 2014
Data Driven Engineering 2014Data Driven Engineering 2014
Data Driven Engineering 2014
 
Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Barga Galvanize Sept 2015
Barga Galvanize Sept 2015
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
 
Barga IC2E & IoTDI'16 Keynote
Barga IC2E & IoTDI'16 KeynoteBarga IC2E & IoTDI'16 Keynote
Barga IC2E & IoTDI'16 Keynote
 

Recently uploaded

➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 

Recently uploaded (20)

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 

Barga Data Science lecture 7

  • 1. Deriving Knowledge from Data at Scale
  • 2. Deriving Knowledge from Data at Scale Good data preparation is key to producing valid and reliable models…
  • 3. Deriving Knowledge from Data at Scale Lecture 7 Agenda
  • 4. Deriving Knowledge from Data at Scale
  • 5. Deriving Knowledge from Data at Scale
  • 6. Deriving Knowledge from Data at Scale
  • 7. Deriving Knowledge from Data at Scale Heads Up, Lecture 8 Agenda
  • 8. Deriving Knowledge from Data at Scale Course Project
  • 9. Deriving Knowledge from Data at Scale
  • 10. Deriving Knowledge from Data at Scale https://www.kaggle.com/c/springleaf-marketing-response not Determine whether to send a direct mail piece to a customer
  • 11. Deriving Knowledge from Data at Scale The Data
  • 12. Deriving Knowledge from Data at Scale The Rules
  • 13. Deriving Knowledge from Data at Scale
  • 14. Deriving Knowledge from Data at Scale what is the data telling you
  • 15. Deriving Knowledge from Data at Scale
  • 16. Deriving Knowledge from Data at Scale The Tragedy of the Titanic
  • 17. Deriving Knowledge from Data at Scale
  • 18. Deriving Knowledge from Data at Scale
  • 19. Deriving Knowledge from Data at Scale
  • 20. Deriving Knowledge from Data at Scale Does the new data point x* exactly match a previous point xi? If so, assign it to the same class as xi Otherwise, just guess. This is the “rote” classifier
  • 21. Deriving Knowledge from Data at Scale Does the new data point x* match a set pf previous points xi on some specific attribute? If so, take a vote to determine class. Example: If most females survived, then assume every female survives But there are lots of possible rules like this. And an attribute can have more than two values. If most people under 4 years old survive, then assume everyone under 4 survives If most people with 1 sibling survive, then assume everyone with 1 sibling survives How do we choose?
  • 22. Deriving Knowledge from Data at Scale IF sex=‘female’ THEN survive=yes ELSE IF sex=‘male’ THEN survive = no confusion matrix no yes <-- classified as 468 109 | no 81 233 | yes (468 + 233) / (468+109+81+233) = 79% correct (and 21% incorrect) Not bad!
  • 23. Deriving Knowledge from Data at Scale
  • 24. Deriving Knowledge from Data at Scale IF pclass=‘1’ THEN survive=yes ELSE IF pclass=‘2’ THEN survive=yes ELSE IF pclass=‘3’ THEN survive=no confusion matrix no yes <-- classified as 372 119 | no 177 223 | yes (372 + 223) / (372+119+223+177) = 67% correct (and 33% incorrect) a little worse
  • 25. Deriving Knowledge from Data at Scale Support Vector Machines
  • 26. Deriving Knowledge from Data at Scale
  • 27. Deriving Knowledge from Data at Scale
  • 28. Deriving Knowledge from Data at Scale fx yest denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data? Estimation: w: weight vector x: data vector
  • 29. Deriving Knowledge from Data at Scale fx a yest denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data?
  • 30. Deriving Knowledge from Data at Scale fx a yest denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data?
  • 31. Deriving Knowledge from Data at Scale fx a yest denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data?
  • 32. Deriving Knowledge from Data at Scale fx a yest denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) Any of these would be fine.. ..but which is best?
  • 33. Deriving Knowledge from Data at Scale fx a yest denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.
  • 34. Deriving Knowledge from Data at Scale O X O O O X X X SUPPORT VECTORS
  • 35. Deriving Knowledge from Data at Scale O X X O O O X X X SUPPORT VECTORS
  • 36. Deriving Knowledge from Data at Scale fx a yest denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Linear SVM
  • 37. Deriving Knowledge from Data at Scale2016/4/12 37 fx a yest denotes +1 denotes -1 f(x,w,b) = sign(w. x + b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM)Support Vectors are those datapoints that the margin pushes up against Linear SVM
  • 38. Deriving Knowledge from Data at Scale denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) Support Vectors are those datapoints that the margin pushes up against 1. Intuitively this feels safest. 2. If we’ve made a small error in the location of the boundary this gives us least chance of causing a misclassification. 3. LOOCV (leave one out cross validation) is easy since the model is immune to removal of any nonsupport-vector data points. 5. Empirically it works very well.
  • 39. Deriving Knowledge from Data at Scale Class 1 Class 2 m
  • 40. Deriving Knowledge from Data at Scale This is going to be a problem!
  • 41. Deriving Knowledge from Data at Scale
  • 42. Deriving Knowledge from Data at Scale What if the data looks like that below?... kernel function
  • 43. Deriving Knowledge from Data at Scale What if the data looks like that below?...
  • 44. Deriving Knowledge from Data at Scale Everything you need to know in one slide…
  • 45. Deriving Knowledge from Data at Scale
  • 46. Deriving Knowledge from Data at Scale
  • 47. Deriving Knowledge from Data at Scale Probably the most tricky part of using SVM RBF is a good first option… Depends on your data—try several. • Kernels have even been developed for nonnumeric data like sequences, structures, and trees/graphs. May help to use a combination of several kernels. Don’t touch your evaluation data while you’re trying out different kernels and parameters. – Use cross-validation for this if you’re short on data
  • 48. Deriving Knowledge from Data at Scale Complexity of the optimization problem remains only dependent on the dimensionality of the input space and not of the feature space!
  • 49. Deriving Knowledge from Data at Scale
  • 50. Deriving Knowledge from Data at Scale • SVM 1 learns “Output==1” vs “Output != 1” • SVM 2 learns “Output==2” vs “Output != 2” …. • SVM N learns “Output==N” vs “Output != N”
  • 51. Deriving Knowledge from Data at Scale2016/4/12 51
  • 52. Deriving Knowledge from Data at Scale
  • 53. Deriving Knowledge from Data at Scale
  • 54. Deriving Knowledge from Data at Scale
  • 55. Deriving Knowledge from Data at Scale • weka.classifiers.functions.SMO • weka.classifiers.functions.libSVM • weka.classifiers.functions.SMOreg
  • 56. Deriving Knowledge from Data at Scale
  • 57. Deriving Knowledge from Data at Scale
  • 58. Deriving Knowledge from Data at Scale
  • 59. Deriving Knowledge from Data at Scale IF pclass=‘1’ THEN survive=yes ELSE IF pclass=‘2’ THEN survive=yes ELSE IF pclass=‘3’ THEN survive=no confusion matrix no yes <-- classified as 372 119 | no 177 223 | yes (372 + 223) / (372+119+223+177) = 67% correct (and 33% incorrect) a little worse
  • 60. Deriving Knowledge from Data at Scale Support Vector Machine Model, Titanic Data, Linear Kernel
  • 61. Deriving Knowledge from Data at Scale Support Vector Machine Model, RBF Kernel Titanic Data
  • 62. Deriving Knowledge from Data at Scale Support Vector Machine Model, RBF Kernel Titanic Data overfitting?
  • 63. Deriving Knowledge from Data at Scale Bill Howe, UW 63 Support Vector Machine Model, RBF Kernel Titanic Data A gamma, parameter that controls/balances model complexity against accuracy
  • 64. Deriving Knowledge from Data at Scale How They Won It! Lessons from data mining the past Kaggle contests…
  • 65. Deriving Knowledge from Data at Scale
  • 66. Deriving Knowledge from Data at Scale 10 Minute Break…
  • 67. Deriving Knowledge from Data at Scale Data Wrangling
  • 68. Deriving Knowledge from Data at Scale gold standard data sets
  • 69. Deriving Knowledge from Data at Scale one dominant attribute even distribution
  • 70. Deriving Knowledge from Data at Scale • Rule of thumb: 5,000 or more desired • Rule of thumb: for each attribute, 10 or more instances • Rule of thumb: >100 for each class
  • 71. Deriving Knowledge from Data at Scale Data cleaning Data integration Data transformation Data reduction Data discretization
  • 72. Deriving Knowledge from Data at Scale
  • 73. Deriving Knowledge from Data at Scale 1. Missing values 2. Outliers 3. Coding 4. Constraints
  • 74. Deriving Knowledge from Data at Scale ReplaceMissingValues • RemoveMisclassified • MergeTwoValues
  • 75. Deriving Knowledge from Data at Scale Missing values – UCI machine learning repository, 31 of 68 data sets reported to have missing values. “Missing” can mean many things… MAR: "Missing at Random": – usually best case – usually not true Non-randomly missing Presumed normal, so not measured Causally missing – attribute value is missing because of other attribute values (or because of the outcome value!)
  • 76. Deriving Knowledge from Data at Scale
  • 77. Deriving Knowledge from Data at Scale Representation matters…
  • 78. Deriving Knowledge from Data at Scale
  • 79. Deriving Knowledge from Data at Scale
  • 80. Deriving Knowledge from Data at Scale Bank Data A Bank Data B Q Q
  • 81. Deriving Knowledge from Data at Scale Simple transformations can often have a large impact in performance Example transformations (not all for performance improvement): • Difference of two date attributes, distance between coordinate,… • Ratio of two numeric (ratioscale) attributes, average for smoothing,…. • Concatenating the values of nominal attributes • Encoding (probabilistic) cluster membership • Adding noise to data (for robustness tests) • Removing data randomly or selectively • Obfuscating the data (for anonymity) Intuition: add features that increase class discrimination (E, IG)… Data Transformation
  • 82. Deriving Knowledge from Data at Scale • Combine attributes • Normalizing data • Simplifying data
  • 83. Deriving Knowledge from Data at Scale Change of scale 1 1 1 1 1
  • 84. Deriving Knowledge from Data at Scale
  • 85. Deriving Knowledge from Data at Scale
  • 86. Deriving Knowledge from Data at Scale Aggregated data tends to have less variability and potentially more information for better predictions…
  • 87. Deriving Knowledge from Data at Scale
  • 88. Deriving Knowledge from Data at Scale
  • 89. Deriving Knowledge from Data at Scale
  • 90. Deriving Knowledge from Data at Scale
  • 91. Deriving Knowledge from Data at Scale
  • 92. Deriving Knowledge from Data at Scale
  • 93. Deriving Knowledge from Data at Scale fills in the missing values for an instance with the expected values
  • 94. Deriving Knowledge from Data at Scale Specify the new ordering… 2-last, 1 1-5, 7-last, 6
  • 95. Deriving Knowledge from Data at Scale instance Resample, a random subsample with or without replacement; To replace or not… Same random seed, will result in same (repeatable) sample. Sample size, as percentage of original data set size.
  • 96. Deriving Knowledge from Data at Scale Suggestions for you to try…
  • 97. Deriving Knowledge from Data at Scale
  • 98. Deriving Knowledge from Data at Scale Curse of Dimensionality exponentially In many cases the information that is lost by discarding variables is made up for by a more accurate mapping/sampling in the lower- dimensional space !
  • 99. Deriving Knowledge from Data at Scale • Sampling
  • 100. Deriving Knowledge from Data at Scale work almost as well as using the entire data set the same property (of interest) as the original set of data
  • 101. Deriving Knowledge from Data at Scale
  • 102. Deriving Knowledge from Data at Scale
  • 103. Deriving Knowledge from Data at Scale • Principle Component Analysis
  • 104. Deriving Knowledge from Data at Scale
  • 105. Deriving Knowledge from Data at Scale find k ≤ n new features (principal components) that can best represent data  Works for numeric data only…
  • 106. Deriving Knowledge from Data at Scale
  • 107. Deriving Knowledge from Data at Scale Attribute Selection (feature selection)
  • 108. Deriving Knowledge from Data at Scale What is Feature selection ? DIMENSIONALITY REDUCTION
  • 109. Deriving Knowledge from Data at Scale Problem: Where to focus attention?
  • 110. Deriving Knowledge from Data at Scale Feature Selection, starts with you… smallest subset of attributes
  • 111. Deriving Knowledge from Data at Scale What is Evaluated? Attributes Subsets of Attributes Evaluation Method Independent Filters Filters Learning Algorithm Wrappers
  • 112. Deriving Knowledge from Data at Scale What is Evaluated? Attributes Subsets of Attributes Evaluation Method Independent Filters Filters Learning Algorithm Wrappers
  • 113. Deriving Knowledge from Data at Scale list of attributes evaluated individually select subset
  • 114. Deriving Knowledge from Data at Scale
  • 115. Deriving Knowledge from Data at Scale A correlation coefficient shows the degree of linear dependence of x and y. In other words, the coefficient shows how close two variables lie along a line. If the coefficient is equal to 1 or -1, all the points lie along a line. If the correlation coefficient is equal to zero, there is no linear relation between x and y. However, this does not necessarily mean that there is no relation between the two variables. There could e.g. be a non-linear relation.
  • 116. Deriving Knowledge from Data at Scale Tab for selecting attributes in a data set…
  • 117. Deriving Knowledge from Data at Scale Interface for classes that evaluate attributes… Interface for ranking or searching for a subset of attributes…
  • 118. Deriving Knowledge from Data at Scale Select CorrelationAttributeEval for Pearson Correlation… False, doesn’t return R score True, returns R scores;
  • 119. Deriving Knowledge from Data at Scale Ranks attributes by their individual evaluations, used in conjunction with GainRatio, Entropy, Pearson, etc… Number of attributes to return, -1 returns all ranked attributes; Attributes to ignore (skip) in the evaluation forma: [1, 3-5, 10]; Cutoff at which attributes can be discarded, -1 no cutoff;
  • 120. Deriving Knowledge from Data at Scale Predicting Self-Reported Health Status The Data Set, NHANES_data.csv (National Health and Nutrition Examination Survey) How would you say your health in general is? Excellent predictor of mortality, health care utilization & disability How I processed it… • 4000 variables; • Attributes with > 30% missing values removed (dropped column); • 105 variables remaining; • Chi-square test, variable and target, remove variables with P value < .20; • Impute all missing values using expectation minimization; • 85 variables remaining; Pearson Correlation Exercise…
  • 121. Deriving Knowledge from Data at Scale NHANES_data.csv
  • 122. Deriving Knowledge from Data at Scale NHANES_data.csv • Convert the last column from numeric to nominal • Find the top 15 features using Pearson Correlation
  • 123. Deriving Knowledge from Data at Scale • OneRAttributeEval • GainRatioAttributeEval • InfoGainAttributeEval • ChiSquaredAttributeEval • ReliefFAttributeEval
  • 124. Deriving Knowledge from Data at Scale • Right Click on the new line in the Result list; • From the pop-up menu, select the item Save reduced data…
  • 125. Deriving Knowledge from Data at Scale • Right Click on the new line in the Result list; • From the pop-up menu, select the item Save reduced data… • Save the dataset with 15 selected attributes to file NHanesPearson.arff
  • 126. Deriving Knowledge from Data at Scale • Right Click on the new line in the Result list; • From the pop-up menu, select the item Save reduced data… • Save the dataset with 15 selected attributes to file NHanesPearson.arff • Switch to the Preprocess mode in Explorer • Click on Open file… and open the file NHanesPearson.arff • Switch to the Classify submode • Click on Choose, select classifier and use this feature set and data to build a predictive model;
  • 127. Deriving Knowledge from Data at Scale • Anything below 0.3 isn’t highly correlated with the target…
  • 128. Deriving Knowledge from Data at Scale What is Evaluated? Attributes Subsets of Attributes Evaluation Method Independent Filters Filters Learning Algorithm Wrappers
  • 129. Deriving Knowledge from Data at Scale given the already picked features independent
  • 130. Deriving Knowledge from Data at Scale all makes the least contribution independent
  • 131. Deriving Knowledge from Data at Scale
  • 132. Deriving Knowledge from Data at Scale
  • 133. Deriving Knowledge from Data at Scale
  • 134. Deriving Knowledge from Data at Scale
  • 135. Deriving Knowledge from Data at Scale Interface for classes that evaluate attributes… Interface for ranking or searching for a subset of attributes…
  • 136. Deriving Knowledge from Data at Scale
  • 137. Deriving Knowledge from Data at Scale Forward, Backward, Bi-Directional Attributes to “seed” the search, listed individually or by range. Cutoff for backtracking…
  • 138. Deriving Knowledge from Data at Scale True: Adds features that are correlated with class and NOT intercorrelated with other features already in selection. False: Eliminates redundant features. Precompute the correlation matrix in advance, useful for fast backtracking, or compute lazily. When given a large number of attributes, compute lazily… CfsSubsetEval
  • 139. Deriving Knowledge from Data at Scale NHANES_data.csv • Convert the last column from numeric to nominal • Set the search method as Best First, Forward • Set the attribute evaluator as CfsSubsetEval • Run across all attributes in data set…
  • 140. Deriving Knowledge from Data at Scale • Feature selection can significantly increase the performance of a learning algorithm (both accuracy and computation time) – but it is not easy! • Relevance <-> Optimality • Correlation and Mutual information between single variables and the target are often used as Ranking-Criteria of variables. Important points 1/2
  • 141. Deriving Knowledge from Data at Scale Important points 2/2 • One can not automatically discard variables with small scores – they may still be useful together with other variables. • Filters – Wrappers - Embedded Methods • How to search the space of all feature subsets ? • How to asses performance of a learner that uses a particular feature subset ?
  • 142. Deriving Knowledge from Data at Scale not all about accuracy • Filtering is fast linear intuitive • Filtering model oblivious may not be optimal • Wrappers model-aware slow nonintuitive • PCA and SVD are lossy work on the entire data set start with fast feature filtering first NOT to use any feature selection
  • 143. Deriving Knowledge from Data at Scale That’s all for tonight…