Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Models in Production
Deriving Knowledge from Data at Scale
Putting an ML Model into Production
• A/B Testing
Deriving Knowledge from Data at Scale
Controlled Experiments in One Slide
Concept is Trivial
• Must run statistical tests to confirm differences are not due to chance
• Best scientific way to prove causality, i.e., the changes in metrics are
caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Best Practice: A/A Test
Run A/A tests
before
Deriving Knowledge from Data at Scale
Best Practice: Ramp-up
Ramp-up
Deriving Knowledge from Data at Scale
Best Practice: Run Experiments at 50/50%
Deriving Knowledge from Data at Scale
Cost based learning
Deriving Knowledge from Data at Scale
Imbalanced Class Distribution & Error Costs
WEKA cost sensitive learning
weighting method
false negatives, FN
try to avoid
false negatives
Deriving Knowledge from Data at Scale
Imbalanced Class Distribution
WEKA cost sensitive learning
Preprocess Classify
meta.CostSensitiveClassifier
set the FN to 10.0 FP to 1.0
tries to optimize accuracy or error can be cost-sensitive
decision trees rule learner
Deriving Knowledge from Data at Scale
Imbalanced Class Distribution
WEKA cost sensitive learning
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
curated
completely specify a problem measure progress
paired with a metric target SLAs score
board
Deriving Knowledge from Data at Scale
This isn’t easy…
• Building high quality gold sets is a challenge.
• It is time consuming.
• It requires making difficult and long lasting
choices, and the rewards are delayed…
Deriving Knowledge from Data at Scale
enforce a few principles
1. Distribution parity
2. Testing blindness
3. Production parity
4. Single metric
5. Reproducibility
6. Experimentation velocity
7. Data is gold
Deriving Knowledge from Data at Scale
• Test set blindness
• Reproducibility and Data is gold
• Experimentation velocity
Deriving Knowledge from Data at Scale
Building Gold sets is hard work. Many common and avoidable mistakes are
made. This suggests having a checklist. Some questions will be trivial to
answer or not applicable, some will require work…
1. Metrics: For each gold set, chose one (1) metric. Having two metrics on the same
gold set is a problem (you can’t optimize both at once).
2. Weighting/Slicing: Not all errors are equal. This should be reflected in the metric, not
through sampling manipulation. Having the weighting in the metric has two
advantages: 1) it is explicitly documented and reproducible in the form of a metric
algorithm, and 2) production, train, and test sets results remain directly comparable
(automatic testing).
3. Yardstick(s): Define algorithms and configuration parameters for public yardstick(s).
There could be more than one yardstick. A simple yardstick is useful for ramping up.
Once one can reproduce/understand the simple yardstick’s result, it becomes easier
to improve on the latest “production” yardstick. Ideally yardsticks come with
downloadable code. The yardsticks provide a set of errors that suggests where
innovation should happen.
Deriving Knowledge from Data at Scale
4. Sizes and access: What are the set sizes? Each size corresponds to an innovation
velocity and a level of representativeness. A good rule of thumb is 5X size ratios
between gold sets drawn from the same distribution. Where should the data live? If
on a server, some services are needed for access and simple manipulations. There
should always be a size that is downloadable (< 1GB) to a desktop for high velocity
innovation.
5. Documentation and format: Create a format/API for the data. Is the data
compressed? Provide sample code to load the data. Document the format. Assign
someone to be the curator of the gold set.
Deriving Knowledge from Data at Scale
6. Features: What (gold) features go in the gold sets? Features must be pickled for result
to be reproducible. Ideally, we would have 2, and possibly 3 types of gold sets.
a. One set should have the deployed features (computed from the raw data). This provides the
production yardstick.
b. One set should be Raw (e.g. contains all information, possibly through tables). This allows
contributors to create features from the raw data to investigate its potential compared to existing
features. This set has more information per pattern and a smaller number of patterns.
c. One set should have an extended number of features. The additional features may be “building
blocks”, features that are scheduled to be deployed next, or high potential features. Moving some
features to a gold set is convenient if multiple people are working on the next generation. Not all
features are worth being in a gold set.
7. Feature optimization sets: Does the data require feature optimization? For instance,
an IP address, a query, or a listing id may be features. But only the most frequent 10M
instances are worth having specific trainable parameters. A pass over the data can
identify the top 10M instance. This is a form of feature optimization. Identifying these
features does not require labels. If a form of feature optimization is done, a separate
data set (disjoint from the training and test set) must be provided.
Deriving Knowledge from Data at Scale
8. Stale rate, optimization, monitoring: How long does the set stay current? In many
cases, we hide the fact that the problem is a time series even though the goal is to
predict the future and we know that the distribution is changing. We must quantify
how much a distribution changes over a fixed period of time. There are several ways
to mitigate the changing distribution problem:
a. Assume the distribution is I.I.D. Regularly re-compute training sets and Gold sets. Determine the
frequency of re-computation, or set in place a system to monitor distribution drifts (monitor KPI
changes while the algorithm is kept constant).
b. Decompose the model along “distribution (fast) tracking parameters” and slow tracking parameters.
The fast tracking model may be a simple calibration with very few parameters.
c. Recast the problem as a time series problem: patterns are (input data from t-T to t-1, prediction at
time t). In this space, the patterns are much larger, but the problem is closer to being I.I.D.
9. The gold sets should have information that reveal the stale rate and allows algorithms
to differentiate themselves based on how they degrade with time.
Deriving Knowledge from Data at Scale
10. Grouping: Should the patterns be grouped? For example in handwriting, examples are
grouped per writer. A set built by shuffling the words is misleading because training
and testing would have word examples for the same writer, which makes
generalization much easier. If the words are grouped per writers, then a writer is
unlikely to appear in both training and test set, which requires the system to generalize
to never seen before handwriting (as opposed to never seen before words). Do we
have these type of constraints? Should we group per advertisers, campaign, users to
generalize across new instances of these entities (as opposed to generalizing to new
queries)? ML requires training and testing to be drawn from the same distribution.
Drawing duplicates is not a problem. Problems arise when one partially draw
examples from the same entity on both training and testing on a small set of entities.
This breaks the IID assumption and makes the generalization on the test set much
easier than it actually is.
11. Sampling production data: What strategy is used for sampling? Uniform? Are any of
the following filtered out: fraud, bad configurations, duplicates, non-billable, adult,
overwrites, etc? Guidance: use the production sameness principle.
Deriving Knowledge from Data at Scale
11. Unlabeled set: If the number of labeled examples is small, a large data set of
unlabeled data with the same distribution should be collected and be made a gold
set. This enables the discovery of new features using intermediate classifiers and
active labeling.
Deriving Knowledge from Data at Scale
GreatestChallengeinMachineLearning
Deriving Knowledge from Data at Scale
gender age smoker eye
color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung
cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
yes
no
no
Train
ML Model
Deriving Knowledge from Data at Scale
The greatest challenge in Machine Learning?
Lack of Labelled Training Data…
What to Do?
• Controlled Experiments – get feedback from user to serve as labels;
• Mechanical Turk – pay people to label data to build training set;
• Ask Users to Label Data – report as spam, ‘hot or not?’, review a product,
observe their click behavior (ad retargeting, search results, etc).
Deriving Knowledge from Data at Scale
Whatifyoucan'tgetlabeledTrainingData?
Traditional Supervised Learning
• Promotion on bookseller’s web page
• Customers can rate books.
• Will a new customer like this book?
• Training set: observations on previous customers
• Test set: new customers
Whathappensif onlyfew customers rate a book?
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K ?
39 41K ?
Age Income LikesBook
22 67K +
39 41K -
Model
Test Data
Prediction
Training Data
Attributes
Target
Label
© 2013 Datameer, Inc. All rights reserved.
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K ?
39 41K ?
Age Income LikesBook
22 67K +
39 41K -
Deriving Knowledge from Data at Scale
Semi-SupervisedLearning
Can we makeuse of the unlabeled data?
In theory: no
... but we can make assumptions
PopularAssumptions
• Clustering assumption
• Low density assumption
• Manifold assumption
Deriving Knowledge from Data at Scale
TheClusteringAssumption
Clustering
• Partition instances into groups (clusters) of similar
instances
• Many different algorithms: k-Means, EM, etc.
Clustering Assumption
• The two classification targets are distinct clusters
• Simple semi-supervised learning: cluster, then
perform majority vote
Deriving Knowledge from Data at Scale
TheClusteringAssumption
Clustering
• Partition instances into groups (clusters) of similar
instances
• Many different algorithms: k-Means, EM, etc.
Clustering Assumption
• The two classification targets are distinct clusters
• Simple semi-supervised learning: cluster, then
perform majority vote
Deriving Knowledge from Data at Scale
TheClusteringAssumption
Clustering
• Partition instances into groups (clusters) of similar
instances
• Many different algorithms: k-Means, EM, etc.
Clustering Assumption
• The two classification targets are distinct clusters
• Simple semi-supervised learning: cluster, then
perform majority vote
Deriving Knowledge from Data at Scale
TheClusteringAssumption
Clustering
• Partition instances into groups (clusters) of similar
instances
• Many different algorithms: k-Means, EM, etc.
Clustering Assumption
• The two classification targets are distinct clusters
• Simple semi-supervised learning: cluster, then
perform majority vote
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussians
• Assumption: the data in each cluster is generated
by a normal distribution
• Find most probable location and shape of clusters
given data
Expectation-Maximization
• Two step optimization procedure
• Keeps estimates of cluster assignment probabilities
for each instance
• Might converge to local optimum
Deriving Knowledge from Data at Scale
GenerativeModels
Mixture of Gaussians
• Assumption: the data in each cluster is generated
by a normal distribution
• Find most probable location and shape of clusters
given data
Expectation-Maximization
• Two step optimization procedure
• Keeps estimates of cluster assignment probabilities
for each instance
• Might converge to local optimum
Deriving Knowledge from Data at Scale
GenerativeModels
Mixture of Gaussians
• Assumption: the data in each cluster is generated
by a normal distribution
• Find most probable location and shape of clusters
given data
Expectation-Maximization
• Two step optimization procedure
• Keeps estimates of cluster assignment probabilities
for each instance
• Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussians
• Assumption: the data in each cluster is generated
by a normal distribution
• Find most probable location and shape of clusters
given data
Expectation-Maximization
• Two step optimization procedure
• Keeps estimates of cluster assignment probabilities
for each instance
• Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussians
• Assumption: the data in each cluster is generated
by a normal distribution
• Find most probable location and shape of clusters
given data
Expectation-Maximization
• Two step optimization procedure
• Keeps estimates of cluster assignment probabilities
for each instance
• Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussians
• Assumption: the data in each cluster is generated
by a normal distribution
• Find most probable location and shape of clusters
given data
Expectation-Maximization
• Two step optimization procedure
• Keeps estimates of cluster assignment probabilities
for each instance
• Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
• Can be adjusted to all kinds of mixture models
• E.g. use Naive Bayes as mixture model for text classification
Self-Training
• Learn model on labeled instances only
• Apply model to unlabeled instances
• Learn new model on all instances
• Repeat until convergence
Deriving Knowledge from Data at Scale
TheLow DensityAssumption
Assumption
• The area between the two classes has low density
• Does not assume any specific form of cluster
Support Vector Machine
• Decision boundary is linear
• Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
TheLow DensityAssumption
Assumption
• The area between the two classes has low density
• Does not assume any specific form of cluster
Support Vector Machine
• Decision boundary is linear
• Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
TheLow DensityAssumption
Assumption
• The area between the two classes has low density
• Does not assume any specific form of cluster
Support Vector Machine
• Decision boundary is linear
• Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
TheLow DensityAssumption
Semi-Supervised SVM
• Minimize distance to labeled and
unlabeled instances
• Parameter to fine-tune influence of
unlabeled instances
• Additional constraint: keep class balance correct
Implementation
• Simple extension of SVM
• But non-convex optimization problem
Deriving Knowledge from Data at Scale
TheLow DensityAssumption
Semi-Supervised SVM
• Minimize distance to labeled and
unlabeled instances
• Parameter to fine-tune influence of
unlabeled instances
• Additional constraint: keep class balance correct
Implementation
• Simple extension of SVM
• But non-convex optimization problem
Deriving Knowledge from Data at Scale
TheLow DensityAssumption
Semi-Supervised SVM
• Minimize distance to labeled and
unlabeled instances
• Parameter to fine-tune influence of
unlabeled instances
• Additional constraint: keep class balance correct
Implementation
• Simple extension of SVM
• But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descent
• One run over the data in random order
• Each misclassified or unlabeled instance moves
classifier a bit
• Steps get smaller over time
Implementation on Hadoop
• Mapper: send data to reducer in random order
• Reducer: update linear classifier for unlabeled
or misclassified instances
• Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descent
• One run over the data in random order
• Each misclassified or unlabeled instance moves
classifier a bit
• Steps get smaller over time
Implementation on Hadoop
• Mapper: send data to reducer in random order
• Reducer: update linear classifier for unlabeled
or misclassified instances
• Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descent
• One run over the data in random order
• Each misclassified or unlabeled instance moves
classifier a bit
• Steps get smaller over time
Implementation on Hadoop
• Mapper: send data to reducer in random order
• Reducer: update linear classifier for unlabeled
or misclassified instances
• Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descent
• One run over the data in random order
• Each misclassified or unlabeled instance moves
classifier a bit
• Steps get smaller over time
Implementation on Hadoop
• Mapper: send data to reducer in random order
• Reducer: update linear classifier for unlabeled
or misclassified instances
• Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descent
• One run over the data in random order
• Each misclassified or unlabeled instance moves
classifier a bit
• Steps get smaller over time
Implementation on Hadoop
• Mapper: send data to reducer in random order
• Reducer: update linear classifier for unlabeled
or misclassified instances
• Many random runs to find best one
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
• Training data is (roughly) contained in a low
dimensional manifold
• One can perform learning in a more meaningful
low-dimensional space
• Avoids curse of dimensionality
Similarity Graphs
• Idea: compute similarity scores between instances
• Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
• Training data is (roughly) contained in a low
dimensional manifold
• One can perform learning in a more meaningful
low-dimensional space
• Avoids curse of dimensionality
Similarity Graphs
• Idea: compute similarity scores between instances
• Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
• Training data is (roughly) contained in a low
dimensional manifold
• One can perform learning in a more
meaningful low-dimensional space
• Avoids curse of dimensionality
SimilarityGraphs
• Idea: compute similarity scores between instances
•
Create network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
MainIdea
• Propagate label information to neighboring instances
• Then repeat until convergence
• Similar to PageRank
Theory
• Known to converge under weak conditions
• Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
MainIdea
• Propagate label information to neighboring instances
• Then repeat until convergence
• Similar to PageRank
Theory
• Known to converge under weak conditions
• Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
MainIdea
• Propagate label information to neighboring instances
• Then repeat until convergence
• Similar to PageRank
Theory
• Known to converge under weak conditions
• Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
MainIdea
• Propagate label information to neighboring instances
• Then repeat until convergence
• Similar to PageRank
Theory
• Known to converge under weak conditions
• Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learning
• Only few training instances have labels
• Unlabeled instances can still provide valuable signal
Different assumptions lead to different approaches
• Cluster assumption: generative models
• Low density assumption: semi-supervised support vector machines
• Manifold assumption: label propagation
Deriving Knowledge from Data at Scale
10 Minute Break…
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
• A
• B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
• Lesson #2: GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
• Lesson #2: Get the data!
Deriving Knowledge from Data at Scale
Lesson #3: Prepare to be humbled
Left Elevator Right Elevator
Deriving Knowledge from Data at Scale
• Lesson #1
• Lesson #2
• Lesson #3
15% Bing
Deriving Knowledge from Data at Scale
• HiPPO stop the project
From Greg Linden’s Blog: http://glinden.blogspot.com/2006/04/early-amazon-shopping-cart.html
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
• Must run statistical tests to confirm differences are not due to chance
• Best scientific way to prove causality, i.e., the changes in metrics are
caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
• Raise your right hand if you think A Wins
• Raise your left hand if you think B Wins
• Don’t raise your hand if you think they’re about the same
A B
Deriving Knowledge from Data at Scale
• A was 8.5% better
Deriving Knowledge from Data at Scale
A
B
Differences: A has taller search box (overall size is the same), has magnifying glass icon,
“popular searches”
B has big search button
• Raise your right hand if you think A Wins
• Raise your left hand if you think B Wins
• Don’t raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
• Raise your right hand if you think A Wins
• Raise your left hand if you think B Wins
• Don’t raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be
humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
 If something is “amazing,” find the flaw!
 Examples
 If you have a mandatory birth date field and people think it’s
unnecessary, you’ll find lots of 11/11/11 or 01/01/01
 If you have an optional drop down, do not default to the first
alphabetical entry, or you’ll have lots jobs = Astronaut
 The previous Office example assumes click maps to revenue.
Seemed reasonable, but when the results look so extreme, find
the flaw (conversion rate is not the same; see why?)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
• OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
• Controlled Experiments in one slide
• Examples: you’re the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it.
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2
Insight through Measurement and Control
• Semmelweis worked at Vienna’s General Hospital, an
important teaching/research hospital, in the 1830s-40s
• In 19th-century Europe, childbed fever killed more than a million
women
• Measurement: the mortality rate for women giving birth was
• 15% in his ward, staffed by doctors and students
• 2% in the ward at the hospital, attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2
Insight through Measurement and Control
• He tried to control all differences
• Birthing positions, ventilation, diet, even the way laundry was done
• He was away for 4 months and death rate fell significantly when
he was away. Could it be related to him?
• Insight:
• Doctors were performing autopsies each morning on cadavers
• Conjecture: particles (called germs today) were being transmitted to
healthy patients on the hands of the physicians
• He experiments with cleansing agents
• Chlorine and lime was effective: death rate fell from 18% to 1%
Deriving Knowledge from Data at Scale
Semmelweis Reflex
• Semmelweis Reflex
2005 study: inadequate hand washing is one of the
prime contributors to the 2 million health-care-associated infections and
90,000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
Hubris
Measure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
• Controlled Experiments in one slide
• Examples: you’re the decision maker
• Cultural evolution: hubris, insight through measurement,
Semmelweis reflex, fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
• Real Data for the city of Oldenburg,
Germany
• X-axis: stork population
• Y-axis: human population
What your mother told you about babies and
storks when you were three is still not right,
despite the strong correlational “evidence”
Ornitholigische Monatsberichte 1936;44(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
But…don’t try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you don't know where you are going, any road will take you there
—Lewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
• Hippos kill more humans than any other (non-human) mammal (really)
• OEC
Get the data
• Prepare to be humbled
The less data, the stronger the opinions…
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal version…
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course Project
Due Oct. 25th
Deriving Knowledge from Data at Scale
Open Discussion on
Course Project…
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment. For feature
selection, large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scale
http://gallery.azureml.net/browse/?tags=[%22Azure%20ML%20Book%22
Deriving Knowledge from Data at Scale
Customer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature and/or
Target
construction
1. Define the objective and quantify it with a metric – optionally with constraints,
if any. This typically requires domain knowledge.
2. Collect and understand the data, deal with the vagaries and biases in the data
acquisition (missing data, outliers due to errors in the data collection process,
more sophisticated biases due to the data collection procedure etc
3. Frame the problem in terms of a machine learning problem – classification,
regression, ranking, clustering, forecasting, outlier detection etc. – some
combination of domain knowledge and ML knowledge is useful.
4. Transform the raw data into a “modeling dataset”, with features, weights,
targets etc., which can be used for modeling. Feature construction can often
be improved with domain knowledge. Target must be identical (or a very
good proxy) of the quantitative metric identified step 1.
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train/ Test split
5. Train, test and evaluate, taking care to control
bias/variance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here), be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) – this is the
ML heavy step.
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature and/or
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train/ Test split
6. Iterate steps (2) – (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
That’s all for our course….

Barga Data Science lecture 10

  • 1.
  • 2.
    Deriving Knowledge fromData at Scale Models in Production
  • 3.
    Deriving Knowledge fromData at Scale Putting an ML Model into Production • A/B Testing
  • 4.
    Deriving Knowledge fromData at Scale Controlled Experiments in One Slide Concept is Trivial • Must run statistical tests to confirm differences are not due to chance • Best scientific way to prove causality, i.e., the changes in metrics are caused by changes introduced in the treatment(s)
  • 5.
    Deriving Knowledge fromData at Scale Best Practice: A/A Test Run A/A tests before
  • 6.
    Deriving Knowledge fromData at Scale Best Practice: Ramp-up Ramp-up
  • 7.
    Deriving Knowledge fromData at Scale Best Practice: Run Experiments at 50/50%
  • 8.
    Deriving Knowledge fromData at Scale Cost based learning
  • 9.
    Deriving Knowledge fromData at Scale Imbalanced Class Distribution & Error Costs WEKA cost sensitive learning weighting method false negatives, FN try to avoid false negatives
  • 10.
    Deriving Knowledge fromData at Scale Imbalanced Class Distribution WEKA cost sensitive learning Preprocess Classify meta.CostSensitiveClassifier set the FN to 10.0 FP to 1.0 tries to optimize accuracy or error can be cost-sensitive decision trees rule learner
  • 11.
    Deriving Knowledge fromData at Scale Imbalanced Class Distribution WEKA cost sensitive learning
  • 12.
  • 13.
    Deriving Knowledge fromData at Scale curated completely specify a problem measure progress paired with a metric target SLAs score board
  • 14.
    Deriving Knowledge fromData at Scale This isn’t easy… • Building high quality gold sets is a challenge. • It is time consuming. • It requires making difficult and long lasting choices, and the rewards are delayed…
  • 15.
    Deriving Knowledge fromData at Scale enforce a few principles 1. Distribution parity 2. Testing blindness 3. Production parity 4. Single metric 5. Reproducibility 6. Experimentation velocity 7. Data is gold
  • 16.
    Deriving Knowledge fromData at Scale • Test set blindness • Reproducibility and Data is gold • Experimentation velocity
  • 17.
    Deriving Knowledge fromData at Scale Building Gold sets is hard work. Many common and avoidable mistakes are made. This suggests having a checklist. Some questions will be trivial to answer or not applicable, some will require work… 1. Metrics: For each gold set, chose one (1) metric. Having two metrics on the same gold set is a problem (you can’t optimize both at once). 2. Weighting/Slicing: Not all errors are equal. This should be reflected in the metric, not through sampling manipulation. Having the weighting in the metric has two advantages: 1) it is explicitly documented and reproducible in the form of a metric algorithm, and 2) production, train, and test sets results remain directly comparable (automatic testing). 3. Yardstick(s): Define algorithms and configuration parameters for public yardstick(s). There could be more than one yardstick. A simple yardstick is useful for ramping up. Once one can reproduce/understand the simple yardstick’s result, it becomes easier to improve on the latest “production” yardstick. Ideally yardsticks come with downloadable code. The yardsticks provide a set of errors that suggests where innovation should happen.
  • 18.
    Deriving Knowledge fromData at Scale 4. Sizes and access: What are the set sizes? Each size corresponds to an innovation velocity and a level of representativeness. A good rule of thumb is 5X size ratios between gold sets drawn from the same distribution. Where should the data live? If on a server, some services are needed for access and simple manipulations. There should always be a size that is downloadable (< 1GB) to a desktop for high velocity innovation. 5. Documentation and format: Create a format/API for the data. Is the data compressed? Provide sample code to load the data. Document the format. Assign someone to be the curator of the gold set.
  • 19.
    Deriving Knowledge fromData at Scale 6. Features: What (gold) features go in the gold sets? Features must be pickled for result to be reproducible. Ideally, we would have 2, and possibly 3 types of gold sets. a. One set should have the deployed features (computed from the raw data). This provides the production yardstick. b. One set should be Raw (e.g. contains all information, possibly through tables). This allows contributors to create features from the raw data to investigate its potential compared to existing features. This set has more information per pattern and a smaller number of patterns. c. One set should have an extended number of features. The additional features may be “building blocks”, features that are scheduled to be deployed next, or high potential features. Moving some features to a gold set is convenient if multiple people are working on the next generation. Not all features are worth being in a gold set. 7. Feature optimization sets: Does the data require feature optimization? For instance, an IP address, a query, or a listing id may be features. But only the most frequent 10M instances are worth having specific trainable parameters. A pass over the data can identify the top 10M instance. This is a form of feature optimization. Identifying these features does not require labels. If a form of feature optimization is done, a separate data set (disjoint from the training and test set) must be provided.
  • 20.
    Deriving Knowledge fromData at Scale 8. Stale rate, optimization, monitoring: How long does the set stay current? In many cases, we hide the fact that the problem is a time series even though the goal is to predict the future and we know that the distribution is changing. We must quantify how much a distribution changes over a fixed period of time. There are several ways to mitigate the changing distribution problem: a. Assume the distribution is I.I.D. Regularly re-compute training sets and Gold sets. Determine the frequency of re-computation, or set in place a system to monitor distribution drifts (monitor KPI changes while the algorithm is kept constant). b. Decompose the model along “distribution (fast) tracking parameters” and slow tracking parameters. The fast tracking model may be a simple calibration with very few parameters. c. Recast the problem as a time series problem: patterns are (input data from t-T to t-1, prediction at time t). In this space, the patterns are much larger, but the problem is closer to being I.I.D. 9. The gold sets should have information that reveal the stale rate and allows algorithms to differentiate themselves based on how they degrade with time.
  • 21.
    Deriving Knowledge fromData at Scale 10. Grouping: Should the patterns be grouped? For example in handwriting, examples are grouped per writer. A set built by shuffling the words is misleading because training and testing would have word examples for the same writer, which makes generalization much easier. If the words are grouped per writers, then a writer is unlikely to appear in both training and test set, which requires the system to generalize to never seen before handwriting (as opposed to never seen before words). Do we have these type of constraints? Should we group per advertisers, campaign, users to generalize across new instances of these entities (as opposed to generalizing to new queries)? ML requires training and testing to be drawn from the same distribution. Drawing duplicates is not a problem. Problems arise when one partially draw examples from the same entity on both training and testing on a small set of entities. This breaks the IID assumption and makes the generalization on the test set much easier than it actually is. 11. Sampling production data: What strategy is used for sampling? Uniform? Are any of the following filtered out: fraud, bad configurations, duplicates, non-billable, adult, overwrites, etc? Guidance: use the production sameness principle.
  • 22.
    Deriving Knowledge fromData at Scale 11. Unlabeled set: If the number of labeled examples is small, a large data set of unlabeled data with the same distribution should be collected and be made a gold set. This enables the discovery of new features using intermediate classifiers and active labeling.
  • 23.
    Deriving Knowledge fromData at Scale GreatestChallengeinMachineLearning
  • 24.
    Deriving Knowledge fromData at Scale gender age smoker eye color male 19 yes green female 44 yes gray male 49 yes blue male 12 no brown female 37 no brown female 60 no brown male 44 no blue female 27 yes brown female 51 yes green female 81 yes gray male 22 yes brown male 29 no blue lung cancer no yes yes no no yes no no yes no no no male 77 yes gray male 19 yes green female 44 no gray yes no no Train ML Model
  • 25.
    Deriving Knowledge fromData at Scale The greatest challenge in Machine Learning? Lack of Labelled Training Data… What to Do? • Controlled Experiments – get feedback from user to serve as labels; • Mechanical Turk – pay people to label data to build training set; • Ask Users to Label Data – report as spam, ‘hot or not?’, review a product, observe their click behavior (ad retargeting, search results, etc).
  • 26.
    Deriving Knowledge fromData at Scale Whatifyoucan'tgetlabeledTrainingData? Traditional Supervised Learning • Promotion on bookseller’s web page • Customers can rate books. • Will a new customer like this book? • Training set: observations on previous customers • Test set: new customers Whathappensif onlyfew customers rate a book? Age Income LikesBook 24 60K + 65 80K - 60 95K - 35 52K + 20 45K + 43 75K + 26 51K + 52 47K - 47 38K - 25 22K - 33 47K + Age Income LikesBook 22 67K ? 39 41K ? Age Income LikesBook 22 67K + 39 41K - Model Test Data Prediction Training Data Attributes Target Label © 2013 Datameer, Inc. All rights reserved. Age Income LikesBook 24 60K + 65 80K - 60 95K - 35 52K + 20 45K + 43 75K + 26 51K + 52 47K - 47 38K - 25 22K - 33 47K + Age Income LikesBook 22 67K ? 39 41K ? Age Income LikesBook 22 67K + 39 41K -
  • 27.
    Deriving Knowledge fromData at Scale Semi-SupervisedLearning Can we makeuse of the unlabeled data? In theory: no ... but we can make assumptions PopularAssumptions • Clustering assumption • Low density assumption • Manifold assumption
  • 28.
    Deriving Knowledge fromData at Scale TheClusteringAssumption Clustering • Partition instances into groups (clusters) of similar instances • Many different algorithms: k-Means, EM, etc. Clustering Assumption • The two classification targets are distinct clusters • Simple semi-supervised learning: cluster, then perform majority vote
  • 29.
    Deriving Knowledge fromData at Scale TheClusteringAssumption Clustering • Partition instances into groups (clusters) of similar instances • Many different algorithms: k-Means, EM, etc. Clustering Assumption • The two classification targets are distinct clusters • Simple semi-supervised learning: cluster, then perform majority vote
  • 30.
    Deriving Knowledge fromData at Scale TheClusteringAssumption Clustering • Partition instances into groups (clusters) of similar instances • Many different algorithms: k-Means, EM, etc. Clustering Assumption • The two classification targets are distinct clusters • Simple semi-supervised learning: cluster, then perform majority vote
  • 31.
    Deriving Knowledge fromData at Scale TheClusteringAssumption Clustering • Partition instances into groups (clusters) of similar instances • Many different algorithms: k-Means, EM, etc. Clustering Assumption • The two classification targets are distinct clusters • Simple semi-supervised learning: cluster, then perform majority vote
  • 32.
    Deriving Knowledge fromData at Scale Generative Models Mixture of Gaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • Two step optimization procedure • Keeps estimates of cluster assignment probabilities for each instance • Might converge to local optimum
  • 33.
    Deriving Knowledge fromData at Scale GenerativeModels Mixture of Gaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • Two step optimization procedure • Keeps estimates of cluster assignment probabilities for each instance • Might converge to local optimum
  • 34.
    Deriving Knowledge fromData at Scale GenerativeModels Mixture of Gaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • Two step optimization procedure • Keeps estimates of cluster assignment probabilities for each instance • Might converge to local optimum
  • 35.
    Deriving Knowledge fromData at Scale Generative Models Mixture of Gaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • Two step optimization procedure • Keeps estimates of cluster assignment probabilities for each instance • Might converge to local optimum
  • 36.
    Deriving Knowledge fromData at Scale Generative Models Mixture of Gaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • Two step optimization procedure • Keeps estimates of cluster assignment probabilities for each instance • Might converge to local optimum
  • 37.
    Deriving Knowledge fromData at Scale Generative Models Mixture of Gaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • Two step optimization procedure • Keeps estimates of cluster assignment probabilities for each instance • Might converge to local optimum
  • 38.
    Deriving Knowledge fromData at Scale BeyondMixtures of Gaussians Expectation-Maximization • Can be adjusted to all kinds of mixture models • E.g. use Naive Bayes as mixture model for text classification Self-Training • Learn model on labeled instances only • Apply model to unlabeled instances • Learn new model on all instances • Repeat until convergence
  • 39.
    Deriving Knowledge fromData at Scale TheLow DensityAssumption Assumption • The area between the two classes has low density • Does not assume any specific form of cluster Support Vector Machine • Decision boundary is linear • Maximizes margin to closest instances
  • 40.
    Deriving Knowledge fromData at Scale TheLow DensityAssumption Assumption • The area between the two classes has low density • Does not assume any specific form of cluster Support Vector Machine • Decision boundary is linear • Maximizes margin to closest instances
  • 41.
    Deriving Knowledge fromData at Scale TheLow DensityAssumption Assumption • The area between the two classes has low density • Does not assume any specific form of cluster Support Vector Machine • Decision boundary is linear • Maximizes margin to closest instances
  • 42.
    Deriving Knowledge fromData at Scale TheLow DensityAssumption Semi-Supervised SVM • Minimize distance to labeled and unlabeled instances • Parameter to fine-tune influence of unlabeled instances • Additional constraint: keep class balance correct Implementation • Simple extension of SVM • But non-convex optimization problem
  • 43.
    Deriving Knowledge fromData at Scale TheLow DensityAssumption Semi-Supervised SVM • Minimize distance to labeled and unlabeled instances • Parameter to fine-tune influence of unlabeled instances • Additional constraint: keep class balance correct Implementation • Simple extension of SVM • But non-convex optimization problem
  • 44.
    Deriving Knowledge fromData at Scale TheLow DensityAssumption Semi-Supervised SVM • Minimize distance to labeled and unlabeled instances • Parameter to fine-tune influence of unlabeled instances • Additional constraint: keep class balance correct Implementation • Simple extension of SVM • But non-convex optimization problem
  • 45.
    Deriving Knowledge fromData at Scale Semi-Supervised SVM Stochastic Gradient Descent • One run over the data in random order • Each misclassified or unlabeled instance moves classifier a bit • Steps get smaller over time Implementation on Hadoop • Mapper: send data to reducer in random order • Reducer: update linear classifier for unlabeled or misclassified instances • Many random runs to find best one
  • 46.
    Deriving Knowledge fromData at Scale Semi-Supervised SVM Stochastic Gradient Descent • One run over the data in random order • Each misclassified or unlabeled instance moves classifier a bit • Steps get smaller over time Implementation on Hadoop • Mapper: send data to reducer in random order • Reducer: update linear classifier for unlabeled or misclassified instances • Many random runs to find best one
  • 47.
    Deriving Knowledge fromData at Scale Semi-Supervised SVM Stochastic Gradient Descent • One run over the data in random order • Each misclassified or unlabeled instance moves classifier a bit • Steps get smaller over time Implementation on Hadoop • Mapper: send data to reducer in random order • Reducer: update linear classifier for unlabeled or misclassified instances • Many random runs to find best one
  • 48.
    Deriving Knowledge fromData at Scale Semi-Supervised SVM Stochastic Gradient Descent • One run over the data in random order • Each misclassified or unlabeled instance moves classifier a bit • Steps get smaller over time Implementation on Hadoop • Mapper: send data to reducer in random order • Reducer: update linear classifier for unlabeled or misclassified instances • Many random runs to find best one
  • 49.
    Deriving Knowledge fromData at Scale Semi-Supervised SVM Stochastic Gradient Descent • One run over the data in random order • Each misclassified or unlabeled instance moves classifier a bit • Steps get smaller over time Implementation on Hadoop • Mapper: send data to reducer in random order • Reducer: update linear classifier for unlabeled or misclassified instances • Many random runs to find best one
  • 50.
    Deriving Knowledge fromData at Scale TheManifoldAssumption The Assumption • Training data is (roughly) contained in a low dimensional manifold • One can perform learning in a more meaningful low-dimensional space • Avoids curse of dimensionality Similarity Graphs • Idea: compute similarity scores between instances • Create network where the nearest neighbors are connected
  • 51.
    Deriving Knowledge fromData at Scale TheManifoldAssumption The Assumption • Training data is (roughly) contained in a low dimensional manifold • One can perform learning in a more meaningful low-dimensional space • Avoids curse of dimensionality Similarity Graphs • Idea: compute similarity scores between instances • Create a network where the nearest neighbors are connected
  • 52.
    Deriving Knowledge fromData at Scale TheManifoldAssumption The Assumption • Training data is (roughly) contained in a low dimensional manifold • One can perform learning in a more meaningful low-dimensional space • Avoids curse of dimensionality SimilarityGraphs • Idea: compute similarity scores between instances • Create network where the nearest neighbors are connected
  • 53.
    Deriving Knowledge fromData at Scale Label Propagation MainIdea • Propagate label information to neighboring instances • Then repeat until convergence • Similar to PageRank Theory • Known to converge under weak conditions • Equivalent to matrix inversion
  • 54.
    Deriving Knowledge fromData at Scale Label Propagation MainIdea • Propagate label information to neighboring instances • Then repeat until convergence • Similar to PageRank Theory • Known to converge under weak conditions • Equivalent to matrix inversion
  • 55.
    Deriving Knowledge fromData at Scale Label Propagation MainIdea • Propagate label information to neighboring instances • Then repeat until convergence • Similar to PageRank Theory • Known to converge under weak conditions • Equivalent to matrix inversion
  • 56.
    Deriving Knowledge fromData at Scale Label Propagation MainIdea • Propagate label information to neighboring instances • Then repeat until convergence • Similar to PageRank Theory • Known to converge under weak conditions • Equivalent to matrix inversion
  • 57.
    Deriving Knowledge fromData at Scale Conclusion Semi-Supervised Learning • Only few training instances have labels • Unlabeled instances can still provide valuable signal Different assumptions lead to different approaches • Cluster assumption: generative models • Low density assumption: semi-supervised support vector machines • Manifold assumption: label propagation
  • 58.
    Deriving Knowledge fromData at Scale 10 Minute Break…
  • 59.
    Deriving Knowledge fromData at Scale Controlled Experiments
  • 60.
    Deriving Knowledge fromData at Scale • A • B
  • 61.
    Deriving Knowledge fromData at Scale OEC Overall Evaluation Criterion Picking a good OEC is key
  • 62.
  • 63.
    Deriving Knowledge fromData at Scale • Lesson #2: GET THE DATA
  • 64.
  • 65.
    Deriving Knowledge fromData at Scale • Lesson #2: Get the data!
  • 66.
    Deriving Knowledge fromData at Scale Lesson #3: Prepare to be humbled Left Elevator Right Elevator
  • 67.
    Deriving Knowledge fromData at Scale • Lesson #1 • Lesson #2 • Lesson #3 15% Bing
  • 68.
    Deriving Knowledge fromData at Scale • HiPPO stop the project From Greg Linden’s Blog: http://glinden.blogspot.com/2006/04/early-amazon-shopping-cart.html
  • 69.
    Deriving Knowledge fromData at Scale TED talk
  • 70.
    Deriving Knowledge fromData at Scale • Must run statistical tests to confirm differences are not due to chance • Best scientific way to prove causality, i.e., the changes in metrics are caused by changes introduced in the treatment(s)
  • 71.
  • 72.
    Deriving Knowledge fromData at Scale • Raise your right hand if you think A Wins • Raise your left hand if you think B Wins • Don’t raise your hand if you think they’re about the same A B
  • 73.
    Deriving Knowledge fromData at Scale • A was 8.5% better
  • 74.
    Deriving Knowledge fromData at Scale A B Differences: A has taller search box (overall size is the same), has magnifying glass icon, “popular searches” B has big search button • Raise your right hand if you think A Wins • Raise your left hand if you think B Wins • Don’t raise your hand if they are the about the same
  • 75.
  • 76.
  • 77.
    Deriving Knowledge fromData at Scale A B • Raise your right hand if you think A Wins • Raise your left hand if you think B Wins • Don’t raise your hand if they are the about the same
  • 78.
    Deriving Knowledge fromData at Scale get the data prepare to be humbled
  • 79.
    Deriving Knowledge fromData at Scale Any statistic that appears interesting is almost certainly a mistake  If something is “amazing,” find the flaw!  Examples  If you have a mandatory birth date field and people think it’s unnecessary, you’ll find lots of 11/11/11 or 01/01/01  If you have an optional drop down, do not default to the first alphabetical entry, or you’ll have lots jobs = Astronaut  The previous Office example assumes click maps to revenue. Seemed reasonable, but when the results look so extreme, find the flaw (conversion rate is not the same; see why?)
  • 80.
    Deriving Knowledge fromData at Scale Data Trumps Intuition
  • 81.
    Deriving Knowledge fromData at Scale Sir Ken Robinson
  • 82.
    Deriving Knowledge fromData at Scale • OEC = Overall Evaluation Criterion
  • 83.
    Deriving Knowledge fromData at Scale • Controlled Experiments in one slide • Examples: you’re the decision maker
  • 84.
    Deriving Knowledge fromData at Scale It is difficult to get a man to understand something when his salary depends upon his not understanding it. -- Upton Sinclair
  • 85.
    Deriving Knowledge fromData at Scale Hubris
  • 86.
    Deriving Knowledge fromData at Scale Cultural Stage 2 Insight through Measurement and Control • Semmelweis worked at Vienna’s General Hospital, an important teaching/research hospital, in the 1830s-40s • In 19th-century Europe, childbed fever killed more than a million women • Measurement: the mortality rate for women giving birth was • 15% in his ward, staffed by doctors and students • 2% in the ward at the hospital, attended by midwives
  • 87.
    Deriving Knowledge fromData at Scale Cultural Stage 2 Insight through Measurement and Control • He tried to control all differences • Birthing positions, ventilation, diet, even the way laundry was done • He was away for 4 months and death rate fell significantly when he was away. Could it be related to him? • Insight: • Doctors were performing autopsies each morning on cadavers • Conjecture: particles (called germs today) were being transmitted to healthy patients on the hands of the physicians • He experiments with cleansing agents • Chlorine and lime was effective: death rate fell from 18% to 1%
  • 88.
    Deriving Knowledge fromData at Scale Semmelweis Reflex • Semmelweis Reflex 2005 study: inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90,000 related deaths annually in the United States
  • 89.
    Deriving Knowledge fromData at Scale Fundamental Understanding
  • 90.
    Deriving Knowledge fromData at Scale Hubris Measure and Control Accept Results avoid Semmelweis Reflex Fundamental Understanding
  • 91.
    Deriving Knowledge fromData at Scale • Controlled Experiments in one slide • Examples: you’re the decision maker • Cultural evolution: hubris, insight through measurement, Semmelweis reflex, fundamental understanding
  • 92.
  • 93.
    Deriving Knowledge fromData at Scale • Real Data for the city of Oldenburg, Germany • X-axis: stork population • Y-axis: human population What your mother told you about babies and storks when you were three is still not right, despite the strong correlational “evidence” Ornitholigische Monatsberichte 1936;44(2)
  • 94.
    Deriving Knowledge fromData at Scale Women have smaller palms and live 6 years longer on average But…don’t try to bandage your hands
  • 95.
    Deriving Knowledge fromData at Scale causal
  • 96.
    Deriving Knowledge fromData at Scale If you don't know where you are going, any road will take you there —Lewis Carroll
  • 97.
  • 98.
    Deriving Knowledge fromData at Scale before
  • 99.
  • 100.
  • 101.
    Deriving Knowledge fromData at Scale • Hippos kill more humans than any other (non-human) mammal (really) • OEC Get the data • Prepare to be humbled The less data, the stronger the opinions…
  • 102.
    Deriving Knowledge fromData at Scale Out of Class Reading Eight (8) page conference paper 40 page journal version…
  • 103.
  • 104.
    Deriving Knowledge fromData at Scale Course Project Due Oct. 25th
  • 105.
    Deriving Knowledge fromData at Scale Open Discussion on Course Project…
  • 106.
  • 107.
  • 108.
    Deriving Knowledge fromData at Scale Gallery of Experiments Contributed by the community
  • 109.
    Deriving Knowledge fromData at Scale Azure Machine Learning Studio
  • 110.
    Deriving Knowledge fromData at Scale Sample Experiments To help you get started
  • 111.
    Deriving Knowledge fromData at Scale Experiment Tools that you can use in your experiment. For feature selection, large set of machine learning algorithms
  • 112.
  • 113.
    Deriving Knowledge fromData at Scale Using classificatio n algorithms Evaluating the model Splitting to Training and Testing Datasets Getting Data For the Experiment
  • 114.
    Deriving Knowledge fromData at Scale http://gallery.azureml.net/browse/?tags=[%22Azure%20ML%20Book%22
  • 115.
    Deriving Knowledge fromData at Scale Customer Churn Model
  • 116.
    Deriving Knowledge fromData at Scale Deployed web service endpoints that can be consumed by applications and for batch processing
  • 117.
  • 118.
    Deriving Knowledge fromData at Scale Define Objective Access and Understand the Data Pre-processing Feature and/or Target construction 1. Define the objective and quantify it with a metric – optionally with constraints, if any. This typically requires domain knowledge. 2. Collect and understand the data, deal with the vagaries and biases in the data acquisition (missing data, outliers due to errors in the data collection process, more sophisticated biases due to the data collection procedure etc 3. Frame the problem in terms of a machine learning problem – classification, regression, ranking, clustering, forecasting, outlier detection etc. – some combination of domain knowledge and ML knowledge is useful. 4. Transform the raw data into a “modeling dataset”, with features, weights, targets etc., which can be used for modeling. Feature construction can often be improved with domain knowledge. Target must be identical (or a very good proxy) of the quantitative metric identified step 1.
  • 119.
    Deriving Knowledge fromData at Scale Feature selection Model training Model scoring Evaluation Train/ Test split 5. Train, test and evaluate, taking care to control bias/variance and ensure the metrics are reported with the right confidence intervals (cross-validation helps here), be vigilant against target leaks (which typically leads to unbelievably good test metrics) – this is the ML heavy step.
  • 120.
    Deriving Knowledge fromData at Scale Define Objective Access and Understand the data Pre-processing Feature and/or Target construction Feature selection Model training Model scoring Evaluation Train/ Test split 6. Iterate steps (2) – (5) until the test metrics are satisfactory
  • 121.
    Deriving Knowledge fromData at Scale Access Data Pre-processing Feature construction Model scoring
  • 122.
  • 123.
  • 124.
    Deriving Knowledge fromData at Scale Book Recommendation
  • 125.
    Deriving Knowledge fromData at Scale That’s all for our course….