Data Science 101

Data Science 101
David Gerster
Strategic Advisory Board

About me
• 10+ years experience in data science at various consumer
web companies
• Worked on web search at Yahoo and Microsoft
• Led the Mobile data science team at Groupon
• Joined BigML as VP Data Science in July 2013
• Joined JLL Spark as VP Data in July 2017
• Advisor to High Fidelity Genetics
3

Finding meaningful patterns in data
• The famous “Iris” data set has measurements for 150 flowers
• Given a flower’s measurements, can we predict its species?
Iris setosa Iris versicolor Iris virginica
5

PetalWidth(cm)
Petal Length (cm)
Iris setosa, red dots
Iris versicolor, green dots
Iris virginica, blue dots
6

PetalWidth(cm)
Petal Length (cm)
Congratulations! You just trained a model.
7

PetalWidth(cm)
Petal Length (cm)
PetalWidth(cm)
Petal Length (cm)
Prediction: Iris setosa
Prediction: Iris versicolor
Prediction: Iris virginica
Prediction:
Iris virginica
8

PetalWidth(cm)
Petal Length (cm)
Prediction: Iris setosa
Prediction: Iris versicolor
Prediction: Iris virginica
Prediction:
Iris virginica
Congratulations! You just scored
four new flowers using your model,
and made a prediction about the
species of each one.
9

Training versus Scoring
• This process had two steps: training and scoring
• When training on historical data, you’re using data gathered over
some length of time
• When scoring new data points, you want the answer immediately
(in “real time”)
10

11
Predicts “blue” with high confidence
Explains a large chunk of the data
(high support)
Predicts “blue” with low confidence
Explains a small chunk of the data
(low support)

Support and Confidence
• A rectangle with a large number of data points has high “support”
• A rectangle that is purely one color has high “confidence”
• If there is a small number of data points, confidence is low even if
it’s purely one color
12

PetalWidth(cm)
Petal Length (cm)
13
Width <= 0.8? Width > 0.8?
Width > 1.75? Width <= 1.75?
Length <= 5? Length > 5?
50 red
45 blue
1 blue, 48 green 4 blue, 2 green
“Decision Tree”
“Leaf Nodes”
50 blue, 50 green
5 blue, 50 green
50 red, 50 blue, 50 green

• Data is just a table of values
• Each row is an instance, an example
of the concept to be learned
• Each column is an attribute or
feature of the instance
• The column we want to predict is the
label or output
• Because we have a label, this is
supervised learning
14
instance
instance
feature labelfeature

Demo: The General Social Survey
• Sociology survey given in the United States since 1972
• Data is 39,000 responses, almost 400 questions each
• Demographic data like income, race, gender, education, marital status
• Many questions about personal beliefs
• “Should an atheist be allowed to teach college, or not?”
• “Are we spending the right amount of money on education?”
• Can we predict income from these responses?
16

How good is our model?
• The model looks good, but how do we quantify this?
17

80%
training set
20%
holdout set
3 out of 4 predictions are correct
Accuracy = 75%
100% of data
1. Train a model using
80% training set
2. Pretend 20% holdout
is new data, and
feed it to the model
3. Check accuracy of
predictions

Predicting political views
• What happens if we predict political views instead of income?
• A different subset of variables becomes important!
19

Finding the important variables
21

The Value of Predictive Modeling
• Provides deep insight into your data
• Finds the small subset of important variables
• Extremely useful for business!
23

Demo: The StumbleUpon Dataset
• StumbleUpon is an app that recommends web pages
• Dataset of 7,400 web pages is provided, with each page labeled as
either “evergreen” or “ephemeral”
• We want to predict the page’s class using this historical data
24
While some pages we recommend, such as news
articles or seasonal recipes, are only relevant for a
short period of time, others maintain a timeless
quality and can be recommended to users long after
they are discovered. In other words, pages can
either be classified as "ephemeral" or "evergreen".

Training a model on StumbleUpon data
• Live demo: training a model on StumbleUpon data
• Key concepts:
• “Bag of words” text analysis
• Evaluating the model using a holdout set
• Combining multiple models to improve accuracy
• The “ensemble” of multiple models has better accuracy!
25

“Ensembles” of Models
• Training multiple models on random subsets of the data gave us a
better result!
• Why?
26

Bias and Variance
• We train a model with the goal of fitting it correctly to the data
• When a model isn’t flexible enough, it may underfit the data, and we
say it has high bias
• When a model is too flexible, it may overfit the data, and we say it
has high variance
For a formal definition of bias and variance, see
Thomas Dietterich’s paper on the subject

Decision trees have high variance
• Decision trees can represent complex functions
• But they are prone to overfitting; they have high variance
• If you draw enough lines, you can create a “model” that just
memorizes the dataset!

Decision trees have high variance
• We can reduce this problem by:
• Taking several random samples from the original data set
• Training a decision tree on each sample
• Having these trees vote on the class
• Goal: Get the expressiveness of a decision tree, with less overfitting

100% of data
Prediction
Single Tree

100% of
data
Bootstrap
sample
Bootstrap
sample
Bootstrap
sample
Bootstrap
sample
Bootstrap
sample
Vote on
Prediction
Ensemble of Trees

45
Blue side
Red sideVote:
2-1, Blue
Vote:
2-1, Red
Vote:
2-1, Blue

Benefits of a Decision Tree Ensemble
• Voted boundary is more accurate than for a single tree
• “Best of both worlds”: Get most of the expressiveness of decision
trees with lower variance
• We’re actually taking advantage of the variance by feeding a different
random sample to each tree and seeing what happens!
46

Why draw straight lines in decision trees?
• Imagine you have 400 variables in your dataset
• You only need to examine 400 variables to draw
the “best” straight line between the dots
• If you want a diagonal line in two dimensions,
there are (400 choose 2) or 79,800
combinations of variables to examine
• Some biology datasets have 100,000 variables!
• (100,000 choose 2) = 4,999,950,000
combinations of 2 variables!
47

Popular algorithms for supervised learning
• We got pretty deep into Decision Trees and ensembles of trees
• Other popular algorithms for supervised learning:
• Support Vector Machines
• Neural Nets (“Deep Learning”)
• Check out BigML’s automated deep learning!
50

Recap: Supervised Learning Topics
• Definition of supervised learning
• Training and scoring a model
• Support and confidence
• Model evaluation using a holdout set
• Bias and variance, underfitting and overfitting
• Using ensembles to improve models
• … And a whole lot about decision trees!
51

53
PetalWidth(cm)
Petal Length (cm)

What if we don’t have labels?
• Can we still get insight into our data if we don’t know the
colors of the dots?
• Since we don’t have labels, this is unsupervised learning
• Clustering: Find “clumps” of unlabeled data that might be interesting
• Anomaly detection: Find outliers in unlabeled data
• Topic Modeling: Identify topics in free text
54

Clustering
• Concept: Find “lumps” of data that exist in distinct clusters
• K-means clustering:
1. Choose a number of clusters k that you are looking for
2. Choose initial “centroids” for the clusters
3. Compute which data points are closest to each centroid
4. Compute the actual center for each of the sets of datapoints
5. Continue until the k centroids stop moving
55

Demo: The Whisky Dataset
• Data on the flavors of 86 single-malt Scotch whiskies
• No labels, just a bunch of taste information
• Can we get insight into this dataset?
69

Demo: Breast Cancer Dataset
• Train a predictive model using the 699 biopsies
• The “label” of benign or malignant is known for each one
• We can train a highly accurate predictive model with this
data
74

Demo: Breast Cancer Dataset
• What if we remove the labels of “benign” and “malignant”?
75

10 lines are needed
to isolate this data point
(not anomalous)

Only 4 lines are needed
to isolate this data point
(highly anomalous)

Demo: Anomaly Detection
• Remove the labels of benign or malignant
• Train an anomaly detector on this unlabeled data
• Create a new dataset with the anomaly scores as “labels”
• Use these “labels” to train a predictive model!
78

Minority Report
• Anomaly detection works great on large unlabeled datasets,
especially if you expect to find an (adversarial) minority class
• Millions of credit card transactions, billions of network events …
• Doesn’t require you to know what you’re looking for!
81

Topic Modeling using LDA
• Uncovers groups of related words (“topics”) in documents
• Does not require an external corpus (e.g. training on Wikipedia)
• No semantic parsing of text
• Unsupervised

Topic modeling on
IMDB reviews
• 52,000 reviews
• 883 movies

Data Science 101

Recommended

Recommended

More Related Content

Similar to Data Science 101

Similar to Data Science 101 (20)

More from ideatoipo

More from ideatoipo (20)

Recently uploaded

Recently uploaded (20)

Data Science 101