SlideShare a Scribd company logo
Data Science 101
David Gerster
Strategic Advisory Board
About me
• 10+ years experience in data science at various consumer
web companies
• Worked on web search at Yahoo and Microsoft
• Led the Mobile data science team at Groupon
• Joined BigML as VP Data Science in July 2013
• Joined JLL Spark as VP Data in July 2017
• Advisor to High Fidelity Genetics
3
Finding meaningful patterns in data
• The famous “Iris” data set has measurements for 150 flowers
• Given a flower’s measurements, can we predict its species?
Iris setosa Iris versicolor Iris virginica
5
PetalWidth(cm)
Petal Length (cm)
Iris setosa, red dots
Iris versicolor, green dots
Iris virginica, blue dots
6
PetalWidth(cm)
Petal Length (cm)
Congratulations! You just trained a model.
7
PetalWidth(cm)
Petal Length (cm)
PetalWidth(cm)
Petal Length (cm)
Prediction: Iris setosa
Prediction: Iris versicolor
Prediction: Iris virginica
Prediction:
Iris virginica
8
PetalWidth(cm)
Petal Length (cm)
Prediction: Iris setosa
Prediction: Iris versicolor
Prediction: Iris virginica
Prediction:
Iris virginica
Congratulations! You just scored
four new flowers using your model,
and made a prediction about the
species of each one.
9
Training versus Scoring
• This process had two steps: training and scoring
• When training on historical data, you’re using data gathered over
some length of time
• When scoring new data points, you want the answer immediately
(in “real time”)
10
11
Predicts “blue” with high confidence
Explains a large chunk of the data
(high support)
Predicts “blue” with low confidence
Explains a small chunk of the data
(low support)
Support and Confidence
• A rectangle with a large number of data points has high “support”
• A rectangle that is purely one color has high “confidence”
• If there is a small number of data points, confidence is low even if
it’s purely one color
12
PetalWidth(cm)
Petal Length (cm)
13
Width <= 0.8? Width > 0.8?
Width > 1.75? Width <= 1.75?
Length <= 5? Length > 5?
50 red
45 blue
1 blue, 48 green 4 blue, 2 green
“Decision Tree”
“Leaf Nodes”
50 blue, 50 green
5 blue, 50 green
50 red, 50 blue, 50 green
• Data is just a table of values
• Each row is an instance, an example
of the concept to be learned
• Each column is an attribute or
feature of the instance
• The column we want to predict is the
label or output
• Because we have a label, this is
supervised learning
14
instance
instance
feature labelfeature
Demo: The General Social Survey
• Sociology survey given in the United States since 1972
• Data is 39,000 responses, almost 400 questions each
• Demographic data like income, race, gender, education, marital status
• Many questions about personal beliefs
• “Should an atheist be allowed to teach college, or not?”
• “Are we spending the right amount of money on education?”
• Can we predict income from these responses?
16
How good is our model?
• The model looks good, but how do we quantify this?
17
80%
training set
20%
holdout set
3 out of 4 predictions are correct
Accuracy = 75%
100% of data
1. Train a model using
80% training set
2. Pretend 20% holdout
is new data, and
feed it to the model
3. Check accuracy of
predictions
Predicting political views
• What happens if we predict political views instead of income?
• A different subset of variables becomes important!
19
20
Finding the important variables
21
22
The Value of Predictive Modeling
• Provides deep insight into your data
• Finds the small subset of important variables
• Extremely useful for business!
23
Demo: The StumbleUpon Dataset
• StumbleUpon is an app that recommends web pages
• Dataset of 7,400 web pages is provided, with each page labeled as
either “evergreen” or “ephemeral”
• We want to predict the page’s class using this historical data
24
While some pages we recommend, such as news
articles or seasonal recipes, are only relevant for a
short period of time, others maintain a timeless
quality and can be recommended to users long after
they are discovered. In other words, pages can
either be classified as "ephemeral" or "evergreen".
Training a model on StumbleUpon data
• Live demo: training a model on StumbleUpon data
• Key concepts:
• “Bag of words” text analysis
• Evaluating the model using a holdout set
• Combining multiple models to improve accuracy
• The “ensemble” of multiple models has better accuracy!
25
“Ensembles” of Models
• Training multiple models on random subsets of the data gave us a
better result!
• Why?
26
Bias and Variance
• We train a model with the goal of fitting it correctly to the data
• When a model isn’t flexible enough, it may underfit the data, and we
say it has high bias
• When a model is too flexible, it may overfit the data, and we say it
has high variance
For a formal definition of bias and variance, see
Thomas Dietterich’s paper on the subject
28
High Bias
29
High Variance
Decision trees have high variance
• Decision trees can represent complex functions
• But they are prone to overfitting; they have high variance
• If you draw enough lines, you can create a “model” that just
memorizes the dataset!
Decision trees have high variance
• We can reduce this problem by:
• Taking several random samples from the original data set
• Training a decision tree on each sample
• Having these trees vote on the class
• Goal: Get the expressiveness of a decision tree, with less overfitting
100% of data
Prediction
Single Tree
100% of
data
Bootstrap
sample
Bootstrap
sample
Bootstrap
sample
Bootstrap
sample
Bootstrap
sample
Vote on
Prediction
Ensemble of Trees
39
40
41
42
45
Blue side
Red sideVote:
2-1, Blue
Vote:
2-1, Red
Vote:
2-1, Blue
Benefits of a Decision Tree Ensemble
• Voted boundary is more accurate than for a single tree
• “Best of both worlds”: Get most of the expressiveness of decision
trees with lower variance
• We’re actually taking advantage of the variance by feeding a different
random sample to each tree and seeing what happens!
46
Why draw straight lines in decision trees?
• Imagine you have 400 variables in your dataset
• You only need to examine 400 variables to draw
the “best” straight line between the dots
• If you want a diagonal line in two dimensions,
there are (400 choose 2) or 79,800
combinations of variables to examine
• Some biology datasets have 100,000 variables!
• (100,000 choose 2) = 4,999,950,000
combinations of 2 variables!
47
Popular algorithms for supervised learning
• We got pretty deep into Decision Trees and ensembles of trees
• Other popular algorithms for supervised learning:
• Support Vector Machines
• Neural Nets (“Deep Learning”)
• Check out BigML’s automated deep learning!
50
Recap: Supervised Learning Topics
• Definition of supervised learning
• Training and scoring a model
• Support and confidence
• Model evaluation using a holdout set
• Bias and variance, underfitting and overfitting
• Using ensembles to improve models
• … And a whole lot about decision trees!
51
53
PetalWidth(cm)
Petal Length (cm)
What if we don’t have labels?
• Can we still get insight into our data if we don’t know the
colors of the dots?
• Since we don’t have labels, this is unsupervised learning
• Clustering: Find “clumps” of unlabeled data that might be interesting
• Anomaly detection: Find outliers in unlabeled data
• Topic Modeling: Identify topics in free text
54
Clustering
• Concept: Find “lumps” of data that exist in distinct clusters
• K-means clustering:
1. Choose a number of clusters k that you are looking for
2. Choose initial “centroids” for the clusters
3. Compute which data points are closest to each centroid
4. Compute the actual center for each of the sets of datapoints
5. Continue until the k centroids stop moving
55
Demo: The Whisky Dataset
• Data on the flavors of 86 single-malt Scotch whiskies
• No labels, just a bunch of taste information
• Can we get insight into this dataset?
69
Demo: Breast Cancer Dataset
• Train a predictive model using the 699 biopsies
• The “label” of benign or malignant is known for each one
• We can train a highly accurate predictive model with this
data
74
Demo: Breast Cancer Dataset
• What if we remove the labels of “benign” and “malignant”?
75
10 lines are needed
to isolate this data point
(not anomalous)
Only 4 lines are needed
to isolate this data point
(highly anomalous)
Demo: Anomaly Detection
• Remove the labels of benign or malignant
• Train an anomaly detector on this unlabeled data
• Create a new dataset with the anomaly scores as “labels”
• Use these “labels” to train a predictive model!
78
Who Needs Labels?
Minority Report
• Anomaly detection works great on large unlabeled datasets,
especially if you expect to find an (adversarial) minority class
• Millions of credit card transactions, billions of network events …
• Doesn’t require you to know what you’re looking for!
81
Topic Modeling using LDA
• Uncovers groups of related words (“topics”) in documents
• Does not require an external corpus (e.g. training on Wikipedia)
• No semantic parsing of text
• Unsupervised
Topic modeling on
IMDB reviews
• 52,000 reviews
• 883 movies
Top 3 Topics in Shrek Reviews (n=26)
Topics Topic distribution for
this document
Borrowed/stolen from Prof. David Blei, with apologies
…
The (assumed) generative process
children
A topic, which is a
distribution over words
A distribution over topics,
specific to each document
A distribution over
topic distributions,
fixed for this corpus
A word
in a document
Topic 1
Topic 3 Topic 2
A distribution over
word distributions,
fixed for this corpus
Word 1
Word 2Word 3
What we observe
children
A word
in a document
n = 26
Shrek
n = 26
Shrek
n = 31
The Sum of All Fears
n = 31
The Sum of All Fears
n = 100
Love, Actually
How do we get such “good” topics?
• Imagine that each document can only belong to one topic
• Does that make it easier or harder to find “good” clusters of words?
• LDA allows documents to belong to multiple topics
Recap: Unsupervised Learning Topics
• Unsupervised learning uses unlabeled data
• Clustering: Finding clumps in unlabeled data
• Anomaly Detection: Finding “weird” instances in unlabeled data
• Topic Modeling: Extracting meaningful topics from free text
94
Final Thought
• Supervised learning has many different algorithms to solve one
problem (predicting the output)
• Unsupervised learning has a many different algorithms to solve many
different problems
95
David Gerster
gerster@bigml.com
Backup Slides
102

More Related Content

Similar to Data Science 101

CS194Lec0hbh6EDA.pptx
CS194Lec0hbh6EDA.pptxCS194Lec0hbh6EDA.pptx
CS194Lec0hbh6EDA.pptx
PrudhvirajEluri1
 
Ai4life aiml-xops-sig
Ai4life aiml-xops-sigAi4life aiml-xops-sig
Ai4life aiml-xops-sig
madhucharis
 
Data Wrangling_1.pptx
Data Wrangling_1.pptxData Wrangling_1.pptx
Data Wrangling_1.pptx
PallabiSahoo5
 
Big Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiBig Data Real Time Training in Chennai
Big Data Real Time Training in Chennai
Vijay Susheedran C G
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introductionNeeraj Tewari
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk Knowledge
Krishna Sankar
 
Research methods and data analysis
Research methods and data analysisResearch methods and data analysis
Research methods and data analysis
yogesh shrestha
 
03-Data-Exploration.pptx
03-Data-Exploration.pptx03-Data-Exploration.pptx
03-Data-Exploration.pptx
Shree Shree
 
Mini datathon
Mini datathonMini datathon
Mini datathon
Kunal Jain
 
Coding qualitative data for non-researchers
Coding qualitative data for non-researchersCoding qualitative data for non-researchers
Coding qualitative data for non-researchers
Kelley Howell
 
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-jaJillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-jaJillian Aurisano
 
Predict the Oscars with Data Science
Predict the Oscars with Data SciencePredict the Oscars with Data Science
Predict the Oscars with Data Science
Thinkful
 
How to Become a Data Science Company instead of a company with Data Scientist...
How to Become a Data Science Company instead of a company with Data Scientist...How to Become a Data Science Company instead of a company with Data Scientist...
How to Become a Data Science Company instead of a company with Data Scientist...
Ruth Kearney
 
L14. Anomaly Detection
L14. Anomaly DetectionL14. Anomaly Detection
L14. Anomaly Detection
Machine Learning Valencia
 
Kevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data MiningKevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data Mining
Library and Information Science Research Coalition
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
Paul Groth
 
Mastering the 80% of Analytics: What Data Scientists Really Do
Mastering the 80% of Analytics: What Data Scientists Really DoMastering the 80% of Analytics: What Data Scientists Really Do
Mastering the 80% of Analytics: What Data Scientists Really Do
Avrio Analytics
 
Multi task learning stepping away from narrow expert models 7.11.18
Multi task learning stepping away from narrow expert models 7.11.18Multi task learning stepping away from narrow expert models 7.11.18
Multi task learning stepping away from narrow expert models 7.11.18
Cloudera, Inc.
 
The Simulacrum, a Synthetic Cancer Dataset
The Simulacrum, a Synthetic Cancer DatasetThe Simulacrum, a Synthetic Cancer Dataset
The Simulacrum, a Synthetic Cancer Dataset
CongChen35
 
Predictive Analysis
Predictive AnalysisPredictive Analysis
Predictive Analysis
Michael Bystry
 

Similar to Data Science 101 (20)

CS194Lec0hbh6EDA.pptx
CS194Lec0hbh6EDA.pptxCS194Lec0hbh6EDA.pptx
CS194Lec0hbh6EDA.pptx
 
Ai4life aiml-xops-sig
Ai4life aiml-xops-sigAi4life aiml-xops-sig
Ai4life aiml-xops-sig
 
Data Wrangling_1.pptx
Data Wrangling_1.pptxData Wrangling_1.pptx
Data Wrangling_1.pptx
 
Big Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiBig Data Real Time Training in Chennai
Big Data Real Time Training in Chennai
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introduction
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk Knowledge
 
Research methods and data analysis
Research methods and data analysisResearch methods and data analysis
Research methods and data analysis
 
03-Data-Exploration.pptx
03-Data-Exploration.pptx03-Data-Exploration.pptx
03-Data-Exploration.pptx
 
Mini datathon
Mini datathonMini datathon
Mini datathon
 
Coding qualitative data for non-researchers
Coding qualitative data for non-researchersCoding qualitative data for non-researchers
Coding qualitative data for non-researchers
 
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-jaJillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
 
Predict the Oscars with Data Science
Predict the Oscars with Data SciencePredict the Oscars with Data Science
Predict the Oscars with Data Science
 
How to Become a Data Science Company instead of a company with Data Scientist...
How to Become a Data Science Company instead of a company with Data Scientist...How to Become a Data Science Company instead of a company with Data Scientist...
How to Become a Data Science Company instead of a company with Data Scientist...
 
L14. Anomaly Detection
L14. Anomaly DetectionL14. Anomaly Detection
L14. Anomaly Detection
 
Kevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data MiningKevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data Mining
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
 
Mastering the 80% of Analytics: What Data Scientists Really Do
Mastering the 80% of Analytics: What Data Scientists Really DoMastering the 80% of Analytics: What Data Scientists Really Do
Mastering the 80% of Analytics: What Data Scientists Really Do
 
Multi task learning stepping away from narrow expert models 7.11.18
Multi task learning stepping away from narrow expert models 7.11.18Multi task learning stepping away from narrow expert models 7.11.18
Multi task learning stepping away from narrow expert models 7.11.18
 
The Simulacrum, a Synthetic Cancer Dataset
The Simulacrum, a Synthetic Cancer DatasetThe Simulacrum, a Synthetic Cancer Dataset
The Simulacrum, a Synthetic Cancer Dataset
 
Predictive Analysis
Predictive AnalysisPredictive Analysis
Predictive Analysis
 

More from ideatoipo

How to Master LinkedIn for Career and Business
How to Master LinkedIn for Career and BusinessHow to Master LinkedIn for Career and Business
How to Master LinkedIn for Career and Business
ideatoipo
 
Anatomy of a Patent for Executives and Entrepreneurs
Anatomy of a Patent for Executives and EntrepreneursAnatomy of a Patent for Executives and Entrepreneurs
Anatomy of a Patent for Executives and Entrepreneurs
ideatoipo
 
How to Master Resume Writing in a Competitive Market
How to Master Resume Writing in a Competitive MarketHow to Master Resume Writing in a Competitive Market
How to Master Resume Writing in a Competitive Market
ideatoipo
 
How to Answer the Most Important Question In Your Interview
How to Answer the Most Important Question In Your InterviewHow to Answer the Most Important Question In Your Interview
How to Answer the Most Important Question In Your Interview
ideatoipo
 
How to Write a Resume in a Competitive Job Market
How to Write a Resume in a Competitive Job MarketHow to Write a Resume in a Competitive Job Market
How to Write a Resume in a Competitive Job Market
ideatoipo
 
How to Get Venture and Angel Funding for Your Tech Startup
How to Get Venture and Angel Funding for Your Tech StartupHow to Get Venture and Angel Funding for Your Tech Startup
How to Get Venture and Angel Funding for Your Tech Startup
ideatoipo
 
How to Get Venture and Angel Funding for Your Tech Startup
How to Get Venture and Angel Funding for Your Tech StartupHow to Get Venture and Angel Funding for Your Tech Startup
How to Get Venture and Angel Funding for Your Tech Startup
ideatoipo
 
How to do a Venture Capital Financing in 2024
How to do a Venture Capital Financing  in 2024How to do a Venture Capital Financing  in 2024
How to do a Venture Capital Financing in 2024
ideatoipo
 
How to Protect Your Intellectual Property
How to Protect Your Intellectual PropertyHow to Protect Your Intellectual Property
How to Protect Your Intellectual Property
ideatoipo
 
How to Systematize Your Job Search in 2024
How to Systematize Your Job Search in 2024How to Systematize Your Job Search in 2024
How to Systematize Your Job Search in 2024
ideatoipo
 
Top Ten Legal Mistakes That Could Doom Your Startup
Top Ten Legal Mistakes That Could Doom Your StartupTop Ten Legal Mistakes That Could Doom Your Startup
Top Ten Legal Mistakes That Could Doom Your Startup
ideatoipo
 
How to Recession-Proof Your Job Search in 2024
How to Recession-Proof Your Job Search in 2024How to Recession-Proof Your Job Search in 2024
How to Recession-Proof Your Job Search in 2024
ideatoipo
 
H1-B + U.S. Visa Options for Int'l Tech Professionals, Entrepreneurs & Investors
H1-B + U.S. Visa Options for Int'l Tech Professionals, Entrepreneurs & InvestorsH1-B + U.S. Visa Options for Int'l Tech Professionals, Entrepreneurs & Investors
H1-B + U.S. Visa Options for Int'l Tech Professionals, Entrepreneurs & Investors
ideatoipo
 
How to Strategically Prepare Your Job Search for 2024
How to Strategically Prepare Your Job Search for 2024How to Strategically Prepare Your Job Search for 2024
How to Strategically Prepare Your Job Search for 2024
ideatoipo
 
How to Secure Seed and Pre-Seed Investment for Your Startup
How to Secure Seed and Pre-Seed Investment for Your StartupHow to Secure Seed and Pre-Seed Investment for Your Startup
How to Secure Seed and Pre-Seed Investment for Your Startup
ideatoipo
 
How to Get Funding for Your Startup by Building Your Corporate Credit
How to Get Funding for Your Startup by Building Your Corporate CreditHow to Get Funding for Your Startup by Building Your Corporate Credit
How to Get Funding for Your Startup by Building Your Corporate Credit
ideatoipo
 
How to Raise Seed Funding for Your Startup: Convertible Notes and SAFEs
How to Raise Seed Funding for Your Startup: Convertible Notes and SAFEsHow to Raise Seed Funding for Your Startup: Convertible Notes and SAFEs
How to Raise Seed Funding for Your Startup: Convertible Notes and SAFEs
ideatoipo
 
Startup Law 101:How to Avoid Legal Pitfalls that Could Doom Your Startup
Startup Law 101:How to Avoid Legal Pitfalls that Could Doom Your StartupStartup Law 101:How to Avoid Legal Pitfalls that Could Doom Your Startup
Startup Law 101:How to Avoid Legal Pitfalls that Could Doom Your Startup
ideatoipo
 
How to Prepare Your Job Search for 2024 Success
How to Prepare Your Job Search for 2024 SuccessHow to Prepare Your Job Search for 2024 Success
How to Prepare Your Job Search for 2024 Success
ideatoipo
 
How to Move Your Startup Company to the U.S.
How to Move Your Startup Company to the U.S.How to Move Your Startup Company to the U.S.
How to Move Your Startup Company to the U.S.
ideatoipo
 

More from ideatoipo (20)

How to Master LinkedIn for Career and Business
How to Master LinkedIn for Career and BusinessHow to Master LinkedIn for Career and Business
How to Master LinkedIn for Career and Business
 
Anatomy of a Patent for Executives and Entrepreneurs
Anatomy of a Patent for Executives and EntrepreneursAnatomy of a Patent for Executives and Entrepreneurs
Anatomy of a Patent for Executives and Entrepreneurs
 
How to Master Resume Writing in a Competitive Market
How to Master Resume Writing in a Competitive MarketHow to Master Resume Writing in a Competitive Market
How to Master Resume Writing in a Competitive Market
 
How to Answer the Most Important Question In Your Interview
How to Answer the Most Important Question In Your InterviewHow to Answer the Most Important Question In Your Interview
How to Answer the Most Important Question In Your Interview
 
How to Write a Resume in a Competitive Job Market
How to Write a Resume in a Competitive Job MarketHow to Write a Resume in a Competitive Job Market
How to Write a Resume in a Competitive Job Market
 
How to Get Venture and Angel Funding for Your Tech Startup
How to Get Venture and Angel Funding for Your Tech StartupHow to Get Venture and Angel Funding for Your Tech Startup
How to Get Venture and Angel Funding for Your Tech Startup
 
How to Get Venture and Angel Funding for Your Tech Startup
How to Get Venture and Angel Funding for Your Tech StartupHow to Get Venture and Angel Funding for Your Tech Startup
How to Get Venture and Angel Funding for Your Tech Startup
 
How to do a Venture Capital Financing in 2024
How to do a Venture Capital Financing  in 2024How to do a Venture Capital Financing  in 2024
How to do a Venture Capital Financing in 2024
 
How to Protect Your Intellectual Property
How to Protect Your Intellectual PropertyHow to Protect Your Intellectual Property
How to Protect Your Intellectual Property
 
How to Systematize Your Job Search in 2024
How to Systematize Your Job Search in 2024How to Systematize Your Job Search in 2024
How to Systematize Your Job Search in 2024
 
Top Ten Legal Mistakes That Could Doom Your Startup
Top Ten Legal Mistakes That Could Doom Your StartupTop Ten Legal Mistakes That Could Doom Your Startup
Top Ten Legal Mistakes That Could Doom Your Startup
 
How to Recession-Proof Your Job Search in 2024
How to Recession-Proof Your Job Search in 2024How to Recession-Proof Your Job Search in 2024
How to Recession-Proof Your Job Search in 2024
 
H1-B + U.S. Visa Options for Int'l Tech Professionals, Entrepreneurs & Investors
H1-B + U.S. Visa Options for Int'l Tech Professionals, Entrepreneurs & InvestorsH1-B + U.S. Visa Options for Int'l Tech Professionals, Entrepreneurs & Investors
H1-B + U.S. Visa Options for Int'l Tech Professionals, Entrepreneurs & Investors
 
How to Strategically Prepare Your Job Search for 2024
How to Strategically Prepare Your Job Search for 2024How to Strategically Prepare Your Job Search for 2024
How to Strategically Prepare Your Job Search for 2024
 
How to Secure Seed and Pre-Seed Investment for Your Startup
How to Secure Seed and Pre-Seed Investment for Your StartupHow to Secure Seed and Pre-Seed Investment for Your Startup
How to Secure Seed and Pre-Seed Investment for Your Startup
 
How to Get Funding for Your Startup by Building Your Corporate Credit
How to Get Funding for Your Startup by Building Your Corporate CreditHow to Get Funding for Your Startup by Building Your Corporate Credit
How to Get Funding for Your Startup by Building Your Corporate Credit
 
How to Raise Seed Funding for Your Startup: Convertible Notes and SAFEs
How to Raise Seed Funding for Your Startup: Convertible Notes and SAFEsHow to Raise Seed Funding for Your Startup: Convertible Notes and SAFEs
How to Raise Seed Funding for Your Startup: Convertible Notes and SAFEs
 
Startup Law 101:How to Avoid Legal Pitfalls that Could Doom Your Startup
Startup Law 101:How to Avoid Legal Pitfalls that Could Doom Your StartupStartup Law 101:How to Avoid Legal Pitfalls that Could Doom Your Startup
Startup Law 101:How to Avoid Legal Pitfalls that Could Doom Your Startup
 
How to Prepare Your Job Search for 2024 Success
How to Prepare Your Job Search for 2024 SuccessHow to Prepare Your Job Search for 2024 Success
How to Prepare Your Job Search for 2024 Success
 
How to Move Your Startup Company to the U.S.
How to Move Your Startup Company to the U.S.How to Move Your Startup Company to the U.S.
How to Move Your Startup Company to the U.S.
 

Recently uploaded

Project File Report BBA 6th semester.pdf
Project File Report BBA 6th semester.pdfProject File Report BBA 6th semester.pdf
Project File Report BBA 6th semester.pdf
RajPriye
 
Digital Transformation in PLM - WHAT and HOW - for distribution.pdf
Digital Transformation in PLM - WHAT and HOW - for distribution.pdfDigital Transformation in PLM - WHAT and HOW - for distribution.pdf
Digital Transformation in PLM - WHAT and HOW - for distribution.pdf
Jos Voskuil
 
Maksym Vyshnivetskyi: PMO Quality Management (UA)
Maksym Vyshnivetskyi: PMO Quality Management (UA)Maksym Vyshnivetskyi: PMO Quality Management (UA)
Maksym Vyshnivetskyi: PMO Quality Management (UA)
Lviv Startup Club
 
Pitch Deck Teardown: RAW Dating App's $3M Angel deck
Pitch Deck Teardown: RAW Dating App's $3M Angel deckPitch Deck Teardown: RAW Dating App's $3M Angel deck
Pitch Deck Teardown: RAW Dating App's $3M Angel deck
HajeJanKamps
 
Cracking the Workplace Discipline Code Main.pptx
Cracking the Workplace Discipline Code Main.pptxCracking the Workplace Discipline Code Main.pptx
Cracking the Workplace Discipline Code Main.pptx
Workforce Group
 
Improving profitability for small business
Improving profitability for small businessImproving profitability for small business
Improving profitability for small business
Ben Wann
 
The Parable of the Pipeline a book every new businessman or business student ...
The Parable of the Pipeline a book every new businessman or business student ...The Parable of the Pipeline a book every new businessman or business student ...
The Parable of the Pipeline a book every new businessman or business student ...
awaisafdar
 
Business Valuation Principles for Entrepreneurs
Business Valuation Principles for EntrepreneursBusiness Valuation Principles for Entrepreneurs
Business Valuation Principles for Entrepreneurs
Ben Wann
 
anas about venice for grade 6f about venice
anas about venice for grade 6f about veniceanas about venice for grade 6f about venice
anas about venice for grade 6f about venice
anasabutalha2013
 
Attending a job Interview for B1 and B2 Englsih learners
Attending a job Interview for B1 and B2 Englsih learnersAttending a job Interview for B1 and B2 Englsih learners
Attending a job Interview for B1 and B2 Englsih learners
Erika906060
 
Affordable Stationery Printing Services in Jaipur | Navpack n Print
Affordable Stationery Printing Services in Jaipur | Navpack n PrintAffordable Stationery Printing Services in Jaipur | Navpack n Print
Affordable Stationery Printing Services in Jaipur | Navpack n Print
Navpack & Print
 
India Orthopedic Devices Market: Unlocking Growth Secrets, Trends and Develop...
India Orthopedic Devices Market: Unlocking Growth Secrets, Trends and Develop...India Orthopedic Devices Market: Unlocking Growth Secrets, Trends and Develop...
India Orthopedic Devices Market: Unlocking Growth Secrets, Trends and Develop...
Kumar Satyam
 
20240425_ TJ Communications Credentials_compressed.pdf
20240425_ TJ Communications Credentials_compressed.pdf20240425_ TJ Communications Credentials_compressed.pdf
20240425_ TJ Communications Credentials_compressed.pdf
tjcomstrang
 
Accpac to QuickBooks Conversion Navigating the Transition with Online Account...
Accpac to QuickBooks Conversion Navigating the Transition with Online Account...Accpac to QuickBooks Conversion Navigating the Transition with Online Account...
Accpac to QuickBooks Conversion Navigating the Transition with Online Account...
PaulBryant58
 
April 2024 Nostalgia Products Newsletter
April 2024 Nostalgia Products NewsletterApril 2024 Nostalgia Products Newsletter
April 2024 Nostalgia Products Newsletter
NathanBaughman3
 
PriyoShop Celebration Pohela Falgun Mar 20, 2024
PriyoShop Celebration Pohela Falgun Mar 20, 2024PriyoShop Celebration Pohela Falgun Mar 20, 2024
PriyoShop Celebration Pohela Falgun Mar 20, 2024
PriyoShop.com LTD
 
Putting the SPARK into Virtual Training.pptx
Putting the SPARK into Virtual Training.pptxPutting the SPARK into Virtual Training.pptx
Putting the SPARK into Virtual Training.pptx
Cynthia Clay
 
Filing Your Delaware Franchise Tax A Detailed Guide
Filing Your Delaware Franchise Tax A Detailed GuideFiling Your Delaware Franchise Tax A Detailed Guide
Filing Your Delaware Franchise Tax A Detailed Guide
YourLegal Accounting
 
Premium MEAN Stack Development Solutions for Modern Businesses
Premium MEAN Stack Development Solutions for Modern BusinessesPremium MEAN Stack Development Solutions for Modern Businesses
Premium MEAN Stack Development Solutions for Modern Businesses
SynapseIndia
 
Meas_Dylan_DMBS_PB1_2024-05XX_Revised.pdf
Meas_Dylan_DMBS_PB1_2024-05XX_Revised.pdfMeas_Dylan_DMBS_PB1_2024-05XX_Revised.pdf
Meas_Dylan_DMBS_PB1_2024-05XX_Revised.pdf
dylandmeas
 

Recently uploaded (20)

Project File Report BBA 6th semester.pdf
Project File Report BBA 6th semester.pdfProject File Report BBA 6th semester.pdf
Project File Report BBA 6th semester.pdf
 
Digital Transformation in PLM - WHAT and HOW - for distribution.pdf
Digital Transformation in PLM - WHAT and HOW - for distribution.pdfDigital Transformation in PLM - WHAT and HOW - for distribution.pdf
Digital Transformation in PLM - WHAT and HOW - for distribution.pdf
 
Maksym Vyshnivetskyi: PMO Quality Management (UA)
Maksym Vyshnivetskyi: PMO Quality Management (UA)Maksym Vyshnivetskyi: PMO Quality Management (UA)
Maksym Vyshnivetskyi: PMO Quality Management (UA)
 
Pitch Deck Teardown: RAW Dating App's $3M Angel deck
Pitch Deck Teardown: RAW Dating App's $3M Angel deckPitch Deck Teardown: RAW Dating App's $3M Angel deck
Pitch Deck Teardown: RAW Dating App's $3M Angel deck
 
Cracking the Workplace Discipline Code Main.pptx
Cracking the Workplace Discipline Code Main.pptxCracking the Workplace Discipline Code Main.pptx
Cracking the Workplace Discipline Code Main.pptx
 
Improving profitability for small business
Improving profitability for small businessImproving profitability for small business
Improving profitability for small business
 
The Parable of the Pipeline a book every new businessman or business student ...
The Parable of the Pipeline a book every new businessman or business student ...The Parable of the Pipeline a book every new businessman or business student ...
The Parable of the Pipeline a book every new businessman or business student ...
 
Business Valuation Principles for Entrepreneurs
Business Valuation Principles for EntrepreneursBusiness Valuation Principles for Entrepreneurs
Business Valuation Principles for Entrepreneurs
 
anas about venice for grade 6f about venice
anas about venice for grade 6f about veniceanas about venice for grade 6f about venice
anas about venice for grade 6f about venice
 
Attending a job Interview for B1 and B2 Englsih learners
Attending a job Interview for B1 and B2 Englsih learnersAttending a job Interview for B1 and B2 Englsih learners
Attending a job Interview for B1 and B2 Englsih learners
 
Affordable Stationery Printing Services in Jaipur | Navpack n Print
Affordable Stationery Printing Services in Jaipur | Navpack n PrintAffordable Stationery Printing Services in Jaipur | Navpack n Print
Affordable Stationery Printing Services in Jaipur | Navpack n Print
 
India Orthopedic Devices Market: Unlocking Growth Secrets, Trends and Develop...
India Orthopedic Devices Market: Unlocking Growth Secrets, Trends and Develop...India Orthopedic Devices Market: Unlocking Growth Secrets, Trends and Develop...
India Orthopedic Devices Market: Unlocking Growth Secrets, Trends and Develop...
 
20240425_ TJ Communications Credentials_compressed.pdf
20240425_ TJ Communications Credentials_compressed.pdf20240425_ TJ Communications Credentials_compressed.pdf
20240425_ TJ Communications Credentials_compressed.pdf
 
Accpac to QuickBooks Conversion Navigating the Transition with Online Account...
Accpac to QuickBooks Conversion Navigating the Transition with Online Account...Accpac to QuickBooks Conversion Navigating the Transition with Online Account...
Accpac to QuickBooks Conversion Navigating the Transition with Online Account...
 
April 2024 Nostalgia Products Newsletter
April 2024 Nostalgia Products NewsletterApril 2024 Nostalgia Products Newsletter
April 2024 Nostalgia Products Newsletter
 
PriyoShop Celebration Pohela Falgun Mar 20, 2024
PriyoShop Celebration Pohela Falgun Mar 20, 2024PriyoShop Celebration Pohela Falgun Mar 20, 2024
PriyoShop Celebration Pohela Falgun Mar 20, 2024
 
Putting the SPARK into Virtual Training.pptx
Putting the SPARK into Virtual Training.pptxPutting the SPARK into Virtual Training.pptx
Putting the SPARK into Virtual Training.pptx
 
Filing Your Delaware Franchise Tax A Detailed Guide
Filing Your Delaware Franchise Tax A Detailed GuideFiling Your Delaware Franchise Tax A Detailed Guide
Filing Your Delaware Franchise Tax A Detailed Guide
 
Premium MEAN Stack Development Solutions for Modern Businesses
Premium MEAN Stack Development Solutions for Modern BusinessesPremium MEAN Stack Development Solutions for Modern Businesses
Premium MEAN Stack Development Solutions for Modern Businesses
 
Meas_Dylan_DMBS_PB1_2024-05XX_Revised.pdf
Meas_Dylan_DMBS_PB1_2024-05XX_Revised.pdfMeas_Dylan_DMBS_PB1_2024-05XX_Revised.pdf
Meas_Dylan_DMBS_PB1_2024-05XX_Revised.pdf
 

Data Science 101

  • 1. Data Science 101 David Gerster Strategic Advisory Board
  • 2. About me • 10+ years experience in data science at various consumer web companies • Worked on web search at Yahoo and Microsoft • Led the Mobile data science team at Groupon • Joined BigML as VP Data Science in July 2013 • Joined JLL Spark as VP Data in July 2017 • Advisor to High Fidelity Genetics 3
  • 3. Finding meaningful patterns in data • The famous “Iris” data set has measurements for 150 flowers • Given a flower’s measurements, can we predict its species? Iris setosa Iris versicolor Iris virginica 5
  • 4. PetalWidth(cm) Petal Length (cm) Iris setosa, red dots Iris versicolor, green dots Iris virginica, blue dots 6
  • 6. PetalWidth(cm) Petal Length (cm) PetalWidth(cm) Petal Length (cm) Prediction: Iris setosa Prediction: Iris versicolor Prediction: Iris virginica Prediction: Iris virginica 8
  • 7. PetalWidth(cm) Petal Length (cm) Prediction: Iris setosa Prediction: Iris versicolor Prediction: Iris virginica Prediction: Iris virginica Congratulations! You just scored four new flowers using your model, and made a prediction about the species of each one. 9
  • 8. Training versus Scoring • This process had two steps: training and scoring • When training on historical data, you’re using data gathered over some length of time • When scoring new data points, you want the answer immediately (in “real time”) 10
  • 9. 11 Predicts “blue” with high confidence Explains a large chunk of the data (high support) Predicts “blue” with low confidence Explains a small chunk of the data (low support)
  • 10. Support and Confidence • A rectangle with a large number of data points has high “support” • A rectangle that is purely one color has high “confidence” • If there is a small number of data points, confidence is low even if it’s purely one color 12
  • 11. PetalWidth(cm) Petal Length (cm) 13 Width <= 0.8? Width > 0.8? Width > 1.75? Width <= 1.75? Length <= 5? Length > 5? 50 red 45 blue 1 blue, 48 green 4 blue, 2 green “Decision Tree” “Leaf Nodes” 50 blue, 50 green 5 blue, 50 green 50 red, 50 blue, 50 green
  • 12. • Data is just a table of values • Each row is an instance, an example of the concept to be learned • Each column is an attribute or feature of the instance • The column we want to predict is the label or output • Because we have a label, this is supervised learning 14 instance instance feature labelfeature
  • 13. Demo: The General Social Survey • Sociology survey given in the United States since 1972 • Data is 39,000 responses, almost 400 questions each • Demographic data like income, race, gender, education, marital status • Many questions about personal beliefs • “Should an atheist be allowed to teach college, or not?” • “Are we spending the right amount of money on education?” • Can we predict income from these responses? 16
  • 14. How good is our model? • The model looks good, but how do we quantify this? 17
  • 15. 80% training set 20% holdout set 3 out of 4 predictions are correct Accuracy = 75% 100% of data 1. Train a model using 80% training set 2. Pretend 20% holdout is new data, and feed it to the model 3. Check accuracy of predictions
  • 16. Predicting political views • What happens if we predict political views instead of income? • A different subset of variables becomes important! 19
  • 17. 20
  • 18. Finding the important variables 21
  • 19. 22
  • 20. The Value of Predictive Modeling • Provides deep insight into your data • Finds the small subset of important variables • Extremely useful for business! 23
  • 21. Demo: The StumbleUpon Dataset • StumbleUpon is an app that recommends web pages • Dataset of 7,400 web pages is provided, with each page labeled as either “evergreen” or “ephemeral” • We want to predict the page’s class using this historical data 24 While some pages we recommend, such as news articles or seasonal recipes, are only relevant for a short period of time, others maintain a timeless quality and can be recommended to users long after they are discovered. In other words, pages can either be classified as "ephemeral" or "evergreen".
  • 22. Training a model on StumbleUpon data • Live demo: training a model on StumbleUpon data • Key concepts: • “Bag of words” text analysis • Evaluating the model using a holdout set • Combining multiple models to improve accuracy • The “ensemble” of multiple models has better accuracy! 25
  • 23. “Ensembles” of Models • Training multiple models on random subsets of the data gave us a better result! • Why? 26
  • 24. Bias and Variance • We train a model with the goal of fitting it correctly to the data • When a model isn’t flexible enough, it may underfit the data, and we say it has high bias • When a model is too flexible, it may overfit the data, and we say it has high variance For a formal definition of bias and variance, see Thomas Dietterich’s paper on the subject
  • 27. Decision trees have high variance • Decision trees can represent complex functions • But they are prone to overfitting; they have high variance • If you draw enough lines, you can create a “model” that just memorizes the dataset!
  • 28.
  • 29. Decision trees have high variance • We can reduce this problem by: • Taking several random samples from the original data set • Training a decision tree on each sample • Having these trees vote on the class • Goal: Get the expressiveness of a decision tree, with less overfitting
  • 32. 39
  • 33. 40
  • 34. 41
  • 35. 42
  • 36. 45 Blue side Red sideVote: 2-1, Blue Vote: 2-1, Red Vote: 2-1, Blue
  • 37. Benefits of a Decision Tree Ensemble • Voted boundary is more accurate than for a single tree • “Best of both worlds”: Get most of the expressiveness of decision trees with lower variance • We’re actually taking advantage of the variance by feeding a different random sample to each tree and seeing what happens! 46
  • 38. Why draw straight lines in decision trees? • Imagine you have 400 variables in your dataset • You only need to examine 400 variables to draw the “best” straight line between the dots • If you want a diagonal line in two dimensions, there are (400 choose 2) or 79,800 combinations of variables to examine • Some biology datasets have 100,000 variables! • (100,000 choose 2) = 4,999,950,000 combinations of 2 variables! 47
  • 39. Popular algorithms for supervised learning • We got pretty deep into Decision Trees and ensembles of trees • Other popular algorithms for supervised learning: • Support Vector Machines • Neural Nets (“Deep Learning”) • Check out BigML’s automated deep learning! 50
  • 40. Recap: Supervised Learning Topics • Definition of supervised learning • Training and scoring a model • Support and confidence • Model evaluation using a holdout set • Bias and variance, underfitting and overfitting • Using ensembles to improve models • … And a whole lot about decision trees! 51
  • 42. What if we don’t have labels? • Can we still get insight into our data if we don’t know the colors of the dots? • Since we don’t have labels, this is unsupervised learning • Clustering: Find “clumps” of unlabeled data that might be interesting • Anomaly detection: Find outliers in unlabeled data • Topic Modeling: Identify topics in free text 54
  • 43. Clustering • Concept: Find “lumps” of data that exist in distinct clusters • K-means clustering: 1. Choose a number of clusters k that you are looking for 2. Choose initial “centroids” for the clusters 3. Compute which data points are closest to each centroid 4. Compute the actual center for each of the sets of datapoints 5. Continue until the k centroids stop moving 55
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57. Demo: The Whisky Dataset • Data on the flavors of 86 single-malt Scotch whiskies • No labels, just a bunch of taste information • Can we get insight into this dataset? 69
  • 58. Demo: Breast Cancer Dataset • Train a predictive model using the 699 biopsies • The “label” of benign or malignant is known for each one • We can train a highly accurate predictive model with this data 74
  • 59. Demo: Breast Cancer Dataset • What if we remove the labels of “benign” and “malignant”? 75
  • 60. 10 lines are needed to isolate this data point (not anomalous)
  • 61. Only 4 lines are needed to isolate this data point (highly anomalous)
  • 62. Demo: Anomaly Detection • Remove the labels of benign or malignant • Train an anomaly detector on this unlabeled data • Create a new dataset with the anomaly scores as “labels” • Use these “labels” to train a predictive model! 78
  • 64. Minority Report • Anomaly detection works great on large unlabeled datasets, especially if you expect to find an (adversarial) minority class • Millions of credit card transactions, billions of network events … • Doesn’t require you to know what you’re looking for! 81
  • 65. Topic Modeling using LDA • Uncovers groups of related words (“topics”) in documents • Does not require an external corpus (e.g. training on Wikipedia) • No semantic parsing of text • Unsupervised
  • 66. Topic modeling on IMDB reviews • 52,000 reviews • 883 movies
  • 67. Top 3 Topics in Shrek Reviews (n=26)
  • 68. Topics Topic distribution for this document Borrowed/stolen from Prof. David Blei, with apologies …
  • 69. The (assumed) generative process children A topic, which is a distribution over words A distribution over topics, specific to each document A distribution over topic distributions, fixed for this corpus A word in a document Topic 1 Topic 3 Topic 2 A distribution over word distributions, fixed for this corpus Word 1 Word 2Word 3
  • 70. What we observe children A word in a document
  • 73. n = 31 The Sum of All Fears
  • 74. n = 31 The Sum of All Fears
  • 75. n = 100 Love, Actually
  • 76. How do we get such “good” topics? • Imagine that each document can only belong to one topic • Does that make it easier or harder to find “good” clusters of words? • LDA allows documents to belong to multiple topics
  • 77. Recap: Unsupervised Learning Topics • Unsupervised learning uses unlabeled data • Clustering: Finding clumps in unlabeled data • Anomaly Detection: Finding “weird” instances in unlabeled data • Topic Modeling: Extracting meaningful topics from free text 94
  • 78. Final Thought • Supervised learning has many different algorithms to solve one problem (predicting the output) • Unsupervised learning has a many different algorithms to solve many different problems 95 David Gerster gerster@bigml.com
  • 80. 102