Predicting the Oscars with Data Science
http://bit.ly/tf-predict-oscars
About me
• Jasjit Singh
• Self-taught developer
• Worked in finance & tech
• Co-Founder Hotspot
• Thinkful General Manager
About us
Thinkful prepares students for web development &
data science jobs with 1-on-1 mentorship programs
What’s your background?
• I have a software background
• I have a math or stats background
• None of the above
Learning goals
• “Data Science Process”
• Python’s data science toolkit and methods
• Basic machine learning concepts
Data Science Process
• Frame the question.
• Collect the raw data.
• Process the data.
• Explore the data.
• Communicate results.
Frame the question
• Who will win the Oscar for Best Picture?
Collect the Data
• What kind of data do we need?
Collect the Data
• Financial data (Budget, box office…)
• Reviews, ratings and scores.
• Awards and nominations.
Process the data
• How’s the data “dirty” and how can we fix it?
Process the data
• User input, redundancies, missing data…
• Formatting: adapt the data to meet certain
specifications.
• Cleaning: detecting and correcting
corrupt or inaccurate records.
Explore the data
• What are the meaningful patterns in the
data?
• How meaningful is each data point for our
predictions?
Communicate the Data
Jupyter Notebooks
• One of data scientist’s everyday tools.
• Lets us show our work: can read the story,
follow the process, and run the code
• Find the links in our classroom tool.
NumPy
• The fundamental package for scientific
computing with Python.
• Lets us store our data into special multi-
dimensional array objects.
• Many methods for fast operations on arrays.
Pandas
• Fundamental high-level building block for
doing practical, real world data analysis in
Python.
• Built on top of NumPy.
Scikit-learn
• Python module for machine learning.
Working through the code
• The goal is to have you follow along. Towards
the end of the class, you can play around
with the code yourself
• Go to Jupyter notebook (http://bit.ly/tf-
jupyter)
• Open “Stage One.ipynb”
Initial imports and loading data with Pandas
Understanding your data
• .head(n) method: Returns first n rows.
• .value_counts() method: Returns the counts
of unique values in that column
Understanding your data
Processing our data: Formatting
• Ratings are in a non-numeric format. We
need to assign each rating a unique integer
so that Python can handle the information.
• We’ll do that with the .ix method
Formatting your Data
Processing our Data: Cleaning
Classification vs Regression
• Regression — Predict values
• Classification — Predict categories.
Classification
Classification
?
For our model we’ll use a decision tree
• It breaks down a dataset into smaller and
smaller subsets.
• The final result is a model with a tree
structure that has:
• Decision nodes: ask a question and have
two or more branches.
• Leaf nodes: represent a classification or
decision.
Why a decision tree?
• Decision tree is just one of many algorithms
you can use (e.g. Naive Bayes, SVM, Least
Squares, Logistic, etc)
• We’re using decision trees because it mirrors
human decision making so it’s simpler to
understand and interpret
Creating your first Decision Tree
You will use the scikit-learn and numpy
libraries to build your first decision tree. We
will need the following to build a decision tree
• target: A one-dimensional numpy array
containing the target from the train data.
• features: A multidimensional numpy array
containing the features/predictors from the
train data.
Creating your first Decision Tree
Importances and Score
Problem #1: Poor Predictions
• We only got 1 out of 4 movies correct
(Amadeus, 1984)
• Open Stage 2 and lets try and improve our
predictions!
Solution: Modify the feature list
Run the prediction again
Success!?
• We now got 3 out of 4 movies correct
• Our model predicted Inglorious Bastards would win.
Hurt Locker won instead
Problem #2: Overfitting
• Resulting model too tied to the training set.
• It doesn’t generalize to new data, which is
the point of prediction.
Overfitting, a visualization
In overfitting, a statistical model describes random error or
noise instead of the underlying relationship.
Solution: Random Forest Classifier
• Random Forest Classifiers use many
Decision Trees to build a classifier.
• We introduce a bit of randomness.
• Each Tree can give a different answer (a
vote). The final classification is the most
common amongst the Trees.
Random Forest Classifier
Creating your first Decision Tree
Importances and Score
Results
1976
Rocky
1984
Amadeus
1996
The English Patient
2009
The Hurt Locker
And our prediction for the 2016
Oscar is…
We can predict the Oscars
Except for 2017 ¯_(ツ)_/¯
Next steps for learning
• Google (self-taught)
• Coursera
• Bootcamps
1-on-1 mentorship enables flexibility
Next steps for learning
Graduate outcomes
Job Titles after GraduationMonths until Employed
Special Introductory Offer
• Prep course for 50% off — $250 instead of $500
• Covers math, stats, Python, and data science toolkit
• Option to continue into full data science program
• If you’re interested talk to me after or email me at
jasjit@thinkful.com

Predict oscars (5:11)