Predict oscars (5:11)

Predicting the Oscars with Data Science
http://bit.ly/tf-predict-oscars

About me
• Jasjit Singh
• Self-taught developer
• Worked in ﬁnance & tech
• Co-Founder Hotspot
• Thinkful General Manager

About us
Thinkful prepares students for web development &
data science jobs with 1-on-1 mentorship programs

What’s your background?
• I have a software background
• I have a math or stats background
• None of the above

Learning goals
• “Data Science Process”
• Python’s data science toolkit and methods
• Basic machine learning concepts

Data Science Process
• Frame the question.
• Collect the raw data.
• Process the data.
• Explore the data.
• Communicate results.

Frame the question
• Who will win the Oscar for Best Picture?

Collect the Data
• What kind of data do we need?

Collect the Data
• Financial data (Budget, box ofﬁce…)
• Reviews, ratings and scores.
• Awards and nominations.

Process the data
• How’s the data “dirty” and how can we ﬁx it?

Process the data
• User input, redundancies, missing data…
• Formatting: adapt the data to meet certain
speciﬁcations.
• Cleaning: detecting and correcting
corrupt or inaccurate records.

Explore the data
• What are the meaningful patterns in the
data?
• How meaningful is each data point for our
predictions?

Jupyter Notebooks
• One of data scientist’s everyday tools.
• Lets us show our work: can read the story,
follow the process, and run the code
• Find the links in our classroom tool.

NumPy
• The fundamental package for scientiﬁc
computing with Python.
• Lets us store our data into special multi-
dimensional array objects.
• Many methods for fast operations on arrays.

Pandas
• Fundamental high-level building block for
doing practical, real world data analysis in
Python.
• Built on top of NumPy.

Scikit-learn
• Python module for machine learning.

Working through the code
• The goal is to have you follow along. Towards
the end of the class, you can play around
with the code yourself
• Go to Jupyter notebook (http://bit.ly/tf-
jupyter)
• Open “Stage One.ipynb”

Initial imports and loading data with Pandas

Understanding your data
• .head(n) method: Returns ﬁrst n rows.
• .value_counts() method: Returns the counts
of unique values in that column

Processing our data: Formatting
• Ratings are in a non-numeric format. We
need to assign each rating a unique integer
so that Python can handle the information.
• We’ll do that with the .ix method

Classiﬁcation vs Regression
• Regression — Predict values
• Classiﬁcation — Predict categories.

For our model we’ll use a decision tree
• It breaks down a dataset into smaller and
smaller subsets.
• The ﬁnal result is a model with a tree
structure that has:
• Decision nodes: ask a question and have
two or more branches.
• Leaf nodes: represent a classiﬁcation or
decision.

Why a decision tree?
• Decision tree is just one of many algorithms
you can use (e.g. Naive Bayes, SVM, Least
Squares, Logistic, etc)
• We’re using decision trees because it mirrors
human decision making so it’s simpler to
understand and interpret

Creating your ﬁrst Decision Tree
You will use the scikit-learn and numpy
libraries to build your ﬁrst decision tree. We
will need the following to build a decision tree
• target: A one-dimensional numpy array
containing the target from the train data.
• features: A multidimensional numpy array
containing the features/predictors from the
train data.

Creating your ﬁrst Decision Tree

Problem #1: Poor Predictions
• We only got 1 out of 4 movies correct
(Amadeus, 1984)
• Open Stage 2 and lets try and improve our
predictions!

Solution: Modify the feature list

Success!?
• We now got 3 out of 4 movies correct
• Our model predicted Inglorious Bastards would win.
Hurt Locker won instead

Problem #2: Overﬁtting
• Resulting model too tied to the training set.
• It doesn’t generalize to new data, which is
the point of prediction.

Overﬁtting, a visualization
In overﬁtting, a statistical model describes random error or
noise instead of the underlying relationship.

Solution: Random Forest Classifier
• Random Forest Classifiers use many
Decision Trees to build a classifier.
• We introduce a bit of randomness.
• Each Tree can give a different answer (a
vote). The final classification is the most
common amongst the Trees.

And our prediction for the 2016
Oscar is…

Next steps for learning
• Google (self-taught)
• Coursera
• Bootcamps

1-on-1 mentorship enables ﬂexibility

Graduate outcomes
Job Titles after GraduationMonths until Employed

Special Introductory Offer
• Prep course for 50% off — $250 instead of $500
• Covers math, stats, Python, and data science toolkit
• Option to continue into full data science program
• If you’re interested talk to me after or email me at
jasjit@thinkful.com

Predict oscars (5:11)

More Related Content

What's hot

Similar to Predict oscars (5:11)

More from Thinkful

Recently uploaded

Predict oscars (5:11)