Tf itpbapm

Intro to
Python:
Build a
Predictive
Model

Introductions
➔ What's your name?
➔ What brought you here today?
➔ What is your programming experience?

We train developers and
data scientists through
1x1 mentorship and
project-based learning.
Guaranteed.
About
Thinkful

Learn
by
Doing
➔ Why is Data Science a thing?
➔ What is Python?
➔ How do we use it with a real
world project?
➔ How do I learn more?

“[LinkedIn] was like arriving at a conference
reception and realizing you don’t know anyone. So
you just stand in the corner sipping your drink —
and you probably leave early.”
— LinkedIn Manager, June 2006
Example:
LinkedIn
2006

➔ Joined LinkedIn in 2006, only 8M
users (450M in 2016)
➔ Started experiments to predict
people’s networks
➔ Engineers were dismissive: “you
can already import your address
book”
Enter:
Data
Scientist

➔ Frame the question
➔ Collect the raw data
➔ Process the data
➔ Explore the data
➔ Communicate results
The
Process:
LinkedIn
Example

➔ What questions do we want to answer?
◆ Who?
◆ What?
◆ When?
◆ Where?
◆ Why?
◆ How?
Case:
Frame
the
Question

➔ What connections (type and number) lead to higher
user engagement?
➔ Which connections do people want to make but are
currently limited from making?
➔ How might we predict these types of connections with
limited data from the user?
Case:
Frame
the
Question

➔ What data do we need to
answer these questions?
Case:
Collect
the
Data

➔ Connection data (who is who connected to?)
➔ Demographic data (what is the profile of
the connection)
➔ Engagement data (how do they use the site)
Case:
Collect
the
Data

➔ How is the data “dirty”
and how can we clean it?
Case:
Process
the
Data

➔ User input
➔ Redundancies
➔ Feature changes
➔ Data model changes
Case:
Process
the
Data

➔ What are the meaningful
patterns in the data?
Case:
Explore
the
Data

➔ Triangle closing
➔ Time Overlaps
➔ Geographic Overlaps
Case:
Explore
the Data

➔ How do we communicate this?
➔ To whom?
Case:
Communicate
Findings

➔ Marketing - sell X more ad space, results in X more
impressions per day
➔ Product - build X more features
➔ Development - grow our team by X
➔ Sales - attract X more premium accounts
➔ C-Level - more revenue, 8M - 450M in 10 years
Case:
Communicate
Findings

Python for Programming
➔ Great for Data Science
➔ Robotics
➔ Web Development
(Python/Django)
➔ Automation
Let’s
Learn
Python

➔ Our model is going to be a Decision Tree
➔ Decision Trees predict the most likely outcome
based on input
➔ Like a computer building a version of 20
questions
The
Model

➔ We’ll be using a Google-hosted
Python notebook to build this
model called Colaboratory
➔ Go to:
Colab.research.google.com
➔ Add the Colaboratory extension
to your G-suite
The
Notebook

➔ Go to: bit.ly/glf-dt
➔ Click File
➔ Select Save a Copy in Drive
➔ This is your personal version of
the notebook--let’s get started!
The
Notebook

from sklearn import tree
import pandas as pd
➔ Import Tree functionality from
the SKLearn Python Package
➔ Import Pandas to create our Data
Frame
➔ bit.ly/sklearn-python
Code
Block 1

golf_df = pd.DataFrame()
golf_df['Outlook'] = ['sunny', 'sunny', 'overcast', 'rainy', 'rainy','rainy',
'overcast', 'sunny', 'sunny', 'rainy', 'sunny', ‘overcast’,
'overcast', 'rainy']
golf_df['Temperature'] = ['hot', 'hot', 'hot', 'mild', 'cool', 'cool', 'cool',
'mild', 'cool', 'mild', 'mild', 'mild', 'hot', 'mild']
golf_df['Humidity'] = ['high', 'high', 'high', 'high', 'normal', 'normal',
'normal','high', 'normal', 'normal', 'normal', 'high',
'normal', 'high']
golf_df['Windy'] = ['false', 'true', 'false', 'false', 'false', 'true', 'true',
'false', 'false', 'false', 'true', 'true', 'false', 'true']
➔ Load in our seed data
Code
Block 2

golf_df['Play'] = ['no', 'no', 'yes', 'yes', 'yes', 'no', 'yes', 'no',
'yes', 'yes', 'yes','yes', 'yes', 'no']
golf_df
➔ This is our target data
➔ Print the data frame to make sure our table
looks accurate
Code
Block 2

one_hot_data = pd.get_dummies(golf_df[['Outlook', 'Temperature', 'Humidity', 'Windy']])
one_hot_data
➔ Sklearn Tree needs numerical data
➔ To make this conversion we use a method called One Hot
Encoding
➔ Print the one_hot_data to see the newly encoded data frame
Code
Block 3

clf = tree.DecisionTreeClassifier()
clf_train = clf.fit(one_hot_data, golf_df['Play'])
➔ We create an empty DecisionTreeClassifier and
assign it to the variable clf
➔ We fit the decision tree with our
one_hot_data and our target feature ‘play’
and assign it to clf_train
Code
Block 4

print(tree.export_graphviz(clf_train, None))
➔ SKLearn is automatically creating our
Decision Tree questions for us (Example: Do I
play golf when it is Overcast?)
➔ Paste the return string of the print
statement into: webgraphviz.com
Code
Block 4

#sunny, hot, normal, true
prediction = clf_train.predict([[0, 0, 1, 0, 1, 0, 0, 1, 0,1]])
Prediction
➔ Now we give our inputs, in the same binary
format
➔ We use the one_hot_data as our guide to
test an input
➔ Print our prediction
Code
Block 5

Our model has a few weaknesses:
➔ Limited inputs
➔ Assumptions
Shortcomings

➔ Start with Python and Statistics
➔ Personal Program Manager
➔ Unlimited Q&A Sessions
➔ Student Slack Community
➔ bit.ly/freetrial-ds
Thinkful
Two-Week
Free
Trial

The
Student
Experience
Marnie Boyer, Thinkful Graduate
Capstone
Wolfgang Hall, Thinkful Graduate
Capstone

➔ bit.ly/tf-event-feedback
Survey

Tf itpbapm

Recommended

Recommended

More Related Content

Similar to Tf itpbapm

Similar to Tf itpbapm (20)

More from Shannon Gallagher

More from Shannon Gallagher (20)

Recently uploaded

Recently uploaded (20)

Tf itpbapm