6. “[LinkedIn] was like arriving at a conference
reception and realizing you don’t know anyone. So
you just stand in the corner sipping your drink —
and you probably leave early.”
— LinkedIn Manager, June 2006
Example:
LinkedIn
2006
7. ➔ Joined LinkedIn in 2006, only 8M
users (450M in 2016)
➔ Started experiments to predict
people’s networks
➔ Engineers were dismissive: “you
can already import your address
book”
Enter:
Data
Scientist
8. ➔ Frame the question
➔ Collect the raw data
➔ Process the data
➔ Explore the data
➔ Communicate results
The
Process:
LinkedIn
Example
9. ➔ What questions do we want to answer?
◆ Who?
◆ What?
◆ When?
◆ Where?
◆ Why?
◆ How?
Case:
Frame
the
Question
10. ➔ What connections (type and number) lead to higher
user engagement?
➔ Which connections do people want to make but are
currently limited from making?
➔ How might we predict these types of connections with
limited data from the user?
Case:
Frame
the
Question
11. ➔ What data do we need to
answer these questions?
Case:
Collect
the
Data
12. ➔ Connection data (who is who connected to?)
➔ Demographic data (what is the profile of
the connection)
➔ Engagement data (how do they use the site)
Case:
Collect
the
Data
13. ➔ How is the data “dirty”
and how can we clean it?
Case:
Process
the
Data
14. ➔ User input
➔ Redundancies
➔ Feature changes
➔ Data model changes
Case:
Process
the
Data
15. ➔ What are the meaningful
patterns in the data?
Case:
Explore
the
Data
17. ➔ How do we communicate this?
➔ To whom?
Case:
Communicate
Findings
18. ➔ Marketing - sell X more ad space, results in X more
impressions per day
➔ Product - build X more features
➔ Development - grow our team by X
➔ Sales - attract X more premium accounts
➔ C-Level - more revenue, 8M - 450M in 10 years
Case:
Communicate
Findings
22. ➔ Our model is going to be a Decision Tree
➔ Decision Trees predict the most likely outcome
based on input
➔ Like a computer building a version of 20
questions
The
Model
24. ➔ We’ll be using a Google-hosted
Python notebook to build this
model called Colaboratory
➔ Go to:
Colab.research.google.com
➔ Add the Colaboratory extension
to your G-suite
The
Notebook
25. ➔ Go to: bit.ly/glf-dt
➔ Click File
➔ Select Save a Copy in Drive
➔ This is your personal version of
the notebook--let’s get started!
The
Notebook
26. from sklearn import tree
import pandas as pd
➔ Import Tree functionality from
the SKLearn Python Package
➔ Import Pandas to create our Data
Frame
➔ bit.ly/sklearn-python
Code
Block 1
28. golf_df['Play'] = ['no', 'no', 'yes', 'yes', 'yes', 'no', 'yes', 'no',
'yes', 'yes', 'yes','yes', 'yes', 'no']
golf_df
➔ This is our target data
➔ Print the data frame to make sure our table
looks accurate
Code
Block 2
29. one_hot_data = pd.get_dummies(golf_df[['Outlook', 'Temperature', 'Humidity', 'Windy']])
one_hot_data
➔ Sklearn Tree needs numerical data
➔ To make this conversion we use a method called One Hot
Encoding
➔ Print the one_hot_data to see the newly encoded data frame
Code
Block 3
30. clf = tree.DecisionTreeClassifier()
clf_train = clf.fit(one_hot_data, golf_df['Play'])
➔ We create an empty DecisionTreeClassifier and
assign it to the variable clf
➔ We fit the decision tree with our
one_hot_data and our target feature ‘play’
and assign it to clf_train
Code
Block 4
31. print(tree.export_graphviz(clf_train, None))
➔ SKLearn is automatically creating our
Decision Tree questions for us (Example: Do I
play golf when it is Overcast?)
➔ Paste the return string of the print
statement into: webgraphviz.com
Code
Block 4
32. #sunny, hot, normal, true
prediction = clf_train.predict([[0, 0, 1, 0, 1, 0, 0, 1, 0,1]])
Prediction
➔ Now we give our inputs, in the same binary
format
➔ We use the one_hot_data as our guide to
test an input
➔ Print our prediction
Code
Block 5
33. Our model has a few weaknesses:
➔ Limited inputs
➔ Assumptions
Shortcomings
35. ➔ Start with Python and Statistics
➔ Personal Program Manager
➔ Unlimited Q&A Sessions
➔ Student Slack Community
➔ bit.ly/freetrial-ds
Thinkful
Two-Week
Free
Trial