Often talks about data science focus on tools and methods. Tools are important and it is really important to stay up to date with the latest tools and technologies (they are the “how”). But data science is also about finding good problems and solving them (the “why”). Good problems are ones that are both valuable (someone really wants an answer) and tractable (there is data you can find to help you answer it). Today, with modern technologies, many more problems are tractable than ever before; lots of data is freely available on the internet and sensors make it easy to collect ad hoc data sets. This talk will emphasize the importance of asking “why” by exploring a few examples from Datascope’s client work and our various side projects where, if we hadn’t asked “why”, the outcomes would have been far less useful.
1. berlin pydata | @gabegaster | 2015 february
what is data science?
2. berlin pydata | @gabegaster | 2015 february
what is data science?
3. berlin pydata | @gabegaster | 2015 february
what is data science?
who is a data scientist?
4. berlin pydata | @gabegaster | 2015 february
what is data science?
who is a data scientist?
review of literature
5. berlin pydata | @gabegaster | 2015 february
what is data science?
who is a data scientist?
review of literature
6. berlin pydata | @gabegaster | 2015 february
what is data science?
review of literature
7. berlin pydata | @gabegaster | 2015 february
what is data science?
review of literature
8. berlin pydata | @gabegaster | 2015 february
what is data science?
who is a data scientist?
9. berlin pydata | @gabegaster | 2015 february
what is data science?
who is a data scientist?
“a scientist who can code”
10. berlin pydata | @gabegaster | 2015 february
what is data science?
who is a data scientist?
“a scientist who can code”
• lower barrier to attack new problems
11. berlin pydata | @gabegaster | 2015 february
what is data science?
who is a data scientist?
“a scientist who can code”
• lower barrier to attack new problems
• repeatable analysis
12. berlin pydata | @gabegaster | 2015 february
what is data science?
who is a data scientist?
“a scientist who can code”
• lower barrier to attack new problems
• repeatable analysis
• freedom to think about problems new ways
13. berlin pydata | @gabegaster | 2015 february
what is data science?
14. berlin pydata | @gabegaster | 2015 february
what is data science?
using emerging technologies to approach
problems scientifically
15. berlin pydata | @gabegaster | 2015 february
what is data science?
using emerging technologies to approach
problems scientifically
which were difficult to answer before
16. berlin pydata | @gabegaster | 2015 february
which were difficult to answer before
17. berlin pydata | @gabegaster | 2015 february
computing has progressed
which were difficult to answer before
18. berlin pydata | @gabegaster | 2015 february
1950
computing has progressed
19. berlin pydata | @gabegaster | 2015 february
1950
cost of new
analysis
computing has progressed
20. berlin pydata | @gabegaster | 2015 february
1950
cost of new
analysis
years
computing has progressed
21. berlin pydata | @gabegaster | 2015 february
1950
cost of new
analysis
years
today
computing has progressed
22. berlin pydata | @gabegaster | 2015 february
1950
cost of new
analysis
years
today
v
computing has progressed
23. berlin pydata | @gabegaster | 2015 february
1950
cost of new
analysis
years
today
hoursv
v
computing has progressed
24. berlin pydata | @gabegaster | 2015 february
1950
cost of new
analysis
years
today
same person thinking about the problem
can conduct experiments to answer it
hoursv
v
computing has progressed
25. berlin pydata | @gabegaster | 2015 february
computing has progressed
26. berlin pydata | @gabegaster | 2015 february
open-source code
computing has progressed
27. berlin pydata | @gabegaster | 2015 february
open-source code
standing on
shoulders of giants
computing has progressed
28. berlin pydata | @gabegaster | 2015 february
open-source code
standing on
shoulders of giants
computing has progressed
29. berlin pydata | @gabegaster | 2015 february
open-source code
standing on
shoulders of giants
computing has progressed
30. berlin pydata | @gabegaster | 2015 february
open-source code
standing on
shoulders of giants
reinventing the wheel
computing has progressed
31. berlin pydata | @gabegaster | 2015 february
open-source code
standing on
shoulders of giants
reinventing the wheel
computing has progressed
32. berlin pydata | @gabegaster | 2015 february
what is data science?
using emerging technologies to approach
problems scientifically
which were difficult to answer before
33. berlin pydata | @gabegaster | 2015 february
what is data science?
using emerging technologies to approach
problems scientifically
knowing
what is possible
which were difficult to answer before
34. berlin pydata | @gabegaster | 2015 february
what is data science?
using emerging technologies to approach
problems scientifically
which were difficult to answer before
knowing
what is possible
doing
something useful
35. berlin pydata | @gabegaster | 2015 february
what is data science?
using emerging technologies to approach
problems scientifically
which were difficult to answer before
knowing
what is possible
doing
something useful
HOW
36. berlin pydata | @gabegaster | 2015 february
what is data science?
using emerging technologies to approach
problems scientifically
which were difficult to answer before
knowing
what is possible
doing
something useful
HOW WHY
37. berlin pydata | @gabegaster | 2015 february
what is data science?
using emerging technologies to approach
problems scientifically
which were difficult to answer before
knowing
what is possible
doing
something useful
38. berlin pydata | @gabegaster | 2015 february
what is data science?
using emerging technologies to approach
problems scientifically
which were difficult to answer before
knowing
what is possible
doing
something useful
using
new
good
the right
tools
39. berlin pydata | @gabegaster | 2015 february
what is data science?
using emerging technologies to approach
problems scientifically
which were difficult to answer before
knowing
what is possible
doing
something useful
using
new
good
the right
asking whytools
40. berlin pydata | @gabegaster | 2015 february
what is data science?
using emerging technologies to approach
problems scientifically
which were difficult to answer before
knowing
what is possible
doing
something useful
using
new
good
the right
asking why
tools
41. berlin pydata | @gabegaster | 2015 february
what is data science?
using emerging technologies to approach
problems scientifically
which were difficult to answer before
knowing
what is possible
doing
something useful
using
new
good
the right
asking whytools
42. berlin pydata | @gabegaster | 2015 february
what is data science?
using emerging technologies to approach
problems scientifically
which were difficult to answer before
knowing
what is possible
doing
something useful
using
new
good
the right
asking whytools WHY
43. berlin pydata | @gabegaster | 2015 february
what is data science?
using emerging technologies to approach
problems scientifically
which were difficult to answer before
knowing
what is possible
doing
something useful
using
new
good
the right
asking whytools WHY
WHY
44. berlin pydata | @gabegaster | 2015 february
why why why
what is data science?
45. berlin pydata | @gabegaster | 2015 february
why why why
what is data science?
science is about asking why
46. berlin pydata | @gabegaster | 2015 february
why why why
what is data science?
science is about asking why
start there
56. berlin pydata | @gabegaster | 2015 february
goal: save money
57. berlin pydata | @gabegaster | 2015 february
goal: save money
58. berlin pydata | @gabegaster | 2015 february
goal: save money
59. berlin pydata | @gabegaster | 2015 february
goal: save money
60. berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february
goal: save money
task: find needle in the haystack (without poking yourself)
61. berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february
goal: save money
task: find needle in the haystack (without poking yourself)
62. berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february
goal: save money
task: find needle in the haystack (without poking yourself)
63. berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february
aboutpatent
not
aboutpatent
goal: save money
task: find needle in the haystack (without poking yourself)
64. berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february
aboutpatent
not
aboutpatent
turn over to plaintiff
don’t
turn over to plaintiff
adverse inference
goal: save money
task: find needle in the haystack (without poking yourself)
65. berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february
aboutpatent
not
aboutpatent
turn over to plaintiff
don’t
turn over to plaintiff
adverse inference
give away trade secrets
goal: save money
task: find needle in the haystack (without poking yourself)
66. berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february
aboutpatent
not
aboutpatent
turn over to plaintiff
don’t
turn over to plaintiff
adverse inference
give away trade secrets
goal: save money
task: find needle in the haystack (without poking yourself)
67. berlin pydata | @gabegaster | 2015 februaryberlin pydata | @gabegaster | 2015 february
turn over to plaintiff
don’t
turn over to plaintiff
goal: save money
task: find needle in the haystack (without poking yourself)
81. berlin pydata | @gabegaster | 2015 february
classify schizophrenia w MRItask:
82. berlin pydata | @gabegaster | 2015 february
why?
classify schizophrenia w MRItask:
83. berlin pydata | @gabegaster | 2015 february
why?
classify schizophrenia w MRItask:
improve understanding of disease
84. berlin pydata | @gabegaster | 2015 february
why?
classify schizophrenia w MRItask:
improve understanding of disease
how?
85. berlin pydata | @gabegaster | 2015 february
why?
classify schizophrenia w MRItask:
improve understanding of disease
how? … outside contest purview
86. berlin pydata | @gabegaster | 2015 february
why? outside contest purview
87. berlin pydata | @gabegaster | 2015 february
why? outside contest purview
88. berlin pydata | @gabegaster | 2015 february
why? outside contest purview
kaggle
89. berlin pydata | @gabegaster | 2015 february
why? outside contest purview
kaggle
getting data
&
making usable
90. berlin pydata | @gabegaster | 2015 february
why? outside contest purview
kaggle
getting data
&
making usable
WHY
91. berlin pydata | @gabegaster | 2015 february
timeline of contest
Accuracy of Classification
92. berlin pydata | @gabegaster | 2015 february
timeline of contest
AUC
Accuracy of Classification
93. berlin pydata | @gabegaster | 2015 february
what is AUC?
AUC
94. berlin pydata | @gabegaster | 2015 february
AUC
what is AUC? Area Under Curve
95. berlin pydata | @gabegaster | 2015 february
AUC
what is AUC? Area Under Curve
what curve?
96. berlin pydata | @gabegaster | 2015 february
AUC
what is AUC? Area Under Curve
what curve? Receiver Operating
Characteristic
97. berlin pydata | @gabegaster | 2015 february
AUC
what is AUC? Area Under Curve
what curve? Receiver Operating
Characteristic
98. berlin pydata | @gabegaster | 2015 february
AUC
what is AUC? Area Under Curve
what curve? Receiver Operating
Characteristic
99. berlin pydata | @gabegaster | 2015 february
balances:
AUC
what is AUC? Area Under Curve
what curve? Receiver Operating
Characteristic
100. berlin pydata | @gabegaster | 2015 february
balances:
True Positive Rate
False Positive Rate
AUC
what is AUC? Area Under Curve
what curve? Receiver Operating
Characteristic
101. berlin pydata | @gabegaster | 2015 february
balances:
True Positive Rate
False Positive Rate
AUC
what is AUC? Area Under Curve
what curve? Receiver Operating
Characteristic
102. berlin pydata | @gabegaster | 2015 february
AUC
what is AUC?
balances:
True Positive Rate
False Positive Rate
Area Under Curve
what curve? Receiver Operating
Characteristic
103. berlin pydata | @gabegaster | 2015 february
why?
AUC
what is AUC?
balances:
True Positive Rate
False Positive Rate
Area Under Curve
what curve? Receiver Operating
Characteristic
104. berlin pydata | @gabegaster | 2015 february
why?
…
AUC
what is AUC?
balances:
True Positive Rate
False Positive Rate
Area Under Curve
what curve? Receiver Operating
Characteristic
105. berlin pydata | @gabegaster | 2015 february
why?
…
upshot:
AUC
what is AUC?
balances:
True Positive Rate
False Positive Rate
Area Under Curve
what curve? Receiver Operating
Characteristic
106. berlin pydata | @gabegaster | 2015 february
why?
…
choice of metric matters a LOT
upshot:
in practice
AUC
what is AUC?
balances:
True Positive Rate
False Positive Rate
Area Under Curve
what curve? Receiver Operating
Characteristic
107. berlin pydata | @gabegaster | 2015 february
timeline of contest
Accuracy of Classification
AUC
108. berlin pydata | @gabegaster | 2015 february
timeline of contest
Accuracy of Classification
AUC
random guess
109. berlin pydata | @gabegaster | 2015 february
timeline of contest
Accuracy of Classification
AUC
random guess
basic SVM
110. berlin pydata | @gabegaster | 2015 february
timeline of contest
goal?
Accuracy of Classification
AUC
random guess
basic SVM
111. berlin pydata | @gabegaster | 2015 february
timeline of contest
goal: depends on why
Accuracy of Classification
AUC
random guess
basic SVM
112. berlin pydata | @gabegaster | 2015 february
random guess
basic SVM
timeline of contest
Accuracy of Classification
AUC
113. berlin pydata | @gabegaster | 2015 february
me
timeline of contest
Accuracy of Classification
AUC
114. berlin pydata | @gabegaster | 2015 february
me
timeline of contest
Accuracy of Classification
AUC
turned out to place 9th — because overfitting
115. berlin pydata | @gabegaster | 2015 february
me
timeline of contest
Accuracy of Classification
AUC
turned out to place 9th — because overfitting
very common problem
116. berlin pydata | @gabegaster | 2015 february
timeline of contest
Accuracy of Classification
worth it?
AUC
126. berlin pydata | @gabegaster | 2015 february
an example
just for fun
127. berlin pydata | @gabegaster | 2015 february
Chicago Bike Share System
!
!
kind of like call-a-bike
128. berlin pydata | @gabegaster | 2015 february
Show what I like about Bike share
!
Chicago Bike Share System
!
!
kind of like call-a-bike
129. berlin pydata | @gabegaster | 2015 february
Show what I like about Bike share
!
Think about how bike share has changed geography
Chicago Bike Share System
!
!
kind of like call-a-bike
130. berlin pydata | @gabegaster | 2015 february
a typical trip for me
131. berlin pydata | @gabegaster | 2015 february
Bus transit
times
=
a LIE
132. berlin pydata | @gabegaster | 2015 february
Chicago is a grid city
133. berlin pydata | @gabegaster | 2015 february
Difficult
Public
Transit on
the grid
=+
Diagonals
134. berlin pydata | @gabegaster | 2015 february
Difficult
Public
Transit on
the grid
=+
Diagonals
2+ buses = FAIL
135. berlin pydata | @gabegaster | 2015 february
Adding bikes to
public transit
=
win
136. berlin pydata | @gabegaster | 2015 february
show how has divvy
changed where people
can go
viz Goal:
137. berlin pydata | @gabegaster | 2015 february
show how has divvy
changed where people
can go
show where people
actually go
viz Goal: