Data Science?!
what even...
David Coallier
@davidcoallier
Data Scientist

Engine Yard
And I cook..
A lot.
(n-1) items
Adapting.
Feedback.
Indifference.
Young mathematically
inclined minds
Young mathematically inclined minds

We knew everything.
First Bad Assumption.
So we asked “experts”.
Bad Ingredients
Bad Data
Tasted like sh*t
From Our Results
We had questions.
Found Expertise
Not Online.
Data Scientific
Method
Find a Question
Your Hypothesis
Current Data

What do you have?
Features & Tests
Try it.
Analyse Results
Won’t be pretty.
Conversation

Framed. By. Data.
But....
Good Discussions
Imply good data scientists
Hacking Skills
Hacking Skills

Maths &
Stats
Hacking Skills

Expertise

Maths &
Stats
Hacking Skills
Machine
Learning

Danger
Zone!!!

Expertise

Research

Maths &
Stats
Hacking Skills

Data
Science

Expertise

Maths &
Stats
Hacking Skills

Danger
Zone!!!

Machine
Learning

Data
Science
Maths &
Stats

Expertise
Research
Business

Don’t need an MBA
In other words.
1. Hacking
2. Maths & Stats
3. Expertise
Apply Method
Data Scientific
1. Question
2. Current Data
3. Features/Tests
4. Analyse
5. Converse
Find a Question

Let’s imagine Github
Upgrade Repos
Affect users as little as possible
import csv
content = csv.read('repo1.csv')
λ e
f (k; λ ) =
k!

k −k

for k >= 0
Converse

Present Findings
Iterate

Commits aren’t key.
KPIs are key

Indicators from experience
Questions

Super Important.
Just test it..
We are Human.

Emotional Connection
What next?

Second Hypothesis.
Focus on Data

Relevant to your KPIs.
Data gives you the what
Humans give you the why
Turn Information
Into

Actionable Insight
Create Discussions
Introspection Engines
Seeing, Feeling it
The brain sees.
Not regressions
Not p-values
Not slopes
Not F-statistics
Not coefficients
Another Example
Fraud Engine
Features

Fraud Engine
Clusters

User Types
Machine Learning
Historical Analysis
Decision

Report as Fraudulent
Fact-Based

Decision Failing
Fact-Based

Decision Making
Action
Measure

Knowledge
Analysis
Failed.

Noetic Intelligence
Action
Measure

Knowledge
Analysis
Action
Measure

Knowledge
Analysis
Offering

Missing Feature
Toolbox

What do we use?
R
Modeling, Testing, Prototyping
RStudio

The IDE
lubridate
and zoo
Dealing with Dates...
yy/mm/dd
mm/dd/yy
YYYY-mm-dd HH:MM:ss TZ
yy-mm-dd
1363784094.513425
yy/mm
different timezone
reshape2

Reshape your Data
ggplot2

Visualise your Data
RCurl, RJSONIO
Find more Data
HMisc

Miscellaneous useful functions
forecast

Can you guess?
garch

Generalized Autoregressive
Conditional Heteroskedasticity
quantmod

Statistical Financial Trading
getSymbols('AAPL')
barChart(AAPL)
addMACD()
xts

Extensible Time Series
igraph

Study Networks
maptools

Read & View Maps
map('state', region = c(row.names(USArrests)), col=cm.colors(16, 1)[floor(USArrests$Rape/max(USArrests$Rape)*28)], fill=T)
Python

Scientific Computing
SciPy
http://www.scipy.org
scipy.stats
scipy.stats
Descriptive Statistics
from scipy.stats import
describe
s = [1,2,1,3,4,5]
print describe(s)
scipy.stats
Probability Distributions
Example
Poisson Distribution
λ e
f (k; λ ) =
k!

k −k

for k >= 0
import scipy.stats.poisson
p = poisson.pmf([1,2,3,4,1,2,3], 2)
print p.mean()
print p.sum()
...
NumPy
http://www.numpy.org/
NumPy
Linear Algebra
⎛ 1 0 ⎞
⎜ 0 1 ⎟
⎝
⎠
import numpy as np
x = np.array([ [1, 0], [0, 1] ])
vec, val = np.linalg.eig(x)
np.linalg.eigvals(x)
>>> np.linalg.eig(x)
(
array([ 1., 1.]),
array([
[ 1., 0.],
[ 0., 1.]
])
)
Matplotlib

Python Plotting
statsmodels
Advanced Statistics Modeling
NLTK

Natural Language Tool Kit
scikit-learn

Machine Learning
from sklearn import tree
X = [[0, 0], [1, 1]]
Y = [0, 1]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
clf.predi...
PyBrain

... Machine Learning
PyMC
Bayesian Inference
Pattern

Web Mining for Python
NetworkX

Study Networks
MILK: Machine Learning
Pandas

easy-to-use data structures
from pandas import *
x = DataFrame([
{"age": 26},
{"age": 19},
{"age": 21},
{"age": 18}
])
print x[x['age'] > 20].count()
...
Python vs R?

Different Purposes
Sto

rage
Oppose
“big” Data
Hadoop
Had - oops
Riak

Key-Value Buckets
CouchDB

Document Database
Redis

In-Memory Database
Cube

Time-series Database
PgSQL

Quite Extensively
Visualisation
Right Now
The rule of 3
Engineer

Report One
Mid-Level Mgr
Report Two
Board Level
Report Three
The Future

Discoverable Insight
d3.js

Data-Driven Documents
The Future

Discoverable Insight
Dashing

Elegant Dashboards
Edward Tufte

Go read his books.
Dogfooding

Data Scientific Method
Original Question
What is Data Science?
Back to you

For questioning
Data Science, what even...
Data Science, what even...
Data Science, what even...
Data Science, what even...
Data Science, what even...
Data Science, what even...
Data Science, what even...
Data Science, what even...
Data Science, what even...
Data Science, what even...
Data Science, what even...
Data Science, what even...
Data Science, what even...
Data Science, what even...
Data Science, what even...
Data Science, what even...
Data Science, what even...
Data Science, what even...
Data Science, what even...
Upcoming SlideShare
Loading in …5
×

Data Science, what even...

1,005 views

Published on

A set of slides from my closing keynote at DamnData. I go over the concept of the Data Scientific Method, the skills required for a data scientists from hacking, to maths and stats, to expertise to business knowledge. I also talk about some ideas we worked on, some tools we use, some technologies and the most important part, the questioning of the data.

Published in: Technology, Education
1 Comment
5 Likes
Statistics
Notes
No Downloads
Views
Total views
1,005
On SlideShare
0
From Embeds
0
Number of Embeds
12
Actions
Shares
0
Downloads
33
Comments
1
Likes
5
Embeds 0
No embeds

No notes for slide

Data Science, what even...

  1. 1. Data Science?! what even...
  2. 2. David Coallier @davidcoallier
  3. 3. Data Scientist Engine Yard
  4. 4. And I cook.. A lot.
  5. 5. (n-1) items
  6. 6. Adapting.
  7. 7. Feedback.
  8. 8. Indifference.
  9. 9. Young mathematically inclined minds
  10. 10. Young mathematically inclined minds We knew everything.
  11. 11. First Bad Assumption.
  12. 12. So we asked “experts”.
  13. 13. Bad Ingredients
  14. 14. Bad Data
  15. 15. Tasted like sh*t
  16. 16. From Our Results We had questions.
  17. 17. Found Expertise Not Online.
  18. 18. Data Scientific Method
  19. 19. Find a Question Your Hypothesis
  20. 20. Current Data What do you have?
  21. 21. Features & Tests Try it.
  22. 22. Analyse Results Won’t be pretty.
  23. 23. Conversation Framed. By. Data.
  24. 24. But....
  25. 25. Good Discussions Imply good data scientists
  26. 26. Hacking Skills
  27. 27. Hacking Skills Maths & Stats
  28. 28. Hacking Skills Expertise Maths & Stats
  29. 29. Hacking Skills Machine Learning Danger Zone!!! Expertise Research Maths & Stats
  30. 30. Hacking Skills Data Science Expertise Maths & Stats
  31. 31. Hacking Skills Danger Zone!!! Machine Learning Data Science Maths & Stats Expertise Research
  32. 32. Business Don’t need an MBA
  33. 33. In other words.
  34. 34. 1. Hacking 2. Maths & Stats 3. Expertise
  35. 35. Apply Method Data Scientific
  36. 36. 1. Question 2. Current Data 3. Features/Tests 4. Analyse 5. Converse
  37. 37. Find a Question Let’s imagine Github
  38. 38. Upgrade Repos Affect users as little as possible
  39. 39. import csv content = csv.read('repo1.csv')
  40. 40. λ e f (k; λ ) = k! k −k for k >= 0
  41. 41. Converse Present Findings
  42. 42. Iterate Commits aren’t key.
  43. 43. KPIs are key Indicators from experience
  44. 44. Questions Super Important.
  45. 45. Just test it..
  46. 46. We are Human. Emotional Connection
  47. 47. What next? Second Hypothesis.
  48. 48. Focus on Data Relevant to your KPIs.
  49. 49. Data gives you the what Humans give you the why
  50. 50. Turn Information
  51. 51. Into Actionable Insight
  52. 52. Create Discussions Introspection Engines
  53. 53. Seeing, Feeling it The brain sees.
  54. 54. Not regressions
  55. 55. Not p-values
  56. 56. Not slopes
  57. 57. Not F-statistics
  58. 58. Not coefficients
  59. 59. Another Example Fraud Engine
  60. 60. Features Fraud Engine
  61. 61. Clusters User Types
  62. 62. Machine Learning Historical Analysis
  63. 63. Decision Report as Fraudulent
  64. 64. Fact-Based Decision Failing
  65. 65. Fact-Based Decision Making
  66. 66. Action Measure Knowledge Analysis
  67. 67. Failed. Noetic Intelligence
  68. 68. Action Measure Knowledge Analysis
  69. 69. Action Measure Knowledge Analysis
  70. 70. Offering Missing Feature
  71. 71. Toolbox What do we use?
  72. 72. R Modeling, Testing, Prototyping
  73. 73. RStudio The IDE
  74. 74. lubridate and zoo Dealing with Dates...
  75. 75. yy/mm/dd mm/dd/yy YYYY-mm-dd HH:MM:ss TZ yy-mm-dd 1363784094.513425 yy/mm different timezone
  76. 76. reshape2 Reshape your Data
  77. 77. ggplot2 Visualise your Data
  78. 78. RCurl, RJSONIO Find more Data
  79. 79. HMisc Miscellaneous useful functions
  80. 80. forecast Can you guess?
  81. 81. garch Generalized Autoregressive Conditional Heteroskedasticity
  82. 82. quantmod Statistical Financial Trading
  83. 83. getSymbols('AAPL') barChart(AAPL) addMACD()
  84. 84. xts Extensible Time Series
  85. 85. igraph Study Networks
  86. 86. maptools Read & View Maps
  87. 87. map('state', region = c(row.names(USArrests)), col=cm.colors(16, 1)[floor(USArrests$Rape/max(USArrests$Rape)*28)], fill=T)
  88. 88. Python Scientific Computing
  89. 89. SciPy http://www.scipy.org
  90. 90. scipy.stats
  91. 91. scipy.stats Descriptive Statistics
  92. 92. from scipy.stats import describe s = [1,2,1,3,4,5] print describe(s)
  93. 93. scipy.stats Probability Distributions
  94. 94. Example Poisson Distribution
  95. 95. λ e f (k; λ ) = k! k −k for k >= 0
  96. 96. import scipy.stats.poisson p = poisson.pmf([1,2,3,4,1,2,3], 2)
  97. 97. print p.mean() print p.sum() ...
  98. 98. NumPy http://www.numpy.org/
  99. 99. NumPy Linear Algebra
  100. 100. ⎛ 1 0 ⎞ ⎜ 0 1 ⎟ ⎝ ⎠
  101. 101. import numpy as np x = np.array([ [1, 0], [0, 1] ]) vec, val = np.linalg.eig(x) np.linalg.eigvals(x)
  102. 102. >>> np.linalg.eig(x) ( array([ 1., 1.]), array([ [ 1., 0.], [ 0., 1.] ]) )
  103. 103. Matplotlib Python Plotting
  104. 104. statsmodels Advanced Statistics Modeling
  105. 105. NLTK Natural Language Tool Kit
  106. 106. scikit-learn Machine Learning
  107. 107. from sklearn import tree X = [[0, 0], [1, 1]] Y = [0, 1] clf = tree.DecisionTreeClassifier() clf = clf.fit(X, Y) clf.predict([[2., 2.]]) >>> array([1])
  108. 108. PyBrain ... Machine Learning
  109. 109. PyMC Bayesian Inference
  110. 110. Pattern Web Mining for Python
  111. 111. NetworkX Study Networks
  112. 112. MILK: Machine Learning
  113. 113. Pandas easy-to-use data structures
  114. 114. from pandas import * x = DataFrame([ {"age": 26}, {"age": 19}, {"age": 21}, {"age": 18} ]) print x[x['age'] > 20].count() print x[x['age'] > 20].mean()
  115. 115. Python vs R? Different Purposes
  116. 116. Sto rage
  117. 117. Oppose “big” Data
  118. 118. Hadoop
  119. 119. Had - oops
  120. 120. Riak Key-Value Buckets
  121. 121. CouchDB Document Database
  122. 122. Redis In-Memory Database
  123. 123. Cube Time-series Database
  124. 124. PgSQL Quite Extensively
  125. 125. Visualisation
  126. 126. Right Now The rule of 3
  127. 127. Engineer Report One
  128. 128. Mid-Level Mgr Report Two
  129. 129. Board Level Report Three
  130. 130. The Future Discoverable Insight
  131. 131. d3.js Data-Driven Documents
  132. 132. The Future Discoverable Insight
  133. 133. Dashing Elegant Dashboards
  134. 134. Edward Tufte Go read his books.
  135. 135. Dogfooding Data Scientific Method
  136. 136. Original Question What is Data Science?
  137. 137. Back to you For questioning

×