Data Science?!
what even...
David Coallier
@davidcoallier
Data Scientist

Engine Yard
And I cook..
A lot.
(n-1) items
Adapting.
Feedback.
Indifference.
Young mathematically
inclined minds
Young mathematically inclined minds

We knew everything.
First Bad Assumption.
So we asked “experts”.
Wrong Ingredients
Bad Data
Tasted like sh*t
From Our Results
We had questions.
Found Expertise
Not Online.
Data Scientific
Method
Find a Question
Your Hypothesis
Current Data

What do you have?
Features & Tests
Try it.
Analyse Results
Won’t be pretty.
Conversation

Framed. By. Data.
But....
Good Discussions
Imply good data scientists
Hacking Skills
Hacking Skills

Maths &
Stats
Hacking Skills

Expertise

Maths &
Stats
Hacking Skills
Machine
Learning

Danger
Zone!!!

Expertise

Research

Maths &
Stats
Hacking Skills

Data
Science

Expertise

Maths &
Stats
Hacking Skills

Danger
Zone!!!

Machine
Learning

Data
Science
Maths &
Stats

Expertise
Research
Business

Don’t need an MBA
In other words.
1. Hacking
2. Maths & Stats
3. Expertise
Apply Method
Data Scientific
1. Question
2. Current Data
3. Features/Tests
4. Analyse
5. Converse
Find a Question

Let’s imagine Github
Upgrade Repos
Affect users as little as possible
import csv
content = csv.read('repo1.csv')
λ e
f (k; λ ) =
k!

k −k

for k >= 0
Converse

Present Findings
Iterate

Commits aren’t key.
KPIs are key

Indicators from experience
Questions

Super Important.
Just test it..
We are Human.

Emotional Connection
What next?

Second Hypothesis.
Focus on Data

Relevant to your KPIs.
Data gives you the what
Humans give you the why
Turn Information
Into

Actionable Insight
Create Discussions
Introspection Engines
Seeing, Feeling it
The brain sees.
Not regressions
Not p-values
Not slopes
Not F-statistics
Not coefficients
Question Data

Not Visualisations.
Toolbox

What do we use?
R
Modeling, Testing, Prototyping
RStudio

The IDE
lubridate
and zoo
Dealing with Dates...
yy/mm/dd
mm/dd/yy
YYYY-mm-dd HH:MM:ss TZ
yy-mm-dd
1363784094.513425
yy/mm
different timezone
reshape2

Reshape your Data
ggplot2

Visualise your Data
RCurl, RJSONIO
Find more Data
HMisc

Miscellaneous useful functions
forecast

Can you guess?
garch

Generalized Autoregressive
Conditional Heteroskedasticity
quantmod

Statistical Financial Trading
getSymbols('AAPL')
barChart(AAPL)
addMACD()
xts

Extensible Time Series
igraph

Study Networks
maptools

Read & View Maps
map('state', region = c(row.names(USArrests)), col=cm.colors(16, 1)[floor(USArrests$Rape/max(USArrests$Rape)*28)], fill=T)
Python

Scientific Computing
SciPy
http://www.scipy.org
scipy.stats
scipy.stats
Descriptive Statistics
from scipy.stats import
describe
s = [1,2,1,3,4,5]
print describe(s)
scipy.stats
Probability Distributions
Example
Poisson Distribution
λ e
f (k; λ ) =
k!

k −k

for k >= 0
import scipy.stats.poisson
p = poisson.pmf([1,2,3,4,1,2,3], 2)
print p.mean()
print p.sum()
...
NumPy
http://www.numpy.org/
NumPy
Linear Algebra
⎛ 1 0 ⎞
⎜ 0 1 ⎟
⎝
⎠
import numpy as np
x = np.array([ [1, 0], [0, 1] ])
vec, val = np.linalg.eig(x)
np.linalg.eigvals(x)
>>> np.linalg.eig(x)
(
array([ 1., 1.]),
array([
[ 1., 0.],
[ 0., 1.]
])
)
Matplotlib

Python Plotting
statsmodels
Advanced Statistics Modeling
NLTK

Natural Language Tool Kit
scikit-learn

Machine Learning
from sklearn import tree
X = [[0, 0], [1, 1]]
Y = [0, 1]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
clf.predi...
PyBrain

... Machine Learning
PyMC
Bayesian Inference
Pattern

Web Mining for Python
NetworkX

Study Networks
MILK: Machine Learning
Pandas

easy-to-use data structures
from pandas import *
x = DataFrame([
{"age": 26},
{"age": 19},
{"age": 21},
{"age": 18}
])
print x[x['age'] > 20].count()
...
Python vs R?

Different Purposes
Dogfooding

Data Scientific Method
Original Question
What is Data Science?
Back to you

For questioning
Data Science, what even?!
Data Science, what even?!
Data Science, what even?!
Data Science, what even?!
Data Science, what even?!
Data Science, what even?!
Data Science, what even?!
Data Science, what even?!
Data Science, what even?!
Data Science, what even?!
Data Science, what even?!
Data Science, what even?!
Data Science, what even?!
Upcoming SlideShare
Loading in...5
×

Data Science, what even?!

1,235

Published on

Presented an abridged version of my "What is data science" talk at #websummit 2013.

This talk goes over the required skillset as defined by Drew Conway and his famous venn diagram, and also outlines the Data Scientific Method brought by Dr. Patil. The talk is mainly two parts and the second part goes over some of the packages and technologies we use — minus the storage part.

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,235
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
23
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Data Science, what even?!

  1. 1. Data Science?! what even...
  2. 2. David Coallier @davidcoallier
  3. 3. Data Scientist Engine Yard
  4. 4. And I cook.. A lot.
  5. 5. (n-1) items
  6. 6. Adapting.
  7. 7. Feedback.
  8. 8. Indifference.
  9. 9. Young mathematically inclined minds
  10. 10. Young mathematically inclined minds We knew everything.
  11. 11. First Bad Assumption.
  12. 12. So we asked “experts”.
  13. 13. Wrong Ingredients
  14. 14. Bad Data
  15. 15. Tasted like sh*t
  16. 16. From Our Results We had questions.
  17. 17. Found Expertise Not Online.
  18. 18. Data Scientific Method
  19. 19. Find a Question Your Hypothesis
  20. 20. Current Data What do you have?
  21. 21. Features & Tests Try it.
  22. 22. Analyse Results Won’t be pretty.
  23. 23. Conversation Framed. By. Data.
  24. 24. But....
  25. 25. Good Discussions Imply good data scientists
  26. 26. Hacking Skills
  27. 27. Hacking Skills Maths & Stats
  28. 28. Hacking Skills Expertise Maths & Stats
  29. 29. Hacking Skills Machine Learning Danger Zone!!! Expertise Research Maths & Stats
  30. 30. Hacking Skills Data Science Expertise Maths & Stats
  31. 31. Hacking Skills Danger Zone!!! Machine Learning Data Science Maths & Stats Expertise Research
  32. 32. Business Don’t need an MBA
  33. 33. In other words.
  34. 34. 1. Hacking 2. Maths & Stats 3. Expertise
  35. 35. Apply Method Data Scientific
  36. 36. 1. Question 2. Current Data 3. Features/Tests 4. Analyse 5. Converse
  37. 37. Find a Question Let’s imagine Github
  38. 38. Upgrade Repos Affect users as little as possible
  39. 39. import csv content = csv.read('repo1.csv')
  40. 40. λ e f (k; λ ) = k! k −k for k >= 0
  41. 41. Converse Present Findings
  42. 42. Iterate Commits aren’t key.
  43. 43. KPIs are key Indicators from experience
  44. 44. Questions Super Important.
  45. 45. Just test it..
  46. 46. We are Human. Emotional Connection
  47. 47. What next? Second Hypothesis.
  48. 48. Focus on Data Relevant to your KPIs.
  49. 49. Data gives you the what Humans give you the why
  50. 50. Turn Information
  51. 51. Into Actionable Insight
  52. 52. Create Discussions Introspection Engines
  53. 53. Seeing, Feeling it The brain sees.
  54. 54. Not regressions
  55. 55. Not p-values
  56. 56. Not slopes
  57. 57. Not F-statistics
  58. 58. Not coefficients
  59. 59. Question Data Not Visualisations.
  60. 60. Toolbox What do we use?
  61. 61. R Modeling, Testing, Prototyping
  62. 62. RStudio The IDE
  63. 63. lubridate and zoo Dealing with Dates...
  64. 64. yy/mm/dd mm/dd/yy YYYY-mm-dd HH:MM:ss TZ yy-mm-dd 1363784094.513425 yy/mm different timezone
  65. 65. reshape2 Reshape your Data
  66. 66. ggplot2 Visualise your Data
  67. 67. RCurl, RJSONIO Find more Data
  68. 68. HMisc Miscellaneous useful functions
  69. 69. forecast Can you guess?
  70. 70. garch Generalized Autoregressive Conditional Heteroskedasticity
  71. 71. quantmod Statistical Financial Trading
  72. 72. getSymbols('AAPL') barChart(AAPL) addMACD()
  73. 73. xts Extensible Time Series
  74. 74. igraph Study Networks
  75. 75. maptools Read & View Maps
  76. 76. map('state', region = c(row.names(USArrests)), col=cm.colors(16, 1)[floor(USArrests$Rape/max(USArrests$Rape)*28)], fill=T)
  77. 77. Python Scientific Computing
  78. 78. SciPy http://www.scipy.org
  79. 79. scipy.stats
  80. 80. scipy.stats Descriptive Statistics
  81. 81. from scipy.stats import describe s = [1,2,1,3,4,5] print describe(s)
  82. 82. scipy.stats Probability Distributions
  83. 83. Example Poisson Distribution
  84. 84. λ e f (k; λ ) = k! k −k for k >= 0
  85. 85. import scipy.stats.poisson p = poisson.pmf([1,2,3,4,1,2,3], 2)
  86. 86. print p.mean() print p.sum() ...
  87. 87. NumPy http://www.numpy.org/
  88. 88. NumPy Linear Algebra
  89. 89. ⎛ 1 0 ⎞ ⎜ 0 1 ⎟ ⎝ ⎠
  90. 90. import numpy as np x = np.array([ [1, 0], [0, 1] ]) vec, val = np.linalg.eig(x) np.linalg.eigvals(x)
  91. 91. >>> np.linalg.eig(x) ( array([ 1., 1.]), array([ [ 1., 0.], [ 0., 1.] ]) )
  92. 92. Matplotlib Python Plotting
  93. 93. statsmodels Advanced Statistics Modeling
  94. 94. NLTK Natural Language Tool Kit
  95. 95. scikit-learn Machine Learning
  96. 96. from sklearn import tree X = [[0, 0], [1, 1]] Y = [0, 1] clf = tree.DecisionTreeClassifier() clf = clf.fit(X, Y) clf.predict([[2., 2.]]) >>> array([1])
  97. 97. PyBrain ... Machine Learning
  98. 98. PyMC Bayesian Inference
  99. 99. Pattern Web Mining for Python
  100. 100. NetworkX Study Networks
  101. 101. MILK: Machine Learning
  102. 102. Pandas easy-to-use data structures
  103. 103. from pandas import * x = DataFrame([ {"age": 26}, {"age": 19}, {"age": 21}, {"age": 18} ]) print x[x['age'] > 20].count() print x[x['age'] > 20].mean()
  104. 104. Python vs R? Different Purposes
  105. 105. Dogfooding Data Scientific Method
  106. 106. Original Question What is Data Science?
  107. 107. Back to you For questioning
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×