Data Science?!
what even...
David Coallier
@davidcoallier
Data Scientist

Engine Yard
And I cook..
A lot.
(n-1) items
Adapting.
Feedback.
Indifference.
Young mathematically
inclined minds
Young mathematically inclined minds

We knew everything.
First Bad Assumption.
So we asked “experts”.
Bad Ingredients
Bad Data
Tasted like sh*t
From Our Results
We had questions.
Found Expertise
Not Online.
Data Scientific
Method
Find a Question
Your Hypothesis
Current Data

What do you have?
Features & Tests
Try it.
Analyse Results
Won’t be pretty.
Conversation

Framed. By. Data.
But....
Good Discussions
Imply good data scientists
Hacking Skills
Hacking Skills

Maths &
Stats
Hacking Skills

Expertise

Maths &
Stats
Hacking Skills
Machine
Learning

Danger
Zone!!!

Expertise

Research

Maths &
Stats
Hacking Skills

Data
Science

Expertise

Maths &
Stats
Hacking Skills

Danger
Zone!!!

Machine
Learning

Data
Science
Maths &
Stats

Expertise
Research
Business

Don’t need an MBA
In other words.
1. Hacking
2. Maths & Stats
3. Expertise
Apply Method
Data Scientific
1. Question
2. Current Data
3. Features/Tests
4. Analyse
5. Converse
Find a Question

Let’s imagine Github
Upgrade Repos
Affect users as little as possible
import csv
content = csv.read('repo1.csv')
λ e
f (k; λ ) =
k!

k −k

for k >= 0
Converse

Present Findings
Iterate

Commits aren’t key.
KPIs are key

Indicators from experience
Questions

Super Important.
Just test it..
We are Human.

Emotional Connection
What next?

Second Hypothesis.
Focus on Data

Relevant to your KPIs.
Data gives you the what
Humans give you the why
Turn Information
Into

Actionable Insight
Create Discussions
Introspection Engines
Seeing, Feeling it
The brain sees.
Not regressions
Not p-values
Not slopes
Not F-statistics
Not coefficients
Another Example
Fraud Engine
Features

Fraud Engine
Clusters

User Types
Machine Learning
Historical Analysis
Decision

Report as Fraudulent
Fact-Based

Decision Failing
Fact-Based

Decision Making
Action
Measure

Knowledge
Analysis
Failed.

Noetic Intelligence
Action
Measure

Knowledge
Analysis
Action
Measure

Knowledge
Analysis
Offering

Missing Feature
Toolbox

What do we use?
R
Modeling, Testing, Prototyping
RStudio

The IDE
lubridate
and zoo
Dealing with Dates...
yy/mm/dd
mm/dd/yy
YYYY-mm-dd HH:MM:ss TZ
yy-mm-dd
1363784094.513425
yy/mm
different timezone
reshape2

Reshape your Data
ggplot2

Visualise your Data
RCurl, RJSONIO
Find more Data
HMisc

Miscellaneous useful functions
forecast

Can you guess?
garch

Generalized Autoregressive
Conditional Heteroskedasticity
quantmod

Statistical Financial Trading
getSymbols('AAPL')
barChart(AAPL)
addMACD()
xts

Extensible Time Series
igraph

Study Networks
maptools

Read & View Maps
map('state', region = c(row.names(USArrests)), col=cm.colors(16, 1)[floor(USArrests$Rape/max(USArrests$Rape)*28)], fill=T)
Python

Scientific Computing
SciPy
http://www.scipy.org
scipy.stats
scipy.stats
Descriptive Statistics
from scipy.stats import
describe
s = [1,2,1,3,4,5]
print describe(s)
scipy.stats
Probability Distributions
Example
Poisson Distribution
λ e
f (k; λ ) =
k!

k −k

for k >= 0
import scipy.stats.poisson
p = poisson.pmf([1,2,3,4,1,2,3], 2)
print p.mean()
print p.sum()
...
NumPy
http://www.numpy.org/
NumPy
Linear Algebra
⎛ 1 0 ⎞
⎜ 0 1 ⎟
⎝
⎠
import numpy as np
x = np.array([ [1, 0], [0, 1] ])
vec, val = np.linalg.eig(x)
np.linalg.eigvals(x)
>>> np.linalg.eig(x)
(
array([ 1., 1.]),
array([
[ 1., 0.],
[ 0., 1.]
])
)
Matplotlib

Python Plotting
statsmodels
Advanced Statistics Modeling
NLTK

Natural Language Tool Kit
scikit-learn

Machine Learning
from sklearn import tree
X = [[0, 0], [1, 1]]
Y = [0, 1]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
clf.predict([[2., 2.]])
>>> array([1])
PyBrain

... Machine Learning
PyMC
Bayesian Inference
Pattern

Web Mining for Python
NetworkX

Study Networks
MILK: Machine Learning
Pandas

easy-to-use data structures
from pandas import *
x = DataFrame([
{"age": 26},
{"age": 19},
{"age": 21},
{"age": 18}
])
print x[x['age'] > 20].count()
print x[x['age'] > 20].mean()
Python vs R?

Different Purposes
Sto

rage
Oppose
“big” Data
Hadoop
Had - oops
Riak

Key-Value Buckets
CouchDB

Document Database
Redis

In-Memory Database
Cube

Time-series Database
PgSQL

Quite Extensively
Visualisation
Right Now
The rule of 3
Engineer

Report One
Mid-Level Mgr
Report Two
Board Level
Report Three
The Future

Discoverable Insight
d3.js

Data-Driven Documents
The Future

Discoverable Insight
Dashing

Elegant Dashboards
Edward Tufte

Go read his books.
Dogfooding

Data Scientific Method
Original Question
What is Data Science?
Back to you

For questioning

Data Science, what even...