Data Science
joaquin vanschoren
WHAT IS DATA SCIENCE?
Hacking skills
Maths & StatsExpertise
Expertise
	

!
	

Maths & 	

Stats
Hacking skills	

!
!
!Danger	

zone!
Machine	

Learning
Research
Data	

Science
[Drew C...
(data) 	

science officer
Maths 	

& 	

Stats	

!
!
Hacking 	

skills	

!
!
!
	

!
	

Evil
	

	

	

Expertise
danger	

zone!
!
machine	

learning
!
...
– David Coallier
“Whenever you read about data science or data
analysis, it’s about the ability to store petabytes of data...
THE REALITY
• You’ll clean a lot of data.A LOT	

• A lot of mathematics. Get over it	

• Some days will be long. Get more ...
BIG DATA	

THE END OFTHEORY?
– Chris Anderson,WIRED
Out with every theory of human behavior, from linguistics to
sociology...
All models
are wrong.
But some are
useful.
All models are
wrong, and
increasingly you
can succeed
without them.
Big data is a
step forward.
But our
problems are
not lack of
access to
data, but
understanding
them.
Data Scientific Method
[DJ Patil, J Elman]
START 	

WITH 	

A QUESTION
Based on an observation
START 	

WITH 	

A QUESTIONWhat (just) happened?	

Why did it happen?	

What will/could happen next?
ANALYSE 	

CURRENT	

DATA
Create an Hypothesis
CREATE
FEATURES,	

EXPERIMENT
Test Hypothesis
ANALYSE 	

RESULTS	

Won’t be pretty, repeat
LET DATA	

FRAME THE	

CONVERSATION
Data gives you the what 	

Humans give you the why
LET DATA	

FRAME THE	

CONVERSATION
Let the dataset 	

change your mindset
CONVERSE
• What data is missing? Where can
we get it?	

• Automate data collection	

• Clean data, then clean it more	

• ...
DATA SCIENCE	

TOOLS
We can't solve problems at the
same level of thinking with which
we've created them…
Probabilistic	

algorithms
When polynomial time 	

is just too slow
[1,54,853,23,4,73,…] Have we seen 73?
min H(1,54,853,…) < H(73) ?
min H2(1,54,853,…) < H2(73) ?
min H3(1,54,853,…) < H3(73...
MINHASHING
Bloom filters	

Find similar documents, photos, …
MapReduce
map
map
map
split
reduce
reduce
reduce
shuffle	

(remote read)
read
data(HDFS)
data(HDFS)
write
readread
worker nodes (loca...
mapper	

node
reducer	

node
remote	

read
Mapper
<a,apple> <a’,slices>
Reducer
Input	

file
Intermediate 	

file (local)
Output	

file
<p,pineapple>
<p’,slices>
<a,apple>
<o,orange>
<a’,slices>
<o’,slices>
Input file Output fileIntermediate 	

file
split 0
split 1
split 2
1 mapper/split 1 reducer/key(set)
shufflesplit
map
map
map
split
reduce
reduce
reduce
shuffle	

(remote read)
read
data(HDFS)
data(HDFS)
write
readread
worker nodes (loca...
map
map
map
split
reduce
reduce
shuffle + parallel sort	

(remote read)
read
data(HDFS)
data(HDFS)
write
readread
split1spl...
Nearest bar
? nearest within distance d?
Input
graph
(node,label)
Nearest bar
Map
∀ , search graph with radius d
< ,{ ,distance} >
Input
graph
(node,label)
Nearest bar
Map
∀ , search graph
< ,{ ,distance} >
Shuffle/
Sort
by id
Input
graph
(node,label)
Nearest bar
Map
∀ , search graph
< ,{ ,distance} >
Input
graph
(node,label)
Shuffle/
Sort
by id
Reduce
< ,[{ ,distance},
{ ,...
Sensor data
Vibration
Strain (longitudinal)
Strain (transverse)
Temperature
NOISE
MATHS
Convolution
signal
kernel
convolution
multiplied by 1:5 and delayed by 14 sample intervals.
Evidently, we have just ...
COMPLEXITY
MATHS,AGAIN
scale space decomposition
signal
kernel 1
convolution 2
kernel 2
convolution 1
SCALE-SPACE
Baseline (σ64)
Traffic Jams (σ16 -σ64)
Slowdown (σ4 -σ16)
Vehicles (σ0 -σ4)
SCALE-SPACE	

DECOMPOSITION
VOLUME
MapReduce
145 sensors	

100Hz	

5GB/day	

2TB/year 	

50MB/s disk I/O
Reduce Reduce Reduce Reduce Reduce
Map Map Map MapMap
Build
windows
Build	

windows
Build
windows
Build
windows
Build
wind...
CONVOLUTE-ADD
Map	

(convolute	

with 0-padding)
Reduce	

(add)
Add values in overlapping regions	

0 0
0 0
0 0
A A+B B B+...
SEGMENTATION
• You don’t need 100Hz data for everything	

• Approximate signal with linear segments	

• Key points: 0-cros...
1st, 2nd,3rd degree derivatives
signal
convolution
segmentation
SEGMENTATION RESULT
VISUALIZATION
Twitter data
TRACKING NEWS STORIES
Geospacial data
OPEN SOURCETOOLS
modelling, testing, prototyping
lubridate, zoo: dates, time series	

reshape2: reshape data	

ggplot2: visualize data	

RC...
scientific computing
numpy: linear algebra	

scipy: optimization, signal/image processing, …	

scikits: toolkits for scipy	...
CouchDB
OTHER
D3.js
@joavanschoren
joaquin.vanschoren@gmail.com
Data science
Data science
Data science
Data science
Data science
Data science
Data science
Data science
Data science
Data science
Data science
Data science
Upcoming SlideShare
Loading in …5
×

Data science

781 views

Published on

Tutorial on data science, what's it like to be a data scientist, big data, the data scientific method, probabilistic algorithms, map-reduce, sensor data analysis, visualization of twitter and foursquare feeds, open source tools (R, Python, NoSQL)

  • Be the first to comment

Data science

  1. 1. Data Science joaquin vanschoren
  2. 2. WHAT IS DATA SCIENCE?
  3. 3. Hacking skills Maths & StatsExpertise
  4. 4. Expertise ! Maths & Stats Hacking skills ! ! !Danger zone! Machine Learning Research Data Science [Drew Conway]
  5. 5. (data) science officer
  6. 6. Maths & Stats ! ! Hacking skills ! ! ! ! Evil Expertise danger zone! ! machine learning ! James Bond villain data science outside committee member identity thief thesis advisor office mate NSA [Joel Grus]
  7. 7. – David Coallier “Whenever you read about data science or data analysis, it’s about the ability to store petabytes of data, retrieve that data in nanoseconds, then turn it into a rainbow with a unicorn dancing on it.” THE HYPE – Harvard Business Review “Data Scientist: The Sexiest Job of the 21st Century”
  8. 8. THE REALITY • You’ll clean a lot of data.A LOT • A lot of mathematics. Get over it • Some days will be long. Get more coffee • Not everything is about Big Data • Most people don’t care about data • Spend time finding the right questions [David Coallier]
  9. 9. BIG DATA THE END OFTHEORY? – Chris Anderson,WIRED Out with every theory of human behavior, from linguistics to sociology.Who knows why people do what they do?The point is they do it… With enough data, the numbers speak for themselves.
  10. 10. All models are wrong. But some are useful.
  11. 11. All models are wrong, and increasingly you can succeed without them.
  12. 12. Big data is a step forward. But our problems are not lack of access to data, but understanding them.
  13. 13. Data Scientific Method [DJ Patil, J Elman]
  14. 14. START WITH A QUESTION Based on an observation
  15. 15. START WITH A QUESTIONWhat (just) happened? Why did it happen? What will/could happen next?
  16. 16. ANALYSE CURRENT DATA Create an Hypothesis
  17. 17. CREATE FEATURES, EXPERIMENT Test Hypothesis
  18. 18. ANALYSE RESULTS Won’t be pretty, repeat
  19. 19. LET DATA FRAME THE CONVERSATION Data gives you the what Humans give you the why
  20. 20. LET DATA FRAME THE CONVERSATION Let the dataset change your mindset
  21. 21. CONVERSE • What data is missing? Where can we get it? • Automate data collection • Clean data, then clean it more • Visualize data: the brain sees • Merge various sources of information • Reformulate hypotheses • Reformulate questions
  22. 22. DATA SCIENCE TOOLS
  23. 23. We can't solve problems at the same level of thinking with which we've created them…
  24. 24. Probabilistic algorithms When polynomial time is just too slow
  25. 25. [1,54,853,23,4,73,…] Have we seen 73? min H(1,54,853,…) < H(73) ? min H2(1,54,853,…) < H2(73) ? min H3(1,54,853,…) < H3(73) ?
  26. 26. MINHASHING Bloom filters Find similar documents, photos, …
  27. 27. MapReduce
  28. 28. map map map split reduce reduce reduce shuffle (remote read) read data(HDFS) data(HDFS) write readread worker nodes (local) worker nodes (local) split1split2splitn
  29. 29. mapper node reducer node remote read
  30. 30. Mapper <a,apple> <a’,slices> Reducer Input file Intermediate file (local) Output file
  31. 31. <p,pineapple> <p’,slices> <a,apple> <o,orange> <a’,slices> <o’,slices> Input file Output fileIntermediate file
  32. 32. split 0 split 1 split 2 1 mapper/split 1 reducer/key(set) shufflesplit
  33. 33. map map map split reduce reduce reduce shuffle (remote read) read data(HDFS) data(HDFS) write readread worker nodes (local) worker nodes (local) split1split2splitn
  34. 34. map map map split reduce reduce shuffle + parallel sort (remote read) read data(HDFS) data(HDFS) write readread split1split2splitn master node assigns map/reduce jobs reassigns if nodes fail
  35. 35. Nearest bar ? nearest within distance d? Input graph (node,label)
  36. 36. Nearest bar Map ∀ , search graph with radius d < ,{ ,distance} > Input graph (node,label)
  37. 37. Nearest bar Map ∀ , search graph < ,{ ,distance} > Shuffle/ Sort by id Input graph (node,label)
  38. 38. Nearest bar Map ∀ , search graph < ,{ ,distance} > Input graph (node,label) Shuffle/ Sort by id Reduce < ,[{ ,distance}, { ,distance}] > -> min() Output < , > < , > marked graph
  39. 39. Sensor data
  40. 40. Vibration Strain (longitudinal) Strain (transverse) Temperature
  41. 41. NOISE
  42. 42. MATHS Convolution signal kernel convolution multiplied by 1:5 and delayed by 14 sample intervals. Evidently, we have just described in words the following definition of discrete convolution with a response function of finite duration M: .r s/j Á M=2 X kD M=2C1 sj k rk (13.1.1) If a discrete response function is nonzero only in some range M=2 < k Ä M=2, where M is a sufficiently large even integer, then the response function is called a finite impulse response (FIR), and its duration is M. (Notice that we are defining M as the number of nonzero values of rk; these values span a time interval of M 1 sampling times.) In most practical circumstances the case of finite M is the case of interest, either because the response really has a finite duration, or because we choose to truncate it at some point and approximate it by a finite-duration response function. The discrete convolution theorem is this: If a signal sj is periodic with period N, so that it is completely determined by the N values s0; : : : ; sN 1, then its discrete convolution with a response function of finite duration N is a member of the discrete Fourier transform pair, g h Á Z 1 1 g. /h.t / d (12 e that g h is a function in the time domain and that g h D h g. It turn the function g h is one member of a simple transform pair, g h ” G.f /H.f / convolution theorem (12.0 ther words, the Fourier transform of the convolution is just the product o vidual Fourier transforms. The correlation of two functions, denoted Corr.g; h/, is defined by Corr.g; h/ Á Z 1 1 g. C t/h. / d (12.0 correlation is a function of t, which is called the lag. It therefore lies in the ain, and it turns out to be one member of the transform pair: Corr.g; h/ ” G.f /H .f / correlation theorem (12.0
  43. 43. COMPLEXITY
  44. 44. MATHS,AGAIN scale space decomposition signal kernel 1 convolution 2 kernel 2 convolution 1
  45. 45. SCALE-SPACE
  46. 46. Baseline (σ64) Traffic Jams (σ16 -σ64) Slowdown (σ4 -σ16) Vehicles (σ0 -σ4) SCALE-SPACE DECOMPOSITION
  47. 47. VOLUME MapReduce 145 sensors 100Hz 5GB/day 2TB/year 50MB/s disk I/O
  48. 48. Reduce Reduce Reduce Reduce Reduce Map Map Map MapMap Build windows Build windows Build windows Build windows Build windows Shuffle Convolute Convolute Convolute Convolute Convolute CONVOLUTION
  49. 49. CONVOLUTE-ADD Map (convolute with 0-padding) Reduce (add) Add values in overlapping regions 0 0 0 0 0 0 A A+B B B+C C
  50. 50. SEGMENTATION • You don’t need 100Hz data for everything • Approximate signal with linear segments • Key points: 0-crossings of 1st, 2nd, 3rd derivative • Maths: derivative of smoothed signal = convolution with derivative of kernel
  51. 51. 1st, 2nd,3rd degree derivatives signal convolution segmentation
  52. 52. SEGMENTATION RESULT
  53. 53. VISUALIZATION
  54. 54. Twitter data
  55. 55. TRACKING NEWS STORIES
  56. 56. Geospacial data
  57. 57. OPEN SOURCETOOLS
  58. 58. modelling, testing, prototyping lubridate, zoo: dates, time series reshape2: reshape data ggplot2: visualize data RCurl, RJSONIO: find more data HMisc: miscellaneous DMwR, mlr: machine learning Forecast: time series forecasting garch: time series modelling quantmod: statistical financial trading xts: extensible time series igraph: study networks maptools: read and view maps R
  59. 59. scientific computing numpy: linear algebra scipy: optimization, signal/image processing, … scikits: toolkits for scipy scikit-learn: machine learning toolkit statsmodels: advanced statistic modelling matplotlib: plotting NLTK: natural language processing PyBrain: more machine learning PyMC: Bayesian inference Pattern:Web mining NetworkX: Study networks Pandas: easy-to-use data structures PYTHON
  60. 60. CouchDB OTHER D3.js
  61. 61. @joavanschoren joaquin.vanschoren@gmail.com

×