Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

781 views

Published on

License: CC Attribution License

No Downloads

Total views

781

On SlideShare

0

From Embeds

0

Number of Embeds

81

Shares

0

Downloads

13

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Data Science joaquin vanschoren
- 2. WHAT IS DATA SCIENCE?
- 3. Hacking skills Maths & StatsExpertise
- 4. Expertise ! Maths & Stats Hacking skills ! ! !Danger zone! Machine Learning Research Data Science [Drew Conway]
- 5. (data) science ofﬁcer
- 6. Maths & Stats ! ! Hacking skills ! ! ! ! Evil Expertise danger zone! ! machine learning ! James Bond villain data science outside committee member identity thief thesis advisor ofﬁce mate NSA [Joel Grus]
- 7. – David Coallier “Whenever you read about data science or data analysis, it’s about the ability to store petabytes of data, retrieve that data in nanoseconds, then turn it into a rainbow with a unicorn dancing on it.” THE HYPE – Harvard Business Review “Data Scientist: The Sexiest Job of the 21st Century”
- 8. THE REALITY • You’ll clean a lot of data.A LOT • A lot of mathematics. Get over it • Some days will be long. Get more coffee • Not everything is about Big Data • Most people don’t care about data • Spend time ﬁnding the right questions [David Coallier]
- 9. BIG DATA THE END OFTHEORY? – Chris Anderson,WIRED Out with every theory of human behavior, from linguistics to sociology.Who knows why people do what they do?The point is they do it… With enough data, the numbers speak for themselves.
- 10. All models are wrong. But some are useful.
- 11. All models are wrong, and increasingly you can succeed without them.
- 12. Big data is a step forward. But our problems are not lack of access to data, but understanding them.
- 13. Data Scientiﬁc Method [DJ Patil, J Elman]
- 14. START WITH A QUESTION Based on an observation
- 15. START WITH A QUESTIONWhat (just) happened? Why did it happen? What will/could happen next?
- 16. ANALYSE CURRENT DATA Create an Hypothesis
- 17. CREATE FEATURES, EXPERIMENT Test Hypothesis
- 18. ANALYSE RESULTS Won’t be pretty, repeat
- 19. LET DATA FRAME THE CONVERSATION Data gives you the what Humans give you the why
- 20. LET DATA FRAME THE CONVERSATION Let the dataset change your mindset
- 21. CONVERSE • What data is missing? Where can we get it? • Automate data collection • Clean data, then clean it more • Visualize data: the brain sees • Merge various sources of information • Reformulate hypotheses • Reformulate questions
- 22. DATA SCIENCE TOOLS
- 23. We can't solve problems at the same level of thinking with which we've created them…
- 24. Probabilistic algorithms When polynomial time is just too slow
- 25. [1,54,853,23,4,73,…] Have we seen 73? min H(1,54,853,…) < H(73) ? min H2(1,54,853,…) < H2(73) ? min H3(1,54,853,…) < H3(73) ?
- 26. MINHASHING Bloom ﬁlters Find similar documents, photos, …
- 27. MapReduce
- 28. map map map split reduce reduce reduce shufﬂe (remote read) read data(HDFS) data(HDFS) write readread worker nodes (local) worker nodes (local) split1split2splitn
- 29. mapper node reducer node remote read
- 30. Mapper <a,apple> <a’,slices> Reducer Input ﬁle Intermediate ﬁle (local) Output ﬁle
- 31. <p,pineapple> <p’,slices> <a,apple> <o,orange> <a’,slices> <o’,slices> Input ﬁle Output ﬁleIntermediate ﬁle
- 32. split 0 split 1 split 2 1 mapper/split 1 reducer/key(set) shufﬂesplit
- 33. map map map split reduce reduce reduce shufﬂe (remote read) read data(HDFS) data(HDFS) write readread worker nodes (local) worker nodes (local) split1split2splitn
- 34. map map map split reduce reduce shufﬂe + parallel sort (remote read) read data(HDFS) data(HDFS) write readread split1split2splitn master node assigns map/reduce jobs reassigns if nodes fail
- 35. Nearest bar ? nearest within distance d? Input graph (node,label)
- 36. Nearest bar Map ∀ , search graph with radius d < ,{ ,distance} > Input graph (node,label)
- 37. Nearest bar Map ∀ , search graph < ,{ ,distance} > Shuﬄe/ Sort by id Input graph (node,label)
- 38. Nearest bar Map ∀ , search graph < ,{ ,distance} > Input graph (node,label) Shuﬄe/ Sort by id Reduce < ,[{ ,distance}, { ,distance}] > -> min() Output < , > < , > marked graph
- 39. Sensor data
- 40. Vibration Strain (longitudinal) Strain (transverse) Temperature
- 41. NOISE
- 42. MATHS Convolution signal kernel convolution multiplied by 1:5 and delayed by 14 sample intervals. Evidently, we have just described in words the following deﬁnition of discrete convolution with a response function of ﬁnite duration M: .r s/j Á M=2 X kD M=2C1 sj k rk (13.1.1) If a discrete response function is nonzero only in some range M=2 < k Ä M=2, where M is a sufﬁciently large even integer, then the response function is called a ﬁnite impulse response (FIR), and its duration is M. (Notice that we are deﬁning M as the number of nonzero values of rk; these values span a time interval of M 1 sampling times.) In most practical circumstances the case of ﬁnite M is the case of interest, either because the response really has a ﬁnite duration, or because we choose to truncate it at some point and approximate it by a ﬁnite-duration response function. The discrete convolution theorem is this: If a signal sj is periodic with period N, so that it is completely determined by the N values s0; : : : ; sN 1, then its discrete convolution with a response function of ﬁnite duration N is a member of the discrete Fourier transform pair, g h Á Z 1 1 g. /h.t / d (12 e that g h is a function in the time domain and that g h D h g. It turn the function g h is one member of a simple transform pair, g h ” G.f /H.f / convolution theorem (12.0 ther words, the Fourier transform of the convolution is just the product o vidual Fourier transforms. The correlation of two functions, denoted Corr.g; h/, is deﬁned by Corr.g; h/ Á Z 1 1 g. C t/h. / d (12.0 correlation is a function of t, which is called the lag. It therefore lies in the ain, and it turns out to be one member of the transform pair: Corr.g; h/ ” G.f /H .f / correlation theorem (12.0
- 43. COMPLEXITY
- 44. MATHS,AGAIN scale space decomposition signal kernel 1 convolution 2 kernel 2 convolution 1
- 45. SCALE-SPACE
- 46. Baseline (σ64) Trafﬁc Jams (σ16 -σ64) Slowdown (σ4 -σ16) Vehicles (σ0 -σ4) SCALE-SPACE DECOMPOSITION
- 47. VOLUME MapReduce 145 sensors 100Hz 5GB/day 2TB/year 50MB/s disk I/O
- 48. Reduce Reduce Reduce Reduce Reduce Map Map Map MapMap Build windows Build windows Build windows Build windows Build windows Shufﬂe Convolute Convolute Convolute Convolute Convolute CONVOLUTION
- 49. CONVOLUTE-ADD Map (convolute with 0-padding) Reduce (add) Add values in overlapping regions 0 0 0 0 0 0 A A+B B B+C C
- 50. SEGMENTATION • You don’t need 100Hz data for everything • Approximate signal with linear segments • Key points: 0-crossings of 1st, 2nd, 3rd derivative • Maths: derivative of smoothed signal = convolution with derivative of kernel
- 51. 1st, 2nd,3rd degree derivatives signal convolution segmentation
- 52. SEGMENTATION RESULT
- 53. VISUALIZATION
- 54. Twitter data
- 55. TRACKING NEWS STORIES
- 56. Geospacial data
- 57. OPEN SOURCETOOLS
- 58. modelling, testing, prototyping lubridate, zoo: dates, time series reshape2: reshape data ggplot2: visualize data RCurl, RJSONIO: ﬁnd more data HMisc: miscellaneous DMwR, mlr: machine learning Forecast: time series forecasting garch: time series modelling quantmod: statistical ﬁnancial trading xts: extensible time series igraph: study networks maptools: read and view maps R
- 59. scientiﬁc computing numpy: linear algebra scipy: optimization, signal/image processing, … scikits: toolkits for scipy scikit-learn: machine learning toolkit statsmodels: advanced statistic modelling matplotlib: plotting NLTK: natural language processing PyBrain: more machine learning PyMC: Bayesian inference Pattern:Web mining NetworkX: Study networks Pandas: easy-to-use data structures PYTHON
- 60. CouchDB OTHER D3.js
- 61. @joavanschoren joaquin.vanschoren@gmail.com

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment