- 1. Big Data [sorry] Data Science: What Does a Data Scientist Do? Carlos Somohano Founder Data Science London @ds_ldn datasciencelondon.org The Cloud and Big Data: HDInsight on Azure London 25/01/13
- 2. Man on the Moon – 1,969
- 3. Man on the Moon – Small Data! Computer Program Apollo X1 Man on the Moon Date: 1,969 Speed: 3,500 km/hour Distance: 356,000 Km 64 Kb, 2Kb RAM, Fortran Weight: 13,500 kg Never been there before Must work 1st time Lots of complex data Must return to Earth
- 4. Apollo XI, 1969 SkyDive Stratos, 2012 64 Kb Tens of Gigabytes Think About It – We live in Crazy Times!
- 5. Big Data is not about Data Volume
- 6. What is Big Data? IT mumbo-jumbo A fashionable term typically used by some IT vendors to remarket old fashioned software hardware
- 7. What is Big Data? The n-Vs Volume … Variety … Velocity … (add your own V here…) So What?
- 8. Change! Water Cooler Chat We need to parallelize data operations but it’s too costly complex … The business can’t get access to all the relevant data, we need external data… We can’t match customer master data to live customer interactions… We can’t just force everything into a star-schema… These BI reports and charts don’t tell us anything we didn’t know… We are missing the ETL window, the data we needed didn’t arrive on time… We can’t predict with conﬁdence if we can’t explore data develop our own models
- 9. What is Big Data? Force of Change Big Data forces you to change the way you collect, store, manage, analyze and visualize data
- 10. Crude Oil
- 11. Big Data = Crude Oil [not New Oil] Think data as ‘crude oil.’ Big Data is about extracting the ‘crude oil,’ transporting it in ‘mega-tankers,’ siphoning it through ‘pipelines,’ and storing it in massive ‘silos’… All ‘this’ is about IT Big Data… ﬁne and well… … BUT
- 12. You need to reﬁne the ‘crude oil’ Enter Data Science…
- 13. The Science [and Art] of… Discovering what we don’t know from data Obtaining predictive, actionable insight from data Creating Data Products that have business impact now Communicating relevant business stories from data Building conﬁdence in decisions that drive business value
- 14. Brief History of Data Science 6th C BC - 1st C BC – The Greeks! Pyrrhonism, Skepticism Empiricism… 1974 – Peter Naur @UoC Datalogy Data Science 2001 – William S. Cleveland @CSU Data Science: An Action Plan …: 2002 – Committee on Data for Science Technology (CODATA) 2003 – Journal of Data Science 2009 – Jeff Hammerbacher @ Facebook What does a Data Scientist Do? 2010 – Drew Conway @NYU The Data Science Venn Diagram 2010 – Hillary Mason Chris Wiggins @Dataists “ 2010 – Mike Loukadis @O’Reilly “What is Data Science?” 2011 – DJ Patil @LinkedIn data scientist vs. data analyst
- 15. Jeff Hammerbacher, 2009 “... on any given day, a team member could author a multistage processing pipeline in Python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data- intensive product or service in Hadoop, or communicate the results of our analyses to other members of the organization.
- 16. Mike Loukides, 2010 Data science enables the creation of data products. Whether... data is search terms, voice samples, or product reviews,... users are in a feedback loop in which they contribute to the products they use. That's the beginning of data science.
- 17. Hilary Mason Chris Wiggins,2010 Data science is clearly a blend of the hackers’ arts, statistics and machine learning...; and the expertise in mathematics and the domain of the data for the analysis to be interpretable... It requires creative decisions and open-mindedness in a scientiﬁc context.
- 19. DJ Patil, 2011 ”We realized that as our organizations grew, we both had to ﬁgure out what to call the people on our teams. Business analyst” and Data analyst” seemed too limiting. The focus of our teams was to work on data applications that would have an immediate and massive impact on the business. The term that seemed to ﬁt best was data scientist: those who use both data and science to create something new”
- 20. What is a Data Scientist?
- 21. The Duck – Billed Platypus The Data Scientist – Billed Platypus
- 22. The Platypus – Billed Data Scientist Machine Learning Hacking Statistics Math Visualization Science Programming Data Mining The Data Scientist – Billed Platypus
- 23. Josh Wills, 2012
- 24. Class DataScientist { Is skeptical, curious. Has inquisitive mind Knows Machine Learning, Statistics, Probability Applies Scientiﬁc Method. Runs Experiments Is good at Coding Hacking Able to deal with IT Data Engineering Knows how to build data products Able to ﬁnd answers to known unknowns Tells relevant business stories from data Has Domain Knowledge }
- 25. What Does a Data Scientist Do?
- 26. 10 Things [most] Data Scientists Do 1 Ask Good Questions. What is What… …we don’t know? …we’d like to know? 2 Deﬁne and Test an Hypothesis. Run experiments 3 Scoop, Scrap, Sink, Sample Business Relevant Data 4 Munge and Wrestle Data. Tame Data 5 Explore Data, Discover Data Playfully. Discover unknowns. 6 Model Data. Model Algorithms. 7 Understand Data Relationships 8 Tell the Machine How to Learn from Data 9 Create Data Products that Deliver Actionable Insight 10 Tell Relevant Business Stories from Data
- 27. [Sort of a] Data Scientist Toolkit § Java, R, Python… (bonus: Clojure, Haskell, Scala) § Hadoop, HDFS MapReduce… (bonus: Spark, Storm) § HBase, Pig Hive… (bonus: Shark, Impala, Cascalog) § ETL, Webscrapers,Flume, Sqoop… (bonus: Hume) § SQL, RDBMS, DW, OLAP… § Knime, Weka, RapidMiner…(bonus: SciPy, NumPy, scikit-learn, pandas) § D3.js, Gephi, ggplot2, Tableu, Flare, Shiny… § SPSS, Matlab, SAS… (the enterprise man) § NoSQL, Mongo DB, Couchbase, Cassandra… § And Yes! … MS-Excel: the most used, most underrated DS tool
- 28. Foundations of Data Science
- 29. [Some] Data Science Principles 1 Socio-Technical Systems (STS) are complex! 2 Data is never at rest 3 Data is dirty, deal with it 4 SVoT = LOL! 5 Data munging data wrestling 70% time 6 Simpliﬁcation. Reduction. Distillation 7 Curiosity. Empiricism. Skepticism
- 30. Knowns Unknowns There are known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don't know. But there are also unknown unknowns. There are things we don't know we don't know Donald Rumsfeld
- 31. DIKUW FTW! D I K U W Data Information Knowledge Understanding Wisdom PAST FUTURE Data Engineer Data Analyst Data Miner Data Scientist Raw What How to Why When Numbers Description Experience Cause Effect Prediction Letters Context Tested Proven What’s best Known Unknown Symbols Relationship Instruction Unknowns Unknowns Known Knowns Signals Reports Programs models
- 32. Data Discovery Data Analyst Data Scientist The new reality for Business Intelligence and Big Data, Applied Data Labs
- 33. Data Models vs. Algorithmic Models Data Modeling VS. Algorithmic Modeling Y ß F( X, random noise, parameters) Y ß Black Box ß X Random Forests We understand the world We don’t understand the world How well ‘my data model’ works The world produces data in a black-box Statisticians, Data Analysts, Data Miners Data Scientists Linear Regression Machine Learning, AI Neural Nets Logistic Regression Random Forests, SVM, GBT Known Distributions Unknown Multivariate Distributions Conﬁdence Intervals Iterative Predictor Variables Goodness of Fit Predictive Accuracy “Statistical Modeling: The Two Cultures” Leo Breiman, 2001
- 34. Learning from Data is Tricky Statistical vs. Machine Learning Supervised vs. Unsupervised Learning Induction vs. Deduction Sampling Conﬁdence Intervals Probability Distribution Deviation Variance Correlation vs. Causation Causation Prediction
- 35. More Data or Better Models? More Data Beats Better Algorithms, Omar Tawakoi @BlueKai Better Algorithms Beat More Data, Mark Torrance @RocketFuel More Data or Better Models, Xavier Armitrain @Netﬂix On Chomsky 2 Cultures of Statistical Learning, Peter Norvig @Google Specialist Knowledge is Useless Unhelpful, Jeremy Howard @Kaggle
- 36. Data Science Process – An approach
- 37. Data Science Process - 1 1 Known Unknowns? 2 We’d like to know…? 3 Outcomes? 4 What Data? 5 Hypothesis? The World Ingest Raw Data Munch Data The Dataset Product Manufactured Transactions MapReduce Independency? Goods shipped Web-Scraping ETL, ELT Correlation? Product purchased Web-clicks logs Data Wrangle Covariance? Phone Calls Made Sensor Data Data Cleansing Causality? Energy Consumed Mobile Data Data Jujitsu Dimensionality? Fraud Committed Docs, Emails, XLS Dim Reduction Missing Values? Repair Requested Social Feeds, RSS Sample Relevant? System Flume Sink HDFS Select, Join, Bind
- 38. Data Science Process - II The Dataset Explore Data Represent Data Discover Data Deliver Insight Learn From Data Data Product Visualize Insight Description Inference Objectives Data Algorithm Models Levers Actionable Machine Learning Modeling Predictive Networks Graphs Simulation Immediate Impact Regression Prediction Optimization Business Value Classiﬁcation Clustering Visualization Easy to explain Experiments Iteration
- 39. What is a Data Product?
- 40. A Data Product Is… … Curated and crafted from raw data … A result of exploration and iterations … A machine that learns from data … An answer to known unknowns or unknown unknowns … A mechanism that triggers immediate business value … A probabilistic window of future events or behavior
- 41. Data Jiu-Jitsu Data Jiu Jitsu Fight $$$$ Data Product Data Scientist Data Jiu-Jitsu: ability to turn big data into data products that generate immediate business value (DJ Patil @LinkedIn)
- 42. Developing Data Products Objectives Levers Data Models What Outcome What Inputs Can What Data Can How the Levers Am I Trying to We Control? We Collect? Inﬂuence the Achieve? Objectives Adapted from “Designing Great Data Products. The Drivetrain Approach: A Four Step Approach to Building Data Products” Jeremy Howard, Margit Zwemer, Mike Loukides, 2012
- 43. Objective-Based Data Products What Outcome Am I Actionable Trying to Achieve? Outcome Data Modeler Simulator Optimizer The Model Assembly Line Adapted from “Designing Great Data Products. The Drivetrain Approach: A Four Step Approach to Building Data Products” Jeremy Howard, Margit Zwemer, Mike Loukides, 2012
- 44. 5 Great Data Products
- 45. Customer Lifecycle Value Optimize CLV Product Recommendations Visualizer Data Modeler Simulator Optimizer 1 Products the customer may like 2 Price Elasticity 3 Probability of Purchase w/o Recommendation 4 Purchase Sequence 5 Causality Model 6 Patience Model Adapted from “Designing Great Data Products. The Drivetrain Approach: A Four Step Approach to Building Data Products” Jeremy Howard, Margit Zwemer, Mike Loukides, 2012
- 46. Automated Fruits Procurement Conﬁrm Purchase Orders In less than 2 hours Safety Stock levels? Demand vs Stock? Price vs. Demand? 12,000 stores Anomalies? 300 Fruits Fruit Shortages? Avg. Shelf life 3 days Fruit Write-offs? Adapted from Blueyonder
- 47. Strawberries the Weather No sales vs X,XXX sales predicted Why these huge stock write-offs? A Predictive Model that calculates strawberry purchases based on Weather forecast Sudden increase in temperature Store temperature Freezer sensor data Remaining stock per shelf live Sales TPoS feeds Web searches, social mentions Adapted from Blueyonder
- 48. Personalized Social Recommendations Collaborative Filtering: Matching Skills to People Prediction: Personalized Skills Recommendation Adapted from “Developing Data Products” by Peter Skomoroch 5 Dec, 2012 Copyright LinkedIn
- 49. Colas- In Which US State I Invest Mktg. $? What the Business Analyst Sent What the Data Scientist did…
- 50. The Great Pop vs. Soda Page http://www.popvssoda.com/
- 51. Pop vs. Soda vs. Coke
- 52. Raw Data Will Drive You Car
- 53. Interested in Data Science? Join our community http://www.meetup.com/Data-Science-London/ Follow us on Twitter @ds_ldn Check out our blog http://datasciencelondon.org
- 54. Thanks for your time