Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data [sorry] & Data Science: What Does a Data Scientist Do?

What 'kind of things' does a data scientist do? What are the foundations and principles of data science? What is a Data Product? What does the data science process looks like? Learning from data: Data Modeling or Algorithmic Modeling? - talk by Carlos Somohano @ds_ldn at The Cloud and Big Data: HDInsight on Azure London 25/01/13

  • Login to see the comments

Big Data [sorry] & Data Science: What Does a Data Scientist Do?

  1. 1. Big Data [sorry] Data Science:What Does a Data Scientist Do? Carlos Somohano Founder Data Science London @ds_ldn The Cloud and Big Data: HDInsight on Azure London 25/01/13
  2. 2. Man on the Moon – 1,969
  3. 3. Man on the Moon – Small Data! Computer Program Apollo X1 Man on the Moon Date: 1,969 Speed: 3,500 km/hour Distance: 356,000 Km 64 Kb, 2Kb RAM, Fortran Weight: 13,500 kg Never been there before Must work 1st time Lots of complex data Must return to Earth
  4. 4. Apollo XI, 1969 SkyDive Stratos, 2012 64 Kb Tens of Gigabytes Think About It – We live in Crazy Times!
  5. 5. Big Data is not about Data Volume
  6. 6. What is Big Data? IT mumbo-jumbo A fashionable term typically used by some IT vendors to remarket old fashioned software hardware
  7. 7. What is Big Data? The n-Vs Volume … Variety … Velocity … (add your own V here…) So What?
  8. 8. Change! Water Cooler Chat We need to parallelize data operations but it’s too costly complex … The business can’t get access to all the relevant data, we need external data… We can’t match customer master data to live customer interactions… We can’t just force everything into a star-schema… These BI reports and charts don’t tell us anything we didn’t know… We are missing the ETL window, the data we needed didn’t arrive on time… We can’t predict with confidence if we can’t explore data develop our own models
  9. 9. What is Big Data? Force of Change Big Data forces you to change the way you collect, store, manage, analyze and visualize data
  10. 10. Crude Oil
  11. 11. Big Data = Crude Oil [not New Oil] Think data as ‘crude oil.’ Big Data is about extracting the ‘crude oil,’transporting it in ‘mega-tankers,’ siphoning it through‘pipelines,’ and storing it in massive ‘silos’… All ‘this’ is about IT Big Data… fine and well… … BUT
  12. 12. You need to refine the ‘crude oil’ Enter Data Science…
  13. 13. The Science [and Art] of… Discovering what we don’t know from data Obtaining predictive, actionable insight from data Creating Data Products that have business impact now Communicating relevant business stories from data Building confidence in decisions that drive business value
  14. 14. Brief History of Data Science 6th C BC - 1st C BC – The Greeks! Pyrrhonism, Skepticism Empiricism… 1974 – Peter Naur @UoC Datalogy Data Science 2001 – William S. Cleveland @CSU Data Science: An Action Plan …: 2002 – Committee on Data for Science Technology (CODATA) 2003 – Journal of Data Science 2009 – Jeff Hammerbacher @ Facebook What does a Data Scientist Do? 2010 – Drew Conway @NYU The Data Science Venn Diagram 2010 – Hillary Mason Chris Wiggins @Dataists “ 2010 – Mike Loukadis @O’Reilly “What is Data Science?” 2011 – DJ Patil @LinkedIn data scientist vs. data analyst
  15. 15. Jeff Hammerbacher, 2009 “... on any given day, a team member could author amultistage processing pipeline in Python, design a hypothesis test, perform a regression analysisover data samples with R, design and implement an algorithm for some data-intensive product or service in Hadoop, orcommunicate the results of our analyses to othermembers of the organization.
  16. 16. Mike Loukides, 2010 Data science enables the creation of dataproducts. Whether... data is search terms, voice samples, orproduct reviews,... users are in a feedback loop inwhich they contribute to the products they use. Thats the beginning of data science.
  17. 17. Hilary Mason Chris Wiggins,2010 Data science is clearly a blend of the hackers’ arts, statistics and machine learning...; and the expertise in mathematics and the domain of the data for the analysis to be interpretable... It requires creative decisions and open-mindedness in a scientific context.
  18. 18. Drew Conway, 2010
  19. 19. DJ Patil, 2011 ”We realized that as our organizations grew, we both had to figure outwhat to call the people on our teams. Business analyst” and Data analyst”seemed too limiting. The focus of our teams was to work on data applications that would havean immediate and massive impact on the business. The term that seemed to fit best was data scientist: those who use bothdata and science to create something new”
  20. 20. What is a Data Scientist?
  21. 21. The Duck – Billed Platypus The Data Scientist – Billed Platypus
  22. 22. The Platypus – Billed Data Scientist Machine Learning Hacking Statistics Math Visualization Science Programming Data Mining The Data Scientist – Billed Platypus
  23. 23. Josh Wills, 2012
  24. 24. Class DataScientist { Is skeptical, curious. Has inquisitive mind Knows Machine Learning, Statistics, Probability Applies Scientific Method. Runs Experiments Is good at Coding Hacking Able to deal with IT Data Engineering Knows how to build data products Able to find answers to known unknowns Tells relevant business stories from data Has Domain Knowledge }
  25. 25. What Does a Data Scientist Do?
  26. 26. 10 Things [most] Data Scientists Do 1  Ask Good Questions. What is What… …we don’t know? …we’d like to know? 2  Define and Test an Hypothesis. Run experiments 3  Scoop, Scrap, Sink, Sample Business Relevant Data 4  Munge and Wrestle Data. Tame Data 5  Explore Data, Discover Data Playfully. Discover unknowns. 6  Model Data. Model Algorithms. 7  Understand Data Relationships 8  Tell the Machine How to Learn from Data 9  Create Data Products that Deliver Actionable Insight 10  Tell Relevant Business Stories from Data
  27. 27. [Sort of a] Data Scientist Toolkit §  Java, R, Python… (bonus: Clojure, Haskell, Scala) §  Hadoop, HDFS MapReduce… (bonus: Spark, Storm) §  HBase, Pig Hive… (bonus: Shark, Impala, Cascalog) §  ETL, Webscrapers,Flume, Sqoop… (bonus: Hume) §  SQL, RDBMS, DW, OLAP… §  Knime, Weka, RapidMiner…(bonus: SciPy, NumPy, scikit-learn, pandas) §  D3.js, Gephi, ggplot2, Tableu, Flare, Shiny… §  SPSS, Matlab, SAS… (the enterprise man) §  NoSQL, Mongo DB, Couchbase, Cassandra… §  And Yes! … MS-Excel: the most used, most underrated DS tool
  28. 28. Foundations of Data Science
  29. 29. [Some] Data Science Principles 1  Socio-Technical Systems (STS) are complex! 2  Data is never at rest 3  Data is dirty, deal with it 4  SVoT = LOL! 5  Data munging data wrestling 70% time 6  Simplification. Reduction. Distillation 7  Curiosity. Empiricism. Skepticism
  30. 30. Knowns Unknowns There are known knowns. These are things we knowthat we know. There are known unknowns. That is to say, there arethings that we know we dont know. But there are also unknown unknowns. There arethings we dont know we dont know Donald Rumsfeld
  31. 31. DIKUW FTW! D I K U W Data Information Knowledge Understanding Wisdom PAST FUTUREData Engineer Data Analyst Data Miner Data Scientist Raw What How to Why When Numbers Description Experience Cause Effect Prediction Letters Context Tested Proven What’s best Known Unknown Symbols Relationship Instruction Unknowns Unknowns Known Knowns Signals Reports Programs models
  32. 32. Data Discovery Data Analyst Data Scientist The new reality for Business Intelligence and Big Data, Applied Data Labs
  33. 33. Data Models vs. Algorithmic Models Data Modeling VS. Algorithmic Modeling Y ß F( X, random noise, parameters) Y ß Black Box ß X Random Forests We understand the world We don’t understand the world How well ‘my data model’ works The world produces data in a black-box Statisticians, Data Analysts, Data Miners Data Scientists Linear Regression Machine Learning, AI Neural Nets Logistic Regression Random Forests, SVM, GBT Known Distributions Unknown Multivariate Distributions Confidence Intervals Iterative Predictor Variables Goodness of Fit Predictive Accuracy “Statistical Modeling: The Two Cultures” Leo Breiman, 2001
  34. 34. Learning from Data is Tricky Statistical vs. Machine Learning Supervised vs. Unsupervised Learning Induction vs. Deduction Sampling Confidence Intervals Probability Distribution Deviation Variance Correlation vs. Causation Causation Prediction
  35. 35. More Data or Better Models? More Data Beats Better Algorithms, Omar Tawakoi @BlueKai Better Algorithms Beat More Data, Mark Torrance @RocketFuel More Data or Better Models, Xavier Armitrain @Netflix On Chomsky 2 Cultures of Statistical Learning, Peter Norvig @Google Specialist Knowledge is Useless Unhelpful, Jeremy Howard @Kaggle
  36. 36. Data Science Process – An approach
  37. 37. Data Science Process - 1 1  Known Unknowns? 2  We’d like to know…? 3  Outcomes? 4  What Data? 5  Hypothesis? The World Ingest Raw Data Munch Data The Dataset Product Manufactured Transactions MapReduce Independency? Goods shipped Web-Scraping ETL, ELT Correlation? Product purchased Web-clicks logs Data Wrangle Covariance? Phone Calls Made Sensor Data Data Cleansing Causality? Energy Consumed Mobile Data Data Jujitsu Dimensionality? Fraud Committed Docs, Emails, XLS Dim Reduction Missing Values? Repair Requested Social Feeds, RSS Sample Relevant? System Flume Sink HDFS Select, Join, Bind
  38. 38. Data Science Process - II The Dataset Explore Data Represent Data Discover Data Deliver Insight Learn From Data Data Product Visualize Insight Description Inference Objectives Data Algorithm Models Levers Actionable Machine Learning Modeling Predictive Networks Graphs Simulation Immediate Impact Regression Prediction Optimization Business Value Classification Clustering Visualization Easy to explain Experiments Iteration
  39. 39. What is a Data Product?
  40. 40. A Data Product Is… … Curated and crafted from raw data … A result of exploration and iterations … A machine that learns from data … An answer to known unknowns or unknown unknowns … A mechanism that triggers immediate business value … A probabilistic window of future events or behavior
  41. 41. Data Jiu-Jitsu Data Jiu Jitsu Fight $$$$ Data Product Data Scientist Data Jiu-Jitsu: ability to turn big data into data products that generate immediate business value (DJ Patil @LinkedIn)
  42. 42. Developing Data Products Objectives Levers Data Models What Outcome What Inputs Can What Data Can How the Levers Am I Trying to We Control? We Collect? Influence the Achieve? Objectives Adapted from “Designing Great Data Products. The Drivetrain Approach: A Four Step Approach to Building Data Products” Jeremy Howard, Margit Zwemer, Mike Loukides, 2012
  43. 43. Objective-Based Data Products What Outcome Am I ActionableTrying to Achieve? Outcome Data Modeler Simulator Optimizer The Model Assembly Line Adapted from “Designing Great Data Products. The Drivetrain Approach: A Four Step Approach to Building Data Products” Jeremy Howard, Margit Zwemer, Mike Loukides, 2012
  44. 44. 5 Great Data Products
  45. 45. Customer Lifecycle Value Optimize CLV Product Recommendations Visualizer Data Modeler Simulator Optimizer 1  Products the customer may like 2  Price Elasticity 3  Probability of Purchase w/o Recommendation 4  Purchase Sequence 5  Causality Model 6  Patience Model Adapted from “Designing Great Data Products. The Drivetrain Approach: A Four Step Approach to Building Data Products” Jeremy Howard, Margit Zwemer, Mike Loukides, 2012
  46. 46. Automated Fruits Procurement Confirm Purchase Orders In less than 2 hours Safety Stock levels? Demand vs Stock? Price vs. Demand? 12,000 stores Anomalies? 300 Fruits Fruit Shortages? Avg. Shelf life 3 days Fruit Write-offs? Adapted from Blueyonder
  47. 47. Strawberries the Weather No sales vs X,XXX sales predicted Why these huge stock write-offs? A Predictive Model that calculates strawberry purchases based on Weather forecast Sudden increase in temperature Store temperature Freezer sensor data Remaining stock per shelf live Sales TPoS feeds Web searches, social mentions Adapted from Blueyonder
  48. 48. Personalized Social Recommendations Collaborative Filtering: Matching Skills to People Prediction: Personalized Skills Recommendation Adapted from “Developing Data Products” by Peter Skomoroch 5 Dec, 2012 Copyright LinkedIn
  49. 49. Colas- In Which US State I Invest Mktg. $? What the Business Analyst Sent What the Data Scientist did…
  50. 50. The Great Pop vs. Soda Page
  51. 51. Pop vs. Soda vs. Coke
  52. 52. Raw Data Will Drive You Car
  53. 53. Interested in Data Science? Join our community Follow us on Twitter @ds_ldn Check out our blog
  54. 54. Thanks for your time