• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Big Data [sorry] & Data Science: What Does a Data Scientist Do?

Big Data [sorry] & Data Science: What Does a Data Scientist Do?



What 'kind of things' does a data scientist do? What are the foundations and principles of data science? What is a Data Product? What does the data science process looks like? Learning from data: Data ...

What 'kind of things' does a data scientist do? What are the foundations and principles of data science? What is a Data Product? What does the data science process looks like? Learning from data: Data Modeling or Algorithmic Modeling? - talk by Carlos Somohano @ds_ldn at The Cloud and Big Data: HDInsight on Azure London 25/01/13



Total Views
Views on SlideShare
Embed Views



170 Embeds 26,944

http://blog.revolutionanalytics.com 9928
http://whatsthebigdata.com 3668
http://blog.zenoss.com 2357
http://www.r-bloggers.com 2326
http://my-inner-voice.blogspot.com 1958
http://diggdata.in 1268
http://ranjanr.blogspot.com 735
http://www.scoop.it 496
https://confluence.gallup.hu 421
http://www.linkedin.com 314
http://africatrics.wordpress.com 238
http://my-inner-voice.blogspot.in 236
http://exexstats.tumblr.com 190
http://my-inner-voice.blogspot.de 183
http://lambandbyte.wordpress.com 164
http://my-inner-voice.blogspot.co.uk 149
http://www.redditmedia.com 145
http://my-inner-voice.blogspot.ca 119
http://shaunanicholson.com 118
http://blogs.grid.iu.edu 111
http://www.informaticsblogs.com 107
https://twitter.com 98
http://johngoodwin225.tumblr.com 82
http://dodata.wordpress.com 78
http://www.wjst.de 67
http://my-inner-voice.blogspot.fr 67
http://my-inner-voice.blogspot.com.au 65
http://www.newsblur.com 58
http://win-4h3gah01s4c 48
http://my-inner-voice.blogspot.nl 42
http://samicastro.wordpress.com 40
http://newsblur.com 39
http://tjo.hatenablog.com 38
http://feeds.feedburner.com 32
http://v-unsuty-5-11 31
http://ranjanr.blogspot.in 30
http://www.ranjanr.blogspot.com 29
http://my-inner-voice.blogspot.be 27
http://my-inner-voice.blogspot.com.br 26
http://my-inner-voice.blogspot.sg 26
http://my-inner-voice.blogspot.se 25
http://techspotlight.tumblr.com 25
http://allincollection.tumblr.com 24
http://my-inner-voice.blogspot.ch 24
http://blogs.opensciencegrid.org 24
http://my-inner-voice.blogspot.it 24
http://my-inner-voice.blogspot.tw 22
http://my-inner-voice.blogspot.ru 22
http://my-inner-voice.blogspot.com.es 22
http://confluence.fedtech.services.gs.com 19



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.


11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Big Data [sorry] & Data Science: What Does a Data Scientist Do? Big Data [sorry] & Data Science: What Does a Data Scientist Do? Presentation Transcript

    • Big Data [sorry]  & Data Science:What Does a Data Scientist Do? Carlos Somohano Founder Data Science London @ds_ldn datasciencelondon.org The Cloud and Big Data: HDInsight on Azure London 25/01/13
    • Man on the Moon – 1,969
    • Man on the Moon – Small Data! Computer Program Apollo X1 Man on the Moon Date: 1,969 Speed: 3,500 km/hour Distance: 356,000 Km 64 Kb, 2Kb RAM, Fortran Weight: 13,500 kg Never been there before Must work 1st time Lots of complex data Must return to Earth
    • Apollo XI, 1969 SkyDive Stratos, 2012 64 Kb Tens of Gigabytes Think About It – We live in Crazy Times!
    • Big Data is not about Data Volume
    • What is Big Data? IT mumbo-jumbo A fashionable term typically used by some IT vendors to remarket old fashioned software & hardware
    • What is Big Data? The n-Vs Volume … Variety … Velocity … (add your own V here…) So What?
    • Change! Water Cooler Chat We need to parallelize data operations but it’s too costly & complex … The business can’t get access to all the relevant data, we need external data… We can’t match customer master data to live customer interactions… We can’t just force everything into a star-schema… These BI reports and charts don’t tell us anything we didn’t know… We are missing the ETL window, the data we needed didn’t arrive on time… We can’t predict with confidence if we can’t explore data & develop our own models
    • What is Big Data? Force of Change Big Data forces you to change the way you collect, store, manage, analyze and visualize data
    • Crude Oil
    • Big Data = Crude Oil [not New Oil] Think data as ‘crude oil.’ Big Data is about extracting the ‘crude oil,’transporting it in ‘mega-tankers,’ siphoning it through‘pipelines,’ and storing it in massive ‘silos’… All ‘this’ is about IT Big Data… fine and well… … BUT
    • You need to refine the ‘crude oil’ Enter Data Science…
    • The Science [and Art] of… Discovering what we don’t know from data Obtaining predictive, actionable insight from data Creating Data Products that have business impact now Communicating relevant business stories from data Building confidence in decisions that drive business value
    • Brief History of Data Science 6th C BC - 1st C BC – The Greeks! Pyrrhonism, Skepticism & Empiricism… 1974 – Peter Naur @UoC Datalogy & Data Science 2001 – William S. Cleveland @CSU "Data Science: An Action Plan …: 2002 – Committee on Data for Science & Technology (CODATA) 2003 – Journal of Data Science 2009 – Jeff Hammerbacher @ Facebook What does a Data Scientist Do? 2010 – Drew Conway @NYU The Data Science Venn Diagram 2010 – Hillary Mason & Chris Wiggins @Dataists “ 2010 – Mike Loukadis @O’Reilly “What is Data Science?” 2011 – DJ Patil @LinkedIn data scientist vs. data analyst
    • Jeff Hammerbacher, 2009 “... on any given day, a team member could author amultistage processing pipeline in Python, design a hypothesis test, perform a regression analysisover data samples with R, design and implement an algorithm for some data-intensive product or service in Hadoop, orcommunicate the results of our analyses to othermembers of the organization."
    • Mike Loukides, 2010 "Data science enables the creation of dataproducts." "Whether... data is search terms, voice samples, orproduct reviews,... users are in a feedback loop inwhich they contribute to the products they use. Thats the beginning of data science."
    • Hilary Mason & Chris Wiggins,2010 "Data science is clearly a blend of the hackers’ arts, statistics and machine learning...; and the expertise in mathematics and the domain of the data for the analysis to be interpretable... It requires creative decisions and open-mindedness in a scientific context.
    • Drew Conway, 2010
    • DJ Patil, 2011 ”We realized that as our organizations grew, we both had to figure outwhat to call the people on our teams. "Business analyst” and "Data analyst”seemed too limiting. The focus of our teams was to work on data applications that would havean immediate and massive impact on the business. The term that seemed to fit best was data scientist: those who use bothdata and science to create something new”
    • What is a Data Scientist?
    • The Duck – Billed Platypus The Data Scientist – Billed Platypus
    • The Platypus – Billed Data Scientist Machine Learning Hacking Statistics Math Visualization Science Programming Data Mining The Data Scientist – Billed Platypus
    • Josh Wills, 2012
    • Class DataScientist { Is skeptical, curious. Has inquisitive mind Knows Machine Learning, Statistics, Probability Applies Scientific Method. Runs Experiments Is good at Coding & Hacking Able to deal with IT Data Engineering Knows how to build data products Able to find answers to known unknowns Tells relevant business stories from data Has Domain Knowledge }
    • What Does a Data Scientist Do?
    • 10 Things [most] Data Scientists Do 1  Ask Good Questions. What is What… …we don’t know? …we’d like to know? 2  Define and Test an Hypothesis. Run experiments 3  Scoop, Scrap, Sink, & Sample Business Relevant Data 4  Munge and Wrestle Data. Tame Data 5  Explore Data, Discover Data Playfully. Discover unknowns. 6  Model Data. Model Algorithms. 7  Understand Data Relationships 8  Tell the Machine How to Learn from Data 9  Create Data Products that Deliver Actionable Insight 10  Tell Relevant Business Stories from Data
    • [Sort of a] Data Scientist Toolkit §  Java, R, Python… (bonus: Clojure, Haskell, Scala) §  Hadoop, HDFS & MapReduce… (bonus: Spark, Storm) §  HBase, Pig & Hive… (bonus: Shark, Impala, Cascalog) §  ETL, Webscrapers,Flume, Sqoop… (bonus: Hume) §  SQL, RDBMS, DW, OLAP… §  Knime, Weka, RapidMiner…(bonus: SciPy, NumPy, scikit-learn, pandas) §  D3.js, Gephi, ggplot2, Tableu, Flare, Shiny… §  SPSS, Matlab, SAS… (the enterprise man) §  NoSQL, Mongo DB, Couchbase, Cassandra… §  And Yes! … MS-Excel: the most used, most underrated DS tool
    • Foundations of Data Science
    • [Some] Data Science Principles 1  Socio-Technical Systems (STS) are complex! 2  Data is never at rest 3  Data is dirty, deal with it 4  SVoT = LOL! 5  Data munging & data wrestling > 70% time 6  Simplification. Reduction. Distillation 7  Curiosity. Empiricism. Skepticism
    • Knowns & Unknowns There are known knowns. These are things we knowthat we know. There are known unknowns. That is to say, there arethings that we know we dont know. But there are also unknown unknowns. There arethings we dont know we dont know Donald Rumsfeld
    • DIKUW FTW! D I K U W Data Information Knowledge Understanding Wisdom PAST FUTUREData Engineer Data Analyst Data Miner Data Scientist Raw What How to Why When Numbers Description Experience Cause & Effect Prediction Letters Context Tested Proven What’s best Known Unknown Symbols Relationship Instruction Unknowns Unknowns Known Knowns Signals Reports Programs models
    • Data Discovery Data Analyst Data Scientist The new reality for Business Intelligence and Big Data, Applied Data Labs
    • Data Models vs. Algorithmic Models Data Modeling VS. Algorithmic Modeling Y ß F( X, random noise, parameters) Y ß Black Box ß X Random Forests We understand the world We don’t understand the world How well ‘my data model’ works The world produces data in a black-box Statisticians, Data Analysts, Data Miners Data Scientists Linear Regression Machine Learning, AI & Neural Nets Logistic Regression Random Forests, SVM, GBT Known Distributions Unknown Multivariate Distributions Confidence Intervals Iterative Predictor Variables & Goodness of Fit Predictive Accuracy “Statistical Modeling: The Two Cultures” Leo Breiman, 2001
    • Learning from Data is Tricky Statistical vs. Machine Learning Supervised vs. Unsupervised Learning Induction vs. Deduction Sampling & Confidence Intervals Probability & Distribution Deviation & Variance Correlation vs. Causation Causation & Prediction
    • More Data or Better Models? More Data Beats Better Algorithms, Omar Tawakoi @BlueKai Better Algorithms Beat More Data, Mark Torrance @RocketFuel More Data or Better Models, Xavier Armitrain @Netflix On Chomsky & 2 Cultures of Statistical Learning, Peter Norvig @Google Specialist Knowledge is Useless & Unhelpful, Jeremy Howard @Kaggle
    • Data Science Process – An approach
    • Data Science Process - 1 1  Known Unknowns? 2  We’d like to know…? 3  Outcomes? 4  What Data? 5  Hypothesis? The World Ingest Raw Data Munch Data The Dataset Product Manufactured Transactions MapReduce Independency? Goods shipped Web-Scraping ETL, ELT Correlation? Product purchased Web-clicks & logs Data Wrangle Covariance? Phone Calls Made Sensor Data Data Cleansing Causality? Energy Consumed Mobile Data Data Jujitsu Dimensionality? Fraud Committed Docs, Emails, XLS Dim Reduction Missing Values? Repair Requested Social Feeds, RSS Sample Relevant? System Flume & Sink HDFS Select, Join, Bind
    • Data Science Process - II The Dataset Explore Data Represent Data Discover Data Deliver Insight Learn From Data Data Product Visualize Insight Description & Inference Objectives Data & Algorithm Models Levers Actionable Machine Learning Modeling Predictive Networks & Graphs Simulation Immediate Impact Regression & Prediction Optimization Business Value Classification & Clustering Visualization Easy to explain Experiments & Iteration
    • What is a Data Product?
    • A Data Product Is… … Curated and crafted from raw data … A result of exploration and iterations … A machine that learns from data … An answer to known unknowns or unknown unknowns … A mechanism that triggers immediate business value … A probabilistic window of future events or behavior
    • Data Jiu-Jitsu Data Jiu Jitsu Fight $$$$ Data Product Data Scientist Data Jiu-Jitsu: ability to turn big data into data products that generate immediate business value (DJ Patil @LinkedIn)
    • Developing Data Products Objectives Levers Data Models What Outcome What Inputs Can What Data Can How the Levers Am I Trying to We Control? We Collect? Influence the Achieve? Objectives Adapted from “Designing Great Data Products. The Drivetrain Approach: A Four Step Approach to Building Data Products” Jeremy Howard, Margit Zwemer, Mike Loukides, 2012
    • Objective-Based Data Products What Outcome Am I ActionableTrying to Achieve? Outcome Data Modeler Simulator Optimizer The Model Assembly Line Adapted from “Designing Great Data Products. The Drivetrain Approach: A Four Step Approach to Building Data Products” Jeremy Howard, Margit Zwemer, Mike Loukides, 2012
    • 5 Great Data Products
    • Customer Lifecycle Value Optimize CLV Product Recommendations Visualizer Data Modeler Simulator Optimizer 1  Products the customer may like 2  Price Elasticity 3  Probability of Purchase w/o Recommendation 4  Purchase Sequence 5  Causality Model 6  Patience Model Adapted from “Designing Great Data Products. The Drivetrain Approach: A Four Step Approach to Building Data Products” Jeremy Howard, Margit Zwemer, Mike Loukides, 2012
    • Automated Fruits Procurement Confirm Purchase Orders In less than 2 hours Safety Stock levels? Demand vs Stock? Price vs. Demand? 12,000 stores Anomalies? 300 Fruits Fruit Shortages? Avg. Shelf life < 3 days Fruit Write-offs? Adapted from Blueyonder
    • Strawberries & the Weather No sales vs X,XXX sales predicted Why these huge stock write-offs? A Predictive Model that calculates strawberry purchases based on Weather forecast Sudden increase in temperature Store temperature Freezer sensor data Remaining stock per shelf live Sales TPoS feeds Web searches, social mentions Adapted from Blueyonder
    • Personalized Social Recommendations Collaborative Filtering: Matching Skills to People Prediction: Personalized Skills Recommendation Adapted from “Developing Data Products” by Peter Skomoroch 5 Dec, 2012 Copyright LinkedIn
    • Colas- In Which US State I Invest Mktg. $? What the Business Analyst Sent What the Data Scientist did…
    • The Great Pop vs. Soda Page http://www.popvssoda.com/
    • Pop vs. Soda vs. Coke
    • Raw Data Will Drive You Car
    • Interested in Data Science? Join our community http://www.meetup.com/Data-Science-London/ Follow us on Twitter @ds_ldn Check out our blog http://datasciencelondon.org
    • Thanks for your time