SlideShare a Scribd company logo
Big Data [sorry] 
       Data Science:
What Does a Data Scientist Do?	

                                                                 Carlos Somohano	

                                                         Founder Data Science London	



The Cloud and Big Data: HDInsight on Azure London 25/01/13
Man on the Moon – 1,969
Man on the Moon – Small Data! 	

Computer Program	

          Apollo X1	

              Man on the Moon	

Date: 1,969	

               Speed: 3,500 km/hour	

   Distance: 356,000 Km	

64 Kb, 2Kb RAM, Fortran	

   Weight: 13,500 kg	

      Never been there before	

Must work 1st time	

        Lots of complex data	

   Must return to Earth
Apollo XI, 1969	

    SkyDive Stratos, 2012	

       64 Kb	

            Tens of Gigabytes	

Think About It – We live in Crazy Times!
Big Data is not about Data Volume
What is Big Data? IT mumbo-jumbo	

  A fashionable term typically used by some IT
  vendors to remarket old fashioned software 
What is Big Data? The n-Vs	

        Volume …	

        Variety …	

        Velocity …	

        (add your own V here…)	

        So What?
Change! Water Cooler Chat	

We need to parallelize data operations but it’s too costly  complex …	

The business can’t get access to all the relevant data, we need external data…	

We can’t match customer master data to live customer interactions…	

We can’t just force everything into a star-schema…	

These BI reports and charts don’t tell us anything we didn’t know…	

We are missing the ETL window, the data we needed didn’t arrive on time…	

We can’t predict with confidence if we can’t explore data  develop our own models
What is Big Data? Force of Change	

 Big Data forces you to change the way you collect,
 store, manage, analyze and visualize data
Crude Oil
Big Data = Crude Oil [not New Oil]	

Think data as ‘crude oil.’	

Big Data is about extracting the ‘crude oil,’
transporting it in ‘mega-tankers,’ siphoning it through
‘pipelines,’ and storing it in massive ‘silos’… 	

All ‘this’ is about IT Big Data… fine and well…	

You need to refine the ‘crude oil’	

       Enter Data Science…
The Science [and Art] of… 	


Discovering what we don’t know from data	


Obtaining predictive, actionable insight from data	


Creating Data Products that have business impact now	


Communicating relevant business stories from data	


Building confidence in decisions that drive business value
Brief History of Data Science	

6th C BC - 1st C BC – The Greeks! Pyrrhonism, Skepticism  Empiricism… 	

1974 – Peter Naur @UoC Datalogy  Data Science	

2001 – William S. Cleveland @CSU Data Science: An Action Plan …: 	

2002 – Committee on Data for Science  Technology (CODATA) 	

2003 – Journal of Data Science 	

2009 – Jeff Hammerbacher @ Facebook What does a Data Scientist Do? 	

2010 – Drew Conway @NYU The Data Science Venn Diagram	

2010 – Hillary Mason  Chris Wiggins @Dataists “	

2010 – Mike Loukadis @O’Reilly “What is Data Science?” 	

2011 – DJ Patil @LinkedIn data scientist vs. data analyst
Jeff Hammerbacher, 2009	

“... on any given day, a team member could author a
multistage processing pipeline in Python, 	


design a hypothesis test, perform a regression analysis
over data samples with R, 	


design and implement an algorithm for some data-
intensive product or service in Hadoop, or
communicate the results of our analyses to other
members of the organization.
Mike Loukides, 2010	

Data science enables the creation of data


Whether... data is search terms, voice samples, or
product reviews,... users are in a feedback loop in
which they contribute to the products they use. 	


That's the beginning of data science.
Hilary Mason  Chris Wiggins,2010	

  Data science is clearly a blend of the hackers’ arts, statistics
  and machine learning...; 	


  and the expertise in mathematics and the domain of the
  data for the analysis to be interpretable... 	


  It requires creative decisions and open-mindedness in a
  scientific context.
Drew Conway, 2010
DJ Patil, 2011	

”We realized that as our organizations grew, we both had to figure out
what to call the people on our teams. Business analyst” and Data analyst”
seemed too limiting. 	


The focus of our teams was to work on data applications that would have
an immediate and massive impact on the business. 	


The term that seemed to fit best was data scientist: those who use both
data and science to create something new”
What is a Data Scientist?
The Duck – Billed Platypus	

       The Data Scientist – Billed Platypus
The Platypus – Billed Data Scientist	

                                                   Machine Learning	







                 Data Mining	

                    The Data Scientist – Billed Platypus
Josh Wills, 2012
Class DataScientist {	

 Is skeptical, curious. Has inquisitive mind 	

 Knows Machine Learning, Statistics, Probability	

 Applies Scientific Method. Runs Experiments	

 Is good at Coding  Hacking	

 Able to deal with IT Data Engineering	

 Knows how to build data products	

 Able to find answers to known unknowns	

 Tells relevant business stories from data	

 Has Domain Knowledge 	

What Does a Data Scientist Do?
10 Things [most] Data Scientists Do	

      1  Ask Good Questions. What is What… 	

           …we don’t know?	

           …we’d like to know?	

      2  Define and Test an Hypothesis. Run experiments	

      3  Scoop, Scrap, Sink,  Sample Business Relevant Data	

      4  Munge and Wrestle Data. Tame Data	

      5  Explore Data, Discover Data Playfully. Discover unknowns.	

      6  Model Data. Model Algorithms.	

      7  Understand Data Relationships	

      8  Tell the Machine How to Learn from Data	

      9  Create Data Products that Deliver Actionable Insight 	

      10  Tell Relevant Business Stories from Data
[Sort of a] Data Scientist Toolkit	

   §  Java, R, Python… (bonus: Clojure, Haskell, Scala)	

   §  Hadoop, HDFS  MapReduce… (bonus: Spark, Storm)	

   §  HBase, Pig  Hive… (bonus: Shark, Impala, Cascalog)	

   §  ETL, Webscrapers,Flume, Sqoop… (bonus: Hume) 	

   §  SQL, RDBMS, DW, OLAP…	

   §  Knime, Weka, RapidMiner…(bonus: SciPy, NumPy, scikit-learn, pandas)	

   §  D3.js, Gephi, ggplot2, Tableu, Flare, Shiny…	

   §  SPSS, Matlab, SAS… (the enterprise man)	

   §  NoSQL, Mongo DB, Couchbase, Cassandra…	

   §  And Yes! … MS-Excel: the most used, most underrated DS tool
Foundations of Data Science
[Some] Data Science Principles	

    1    Socio-Technical Systems (STS) are complex!	

    2    Data is never at rest	

    3    Data is dirty, deal with it	

    4    SVoT = LOL!	

    5    Data munging  data wrestling  70% time	

    6    Simplification. Reduction. Distillation	

    7    Curiosity. Empiricism. Skepticism
Knowns  Unknowns	

There are known knowns. These are things we know
that we know. 	

There are known unknowns. That is to say, there are
things that we know we don't know.	

But there are also unknown unknowns. There are
things we don't know we don't know	

                                    Donald Rumsfeld

  D                      I                      K                       U                      W

 Data              Information               Knowledge           Understanding              Wisdom

                                      PAST                                                   FUTURE

Data Engineer	

    Data Analyst	

                          Data Miner	

      Data Scientist	

        Raw                  What               How to                  Why                   When

    Numbers            Description            Experience          Cause  Effect           Prediction

     Letters             Context                 Tested                Proven             What’s best

                                                                        Known               Unknown
     Symbols          Relationship             Instruction              Unknowns	


                      Known Knowns	

      Signals            Reports               Programs                models
Data Discovery	

                                      Data Analyst	

                                                        Data Scientist	

The new reality for Business Intelligence and Big Data, Applied Data Labs
Data Models vs. Algorithmic Models	

           Data Modeling	


          Algorithmic Modeling	

 Y ß F( X, random noise, parameters) 	

                                 Y ß 	

        Black Box	

         ß X	

                                                                                         Random Forests	

          We understand the world	

                                            We don’t understand the world	

    How well ‘my data model’ works	

                                       The world produces data in a black-box 	

    Statisticians, Data Analysts, Data Miners	

                            Data Scientists	

    Linear Regression	

                                                    Machine Learning, AI  Neural Nets	

    Logistic Regression	

                                                  Random Forests, SVM, GBT	

    Known Distributions	

                                                  Unknown Multivariate Distributions	

    Confidence Intervals	


    Predictor Variables  Goodness of Fit	

                                Predictive Accuracy	




                                             “Statistical Modeling: The Two Cultures” Leo Breiman, 2001
Learning from Data is Tricky	

      Statistical vs. Machine Learning	

      Supervised vs. Unsupervised Learning	

      Induction vs. Deduction	

      Sampling  Confidence Intervals 	

      Probability  Distribution	

      Deviation  Variance	

      Correlation vs. Causation	

      Causation  Prediction
More Data or Better Models?	

More Data Beats Better Algorithms, Omar Tawakoi @BlueKai	


Better Algorithms Beat More Data, Mark Torrance @RocketFuel	


More Data or Better Models, Xavier Armitrain @Netflix	


On Chomsky  2 Cultures of Statistical Learning, Peter Norvig @Google 	


Specialist Knowledge is Useless  Unhelpful, Jeremy Howard @Kaggle
Data Science Process – An approach
Data Science Process - 1	

      1  Known Unknowns? 	

      2  We’d like to know…?	

      3  Outcomes?	

      4  What Data?	

      5  Hypothesis?	

         The World 	

            Ingest Raw Data	

     Munch Data	

           The Dataset	

Product Manufactured	




Goods shipped	


        ETL, ELT	


Product purchased	

              Web-clicks  logs	

   Data Wrangle 	


Phone Calls Made	

               Sensor Data	

         Data Cleansing	


Energy Consumed 	

               Mobile Data	

         Data Jujitsu	


Fraud Committed	

                Docs, Emails, XLS	

   Dim Reduction	

        Missing Values?	

Repair Requested	

               Social Feeds, RSS	




                        Flume  Sink HDFS	

   Select, Join, Bind
Data Science Process - II	

The Dataset	

   Explore Data	

                 Represent Data	

                 Discover Data	

                                                                    Deliver Insight 	

                 Learn From Data	

              Data Product	

                                                                    Visualize Insight 	

                 Description  Inference	


                 Data  Algorithm Models	



                 Machine Learning	



                 Networks  Graphs	


      Immediate Impact	

                 Regression  Prediction	


    Business Value	

                 Classification  Clustering	


   Easy to explain	

                 Experiments  Iteration
What is a Data Product?
A Data Product Is… 	

… Curated and crafted from raw data	

… A result of exploration and iterations	

… A machine that learns from data 	

… An answer to known unknowns or unknown unknowns	

… A mechanism that triggers immediate business value	

… A probabilistic window of future events or behavior
Data Jiu-Jitsu	


                                                    Jiu Jitsu Fight 	


                                                                     Data Product	

 Data Scientist	

Data Jiu-Jitsu: ability to turn big data into data products that generate immediate business value	

                                                                             (DJ Patil @LinkedIn)
Developing Data Products	





       What Outcome                       What Inputs Can                    What Data Can                     How the Levers
       Am I Trying to                     We Control?	

                     We Collect?	

                    Influence the


Adapted from “Designing Great Data Products. The Drivetrain Approach: A Four Step Approach to Building Data Products”	

 Jeremy Howard, Margit Zwemer, Mike Loukides, 2012
Objective-Based Data Products	

What Outcome Am I                                                                                                          Actionable
Trying to Achieve?	






                                                 The Model Assembly Line 	

Adapted from “Designing Great Data Products. The Drivetrain Approach: A Four Step Approach to Building Data Products”	

 Jeremy Howard, Margit Zwemer, Mike Loukides, 2012
5 Great Data Products
Customer Lifecycle Value	

             Optimize CLV	

                         Product Recommendations	






                                  1  Products the customer may like	

                                  2  Price Elasticity	

                                  3  Probability of Purchase w/o Recommendation	

                                  4  Purchase Sequence	

                                  5  Causality Model	

                                  6  Patience Model	

Adapted from “Designing Great Data Products. The Drivetrain Approach: A Four Step Approach to Building Data Products”	

 Jeremy Howard, Margit Zwemer, Mike Loukides, 2012
Automated Fruits Procurement	

                                Confirm Purchase Orders	

                                In less than 2 hours	

                                Safety Stock levels?	

                                Demand vs Stock?	

                                Price vs. Demand?	

 12,000 stores	


 300 Fruits	

                  Fruit Shortages?	

 Avg. Shelf life  3 days 	

   Fruit Write-offs?	

 Adapted from Blueyonder
Strawberries  the Weather	

                                         No sales vs X,XXX sales predicted	

Why these huge stock write-offs?	

                                       A Predictive Model that calculates
                                       strawberry purchases based on	


                                          Weather forecast	

   Sudden increase in temperature	

      Store temperature	

                                          Freezer sensor data	

                                          Remaining stock per shelf live	

                                          Sales TPoS feeds	

                                          Web searches, social mentions 	

   Adapted from Blueyonder
Personalized Social Recommendations	

 Collaborative Filtering: Matching Skills to People	

             Prediction: Personalized Skills Recommendation	

 Adapted from “Developing Data Products” by Peter Skomoroch 5 Dec, 2012 Copyright LinkedIn
Colas- In Which US State I Invest Mktg. $? 	

            What the Business Analyst Sent	

                                                What the Data Scientist did…
The Great Pop vs. Soda Page	

Pop vs. Soda vs. Coke
Raw Data Will Drive You Car
Interested in Data Science?	

Join our community	

Follow us on Twitter 	


Check out our blog
Thanks for your time

More Related Content

What's hot

Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
Sampath Kumar
Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data ScienceEdureka!
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and Analytics
Srinath Perera
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
Jason Geng
Data science presentation
Data science presentationData science presentation
Data science presentation
Data Science
Data ScienceData Science
Data Science
Amit Singh
Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...
Data science
Data scienceData science
Data science
Ranjit Nambisan
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Data Science Introduction
Data Science IntroductionData Science Introduction
Data Science Introduction
Gang Tao
Data science Big Data
Data science Big DataData science Big Data
Data science Big Data
Introduction to data science club
Introduction to data science clubIntroduction to data science club
Introduction to data science club
Data Science Club
Big data by Mithlesh sadh
Big data by Mithlesh sadhBig data by Mithlesh sadh
Big data by Mithlesh sadh
Mithlesh Sadh
Data Science
Data ScienceData Science
Data Science
Rabin BK
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyRohit Dubey
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Presentation on Big Data
Presentation on Big DataPresentation on Big Data
Presentation on Big Data
Md. Salman Ahmed
Introduction to Big Data
Introduction to Big Data Introduction to Big Data
Introduction to Big Data
Srinath Perera
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
Hadi Fadlallah

What's hot (20)

Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data Science
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and Analytics
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
Data science presentation
Data science presentationData science presentation
Data science presentation
Data Science
Data ScienceData Science
Data Science
Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...
Data science
Data scienceData science
Data science
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Data Science Introduction
Data Science IntroductionData Science Introduction
Data Science Introduction
Data science Big Data
Data science Big DataData science Big Data
Data science Big Data
Introduction to data science club
Introduction to data science clubIntroduction to data science club
Introduction to data science club
Big data by Mithlesh sadh
Big data by Mithlesh sadhBig data by Mithlesh sadh
Big data by Mithlesh sadh
Data Science
Data ScienceData Science
Data Science
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Presentation on Big Data
Presentation on Big DataPresentation on Big Data
Presentation on Big Data
Introduction to Big Data
Introduction to Big Data Introduction to Big Data
Introduction to Big Data
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering

Viewers also liked

Hands-on Deep Learning in Python
Hands-on Deep Learning in PythonHands-on Deep Learning in Python
Hands-on Deep Learning in Python
Imry Kissos
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)
Prof. Dr. Diego Kuonen
Data By The People, For The People
Data By The People, For The PeopleData By The People, For The People
Data By The People, For The People
Daniel Tunkelang
Hadoop and Machine Learning
Hadoop and Machine LearningHadoop and Machine Learning
Hadoop and Machine Learning
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data Scientist
Daniel Tunkelang
10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems
Xavier Amatriain
A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013Philip Zheng
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
Sebastian Raschka
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
Devashish Shanker
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine Learning
Varad Meru
Machine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesMachine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification Rules
Pier Luca Lanzi
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data Scientists
David Pittman
Tutorial on Deep learning and Applications
Tutorial on Deep learning and ApplicationsTutorial on Deep learning and Applications
Tutorial on Deep learning and Applications
NhatHai Phan
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
Owen Zhang
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
Si Haem
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
Lars Marius Garshol
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural networkDEEPASHRI HK
10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions
Artificial Intelligence Presentation
Artificial Intelligence PresentationArtificial Intelligence Presentation
Artificial Intelligence Presentationlpaviglianiti
Impact of big data on analytics
Impact of big data on analyticsImpact of big data on analytics
Impact of big data on analytics

Viewers also liked (20)

Hands-on Deep Learning in Python
Hands-on Deep Learning in PythonHands-on Deep Learning in Python
Hands-on Deep Learning in Python
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)
Data By The People, For The People
Data By The People, For The PeopleData By The People, For The People
Data By The People, For The People
Hadoop and Machine Learning
Hadoop and Machine LearningHadoop and Machine Learning
Hadoop and Machine Learning
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data Scientist
10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems
A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine Learning
Machine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesMachine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification Rules
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data Scientists
Tutorial on Deep learning and Applications
Tutorial on Deep learning and ApplicationsTutorial on Deep learning and Applications
Tutorial on Deep learning and Applications
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions
Artificial Intelligence Presentation
Artificial Intelligence PresentationArtificial Intelligence Presentation
Artificial Intelligence Presentation
Impact of big data on analytics
Impact of big data on analyticsImpact of big data on analytics
Impact of big data on analytics

Similar to Big Data [sorry] & Data Science: What Does a Data Scientist Do?

Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data Science
Andrew Gardner
Thinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DCThinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DC
TJ Stalcup
Data science
Data scienceData science
Data science
Sreejith c
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
TJ Stalcup
Intro to Data Science
Intro to Data ScienceIntro to Data Science
Intro to Data Science
TJ Stalcup
How Your Data Can Predict The Future
How Your Data Can Predict The FutureHow Your Data Can Predict The Future
How Your Data Can Predict The Future
Becky Wang
Big data
Big dataBig data
Big data
Gaetan Lion
IIPGH Webinar 1: Getting Started With Data Science
IIPGH Webinar 1: Getting Started With Data ScienceIIPGH Webinar 1: Getting Started With Data Science
IIPGH Webinar 1: Getting Started With Data Science
intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...
2017 06-14-getting started with data science
2017 06-14-getting started with data science2017 06-14-getting started with data science
2017 06-14-getting started with data science
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Dr. Sunil Kr. Pandey
Predictive modelling with azure ml
Predictive modelling with azure mlPredictive modelling with azure ml
Predictive modelling with azure ml
Koray Kocabas
There's no such thing as big data
There's no such thing as big dataThere's no such thing as big data
There's no such thing as big data
Andrew Clegg
From Rocket Science to Data Science
From Rocket Science to Data ScienceFrom Rocket Science to Data Science
From Rocket Science to Data Science
Sanghamitra Deb
BrightTALK - Semantic AI
BrightTALK - Semantic AI BrightTALK - Semantic AI
BrightTALK - Semantic AI
Semantic Web Company
Data mining
Data miningData mining
Data mining
Akannsha Totewar
Big Data vs. Small Data...what's the difference?
Big Data vs. Small Data...what's the difference?Big Data vs. Small Data...what's the difference?
Big Data vs. Small Data...what's the difference?
Anna Kuhn
How to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data TeamHow to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data Team
Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.
Natalino Busa

Similar to Big Data [sorry] & Data Science: What Does a Data Scientist Do? (20)

Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data Science
Thinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DCThinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DC
Data science
Data scienceData science
Data science
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
Intro to Data Science
Intro to Data ScienceIntro to Data Science
Intro to Data Science
How Your Data Can Predict The Future
How Your Data Can Predict The FutureHow Your Data Can Predict The Future
How Your Data Can Predict The Future
Big data
Big dataBig data
Big data
IIPGH Webinar 1: Getting Started With Data Science
IIPGH Webinar 1: Getting Started With Data ScienceIIPGH Webinar 1: Getting Started With Data Science
IIPGH Webinar 1: Getting Started With Data Science
intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...
2017 06-14-getting started with data science
2017 06-14-getting started with data science2017 06-14-getting started with data science
2017 06-14-getting started with data science
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Predictive modelling with azure ml
Predictive modelling with azure mlPredictive modelling with azure ml
Predictive modelling with azure ml
There's no such thing as big data
There's no such thing as big dataThere's no such thing as big data
There's no such thing as big data
From Rocket Science to Data Science
From Rocket Science to Data ScienceFrom Rocket Science to Data Science
From Rocket Science to Data Science
BrightTALK - Semantic AI
BrightTALK - Semantic AI BrightTALK - Semantic AI
BrightTALK - Semantic AI
Data mining
Data miningData mining
Data mining
Big Data vs. Small Data...what's the difference?
Big Data vs. Small Data...what's the difference?Big Data vs. Small Data...what's the difference?
Big Data vs. Small Data...what's the difference?
How to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data TeamHow to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data Team
Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.

More from Data Science London

Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Data Science London
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera Impala
Data Science London
Nowcasting Business Performance
Nowcasting Business PerformanceNowcasting Business Performance
Nowcasting Business Performance
Data Science London
Numpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingNumpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunching
Data Science London
Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Data Science London
Big Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least SquaresBig Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least Squares
Data Science London
Bringing back the excitement to data analysis
Bringing back the excitement to data analysisBringing back the excitement to data analysis
Bringing back the excitement to data analysis
Data Science London
ACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, Today
Data Science London
Beyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignBeyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems Design
Data Science London
Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?
Data Science London
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and Future
Data Science London
Data Science for Live Music
Data Science for Live MusicData Science for Live Music
Data Science for Live Music
Data Science London
Research at
Research at last.fmResearch at
Research at
Data Science London
Music and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryMusic and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music Industry
Data Science London
Scientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutScientific Article Recommendations with Mahout
Scientific Article Recommendations with Mahout
Data Science London
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapR
Data Science London
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in Mahout
Data Science London
Going Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersGoing Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook Users
Data Science London
Practical Magic with Incanter
Practical Magic with IncanterPractical Magic with Incanter
Practical Magic with Incanter
Data Science London

More from Data Science London (20)

Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera Impala
Nowcasting Business Performance
Nowcasting Business PerformanceNowcasting Business Performance
Nowcasting Business Performance
Numpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingNumpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunching
Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)
Big Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least SquaresBig Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least Squares
Bringing back the excitement to data analysis
Bringing back the excitement to data analysisBringing back the excitement to data analysis
Bringing back the excitement to data analysis
Survival Analysis of Web Users
Survival Analysis of Web UsersSurvival Analysis of Web Users
Survival Analysis of Web Users
ACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, Today
Beyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignBeyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems Design
Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and Future
Data Science for Live Music
Data Science for Live MusicData Science for Live Music
Data Science for Live Music
Research at
Research at last.fmResearch at
Research at
Music and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryMusic and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music Industry
Scientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutScientific Article Recommendations with Mahout
Scientific Article Recommendations with Mahout
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapR
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in Mahout
Going Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersGoing Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook Users
Practical Magic with Incanter
Practical Magic with IncanterPractical Magic with Incanter
Practical Magic with Incanter

Recently uploaded

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn

Recently uploaded (20)

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching

Big Data [sorry] & Data Science: What Does a Data Scientist Do?

  • 1. Big Data [sorry] Data Science: What Does a Data Scientist Do? Carlos Somohano Founder Data Science London @ds_ldn The Cloud and Big Data: HDInsight on Azure London 25/01/13
  • 2. Man on the Moon – 1,969
  • 3. Man on the Moon – Small Data! Computer Program Apollo X1 Man on the Moon Date: 1,969 Speed: 3,500 km/hour Distance: 356,000 Km 64 Kb, 2Kb RAM, Fortran Weight: 13,500 kg Never been there before Must work 1st time Lots of complex data Must return to Earth
  • 4. Apollo XI, 1969 SkyDive Stratos, 2012 64 Kb Tens of Gigabytes Think About It – We live in Crazy Times!
  • 5. Big Data is not about Data Volume
  • 6. What is Big Data? IT mumbo-jumbo A fashionable term typically used by some IT vendors to remarket old fashioned software hardware
  • 7. What is Big Data? The n-Vs Volume … Variety … Velocity … (add your own V here…) So What?
  • 8. Change! Water Cooler Chat We need to parallelize data operations but it’s too costly complex … The business can’t get access to all the relevant data, we need external data… We can’t match customer master data to live customer interactions… We can’t just force everything into a star-schema… These BI reports and charts don’t tell us anything we didn’t know… We are missing the ETL window, the data we needed didn’t arrive on time… We can’t predict with confidence if we can’t explore data develop our own models
  • 9. What is Big Data? Force of Change Big Data forces you to change the way you collect, store, manage, analyze and visualize data
  • 11. Big Data = Crude Oil [not New Oil] Think data as ‘crude oil.’ Big Data is about extracting the ‘crude oil,’ transporting it in ‘mega-tankers,’ siphoning it through ‘pipelines,’ and storing it in massive ‘silos’… All ‘this’ is about IT Big Data… fine and well… … BUT
  • 12. You need to refine the ‘crude oil’ Enter Data Science…
  • 13. The Science [and Art] of… Discovering what we don’t know from data Obtaining predictive, actionable insight from data Creating Data Products that have business impact now Communicating relevant business stories from data Building confidence in decisions that drive business value
  • 14. Brief History of Data Science 6th C BC - 1st C BC – The Greeks! Pyrrhonism, Skepticism Empiricism… 1974 – Peter Naur @UoC Datalogy Data Science 2001 – William S. Cleveland @CSU Data Science: An Action Plan …: 2002 – Committee on Data for Science Technology (CODATA) 2003 – Journal of Data Science 2009 – Jeff Hammerbacher @ Facebook What does a Data Scientist Do? 2010 – Drew Conway @NYU The Data Science Venn Diagram 2010 – Hillary Mason Chris Wiggins @Dataists “ 2010 – Mike Loukadis @O’Reilly “What is Data Science?” 2011 – DJ Patil @LinkedIn data scientist vs. data analyst
  • 15. Jeff Hammerbacher, 2009 “... on any given day, a team member could author a multistage processing pipeline in Python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data- intensive product or service in Hadoop, or communicate the results of our analyses to other members of the organization.
  • 16. Mike Loukides, 2010 Data science enables the creation of data products. Whether... data is search terms, voice samples, or product reviews,... users are in a feedback loop in which they contribute to the products they use. That's the beginning of data science.
  • 17. Hilary Mason Chris Wiggins,2010 Data science is clearly a blend of the hackers’ arts, statistics and machine learning...; and the expertise in mathematics and the domain of the data for the analysis to be interpretable... It requires creative decisions and open-mindedness in a scientific context.
  • 19. DJ Patil, 2011 ”We realized that as our organizations grew, we both had to figure out what to call the people on our teams. Business analyst” and Data analyst” seemed too limiting. The focus of our teams was to work on data applications that would have an immediate and massive impact on the business. The term that seemed to fit best was data scientist: those who use both data and science to create something new”
  • 20. What is a Data Scientist?
  • 21. The Duck – Billed Platypus The Data Scientist – Billed Platypus
  • 22. The Platypus – Billed Data Scientist Machine Learning Hacking Statistics Math Visualization Science Programming Data Mining The Data Scientist – Billed Platypus
  • 24. Class DataScientist { Is skeptical, curious. Has inquisitive mind Knows Machine Learning, Statistics, Probability Applies Scientific Method. Runs Experiments Is good at Coding Hacking Able to deal with IT Data Engineering Knows how to build data products Able to find answers to known unknowns Tells relevant business stories from data Has Domain Knowledge }
  • 25. What Does a Data Scientist Do?
  • 26. 10 Things [most] Data Scientists Do 1  Ask Good Questions. What is What… …we don’t know? …we’d like to know? 2  Define and Test an Hypothesis. Run experiments 3  Scoop, Scrap, Sink, Sample Business Relevant Data 4  Munge and Wrestle Data. Tame Data 5  Explore Data, Discover Data Playfully. Discover unknowns. 6  Model Data. Model Algorithms. 7  Understand Data Relationships 8  Tell the Machine How to Learn from Data 9  Create Data Products that Deliver Actionable Insight 10  Tell Relevant Business Stories from Data
  • 27. [Sort of a] Data Scientist Toolkit §  Java, R, Python… (bonus: Clojure, Haskell, Scala) §  Hadoop, HDFS MapReduce… (bonus: Spark, Storm) §  HBase, Pig Hive… (bonus: Shark, Impala, Cascalog) §  ETL, Webscrapers,Flume, Sqoop… (bonus: Hume) §  SQL, RDBMS, DW, OLAP… §  Knime, Weka, RapidMiner…(bonus: SciPy, NumPy, scikit-learn, pandas) §  D3.js, Gephi, ggplot2, Tableu, Flare, Shiny… §  SPSS, Matlab, SAS… (the enterprise man) §  NoSQL, Mongo DB, Couchbase, Cassandra… §  And Yes! … MS-Excel: the most used, most underrated DS tool
  • 29. [Some] Data Science Principles 1  Socio-Technical Systems (STS) are complex! 2  Data is never at rest 3  Data is dirty, deal with it 4  SVoT = LOL! 5  Data munging data wrestling 70% time 6  Simplification. Reduction. Distillation 7  Curiosity. Empiricism. Skepticism
  • 30. Knowns Unknowns There are known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don't know. But there are also unknown unknowns. There are things we don't know we don't know Donald Rumsfeld
  • 31. DIKUW FTW! D I K U W Data Information Knowledge Understanding Wisdom PAST FUTURE Data Engineer Data Analyst Data Miner Data Scientist Raw What How to Why When Numbers Description Experience Cause Effect Prediction Letters Context Tested Proven What’s best Known Unknown Symbols Relationship Instruction Unknowns Unknowns Known Knowns Signals Reports Programs models
  • 32. Data Discovery Data Analyst Data Scientist The new reality for Business Intelligence and Big Data, Applied Data Labs
  • 33. Data Models vs. Algorithmic Models Data Modeling VS. Algorithmic Modeling Y ß F( X, random noise, parameters) Y ß Black Box ß X Random Forests We understand the world We don’t understand the world How well ‘my data model’ works The world produces data in a black-box Statisticians, Data Analysts, Data Miners Data Scientists Linear Regression Machine Learning, AI Neural Nets Logistic Regression Random Forests, SVM, GBT Known Distributions Unknown Multivariate Distributions Confidence Intervals Iterative Predictor Variables Goodness of Fit Predictive Accuracy “Statistical Modeling: The Two Cultures” Leo Breiman, 2001
  • 34. Learning from Data is Tricky Statistical vs. Machine Learning Supervised vs. Unsupervised Learning Induction vs. Deduction Sampling Confidence Intervals Probability Distribution Deviation Variance Correlation vs. Causation Causation Prediction
  • 35. More Data or Better Models? More Data Beats Better Algorithms, Omar Tawakoi @BlueKai Better Algorithms Beat More Data, Mark Torrance @RocketFuel More Data or Better Models, Xavier Armitrain @Netflix On Chomsky 2 Cultures of Statistical Learning, Peter Norvig @Google Specialist Knowledge is Useless Unhelpful, Jeremy Howard @Kaggle
  • 36. Data Science Process – An approach
  • 37. Data Science Process - 1 1  Known Unknowns? 2  We’d like to know…? 3  Outcomes? 4  What Data? 5  Hypothesis? The World Ingest Raw Data Munch Data The Dataset Product Manufactured Transactions MapReduce Independency? Goods shipped Web-Scraping ETL, ELT Correlation? Product purchased Web-clicks logs Data Wrangle Covariance? Phone Calls Made Sensor Data Data Cleansing Causality? Energy Consumed Mobile Data Data Jujitsu Dimensionality? Fraud Committed Docs, Emails, XLS Dim Reduction Missing Values? Repair Requested Social Feeds, RSS Sample Relevant? System Flume Sink HDFS Select, Join, Bind
  • 38. Data Science Process - II The Dataset Explore Data Represent Data Discover Data Deliver Insight Learn From Data Data Product Visualize Insight Description Inference Objectives Data Algorithm Models Levers Actionable Machine Learning Modeling Predictive Networks Graphs Simulation Immediate Impact Regression Prediction Optimization Business Value Classification Clustering Visualization Easy to explain Experiments Iteration
  • 39. What is a Data Product?
  • 40. A Data Product Is… … Curated and crafted from raw data … A result of exploration and iterations … A machine that learns from data … An answer to known unknowns or unknown unknowns … A mechanism that triggers immediate business value … A probabilistic window of future events or behavior
  • 41. Data Jiu-Jitsu Data Jiu Jitsu Fight $$$$ Data Product Data Scientist Data Jiu-Jitsu: ability to turn big data into data products that generate immediate business value (DJ Patil @LinkedIn)
  • 42. Developing Data Products Objectives Levers Data Models What Outcome What Inputs Can What Data Can How the Levers Am I Trying to We Control? We Collect? Influence the Achieve? Objectives Adapted from “Designing Great Data Products. The Drivetrain Approach: A Four Step Approach to Building Data Products” Jeremy Howard, Margit Zwemer, Mike Loukides, 2012
  • 43. Objective-Based Data Products What Outcome Am I Actionable Trying to Achieve? Outcome Data Modeler Simulator Optimizer The Model Assembly Line Adapted from “Designing Great Data Products. The Drivetrain Approach: A Four Step Approach to Building Data Products” Jeremy Howard, Margit Zwemer, Mike Loukides, 2012
  • 44. 5 Great Data Products
  • 45. Customer Lifecycle Value Optimize CLV Product Recommendations Visualizer Data Modeler Simulator Optimizer 1  Products the customer may like 2  Price Elasticity 3  Probability of Purchase w/o Recommendation 4  Purchase Sequence 5  Causality Model 6  Patience Model Adapted from “Designing Great Data Products. The Drivetrain Approach: A Four Step Approach to Building Data Products” Jeremy Howard, Margit Zwemer, Mike Loukides, 2012
  • 46. Automated Fruits Procurement Confirm Purchase Orders In less than 2 hours Safety Stock levels? Demand vs Stock? Price vs. Demand? 12,000 stores Anomalies? 300 Fruits Fruit Shortages? Avg. Shelf life 3 days Fruit Write-offs? Adapted from Blueyonder
  • 47. Strawberries the Weather No sales vs X,XXX sales predicted Why these huge stock write-offs? A Predictive Model that calculates strawberry purchases based on Weather forecast Sudden increase in temperature Store temperature Freezer sensor data Remaining stock per shelf live Sales TPoS feeds Web searches, social mentions Adapted from Blueyonder
  • 48. Personalized Social Recommendations Collaborative Filtering: Matching Skills to People Prediction: Personalized Skills Recommendation Adapted from “Developing Data Products” by Peter Skomoroch 5 Dec, 2012 Copyright LinkedIn
  • 49. Colas- In Which US State I Invest Mktg. $? What the Business Analyst Sent What the Data Scientist did…
  • 50. The Great Pop vs. Soda Page
  • 51. Pop vs. Soda vs. Coke
  • 52. Raw Data Will Drive You Car
  • 53. Interested in Data Science? Join our community Follow us on Twitter @ds_ldn Check out our blog