HUMAN CLONING
                           The Data Scientist bottleneck resolved
                                        Dr Alex Farquhar




Friday, 24 February 2012
exabytes data (IDC/EMC report 2008)

         20,000


         15,000


         10,000


            5,000


                   0
                    2008   2009   2010   2011   2012   2013   2014   2015   2016   2017




Friday, 24 February 2012
By 2018, the United States alone could face a
                           shortage of 140,000 to 190,000 data people...




Friday, 24 February 2012
WE’RE ALL DOOMED




Friday, 24 February 2012
DATA PEOPLE?




                                     © Drew Conway


Friday, 24 February 2012
MAYBE WE CAN JUST....



    •1       statistician + 1 developer ≈ 1 data scientist?




Friday, 24 February 2012
HOW ABOUT....



    •4       statisticians + 4 developers ≈ 4 Data Scientists?




Friday, 24 February 2012
Friday, 24 February 2012
Friday, 24 February 2012
WHAT CAN WE DO?


    • Train            more new data scientists (not fast enough)

    • Cross-train             people

    • Cobble               together different skills in teams (see above)




Friday, 24 February 2012
WHAT CAN WE DO?



    • Do            more work




Friday, 24 February 2012
DOING MORE

    • simplify             (fob the work off)

    • automate               (fob even more work off)

    • choose/build              the right tools

    • parallelise

    • iterate



Friday, 24 February 2012
SIMPLIFY & AUTOMATE



    • Counting              stuff is not much fun




Friday, 24 February 2012
SIMPLIFY & AUTOMATE



                                             Hive




                                 TSV files   Hadoop

Friday, 24 February 2012
AUTOMATE / PARALLELISE
                           magic




                           Hadoop




                             Job



Friday, 24 February 2012
AUTOMATE / PARALLELISE
                                      magic




                                     Hadoop



                               Lots of jobs at once
                           Job 1   Job 2   Job 3   Job 4

Friday, 24 February 2012
TOOLS



    • something            thats allows fast iteration i.e. not java

    • R, ruby, python




Friday, 24 February 2012
PARALLELISE




Friday, 24 February 2012
ITERATE


    • try        different things

    • improve                what works

    • dump                 what doesn’t

    • constant               improvement & learning → get faster



Friday, 24 February 2012
WE’RE NOT ALL
                             DOOMED



Friday, 24 February 2012

"Human Cloning: The Data Scientist Bottleneck Resolved" Dr. Alex Farquhar @ds_ldn

  • 1.
    HUMAN CLONING The Data Scientist bottleneck resolved Dr Alex Farquhar Friday, 24 February 2012
  • 2.
    exabytes data (IDC/EMCreport 2008) 20,000 15,000 10,000 5,000 0 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 Friday, 24 February 2012
  • 3.
    By 2018, theUnited States alone could face a shortage of 140,000 to 190,000 data people... Friday, 24 February 2012
  • 4.
  • 5.
    DATA PEOPLE? © Drew Conway Friday, 24 February 2012
  • 6.
    MAYBE WE CANJUST.... •1 statistician + 1 developer ≈ 1 data scientist? Friday, 24 February 2012
  • 7.
    HOW ABOUT.... •4 statisticians + 4 developers ≈ 4 Data Scientists? Friday, 24 February 2012
  • 8.
  • 9.
  • 10.
    WHAT CAN WEDO? • Train more new data scientists (not fast enough) • Cross-train people • Cobble together different skills in teams (see above) Friday, 24 February 2012
  • 11.
    WHAT CAN WEDO? • Do more work Friday, 24 February 2012
  • 12.
    DOING MORE • simplify (fob the work off) • automate (fob even more work off) • choose/build the right tools • parallelise • iterate Friday, 24 February 2012
  • 13.
    SIMPLIFY & AUTOMATE • Counting stuff is not much fun Friday, 24 February 2012
  • 14.
    SIMPLIFY & AUTOMATE Hive TSV files Hadoop Friday, 24 February 2012
  • 15.
    AUTOMATE / PARALLELISE magic Hadoop Job Friday, 24 February 2012
  • 16.
    AUTOMATE / PARALLELISE magic Hadoop Lots of jobs at once Job 1 Job 2 Job 3 Job 4 Friday, 24 February 2012
  • 17.
    TOOLS • something thats allows fast iteration i.e. not java • R, ruby, python Friday, 24 February 2012
  • 18.
  • 19.
    ITERATE • try different things • improve what works • dump what doesn’t • constant improvement & learning → get faster Friday, 24 February 2012
  • 20.
    WE’RE NOT ALL DOOMED Friday, 24 February 2012