SlideShare a Scribd company logo
1 of 80
Download to read offline
BigML Inc, 2012   Geneva, October 12, 2012
Agenda


 ·•Short	 intro
 ·•The	 Big	 Data	 Revolution
 ·•What	 is	 BigML?
 ·•Behind	 the	 scenes
 ·•Coming	 down	 the	 pike
 ·•Hacking	 with	 the	 BigML	 API	 
BigML Inc, 2012   Geneva, October 12, 2012   2
Francisco J Martin
                        Background:
                        • 5-year degree in Computer Science, UPV
                        • Ph.D. in Artificial Intelligence, UPC
                        • Postdoc (Machine Learning), Oregon State University
                        • Founder and CEO at iSOCO
                        • Founder and CEO at Strands
                        • Co-authored 6 patents acquired by Apple Inc
                        • Directly raised $75+MM in venture capital and cashed
                         out additional $18+MM for early investors
                        • Directly sold and negotiated $30+MM in licenses

   BigML:
  • Co-founder and CEO
  • Joined: January 2011
  • Tasks: Product conceptualization, design, and architecture
  • Develops: BigML middle-end and public API
  • 1202 (19%) of commits to total BigML code base
BigML Inc, 2012               Geneva, October 12, 2012                           3
Academia vs the Real-world


                                              Neo, sooner or
                                            later you're going
                                            to realize, just as
                                            I did, that there's
                                               a difference
                                            between knowing the
                                             path, and walking
                                                  the path


BigML Inc, 2012           Geneva, October 12, 2012                4
Walking the data path
                                                                                                            Large-scale
                                                                                                          Machine Learning

                                                                                  Recommender
                                                                                     Systems


                                                                                                             Everything
                                                         Machine Learning



                                Personalization
                                                                                   Music, video,
           Multi-agent                                                            fitness, finance
            Learning                                        Intrusion
                                                            Detection



                                 E-commerce

                                                                                  ata
                                                                                D
      8-queen problem

1996                     1999                     2002                    2004                     2011                   2012
         Academia                  iSOCO                   Academia               Strands Inc               BigML Inc
BigML Inc, 2012                                      Geneva, October 12, 2012                                                5
BigML Status
·•Founded	 in	 Jan	 2011
·•9	 FTE,	 1	 PT
·•5	 Ph.Ds
·•4	 patent	 applications
·•Advisors	 and	 BA:
   US Patent Application No. 61/555,615
   For: VISUALIZATION AND INTERACTION WITH COMPACT REPRESENTATION OF DECISION TREES
   Filed: November, 2011

   US Patent Application No. 61/557,826
   For: METHODS FOR BUILDING AND USING DECISION TREES IN A DISTRIBUTED ENVIRONMENT
   Filed: November, 2011

   US Patent Application No. 61/557,539
   For: EVOLVING PARALLEL SYSTEM TO AUTOMATICALLY IMPROVE THE PERFORMANCE OF DISTRIBUTED SYSTEMS
   Filed: November, 2011

   US Patent Application No. 61/710,175
   For: SYSTEM AND METHODS TO EXCHANGE ACTIONABLE PREDICTIVE MODELS IN A VIRTUAL MARKETPLACE
   Filed: October, 2012


BigML Inc, 2012                                  Geneva, October 12, 2012                          6
From the trenches




                       Beneath Hill 60             BigML Team




BigML Inc, 2012         Geneva, October 12, 2012                7
Agenda


 ·•Short	 intro
 ·•The	 Big	 Data	 Revolution
 ·•What	 is	 BigML?
 ·•Behind	 the	 scenes
 ·•Coming	 down	 the	 pike
 ·•Hacking	 with	 the	 BigML	 API	 
BigML Inc, 2012   Geneva, October 12, 2012   8
Big Data
              What	 is	 Big	 Data?                   What	 is	 a	 Data	 Scientist?




      How	 not	 to	 start	 with	 Big	 Data?                 What	 is	 Data-driven	 
                                                             Decision	 Making?




BigML Inc, 2012                       Geneva, October 12, 2012                        9
Trends




                                              http://strata.oreilly.com/2011/08/building-data-startups.html



BigML Inc, 2012    Geneva, October 12, 2012                                                                   10
What’s Big Data?

                                             	 Big	 Data	 means	 way	 
                                                too	 many	 different	 
                                                      things	 to	 
                                             many	 different	 people


      “when	 the	 human	 cost	 of	 making	 the	 decision	 of	 throwing	 
       something	 away	 became	 higher	 than	 the	 machine	 cost	 of	 
                continuing	 to	 store	 it”	 George	 Dyson
BigML Inc, 2012                 Geneva, October 12, 2012                   11
What’s Big Data?
                      The 3 v’s                                                   The 3 I’s
                             Volume                                                   Immediate
             (big,	 enormous,	 huge,	 vast,	 immense,	 	 very	           In	 the	 sense	 that	 you	 need	 to	 do	 something	 
                               large,	 etc)                                                    about	 it


                             Variety                                                 Intimidating
              (heterogenous,	 diverse,	 complex,	 multiple	                           What	 if	 you	 do	 not?
                        sources,	 sensors,	 etc)



                            Velocity                                                   Ill-defined
              (speed,	 dynamic	 real-time,	 streamed,	 etc)                           What	 is	 it?	 Anyway?




BigML Inc, 2012
                      Data	 matters!!!                 Geneva, October 12, 2012                                                 12
Machine Learning
                                            Even	 if	 we,	 human	 
                                           beings,	 are	 learning	 
                                          machines,	 we	 are	 really	 
                                          bad	 at	 processing	 small	 
                                             amounts	 of	 data


                                            Machines	 are	 good	 at	 
                                           quickly	 processing	 huge	 
                                              amounts	 of	 data.
                                            Machine	 Learning	 can	 
                                            make	 them	 learn	 from	 
                                                      data
BigML Inc, 2012        Geneva, October 12, 2012                          13
It’s all about machine learning

                                                                    Forget plastics.
                                                                     It’s all about
                                                                   machine learning
            http://www.youtube.com/watch?v=PSxihhBzCjk




      It's	 as	 if	 the	 machines	 have	 been	 in	 training	 all	 their	 lives	 to	 
    adapt	 and	 make	 use	 of	 the	 Big	 Data	 now	 being	 thrown	 at	 them	 
     -	 a	 combination	 of	 Moore's	 Law	 and	 the	 cloud	 mixed	 in	 with	 
    Machine	 Learning	 finally	 makes	 it	 all	 possible.	 ---	 Jeff	 Bussgang


BigML Inc, 2012                                   Geneva, October 12, 2012             14
Learning from Data
                                       Unknown Model
                                                 f : X -> Y
                                  Example: ideal credit approval formula
          f1   f2   fn label
     x1
                                     Training Examples
                                        (x1, l1), (x2, l2), ..., (xN, lN)
     xN
                               Example: historical records of credit customers




                  Models                                                                           Final Model
              M                               Learning                                                g~f
  Example: set of candidate                   Algorithm                                      Example: learned credit
   credit approval formulas                                                                     approval formula


                                                                         Based on Learning from Data by Y. Abu-Mostafa, M. Magdon-Ismail and H. Lin

BigML Inc, 2012                               Geneva, October 12, 2012                                                                           15
What’s Big Machine Learning?

                         Volume                                            Large-scale	 machine	 
      What	 to	 do	 when	 data	 is	 too	 big	 to	 fit	 within	 the	 
           system	 memory	 of	 a	 single	 computer?                              learning
                                                                          Clean,	 refine,	 update,	 join,	 merge,	 aggregate,	 	 
                         Variety                                           structure	 or	 deconstruct	 data	 until	 it	 matches	 
                                                                           the	 required	 input	 format	 or	 	 (why	 not)	 just	 
                                                                              generate/store	 data	 in	 the	 right	 format




                        Velocity                                              Stream	 Algorithms

BigML Inc, 2012                                           Geneva, October 12, 2012                                              16
Machine Learning
                  ...or you can deal with that!




BigML Inc, 2012             Geneva, October 12, 2012   17
Does More Data beat Better Algorithms?

                               More	 features
    More	 examples




                                                             The Unreasonable Effectiveness of Data




                                                                More Data or Better Models.
                                                                Xavier Amatriain



BigML Inc, 2012                   Geneva, October 12, 2012                                        18
What’s Big Data?

                               Global	 realization	 that
                 	 learning	 from	 data	 (i.e.,	 Machine	 Learning)
        can	 help	 us	 better	 analyze	 our	 past,	 understand	 our	 present,	 
                 and	 predict	 our	 future.	 ---	 Francisco	 J	 Martin




                  Data        Past               Present       Future


BigML Inc, 2012                     Geneva, October 12, 2012                      19
Big Data
              What	 is	 Big	 Data?                   What	 is	 a	 Data	 Scientist?




      How	 not	 to	 start	 with	 Big	 Data?                 What	 is	 Data-driven	 
                                                             Decision	 Making?




BigML Inc, 2012                       Geneva, October 12, 2012                        20
Is Wikipedia right?




           Really? Seriously?? Are you kidding me???
BigML Inc, 2012              Geneva, October 12, 2012   21
Data can’t be wrong?




BigML Inc, 2012         Geneva, October 12, 2012   22
McKinsey can’t be wrong
                                Critical Shortage Of “Data Scientist”
                                Talent Predicted By 2018




BigML Inc, 2012          Geneva, October 12, 2012                       23
HBR can’t be wrong




BigML Inc, 2012       Geneva, October 12, 2012   24
Wikipedia is right!




BigML Inc, 2012         Geneva, October 12, 2012   25
If Data Scientists don’t exist
                      can they be created?




BigML Inc, 2012             Geneva, October 12, 2012   26
The first Data Scientist

      Computer
                           Statistician
       Scientist



                  Mathematician




                             Hans’ brain, the first Data Scientist
BigML Inc, 2012                           Geneva, October 12, 2012   27
The magic formula




                      A	 data	 scientist	 is“part	 
                       analyst,	 part	 artist.”
                                         	 	 
                   Anjul	 Bhambhri,Vice	 President	 of	 Big	 Data	 
                               Products	 at	 IBM

BigML Inc, 2012           Geneva, October 12, 2012                28
Are Data Scientists super heroes?




BigML Inc, 2012              Geneva, October 12, 2012   29
The most powerful human super hero




   http://photos.oregonlive.com/photo-essay/2012/06/ashton_eaton_sets_decathlon_wo.html
BigML Inc, 2012                                                                           Geneva, October 12, 2012   30
Are Data Scientists super heroes?

                                                                  High school
                          Events   Decathlon World Record                        World Record
                                                                  World Record

                  100 m                   10.21                      10.08          9.58

                  Long Jump              8.23 m                     8.16 m        8.95 m
                  Shot Put              14.20 m                    20.65 m       23.12 m
                  High Jump              2.05 m                     2.31 m        2.45 m
                  400 m                   46.70                      44.69         43.18

                  110 m hurdles           13.70                      13.74         12.80

                  Discus throw          42.81 m                    61.38 m       74.08 m
                  Pole Vault             5.30 m                     5.56 m        6.14 m
                  Javelin Throw         58.87 m                    73.74 m       98.48 m
                  1500m                  4:14.48                   3:38.26        3:26.00




BigML Inc, 2012                                Geneva, October 12, 2012                         31
The Wikipedia is always right!




BigML Inc, 2012            Geneva, October 12, 2012   32
BigML’s Data Science Team
                                                                                           UI
                                                               Design
                                                                                                            Visualization


                                                                                     Oscar Rovira, MSc*




                                                                                                                                     Infrastructure, Cloud-based
                                                              Bea Garcia, BSc
                      Product Design

                                              Common Sense
                                               Business and                                                 Justin Donaldson Ph.D.




                                                                                                                                              Computing
                       Architecture,                                              Francisco J Martin, PhD
                     Software Design,
                    Distributed Systems                       Jos Verwoerd, MSc

                                                                                                             Poul Petersen, MSc

                   Large-scale and learning
                  algorithm implementation                                               Jao, PhD




                                                               Charles Parker,
                  Machine Learning Research                         PhD                                     Adam Ashenfelter, MSc




                                                                                    Tom Dietterich, PhD




BigML Inc, 2012                                                       Geneva, October 12, 2012                                                                     33
Take Away

                                                                                           Oscar Rovira, MSc*


                                                                    Bea Garcia, BSc

                                                                                                                  Justin Donaldson Ph.D.



                                                                                        Francisco J Martin, PhD


                                                                    Jos Verwoerd, MSc

                                                                                                                   Poul Petersen, MSc



                                                                                               Jao, PhD




                                                                     Charles Parker,
                                                                          PhD                                     Adam Ashenfelter, MSc




                                                                                          Tom Dietterich, PhD


    So	 instead	 of	 trying	 to	 quickly	 create	 “mediocre	 data	 scientists”,	 Universities	 
        should	 focus	 on	 creating	 excellent	 mathematicians,	 statisticians,	 computer	 
     scientists,	 software	 architects,	 designers,	 	 etc	 who	 are	 fabulous	 team	 players

BigML Inc, 2012                          Geneva, October 12, 2012                                                                          34
Big Data
              What	 is	 Big	 Data?                   What	 is	 a	 Data	 Scientist?




      How	 not	 to	 start	 with	 Big	 Data?                 What	 is	 Data-driven	 
                                                             Decision	 Making?




BigML Inc, 2012                       Geneva, October 12, 2012                        35
Iris Dataset




                                                http://en.wikipedia.org/wiki/Iris_flower_data_set

BigML Inc, 2012      Geneva, October 12, 2012                                                      36
Digesting Big Data
                                                           Assimilation	 
                                                          (making	 insights	 actionable)
                                                                                                                 Almost	 no	 attention!!!
                  (reject	 bad	 data,	 wrong	 insights)




                                                            Absorption
          Egestion




                                                               (deriving	 insights)



                                                              Digestion
                                                                  (processing)
                                                                                                                 Too	 much	 attention!!!
                                                               Ingestion
                                                            (capturing	 and	 storing)



BigML Inc, 2012                                                                       Geneva, October 12, 2012                              37
Big Data meets Hadoop
                              ·•Hadoop	 has	 been	 excessively	 
                              promoted	 as	 the	 way	 to	 make	 
                              Big	 Data	 problems	 easy.	 

                              ·•There	 are	 quite	 a	 few	 vendors	 
                              pushing	 different	 Hadoop	 flavors	 
                              to	 the	 market.


          However,	 Hadoop	 is	 complex,	 slow,	 expensive	 
                           and	 batch
BigML Inc, 2012               Geneva, October 12, 2012                 38
Big Data and Hadoop
  Running Hadoop on a cluster - The New IT sport of 2012




BigML Inc, 2012         Geneva, October 12, 2012       39
Real-Time Hadoop?




                  Really? Seriously?? Are you kidding me???
BigML Inc, 2012                  Geneva, October 12, 2012     40
Why not Hadoop?
 Hadoop	 on	 a	 cluster	 is	 the	 right	 solution	 for	 jobs	 where	 the	 input	 data	 is	 multi-terabyte	 or	 larger


  ·•Evidence	 suggests	 that	 many	 MapReduce-like	 jobs	 process	 relatively	 small	 input	 data	 sets	 
     (less	 than	 14	 GB)
  ·•Iterative-machine	 learning	 algorithms,	 do	 not	 map	 trivially	 to	 MapReduce.
  ·•Memory	 has	 reached	 a	 GB/$	 ratio	 such	 that	 it	 is	 now	 technically	 and	 financially	 feasible	 to	 
  have	 servers	 with	 100s	 GB	 of	 DRAM
  ·•In	 terms	 of	 hardware	 and	 programmer	 time,	 this	 may	 be	 a	 better	 option	 for	 the	 majority	 of	 
     data	 processing	 jobs.
                                                      Rowstron, A. et al, Nobody ever got fired for using Hadoop
                                                      on a cluster, Microsoft Research, Cambridge, 2012




  ·•Hadoop	 is	 bad	 at	 iterative	 algorithms:	 High	 job	 startup	 costs	 and	 awkward	 to	 retain	 state	 
    across	 iterations
  ·•High	 sensitivity	 to	 skew:	 iteration	 speed	 bounded	 by	 slowest	 task.
  ·•Potentially	 poor	 cluster	 utilization:	 must	 shuffle	 all	 data	 to	 a	 single	 reducer.
                                                     Large-Scale Machine Learning at Twitter, Jimmy Lin



BigML Inc, 2012                                 Geneva, October 12, 2012                                            41
Making Big Data Small
                  Hadoop                      Streaming	 Algorithms

   ·•Complex                                 ·•Simple
   ·•Slow                                    ·•Fast
   ·•Batch                                   ·•Real-time
   ·•Expensive                               ·•Cheap	 
                           Noel	 Welsh,	 Strata	 conference,	 London,	 October	 2012


BigML Inc, 2012            Geneva, October 12, 2012                                    42
Self-imposed Shackles
                                                 Once	 a	 baby	 elephant	 accepts	 the	 
                                               limitation	 imposed	 on	 him	 it	 becomes	 
                                                a	 permanent	 belief,	 or	 in	 his	 case,	 a	 
                                                  conditioned	 reaction.	 Now	 as	 the	 
                                                 elephant	 grows	 into	 adulthood,	 he	 
                                                   has	 the	 power	 to	 easily	 pull	 the	 
                                                   stake	 out	 of	 the	 ground,	 but	 his	 
                                                conditioning	 has	 taught	 him	 that	 the	 
                                               effort	 will	 not	 only	 be	 futile,	 it	 will	 be	 
                                                              painful	 as	 well.
                                                                   http://www.selfgrowth.com/articles/Martinez1.html




     Tackling	 Big	 Data	 with	 Hadoop	 on	 a	 cluster	 is	 like	 
        self-imposing	 shackles	 on	 your	 own	 project

BigML Inc, 2012               Geneva, October 12, 2012                                                                 43
Starting with Big Data
                       •Buy a few machines and set up a cluster.
                       •Installing and running any flavor of Hadoop.
                       •Figure out how to implement complex map-reduce
                       algorithms to compute a few analytics.




                       •Start with a very small data sample.
                       •Use free or cloud-based tools to build a first predictive
                       model that you can understand.
                       •Check if the model gives you any practical insight.
                       •Use the model to generate predictions and see if it can
                       improve your performance.
                       •Check how more data can improve the model.
                       •Check if more sophisticated models can beat your model
                       •Iterate.
                       •Check if the volume, variety, and velocity of your data
                       require a behind-the-firewall/ cloud solution or a batch/stream
                       solution.
BigML Inc, 2012               Geneva, October 12, 2012                                  44
Big Data
              What	 is	 Big	 Data?                  What	 is	 a	 Data	 Scientist?




      How	 not	 to	 deal	 with	 Big	 Data?                 What	 is	 Data-driven	 
                                                            Decision	 Making?




BigML Inc, 2012                      Geneva, October 12, 2012                        45
Data-Driven Decisions
  Automated, data-driven decisions will significantly
  impact more industries than any other information
       system since “computers” were people




                                             http://www.nytimes.com/2011/04/24/business/24unboxed.html

BigML Inc, 2012          Geneva, October 12, 2012                                                        46
The “HiPPO” (Highest Paid Person’s Opinion) is dead




BigML Inc, 2012                  Geneva, October 12, 2012               47
Predictive Analytics




                  Descriptive	 Analytics                                Predictive	 Analytics
        Traditional,	 backward-looking	 business	                           Machine	 Learning
                         analytics




BigML Inc, 2012                              Geneva, October 12, 2012                           48
Predictive Model


                           “The goal of a predictive
                                 model is not
                           to predict the future but
                          to help you make a better
                           decision in the present”



                                                Taken from Paul Saffo, HBR


BigML Inc, 2012      Geneva, October 12, 2012                                49
Data-Driven Decision Making




                                               Analytics	 and	 
                                            Predictive	 Analytics	 
                                              combined	 with	 
                                            Experience&Intuition	 
BigML Inc, 2012            Geneva, October 12, 2012                   50
It’s time to switch the attention

                                                             Assimilation	 
                                                            (making	 insights	 actionable)
                                                                                                                     More	 attention!!!
                  (reject	 bad	 data,	 wrong	 insights)




                                                              Absorption                                            More focus on the models and
          Egestion




                                                                 (deriving	 insights)                              how to operationalize them than
                                                                                                                   on the infrastructure to generate
                                                                                                                                  them
                                                                Digestion
                                                                    (processing)
                                                                                                                      less	 attention!!!
                                                                 Ingestion
                                                              (capturing	 and	 storing)



BigML Inc, 2012                                                                         Geneva, October 12, 2012                                   51
Take aways
 •Big Data is just data
 •It’s all about machine learning

 •Try to excel in one of the data science disciplines

 •Don’t shackle yourself to the wrong platform
 •Trying to predict the future can help you make the right
 decision in the present

 •Focus on evaluation and actionability of models and not
 on how they are built
BigML Inc, 2012            Geneva, October 12, 2012          52
Agenda


 ·•Short	 intro
 ·•The	 Big	 Data	 Revolution
 ·•What	 is	 BigML?
 ·•Behind	 the	 scenes
 ·•Coming	 down	 the	 pike
 ·•Hacking	 with	 the	 BigML	 API	 
BigML Inc, 2012   Geneva, October 12, 2012   53
BigML Goal


    Highly	 Scalable,	 Cloud-based	 Machine	 Learning	 Service
    Simple,	 Easy-to-Use	 and	 Seamless-to-
                   Integrate


BigML Inc, 2012            Geneva, October 12, 2012              54
BigML vs ML

          You can deal      ...or you can deal with that!
           with this...




   BigML 1-click model




BigML Inc, 2012              Geneva, October 12, 2012       55
BigML vs Big Data

          You can deal    ...or you can deal with that!
           with this...




   BigML 1-click model




BigML Inc, 2012           Geneva, October 12, 2012        56
How it Works




BigML Inc, 2012      Geneva, October 12, 2012   57
Machine Learning Made Easy

                                            True




BigML Inc, 2012          Geneva, October 12, 2012   58
Simple is not easy



       “Any fool can make something complicated. It
       takes a genius to make it simple.”
                                          ― Woody Guthrie




BigML Inc, 2012            Geneva, October 12, 2012         59
Fully Web based




BigML Inc, 2012        Geneva, October 12, 2012   60
RESTful API




BigML Inc, 2012     Geneva, October 12, 2012   61
Agenda


 ·•Short	 intro
 ·•The	 Big	 Data	 Revolution
 ·•What	 is	 BigML?	 -	 Demo
 ·•Behind	 the	 scenes
 ·•Coming	 down	 the	 pike
 ·•Hacking	 with	 the	 BigML	 API	 
BigML Inc, 2012   Geneva, October 12, 2012   62
Agenda


 ·•Short	 intro
 ·•The	 Big	 Data	 Revolution
 ·•What	 is	 BigML?
 ·•Behind	 the	 scenes
 ·•Coming	 down	 the	 pike
 ·•Hacking	 with	 the	 BigML	 API	 
BigML Inc, 2012   Geneva, October 12, 2012   63
BigML’ Software Architecture
  Front-end
      [Neutronia]

        [Medusa]


[CuriousYellow]


             [Sky]

Middle-end
        [Apian]

Backend
  [Wintermute]

Infrastructure
         [Sauron]          Boto, Fabric
BigML Inc, 2012                    Geneva, October 12, 2012   64
BigML’s AWS-based Architecture




BigML Inc, 2012            Geneva, October 12, 2012   65
Why Tree Models?




     ·•Highly	 scalable
     ·•Graphically	 representable	 and	 interactive
     ·•Easily	 understandable
     ·•Easily	 translatable	 into	 rules,	 PMML,	 and	 code.	 
     ·•Easily	 upgradable	 with	 ensembles:	 boosting,	 bagging,	 and	 
     random	 forests,	 etc
     ·•Top	 performers!	 http://www.niculescu-mizil.org/papers/empirical.icml06.pdfempirical.icml06.pdf
BigML Inc, 2012                                Geneva, October 12, 2012                                   66
BigML Histograms
 BigML's trees and dataset summaries use histograms with the following traits:

             Streaming                    Memory	 constrained                        Dynamic
 Data	 is	 never	 kept	 in	 memory	  The	 less	 memory	 allocated,	 the	     The	 histogram	 bins	 adjust	 
 but	 needs	 only	 one	 pass	 over	       lossier	 the	 compressed	       themselves	 as	 they	 observe	 the	 
    the	 data	 to	 capture	 the	                 distribution.                          data.
            distribution.



     Robust	 to	 ordered	                    Merge	 friendly                          More...
           data
   So	 it	 works	 even	 if	 the	 data	      For	 parallelization	 and	         http://blog.bigml.com/
     stream	 is	 non-stationary                   distribution.              2012/06/18/bigmls-fancy-
                                                                                     histograms/



BigML Inc, 2012                                Geneva, October 12, 2012                                    67
BigML Streaming Trees
  BigML's trees are:

                          CART                                          Grown	 breadth	 first

   Classification	 &	 Regression	                                   So	 partial	 trees	 are	 
               Trees                                                     meaningful


                  Built	 Hoeffding-style	                       Friendly	 for	 parallelization

  So	 they	 consume	 streaming	  Can	 work	 over	 multiple	 
  data	 and	 can	 split	 "early" cores	 or	 multiple	 computers

BigML Inc, 2012                              Geneva, October 12, 2012                            68
Growing a Streaming Tree

 ·•Each	 split	 breaks	 the	 data	 
 into	 subsets.

 ·•The	 split	 should	 make	 the	 
 subsets	 as	 distinct	 from	 one	 
 another	 as	 possible.

 ·•Subsets	 are	 chosen	 to	 
 maximize	 information	 gain	 
 (classification)	 or	 minimize	 
 squared	 error	 (regression).

BigML Inc, 2012                 Geneva, October 12, 2012   69
Distributed Streaming Trees




                                           




BigML Inc, 2012           Geneva, October 12, 2012   70
Streaming Trees - Early Splits




BigML Inc, 2012           Geneva, October 12, 2012   71
Agenda


 ·•Short	 intro
 ·•The	 Big	 Data	 Revolution
 ·•What	 is	 BigML?
 ·•Behind	 the	 scenes
 ·•Coming	 down	 the	 pike
 ·•Hacking	 with	 the	 BigML	 API	 
BigML Inc, 2012   Geneva, October 12, 2012   72
Automatic Evaluations




BigML Inc, 2012          Geneva, October 12, 2012   73
A marketplace for predictive models




BigML Inc, 2012               Geneva, October 12, 2012   74
Simple is not easy



       “Any fool can make something complicated. It
       takes a genius to make it simple.”
                                          ― Woody Guthrie




BigML Inc, 2012            Geneva, October 12, 2012         75
Machine Learning Made Easy

                                            True




BigML Inc, 2012          Geneva, October 12, 2012   76
Agenda


 ·•Short	 intro
 ·•The	 Big	 Data	 Revolution
 ·•Demo
 ·•Behind	 the	 scenes
 ·•Coming	 down	 the	 pike
 ·•Hacking	 with	 the	 BigML	 API	 
BigML Inc, 2012   Geneva, October 12, 2012   77
Back to the trenches




                                                  Gallipoli
BigML Inc, 2012        Geneva, October 12, 2012               78
Good Reading
 Big Data Trends - David Feinleib
 http://www.slideshare.net/bigdatalandscape/big-data-trends

 Hey Graduates: Forget Plastics - It's All About Machine Learning. Jess Bussgang.
 http://bostonvcblog.typepad.com/vc/2012/05/forget-plastics-its-all-about-machine-learning.html

 More Data or Better Models. Xavier Amatriain
 http://technocalifornia.blogspot.ch/2012/07/more-data-or-better-models.html

 Making Big Data Small. Noel Welsh
 http://strataconf.com/strataeu/public/schedule/detail/25984

 Data Killed the HiPPO star. Jeff Jordan, Andreessen Horowitz
 http://gigaom.com/2012/02/18/data-killed-the-hippo-star/

 When There’s No Such Thing as Too Much Information. Steve Lohr
 http://www.nytimes.com/2011/04/24/business/24unboxed.html

 Nobody ever got fired for using Hadoop on a cluster. Antony Rowstron, Dushyanth Narayanan, Austin Donnelly, Greg
 O’Shea, Andrew Douglas
 http://research.microsoft.com/pubs/163083/hotcbp12%20final.pdf

 Six Rules for Effective Forecasting. Paul Saffo
 http://www.usc.edu/schools/annenberg/asc/projects/wkc/pdf/200912digitalleadership_saffo.pdf

 Large-scale Machine Learning at Twitter. Jimmy Lin and Alek Kolcz
 http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf

BigML Inc, 2012                                   Geneva, October 12, 2012                                         79
BigML Inc, 2012   Geneva, October 12, 2012

More Related Content

What's hot

Nautillus tiskovina i_200x200_ang
Nautillus tiskovina i_200x200_angNautillus tiskovina i_200x200_ang
Nautillus tiskovina i_200x200_angNAUTILLUS
 
Manfred Linking the Real World
Manfred Linking the Real WorldManfred Linking the Real World
Manfred Linking the Real Worldsssw2012
 
Advanced hci in hybrid learning scenariov3
Advanced hci in hybrid learning scenariov3Advanced hci in hybrid learning scenariov3
Advanced hci in hybrid learning scenariov3manou2008
 
Teigland YPO/WPO Mar 2012
Teigland YPO/WPO Mar 2012Teigland YPO/WPO Mar 2012
Teigland YPO/WPO Mar 2012Robin Teigland
 
The Web of Data - Tom Heath
The Web of Data - Tom HeathThe Web of Data - Tom Heath
The Web of Data - Tom Heathsssw2012
 
Boostzone Webreview on the Future of the World of Work - August 2012
Boostzone Webreview on the Future of the World of Work - August 2012Boostzone Webreview on the Future of the World of Work - August 2012
Boostzone Webreview on the Future of the World of Work - August 2012Boostzone Institute
 
Developing a Community Networking Strategy – Steps to Take
Developing a Community Networking Strategy – Steps to TakeDeveloping a Community Networking Strategy – Steps to Take
Developing a Community Networking Strategy – Steps to TakeJenny Ambrozek
 
Research 2.0
Research 2.0Research 2.0
Research 2.0Yuwei Lin
 
Surviving Social Software Fatigue
Surviving Social Software FatigueSurviving Social Software Fatigue
Surviving Social Software FatigueAlan Lepofsky
 
SP1: Exploratory Network Analysis with Gephi
SP1: Exploratory Network Analysis with GephiSP1: Exploratory Network Analysis with Gephi
SP1: Exploratory Network Analysis with GephiJohn Breslin
 
20121211 open cities_open_organizations
20121211 open cities_open_organizations20121211 open cities_open_organizations
20121211 open cities_open_organizationsAntti Poikola
 
Dkv 18 Jan2010 innovabia innovation
Dkv 18 Jan2010 innovabia innovationDkv 18 Jan2010 innovabia innovation
Dkv 18 Jan2010 innovabia innovationOsama Ghanim
 
Ol615 team 2 group presentation_v8.0_final_20121006
Ol615 team 2 group presentation_v8.0_final_20121006Ol615 team 2 group presentation_v8.0_final_20121006
Ol615 team 2 group presentation_v8.0_final_20121006Karla Natale
 

What's hot (16)

Nautillus tiskovina i_200x200_ang
Nautillus tiskovina i_200x200_angNautillus tiskovina i_200x200_ang
Nautillus tiskovina i_200x200_ang
 
Manfred Linking the Real World
Manfred Linking the Real WorldManfred Linking the Real World
Manfred Linking the Real World
 
Advanced hci in hybrid learning scenariov3
Advanced hci in hybrid learning scenariov3Advanced hci in hybrid learning scenariov3
Advanced hci in hybrid learning scenariov3
 
Teigland YPO/WPO Mar 2012
Teigland YPO/WPO Mar 2012Teigland YPO/WPO Mar 2012
Teigland YPO/WPO Mar 2012
 
Participation
ParticipationParticipation
Participation
 
The Web of Data - Tom Heath
The Web of Data - Tom HeathThe Web of Data - Tom Heath
The Web of Data - Tom Heath
 
Boostzone Webreview on the Future of the World of Work - August 2012
Boostzone Webreview on the Future of the World of Work - August 2012Boostzone Webreview on the Future of the World of Work - August 2012
Boostzone Webreview on the Future of the World of Work - August 2012
 
Developing a Community Networking Strategy – Steps to Take
Developing a Community Networking Strategy – Steps to TakeDeveloping a Community Networking Strategy – Steps to Take
Developing a Community Networking Strategy – Steps to Take
 
Eciu
EciuEciu
Eciu
 
Knowledge And Self (PDF)
Knowledge And Self (PDF)Knowledge And Self (PDF)
Knowledge And Self (PDF)
 
Research 2.0
Research 2.0Research 2.0
Research 2.0
 
Surviving Social Software Fatigue
Surviving Social Software FatigueSurviving Social Software Fatigue
Surviving Social Software Fatigue
 
SP1: Exploratory Network Analysis with Gephi
SP1: Exploratory Network Analysis with GephiSP1: Exploratory Network Analysis with Gephi
SP1: Exploratory Network Analysis with Gephi
 
20121211 open cities_open_organizations
20121211 open cities_open_organizations20121211 open cities_open_organizations
20121211 open cities_open_organizations
 
Dkv 18 Jan2010 innovabia innovation
Dkv 18 Jan2010 innovabia innovationDkv 18 Jan2010 innovabia innovation
Dkv 18 Jan2010 innovabia innovation
 
Ol615 team 2 group presentation_v8.0_final_20121006
Ol615 team 2 group presentation_v8.0_final_20121006Ol615 team 2 group presentation_v8.0_final_20121006
Ol615 team 2 group presentation_v8.0_final_20121006
 

Similar to BigML's take on Big Data

Seshadri Subbanna Presentation: Driving Collaborative Innovation with Clients...
Seshadri Subbanna Presentation: Driving Collaborative Innovation with Clients...Seshadri Subbanna Presentation: Driving Collaborative Innovation with Clients...
Seshadri Subbanna Presentation: Driving Collaborative Innovation with Clients...CityAge
 
Big data and Analytics
Big data and AnalyticsBig data and Analytics
Big data and AnalyticsKevin Magee
 
Accretive Health - Quality Management in Health Care
Accretive Health - Quality Management in Health CareAccretive Health - Quality Management in Health Care
Accretive Health - Quality Management in Health CareAccretiveHealth
 
Telco Big Data Workshop Sample
Telco Big Data Workshop SampleTelco Big Data Workshop Sample
Telco Big Data Workshop SampleAlan Quayle
 
Big Data: A Big Trap for Product Development
Big Data: A Big Trap for Product DevelopmentBig Data: A Big Trap for Product Development
Big Data: A Big Trap for Product DevelopmentStrategy 2 Market, Inc,
 
An Overview of BigData
An Overview of BigDataAn Overview of BigData
An Overview of BigDataValarmathi V
 
GeoNode Motivation, Design, and Challenges
GeoNode Motivation, Design, and ChallengesGeoNode Motivation, Design, and Challenges
GeoNode Motivation, Design, and ChallengesSebastian Benthall
 
The Secret Sauce for Innovation (longform)
The Secret Sauce for Innovation (longform) The Secret Sauce for Innovation (longform)
The Secret Sauce for Innovation (longform) Laszlo Szalvay
 
Business Intelligence for normal people
Business Intelligence for normal peopleBusiness Intelligence for normal people
Business Intelligence for normal peoplemark madsen
 
CII Panel Discussion on Cloud Computing
CII Panel Discussion on Cloud ComputingCII Panel Discussion on Cloud Computing
CII Panel Discussion on Cloud ComputingAnand Deshpande
 
Big data overview external
Big data overview externalBig data overview external
Big data overview externalBrett Colbert
 
Productivity Future Vision
Productivity Future VisionProductivity Future Vision
Productivity Future VisionMicro Focus SRL
 
Summerschool+ 2012 Ibm Kees Donker future of learning
Summerschool+ 2012 Ibm Kees Donker future of learningSummerschool+ 2012 Ibm Kees Donker future of learning
Summerschool+ 2012 Ibm Kees Donker future of learningKennisnet
 
Marco Roos: Newton's ideas and methods are preserved forever: how about yours?
Marco Roos: Newton's ideas and methods are preserved forever: how about yours?Marco Roos: Newton's ideas and methods are preserved forever: how about yours?
Marco Roos: Newton's ideas and methods are preserved forever: how about yours?GigaScience, BGI Hong Kong
 
The big-data revolution in healthcare
The big-data revolution in healthcareThe big-data revolution in healthcare
The big-data revolution in healthcareVaibhav Srivastav
 
Soderstrom
SoderstromSoderstrom
SoderstromNASAPMC
 
Wake up and smell the data
Wake up and smell the dataWake up and smell the data
Wake up and smell the datamark madsen
 
Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Alexandru Iosup
 

Similar to BigML's take on Big Data (20)

Big Data a big deal?
Big Data a big deal?Big Data a big deal?
Big Data a big deal?
 
Seshadri Subbanna Presentation: Driving Collaborative Innovation with Clients...
Seshadri Subbanna Presentation: Driving Collaborative Innovation with Clients...Seshadri Subbanna Presentation: Driving Collaborative Innovation with Clients...
Seshadri Subbanna Presentation: Driving Collaborative Innovation with Clients...
 
Big data and Analytics
Big data and AnalyticsBig data and Analytics
Big data and Analytics
 
Accretive Health - Quality Management in Health Care
Accretive Health - Quality Management in Health CareAccretive Health - Quality Management in Health Care
Accretive Health - Quality Management in Health Care
 
Telco Big Data Workshop Sample
Telco Big Data Workshop SampleTelco Big Data Workshop Sample
Telco Big Data Workshop Sample
 
Big Data: A Big Trap for Product Development
Big Data: A Big Trap for Product DevelopmentBig Data: A Big Trap for Product Development
Big Data: A Big Trap for Product Development
 
An Overview of BigData
An Overview of BigDataAn Overview of BigData
An Overview of BigData
 
GeoNode Motivation, Design, and Challenges
GeoNode Motivation, Design, and ChallengesGeoNode Motivation, Design, and Challenges
GeoNode Motivation, Design, and Challenges
 
The Secret Sauce for Innovation (longform)
The Secret Sauce for Innovation (longform) The Secret Sauce for Innovation (longform)
The Secret Sauce for Innovation (longform)
 
Business Intelligence for normal people
Business Intelligence for normal peopleBusiness Intelligence for normal people
Business Intelligence for normal people
 
CII Panel Discussion on Cloud Computing
CII Panel Discussion on Cloud ComputingCII Panel Discussion on Cloud Computing
CII Panel Discussion on Cloud Computing
 
Big data overview external
Big data overview externalBig data overview external
Big data overview external
 
Productivity Future Vision
Productivity Future VisionProductivity Future Vision
Productivity Future Vision
 
Summerschool+ 2012 Ibm Kees Donker future of learning
Summerschool+ 2012 Ibm Kees Donker future of learningSummerschool+ 2012 Ibm Kees Donker future of learning
Summerschool+ 2012 Ibm Kees Donker future of learning
 
Marco Roos: Newton's ideas and methods are preserved forever: how about yours?
Marco Roos: Newton's ideas and methods are preserved forever: how about yours?Marco Roos: Newton's ideas and methods are preserved forever: how about yours?
Marco Roos: Newton's ideas and methods are preserved forever: how about yours?
 
The big-data revolution in healthcare
The big-data revolution in healthcareThe big-data revolution in healthcare
The big-data revolution in healthcare
 
Horse meat or beef? (3) D Murphy, National Grid, 21/3/13
Horse meat or beef? (3) D Murphy, National Grid, 21/3/13Horse meat or beef? (3) D Murphy, National Grid, 21/3/13
Horse meat or beef? (3) D Murphy, National Grid, 21/3/13
 
Soderstrom
SoderstromSoderstrom
Soderstrom
 
Wake up and smell the data
Wake up and smell the dataWake up and smell the data
Wake up and smell the data
 
Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.
 

Recently uploaded

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 

Recently uploaded (20)

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 

BigML's take on Big Data

  • 1. BigML Inc, 2012 Geneva, October 12, 2012
  • 2. Agenda ·•Short intro ·•The Big Data Revolution ·•What is BigML? ·•Behind the scenes ·•Coming down the pike ·•Hacking with the BigML API BigML Inc, 2012 Geneva, October 12, 2012 2
  • 3. Francisco J Martin Background: • 5-year degree in Computer Science, UPV • Ph.D. in Artificial Intelligence, UPC • Postdoc (Machine Learning), Oregon State University • Founder and CEO at iSOCO • Founder and CEO at Strands • Co-authored 6 patents acquired by Apple Inc • Directly raised $75+MM in venture capital and cashed out additional $18+MM for early investors • Directly sold and negotiated $30+MM in licenses BigML: • Co-founder and CEO • Joined: January 2011 • Tasks: Product conceptualization, design, and architecture • Develops: BigML middle-end and public API • 1202 (19%) of commits to total BigML code base BigML Inc, 2012 Geneva, October 12, 2012 3
  • 4. Academia vs the Real-world Neo, sooner or later you're going to realize, just as I did, that there's a difference between knowing the path, and walking the path BigML Inc, 2012 Geneva, October 12, 2012 4
  • 5. Walking the data path Large-scale Machine Learning Recommender Systems Everything Machine Learning Personalization Music, video, Multi-agent fitness, finance Learning Intrusion Detection E-commerce ata D 8-queen problem 1996 1999 2002 2004 2011 2012 Academia iSOCO Academia Strands Inc BigML Inc BigML Inc, 2012 Geneva, October 12, 2012 5
  • 6. BigML Status ·•Founded in Jan 2011 ·•9 FTE, 1 PT ·•5 Ph.Ds ·•4 patent applications ·•Advisors and BA: US Patent Application No. 61/555,615 For: VISUALIZATION AND INTERACTION WITH COMPACT REPRESENTATION OF DECISION TREES Filed: November, 2011 US Patent Application No. 61/557,826 For: METHODS FOR BUILDING AND USING DECISION TREES IN A DISTRIBUTED ENVIRONMENT Filed: November, 2011 US Patent Application No. 61/557,539 For: EVOLVING PARALLEL SYSTEM TO AUTOMATICALLY IMPROVE THE PERFORMANCE OF DISTRIBUTED SYSTEMS Filed: November, 2011 US Patent Application No. 61/710,175 For: SYSTEM AND METHODS TO EXCHANGE ACTIONABLE PREDICTIVE MODELS IN A VIRTUAL MARKETPLACE Filed: October, 2012 BigML Inc, 2012 Geneva, October 12, 2012 6
  • 7. From the trenches Beneath Hill 60 BigML Team BigML Inc, 2012 Geneva, October 12, 2012 7
  • 8. Agenda ·•Short intro ·•The Big Data Revolution ·•What is BigML? ·•Behind the scenes ·•Coming down the pike ·•Hacking with the BigML API BigML Inc, 2012 Geneva, October 12, 2012 8
  • 9. Big Data What is Big Data? What is a Data Scientist? How not to start with Big Data? What is Data-driven Decision Making? BigML Inc, 2012 Geneva, October 12, 2012 9
  • 10. Trends http://strata.oreilly.com/2011/08/building-data-startups.html BigML Inc, 2012 Geneva, October 12, 2012 10
  • 11. What’s Big Data? Big Data means way too many different things to many different people “when the human cost of making the decision of throwing something away became higher than the machine cost of continuing to store it” George Dyson BigML Inc, 2012 Geneva, October 12, 2012 11
  • 12. What’s Big Data? The 3 v’s The 3 I’s Volume Immediate (big, enormous, huge, vast, immense, very In the sense that you need to do something large, etc) about it Variety Intimidating (heterogenous, diverse, complex, multiple What if you do not? sources, sensors, etc) Velocity Ill-defined (speed, dynamic real-time, streamed, etc) What is it? Anyway? BigML Inc, 2012 Data matters!!! Geneva, October 12, 2012 12
  • 13. Machine Learning Even if we, human beings, are learning machines, we are really bad at processing small amounts of data Machines are good at quickly processing huge amounts of data. Machine Learning can make them learn from data BigML Inc, 2012 Geneva, October 12, 2012 13
  • 14. It’s all about machine learning Forget plastics. It’s all about machine learning http://www.youtube.com/watch?v=PSxihhBzCjk It's as if the machines have been in training all their lives to adapt and make use of the Big Data now being thrown at them - a combination of Moore's Law and the cloud mixed in with Machine Learning finally makes it all possible. --- Jeff Bussgang BigML Inc, 2012 Geneva, October 12, 2012 14
  • 15. Learning from Data Unknown Model f : X -> Y Example: ideal credit approval formula f1 f2 fn label x1 Training Examples (x1, l1), (x2, l2), ..., (xN, lN) xN Example: historical records of credit customers Models Final Model M Learning g~f Example: set of candidate Algorithm Example: learned credit credit approval formulas approval formula Based on Learning from Data by Y. Abu-Mostafa, M. Magdon-Ismail and H. Lin BigML Inc, 2012 Geneva, October 12, 2012 15
  • 16. What’s Big Machine Learning? Volume Large-scale machine What to do when data is too big to fit within the system memory of a single computer? learning Clean, refine, update, join, merge, aggregate, Variety structure or deconstruct data until it matches the required input format or (why not) just generate/store data in the right format Velocity Stream Algorithms BigML Inc, 2012 Geneva, October 12, 2012 16
  • 17. Machine Learning ...or you can deal with that! BigML Inc, 2012 Geneva, October 12, 2012 17
  • 18. Does More Data beat Better Algorithms? More features More examples The Unreasonable Effectiveness of Data More Data or Better Models. Xavier Amatriain BigML Inc, 2012 Geneva, October 12, 2012 18
  • 19. What’s Big Data? Global realization that learning from data (i.e., Machine Learning) can help us better analyze our past, understand our present, and predict our future. --- Francisco J Martin Data Past Present Future BigML Inc, 2012 Geneva, October 12, 2012 19
  • 20. Big Data What is Big Data? What is a Data Scientist? How not to start with Big Data? What is Data-driven Decision Making? BigML Inc, 2012 Geneva, October 12, 2012 20
  • 21. Is Wikipedia right? Really? Seriously?? Are you kidding me??? BigML Inc, 2012 Geneva, October 12, 2012 21
  • 22. Data can’t be wrong? BigML Inc, 2012 Geneva, October 12, 2012 22
  • 23. McKinsey can’t be wrong Critical Shortage Of “Data Scientist” Talent Predicted By 2018 BigML Inc, 2012 Geneva, October 12, 2012 23
  • 24. HBR can’t be wrong BigML Inc, 2012 Geneva, October 12, 2012 24
  • 25. Wikipedia is right! BigML Inc, 2012 Geneva, October 12, 2012 25
  • 26. If Data Scientists don’t exist can they be created? BigML Inc, 2012 Geneva, October 12, 2012 26
  • 27. The first Data Scientist Computer Statistician Scientist Mathematician Hans’ brain, the first Data Scientist BigML Inc, 2012 Geneva, October 12, 2012 27
  • 28. The magic formula A data scientist is“part analyst, part artist.” Anjul Bhambhri,Vice President of Big Data Products at IBM BigML Inc, 2012 Geneva, October 12, 2012 28
  • 29. Are Data Scientists super heroes? BigML Inc, 2012 Geneva, October 12, 2012 29
  • 30. The most powerful human super hero http://photos.oregonlive.com/photo-essay/2012/06/ashton_eaton_sets_decathlon_wo.html BigML Inc, 2012 Geneva, October 12, 2012 30
  • 31. Are Data Scientists super heroes? High school Events Decathlon World Record World Record World Record 100 m 10.21 10.08 9.58 Long Jump 8.23 m 8.16 m 8.95 m Shot Put 14.20 m 20.65 m 23.12 m High Jump 2.05 m 2.31 m 2.45 m 400 m 46.70 44.69 43.18 110 m hurdles 13.70 13.74 12.80 Discus throw 42.81 m 61.38 m 74.08 m Pole Vault 5.30 m 5.56 m 6.14 m Javelin Throw 58.87 m 73.74 m 98.48 m 1500m 4:14.48 3:38.26 3:26.00 BigML Inc, 2012 Geneva, October 12, 2012 31
  • 32. The Wikipedia is always right! BigML Inc, 2012 Geneva, October 12, 2012 32
  • 33. BigML’s Data Science Team UI Design Visualization Oscar Rovira, MSc* Infrastructure, Cloud-based Bea Garcia, BSc Product Design Common Sense Business and Justin Donaldson Ph.D. Computing Architecture, Francisco J Martin, PhD Software Design, Distributed Systems Jos Verwoerd, MSc Poul Petersen, MSc Large-scale and learning algorithm implementation Jao, PhD Charles Parker, Machine Learning Research PhD Adam Ashenfelter, MSc Tom Dietterich, PhD BigML Inc, 2012 Geneva, October 12, 2012 33
  • 34. Take Away Oscar Rovira, MSc* Bea Garcia, BSc Justin Donaldson Ph.D. Francisco J Martin, PhD Jos Verwoerd, MSc Poul Petersen, MSc Jao, PhD Charles Parker, PhD Adam Ashenfelter, MSc Tom Dietterich, PhD So instead of trying to quickly create “mediocre data scientists”, Universities should focus on creating excellent mathematicians, statisticians, computer scientists, software architects, designers, etc who are fabulous team players BigML Inc, 2012 Geneva, October 12, 2012 34
  • 35. Big Data What is Big Data? What is a Data Scientist? How not to start with Big Data? What is Data-driven Decision Making? BigML Inc, 2012 Geneva, October 12, 2012 35
  • 36. Iris Dataset http://en.wikipedia.org/wiki/Iris_flower_data_set BigML Inc, 2012 Geneva, October 12, 2012 36
  • 37. Digesting Big Data Assimilation (making insights actionable) Almost no attention!!! (reject bad data, wrong insights) Absorption Egestion (deriving insights) Digestion (processing) Too much attention!!! Ingestion (capturing and storing) BigML Inc, 2012 Geneva, October 12, 2012 37
  • 38. Big Data meets Hadoop ·•Hadoop has been excessively promoted as the way to make Big Data problems easy. ·•There are quite a few vendors pushing different Hadoop flavors to the market. However, Hadoop is complex, slow, expensive and batch BigML Inc, 2012 Geneva, October 12, 2012 38
  • 39. Big Data and Hadoop Running Hadoop on a cluster - The New IT sport of 2012 BigML Inc, 2012 Geneva, October 12, 2012 39
  • 40. Real-Time Hadoop? Really? Seriously?? Are you kidding me??? BigML Inc, 2012 Geneva, October 12, 2012 40
  • 41. Why not Hadoop? Hadoop on a cluster is the right solution for jobs where the input data is multi-terabyte or larger ·•Evidence suggests that many MapReduce-like jobs process relatively small input data sets (less than 14 GB) ·•Iterative-machine learning algorithms, do not map trivially to MapReduce. ·•Memory has reached a GB/$ ratio such that it is now technically and financially feasible to have servers with 100s GB of DRAM ·•In terms of hardware and programmer time, this may be a better option for the majority of data processing jobs. Rowstron, A. et al, Nobody ever got fired for using Hadoop on a cluster, Microsoft Research, Cambridge, 2012 ·•Hadoop is bad at iterative algorithms: High job startup costs and awkward to retain state across iterations ·•High sensitivity to skew: iteration speed bounded by slowest task. ·•Potentially poor cluster utilization: must shuffle all data to a single reducer. Large-Scale Machine Learning at Twitter, Jimmy Lin BigML Inc, 2012 Geneva, October 12, 2012 41
  • 42. Making Big Data Small Hadoop Streaming Algorithms ·•Complex ·•Simple ·•Slow ·•Fast ·•Batch ·•Real-time ·•Expensive ·•Cheap Noel Welsh, Strata conference, London, October 2012 BigML Inc, 2012 Geneva, October 12, 2012 42
  • 43. Self-imposed Shackles Once a baby elephant accepts the limitation imposed on him it becomes a permanent belief, or in his case, a conditioned reaction. Now as the elephant grows into adulthood, he has the power to easily pull the stake out of the ground, but his conditioning has taught him that the effort will not only be futile, it will be painful as well. http://www.selfgrowth.com/articles/Martinez1.html Tackling Big Data with Hadoop on a cluster is like self-imposing shackles on your own project BigML Inc, 2012 Geneva, October 12, 2012 43
  • 44. Starting with Big Data •Buy a few machines and set up a cluster. •Installing and running any flavor of Hadoop. •Figure out how to implement complex map-reduce algorithms to compute a few analytics. •Start with a very small data sample. •Use free or cloud-based tools to build a first predictive model that you can understand. •Check if the model gives you any practical insight. •Use the model to generate predictions and see if it can improve your performance. •Check how more data can improve the model. •Check if more sophisticated models can beat your model •Iterate. •Check if the volume, variety, and velocity of your data require a behind-the-firewall/ cloud solution or a batch/stream solution. BigML Inc, 2012 Geneva, October 12, 2012 44
  • 45. Big Data What is Big Data? What is a Data Scientist? How not to deal with Big Data? What is Data-driven Decision Making? BigML Inc, 2012 Geneva, October 12, 2012 45
  • 46. Data-Driven Decisions Automated, data-driven decisions will significantly impact more industries than any other information system since “computers” were people http://www.nytimes.com/2011/04/24/business/24unboxed.html BigML Inc, 2012 Geneva, October 12, 2012 46
  • 47. The “HiPPO” (Highest Paid Person’s Opinion) is dead BigML Inc, 2012 Geneva, October 12, 2012 47
  • 48. Predictive Analytics Descriptive Analytics Predictive Analytics Traditional, backward-looking business Machine Learning analytics BigML Inc, 2012 Geneva, October 12, 2012 48
  • 49. Predictive Model “The goal of a predictive model is not to predict the future but to help you make a better decision in the present” Taken from Paul Saffo, HBR BigML Inc, 2012 Geneva, October 12, 2012 49
  • 50. Data-Driven Decision Making Analytics and Predictive Analytics combined with Experience&Intuition BigML Inc, 2012 Geneva, October 12, 2012 50
  • 51. It’s time to switch the attention Assimilation (making insights actionable) More attention!!! (reject bad data, wrong insights) Absorption More focus on the models and Egestion (deriving insights) how to operationalize them than on the infrastructure to generate them Digestion (processing) less attention!!! Ingestion (capturing and storing) BigML Inc, 2012 Geneva, October 12, 2012 51
  • 52. Take aways •Big Data is just data •It’s all about machine learning •Try to excel in one of the data science disciplines •Don’t shackle yourself to the wrong platform •Trying to predict the future can help you make the right decision in the present •Focus on evaluation and actionability of models and not on how they are built BigML Inc, 2012 Geneva, October 12, 2012 52
  • 53. Agenda ·•Short intro ·•The Big Data Revolution ·•What is BigML? ·•Behind the scenes ·•Coming down the pike ·•Hacking with the BigML API BigML Inc, 2012 Geneva, October 12, 2012 53
  • 54. BigML Goal Highly Scalable, Cloud-based Machine Learning Service Simple, Easy-to-Use and Seamless-to- Integrate BigML Inc, 2012 Geneva, October 12, 2012 54
  • 55. BigML vs ML You can deal ...or you can deal with that! with this... BigML 1-click model BigML Inc, 2012 Geneva, October 12, 2012 55
  • 56. BigML vs Big Data You can deal ...or you can deal with that! with this... BigML 1-click model BigML Inc, 2012 Geneva, October 12, 2012 56
  • 57. How it Works BigML Inc, 2012 Geneva, October 12, 2012 57
  • 58. Machine Learning Made Easy True BigML Inc, 2012 Geneva, October 12, 2012 58
  • 59. Simple is not easy “Any fool can make something complicated. It takes a genius to make it simple.” ― Woody Guthrie BigML Inc, 2012 Geneva, October 12, 2012 59
  • 60. Fully Web based BigML Inc, 2012 Geneva, October 12, 2012 60
  • 61. RESTful API BigML Inc, 2012 Geneva, October 12, 2012 61
  • 62. Agenda ·•Short intro ·•The Big Data Revolution ·•What is BigML? - Demo ·•Behind the scenes ·•Coming down the pike ·•Hacking with the BigML API BigML Inc, 2012 Geneva, October 12, 2012 62
  • 63. Agenda ·•Short intro ·•The Big Data Revolution ·•What is BigML? ·•Behind the scenes ·•Coming down the pike ·•Hacking with the BigML API BigML Inc, 2012 Geneva, October 12, 2012 63
  • 64. BigML’ Software Architecture Front-end [Neutronia] [Medusa] [CuriousYellow] [Sky] Middle-end [Apian] Backend [Wintermute] Infrastructure [Sauron] Boto, Fabric BigML Inc, 2012 Geneva, October 12, 2012 64
  • 65. BigML’s AWS-based Architecture BigML Inc, 2012 Geneva, October 12, 2012 65
  • 66. Why Tree Models? ·•Highly scalable ·•Graphically representable and interactive ·•Easily understandable ·•Easily translatable into rules, PMML, and code. ·•Easily upgradable with ensembles: boosting, bagging, and random forests, etc ·•Top performers! http://www.niculescu-mizil.org/papers/empirical.icml06.pdfempirical.icml06.pdf BigML Inc, 2012 Geneva, October 12, 2012 66
  • 67. BigML Histograms BigML's trees and dataset summaries use histograms with the following traits: Streaming Memory constrained Dynamic Data is never kept in memory The less memory allocated, the The histogram bins adjust but needs only one pass over lossier the compressed themselves as they observe the the data to capture the distribution. data. distribution. Robust to ordered Merge friendly More... data So it works even if the data For parallelization and http://blog.bigml.com/ stream is non-stationary distribution. 2012/06/18/bigmls-fancy- histograms/ BigML Inc, 2012 Geneva, October 12, 2012 67
  • 68. BigML Streaming Trees BigML's trees are: CART Grown breadth first Classification & Regression So partial trees are Trees meaningful Built Hoeffding-style Friendly for parallelization So they consume streaming Can work over multiple data and can split "early" cores or multiple computers BigML Inc, 2012 Geneva, October 12, 2012 68
  • 69. Growing a Streaming Tree ·•Each split breaks the data into subsets. ·•The split should make the subsets as distinct from one another as possible. ·•Subsets are chosen to maximize information gain (classification) or minimize squared error (regression). BigML Inc, 2012 Geneva, October 12, 2012 69
  • 70. Distributed Streaming Trees   BigML Inc, 2012 Geneva, October 12, 2012 70
  • 71. Streaming Trees - Early Splits BigML Inc, 2012 Geneva, October 12, 2012 71
  • 72. Agenda ·•Short intro ·•The Big Data Revolution ·•What is BigML? ·•Behind the scenes ·•Coming down the pike ·•Hacking with the BigML API BigML Inc, 2012 Geneva, October 12, 2012 72
  • 73. Automatic Evaluations BigML Inc, 2012 Geneva, October 12, 2012 73
  • 74. A marketplace for predictive models BigML Inc, 2012 Geneva, October 12, 2012 74
  • 75. Simple is not easy “Any fool can make something complicated. It takes a genius to make it simple.” ― Woody Guthrie BigML Inc, 2012 Geneva, October 12, 2012 75
  • 76. Machine Learning Made Easy True BigML Inc, 2012 Geneva, October 12, 2012 76
  • 77. Agenda ·•Short intro ·•The Big Data Revolution ·•Demo ·•Behind the scenes ·•Coming down the pike ·•Hacking with the BigML API BigML Inc, 2012 Geneva, October 12, 2012 77
  • 78. Back to the trenches Gallipoli BigML Inc, 2012 Geneva, October 12, 2012 78
  • 79. Good Reading Big Data Trends - David Feinleib http://www.slideshare.net/bigdatalandscape/big-data-trends Hey Graduates: Forget Plastics - It's All About Machine Learning. Jess Bussgang. http://bostonvcblog.typepad.com/vc/2012/05/forget-plastics-its-all-about-machine-learning.html More Data or Better Models. Xavier Amatriain http://technocalifornia.blogspot.ch/2012/07/more-data-or-better-models.html Making Big Data Small. Noel Welsh http://strataconf.com/strataeu/public/schedule/detail/25984 Data Killed the HiPPO star. Jeff Jordan, Andreessen Horowitz http://gigaom.com/2012/02/18/data-killed-the-hippo-star/ When There’s No Such Thing as Too Much Information. Steve Lohr http://www.nytimes.com/2011/04/24/business/24unboxed.html Nobody ever got fired for using Hadoop on a cluster. Antony Rowstron, Dushyanth Narayanan, Austin Donnelly, Greg O’Shea, Andrew Douglas http://research.microsoft.com/pubs/163083/hotcbp12%20final.pdf Six Rules for Effective Forecasting. Paul Saffo http://www.usc.edu/schools/annenberg/asc/projects/wkc/pdf/200912digitalleadership_saffo.pdf Large-scale Machine Learning at Twitter. Jimmy Lin and Alek Kolcz http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf BigML Inc, 2012 Geneva, October 12, 2012 79
  • 80. BigML Inc, 2012 Geneva, October 12, 2012