0
BigML Inc, 2012   Geneva, October 12, 2012
Agenda ·•Short	 intro ·•The	 Big	 Data	 Revolution ·•What	 is	 BigML? ·•Behind	 the	 scenes ·•Coming	 down	 the	 pike ·•Ha...
Francisco J Martin                        Background:                        • 5-year degree in Computer Science, UPV     ...
Academia vs the Real-world                                              Neo, sooner or                                    ...
Walking the data path                                                                                                     ...
BigML Status·•Founded	 in	 Jan	 2011·•9	 FTE,	 1	 PT·•5	 Ph.Ds·•4	 patent	 applications·•Advisors	 and	 BA:   US Patent Ap...
From the trenches                       Beneath Hill 60             BigML TeamBigML Inc, 2012         Geneva, October 12, ...
Agenda ·•Short	 intro ·•The	 Big	 Data	 Revolution ·•What	 is	 BigML? ·•Behind	 the	 scenes ·•Coming	 down	 the	 pike ·•Ha...
Big Data              What	 is	 Big	 Data?                   What	 is	 a	 Data	 Scientist?      How	 not	 to	 start	 with	...
Trends                                              http://strata.oreilly.com/2011/08/building-data-startups.htmlBigML Inc...
What’s Big Data?                                             	 Big	 Data	 means	 way	                                     ...
What’s Big Data?                      The 3 v’s                                                   The 3 I’s               ...
Machine Learning                                            Even	 if	 we,	 human	                                         ...
It’s all about machine learning                                                                    Forget plastics.       ...
Learning from Data                                       Unknown Model                                                 f :...
What’s Big Machine Learning?                         Volume                                            Large-scale	 machin...
Machine Learning                  ...or you can deal with that!BigML Inc, 2012             Geneva, October 12, 2012   17
Does More Data beat Better Algorithms?                               More	 features    More	 examples                     ...
What’s Big Data?                               Global	 realization	 that                 	 learning	 from	 data	 (i.e.,	 M...
Big Data              What	 is	 Big	 Data?                   What	 is	 a	 Data	 Scientist?      How	 not	 to	 start	 with	...
Is Wikipedia right?           Really? Seriously?? Are you kidding me???BigML Inc, 2012              Geneva, October 12, 20...
Data can’t be wrong?BigML Inc, 2012         Geneva, October 12, 2012   22
McKinsey can’t be wrong                                Critical Shortage Of “Data Scientist”                              ...
HBR can’t be wrongBigML Inc, 2012       Geneva, October 12, 2012   24
Wikipedia is right!BigML Inc, 2012         Geneva, October 12, 2012   25
If Data Scientists don’t exist                      can they be created?BigML Inc, 2012             Geneva, October 12, 20...
The first Data Scientist      Computer                           Statistician       Scientist                  Mathematici...
The magic formula                      A	 data	 scientist	 is“part	                        analyst,	 part	 artist.”       ...
Are Data Scientists super heroes?BigML Inc, 2012              Geneva, October 12, 2012   29
The most powerful human super hero   http://photos.oregonlive.com/photo-essay/2012/06/ashton_eaton_sets_decathlon_wo.htmlB...
Are Data Scientists super heroes?                                                                  High school            ...
The Wikipedia is always right!BigML Inc, 2012            Geneva, October 12, 2012   32
BigML’s Data Science Team                                                                                           UI    ...
Take Away                                                                                           Oscar Rovira, MSc*    ...
Big Data              What	 is	 Big	 Data?                   What	 is	 a	 Data	 Scientist?      How	 not	 to	 start	 with	...
Iris Dataset                                                http://en.wikipedia.org/wiki/Iris_flower_data_setBigML Inc, 201...
Digesting Big Data                                                           Assimilation	                                ...
Big Data meets Hadoop                              ·•Hadoop	 has	 been	 excessively	                               promote...
Big Data and Hadoop  Running Hadoop on a cluster - The New IT sport of 2012BigML Inc, 2012         Geneva, October 12, 201...
Real-Time Hadoop?                  Really? Seriously?? Are you kidding me???BigML Inc, 2012                  Geneva, Octob...
Why not Hadoop? Hadoop	 on	 a	 cluster	 is	 the	 right	 solution	 for	 jobs	 where	 the	 input	 data	 is	 multi-terabyte	 ...
Making Big Data Small                  Hadoop                      Streaming	 Algorithms   ·•Complex                      ...
Self-imposed Shackles                                                 Once	 a	 baby	 elephant	 accepts	 the	              ...
Starting with Big Data                       •Buy a few machines and set up a cluster.                       •Installing a...
Big Data              What	 is	 Big	 Data?                  What	 is	 a	 Data	 Scientist?      How	 not	 to	 deal	 with	 B...
Data-Driven Decisions  Automated, data-driven decisions will significantly  impact more industries than any other informat...
The “HiPPO” (Highest Paid Person’s Opinion) is deadBigML Inc, 2012                  Geneva, October 12, 2012              ...
Predictive Analytics                  Descriptive	 Analytics                                Predictive	 Analytics        T...
Predictive Model                           “The goal of a predictive                                 model is not         ...
Data-Driven Decision Making                                               Analytics	 and	                                 ...
It’s time to switch the attention                                                             Assimilation	               ...
Take aways •Big Data is just data •It’s all about machine learning •Try to excel in one of the data science disciplines •D...
Agenda ·•Short	 intro ·•The	 Big	 Data	 Revolution ·•What	 is	 BigML? ·•Behind	 the	 scenes ·•Coming	 down	 the	 pike ·•Ha...
BigML Goal    Highly	 Scalable,	 Cloud-based	 Machine	 Learning	 Service    Simple,	 Easy-to-Use	 and	 Seamless-to-       ...
BigML vs ML          You can deal      ...or you can deal with that!           with this...   BigML 1-click modelBigML Inc...
BigML vs Big Data          You can deal    ...or you can deal with that!           with this...   BigML 1-click modelBigML...
How it WorksBigML Inc, 2012      Geneva, October 12, 2012   57
Machine Learning Made Easy                                            TrueBigML Inc, 2012          Geneva, October 12, 201...
Simple is not easy       “Any fool can make something complicated. It       takes a genius to make it simple.”            ...
Fully Web basedBigML Inc, 2012        Geneva, October 12, 2012   60
RESTful APIBigML Inc, 2012     Geneva, October 12, 2012   61
Agenda ·•Short	 intro ·•The	 Big	 Data	 Revolution ·•What	 is	 BigML?	 -	 Demo ·•Behind	 the	 scenes ·•Coming	 down	 the	 ...
Agenda ·•Short	 intro ·•The	 Big	 Data	 Revolution ·•What	 is	 BigML? ·•Behind	 the	 scenes ·•Coming	 down	 the	 pike ·•Ha...
BigML’ Software Architecture  Front-end      [Neutronia]        [Medusa][CuriousYellow]             [Sky]Middle-end       ...
BigML’s AWS-based ArchitectureBigML Inc, 2012            Geneva, October 12, 2012   65
Why Tree Models?     ·•Highly	 scalable     ·•Graphically	 representable	 and	 interactive     ·•Easily	 understandable   ...
BigML Histograms BigMLs trees and dataset summaries use histograms with the following traits:             Streaming       ...
BigML Streaming Trees  BigMLs trees are:                          CART                                          Grown	 bre...
Growing a Streaming Tree ·•Each	 split	 breaks	 the	 data	  into	 subsets. ·•The	 split	 should	 make	 the	  subsets	 as	 ...
Distributed Streaming Trees                                           BigML Inc, 2012           Geneva, October 12, 2012  ...
Streaming Trees - Early SplitsBigML Inc, 2012           Geneva, October 12, 2012   71
Agenda ·•Short	 intro ·•The	 Big	 Data	 Revolution ·•What	 is	 BigML? ·•Behind	 the	 scenes ·•Coming	 down	 the	 pike ·•Ha...
Automatic EvaluationsBigML Inc, 2012          Geneva, October 12, 2012   73
A marketplace for predictive modelsBigML Inc, 2012               Geneva, October 12, 2012   74
Simple is not easy       “Any fool can make something complicated. It       takes a genius to make it simple.”            ...
Machine Learning Made Easy                                            TrueBigML Inc, 2012          Geneva, October 12, 201...
Agenda ·•Short	 intro ·•The	 Big	 Data	 Revolution ·•Demo ·•Behind	 the	 scenes ·•Coming	 down	 the	 pike ·•Hacking	 with	...
Back to the trenches                                                  GallipoliBigML Inc, 2012        Geneva, October 12, ...
Good Reading Big Data Trends - David Feinleib http://www.slideshare.net/bigdatalandscape/big-data-trends Hey Graduates: Fo...
BigML Inc, 2012   Geneva, October 12, 2012
Upcoming SlideShare
Loading in...5
×

BigML's take on Big Data

2,557

Published on

BigML's take on Big Data. University of Geneva, October 12, 2012.

In the "Big Data" era, rapidly and easily getting insights from your data or creating data-driven applications does not have to be painful. BigML shows how business managers, application developers, and data scientists can start building their own predictive models in a matter of minutes.

Published in: Technology
1 Comment
3 Likes
Statistics
Notes
No Downloads
Views
Total Views
2,557
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
94
Comments
1
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "BigML's take on Big Data"

  1. 1. BigML Inc, 2012 Geneva, October 12, 2012
  2. 2. Agenda ·•Short intro ·•The Big Data Revolution ·•What is BigML? ·•Behind the scenes ·•Coming down the pike ·•Hacking with the BigML API BigML Inc, 2012 Geneva, October 12, 2012 2
  3. 3. Francisco J Martin Background: • 5-year degree in Computer Science, UPV • Ph.D. in Artificial Intelligence, UPC • Postdoc (Machine Learning), Oregon State University • Founder and CEO at iSOCO • Founder and CEO at Strands • Co-authored 6 patents acquired by Apple Inc • Directly raised $75+MM in venture capital and cashed out additional $18+MM for early investors • Directly sold and negotiated $30+MM in licenses BigML: • Co-founder and CEO • Joined: January 2011 • Tasks: Product conceptualization, design, and architecture • Develops: BigML middle-end and public API • 1202 (19%) of commits to total BigML code baseBigML Inc, 2012 Geneva, October 12, 2012 3
  4. 4. Academia vs the Real-world Neo, sooner or later youre going to realize, just as I did, that theres a difference between knowing the path, and walking the pathBigML Inc, 2012 Geneva, October 12, 2012 4
  5. 5. Walking the data path Large-scale Machine Learning Recommender Systems Everything Machine Learning Personalization Music, video, Multi-agent fitness, finance Learning Intrusion Detection E-commerce ata D 8-queen problem1996 1999 2002 2004 2011 2012 Academia iSOCO Academia Strands Inc BigML IncBigML Inc, 2012 Geneva, October 12, 2012 5
  6. 6. BigML Status·•Founded in Jan 2011·•9 FTE, 1 PT·•5 Ph.Ds·•4 patent applications·•Advisors and BA: US Patent Application No. 61/555,615 For: VISUALIZATION AND INTERACTION WITH COMPACT REPRESENTATION OF DECISION TREES Filed: November, 2011 US Patent Application No. 61/557,826 For: METHODS FOR BUILDING AND USING DECISION TREES IN A DISTRIBUTED ENVIRONMENT Filed: November, 2011 US Patent Application No. 61/557,539 For: EVOLVING PARALLEL SYSTEM TO AUTOMATICALLY IMPROVE THE PERFORMANCE OF DISTRIBUTED SYSTEMS Filed: November, 2011 US Patent Application No. 61/710,175 For: SYSTEM AND METHODS TO EXCHANGE ACTIONABLE PREDICTIVE MODELS IN A VIRTUAL MARKETPLACE Filed: October, 2012BigML Inc, 2012 Geneva, October 12, 2012 6
  7. 7. From the trenches Beneath Hill 60 BigML TeamBigML Inc, 2012 Geneva, October 12, 2012 7
  8. 8. Agenda ·•Short intro ·•The Big Data Revolution ·•What is BigML? ·•Behind the scenes ·•Coming down the pike ·•Hacking with the BigML API BigML Inc, 2012 Geneva, October 12, 2012 8
  9. 9. Big Data What is Big Data? What is a Data Scientist? How not to start with Big Data? What is Data-driven Decision Making?BigML Inc, 2012 Geneva, October 12, 2012 9
  10. 10. Trends http://strata.oreilly.com/2011/08/building-data-startups.htmlBigML Inc, 2012 Geneva, October 12, 2012 10
  11. 11. What’s Big Data? Big Data means way too many different things to many different people “when the human cost of making the decision of throwing something away became higher than the machine cost of continuing to store it” George DysonBigML Inc, 2012 Geneva, October 12, 2012 11
  12. 12. What’s Big Data? The 3 v’s The 3 I’s Volume Immediate (big, enormous, huge, vast, immense, very In the sense that you need to do something large, etc) about it Variety Intimidating (heterogenous, diverse, complex, multiple What if you do not? sources, sensors, etc) Velocity Ill-defined (speed, dynamic real-time, streamed, etc) What is it? Anyway?BigML Inc, 2012 Data matters!!! Geneva, October 12, 2012 12
  13. 13. Machine Learning Even if we, human beings, are learning machines, we are really bad at processing small amounts of data Machines are good at quickly processing huge amounts of data. Machine Learning can make them learn from dataBigML Inc, 2012 Geneva, October 12, 2012 13
  14. 14. It’s all about machine learning Forget plastics. It’s all about machine learning http://www.youtube.com/watch?v=PSxihhBzCjk Its as if the machines have been in training all their lives to adapt and make use of the Big Data now being thrown at them - a combination of Moores Law and the cloud mixed in with Machine Learning finally makes it all possible. --- Jeff BussgangBigML Inc, 2012 Geneva, October 12, 2012 14
  15. 15. Learning from Data Unknown Model f : X -> Y Example: ideal credit approval formula f1 f2 fn label x1 Training Examples (x1, l1), (x2, l2), ..., (xN, lN) xN Example: historical records of credit customers Models Final Model M Learning g~f Example: set of candidate Algorithm Example: learned credit credit approval formulas approval formula Based on Learning from Data by Y. Abu-Mostafa, M. Magdon-Ismail and H. LinBigML Inc, 2012 Geneva, October 12, 2012 15
  16. 16. What’s Big Machine Learning? Volume Large-scale machine What to do when data is too big to fit within the system memory of a single computer? learning Clean, refine, update, join, merge, aggregate, Variety structure or deconstruct data until it matches the required input format or (why not) just generate/store data in the right format Velocity Stream AlgorithmsBigML Inc, 2012 Geneva, October 12, 2012 16
  17. 17. Machine Learning ...or you can deal with that!BigML Inc, 2012 Geneva, October 12, 2012 17
  18. 18. Does More Data beat Better Algorithms? More features More examples The Unreasonable Effectiveness of Data More Data or Better Models. Xavier AmatriainBigML Inc, 2012 Geneva, October 12, 2012 18
  19. 19. What’s Big Data? Global realization that learning from data (i.e., Machine Learning) can help us better analyze our past, understand our present, and predict our future. --- Francisco J Martin Data Past Present FutureBigML Inc, 2012 Geneva, October 12, 2012 19
  20. 20. Big Data What is Big Data? What is a Data Scientist? How not to start with Big Data? What is Data-driven Decision Making?BigML Inc, 2012 Geneva, October 12, 2012 20
  21. 21. Is Wikipedia right? Really? Seriously?? Are you kidding me???BigML Inc, 2012 Geneva, October 12, 2012 21
  22. 22. Data can’t be wrong?BigML Inc, 2012 Geneva, October 12, 2012 22
  23. 23. McKinsey can’t be wrong Critical Shortage Of “Data Scientist” Talent Predicted By 2018BigML Inc, 2012 Geneva, October 12, 2012 23
  24. 24. HBR can’t be wrongBigML Inc, 2012 Geneva, October 12, 2012 24
  25. 25. Wikipedia is right!BigML Inc, 2012 Geneva, October 12, 2012 25
  26. 26. If Data Scientists don’t exist can they be created?BigML Inc, 2012 Geneva, October 12, 2012 26
  27. 27. The first Data Scientist Computer Statistician Scientist Mathematician Hans’ brain, the first Data ScientistBigML Inc, 2012 Geneva, October 12, 2012 27
  28. 28. The magic formula A data scientist is“part analyst, part artist.” Anjul Bhambhri,Vice President of Big Data Products at IBMBigML Inc, 2012 Geneva, October 12, 2012 28
  29. 29. Are Data Scientists super heroes?BigML Inc, 2012 Geneva, October 12, 2012 29
  30. 30. The most powerful human super hero http://photos.oregonlive.com/photo-essay/2012/06/ashton_eaton_sets_decathlon_wo.htmlBigML Inc, 2012 Geneva, October 12, 2012 30
  31. 31. Are Data Scientists super heroes? High school Events Decathlon World Record World Record World Record 100 m 10.21 10.08 9.58 Long Jump 8.23 m 8.16 m 8.95 m Shot Put 14.20 m 20.65 m 23.12 m High Jump 2.05 m 2.31 m 2.45 m 400 m 46.70 44.69 43.18 110 m hurdles 13.70 13.74 12.80 Discus throw 42.81 m 61.38 m 74.08 m Pole Vault 5.30 m 5.56 m 6.14 m Javelin Throw 58.87 m 73.74 m 98.48 m 1500m 4:14.48 3:38.26 3:26.00BigML Inc, 2012 Geneva, October 12, 2012 31
  32. 32. The Wikipedia is always right!BigML Inc, 2012 Geneva, October 12, 2012 32
  33. 33. BigML’s Data Science Team UI Design Visualization Oscar Rovira, MSc* Infrastructure, Cloud-based Bea Garcia, BSc Product Design Common Sense Business and Justin Donaldson Ph.D. Computing Architecture, Francisco J Martin, PhD Software Design, Distributed Systems Jos Verwoerd, MSc Poul Petersen, MSc Large-scale and learning algorithm implementation Jao, PhD Charles Parker, Machine Learning Research PhD Adam Ashenfelter, MSc Tom Dietterich, PhDBigML Inc, 2012 Geneva, October 12, 2012 33
  34. 34. Take Away Oscar Rovira, MSc* Bea Garcia, BSc Justin Donaldson Ph.D. Francisco J Martin, PhD Jos Verwoerd, MSc Poul Petersen, MSc Jao, PhD Charles Parker, PhD Adam Ashenfelter, MSc Tom Dietterich, PhD So instead of trying to quickly create “mediocre data scientists”, Universities should focus on creating excellent mathematicians, statisticians, computer scientists, software architects, designers, etc who are fabulous team playersBigML Inc, 2012 Geneva, October 12, 2012 34
  35. 35. Big Data What is Big Data? What is a Data Scientist? How not to start with Big Data? What is Data-driven Decision Making?BigML Inc, 2012 Geneva, October 12, 2012 35
  36. 36. Iris Dataset http://en.wikipedia.org/wiki/Iris_flower_data_setBigML Inc, 2012 Geneva, October 12, 2012 36
  37. 37. Digesting Big Data Assimilation (making insights actionable) Almost no attention!!! (reject bad data, wrong insights) Absorption Egestion (deriving insights) Digestion (processing) Too much attention!!! Ingestion (capturing and storing)BigML Inc, 2012 Geneva, October 12, 2012 37
  38. 38. Big Data meets Hadoop ·•Hadoop has been excessively promoted as the way to make Big Data problems easy. ·•There are quite a few vendors pushing different Hadoop flavors to the market. However, Hadoop is complex, slow, expensive and batchBigML Inc, 2012 Geneva, October 12, 2012 38
  39. 39. Big Data and Hadoop Running Hadoop on a cluster - The New IT sport of 2012BigML Inc, 2012 Geneva, October 12, 2012 39
  40. 40. Real-Time Hadoop? Really? Seriously?? Are you kidding me???BigML Inc, 2012 Geneva, October 12, 2012 40
  41. 41. Why not Hadoop? Hadoop on a cluster is the right solution for jobs where the input data is multi-terabyte or larger ·•Evidence suggests that many MapReduce-like jobs process relatively small input data sets (less than 14 GB) ·•Iterative-machine learning algorithms, do not map trivially to MapReduce. ·•Memory has reached a GB/$ ratio such that it is now technically and financially feasible to have servers with 100s GB of DRAM ·•In terms of hardware and programmer time, this may be a better option for the majority of data processing jobs. Rowstron, A. et al, Nobody ever got fired for using Hadoop on a cluster, Microsoft Research, Cambridge, 2012 ·•Hadoop is bad at iterative algorithms: High job startup costs and awkward to retain state across iterations ·•High sensitivity to skew: iteration speed bounded by slowest task. ·•Potentially poor cluster utilization: must shuffle all data to a single reducer. Large-Scale Machine Learning at Twitter, Jimmy LinBigML Inc, 2012 Geneva, October 12, 2012 41
  42. 42. Making Big Data Small Hadoop Streaming Algorithms ·•Complex ·•Simple ·•Slow ·•Fast ·•Batch ·•Real-time ·•Expensive ·•Cheap Noel Welsh, Strata conference, London, October 2012BigML Inc, 2012 Geneva, October 12, 2012 42
  43. 43. Self-imposed Shackles Once a baby elephant accepts the limitation imposed on him it becomes a permanent belief, or in his case, a conditioned reaction. Now as the elephant grows into adulthood, he has the power to easily pull the stake out of the ground, but his conditioning has taught him that the effort will not only be futile, it will be painful as well. http://www.selfgrowth.com/articles/Martinez1.html Tackling Big Data with Hadoop on a cluster is like self-imposing shackles on your own projectBigML Inc, 2012 Geneva, October 12, 2012 43
  44. 44. Starting with Big Data •Buy a few machines and set up a cluster. •Installing and running any flavor of Hadoop. •Figure out how to implement complex map-reduce algorithms to compute a few analytics. •Start with a very small data sample. •Use free or cloud-based tools to build a first predictive model that you can understand. •Check if the model gives you any practical insight. •Use the model to generate predictions and see if it can improve your performance. •Check how more data can improve the model. •Check if more sophisticated models can beat your model •Iterate. •Check if the volume, variety, and velocity of your data require a behind-the-firewall/ cloud solution or a batch/stream solution.BigML Inc, 2012 Geneva, October 12, 2012 44
  45. 45. Big Data What is Big Data? What is a Data Scientist? How not to deal with Big Data? What is Data-driven Decision Making?BigML Inc, 2012 Geneva, October 12, 2012 45
  46. 46. Data-Driven Decisions Automated, data-driven decisions will significantly impact more industries than any other information system since “computers” were people http://www.nytimes.com/2011/04/24/business/24unboxed.htmlBigML Inc, 2012 Geneva, October 12, 2012 46
  47. 47. The “HiPPO” (Highest Paid Person’s Opinion) is deadBigML Inc, 2012 Geneva, October 12, 2012 47
  48. 48. Predictive Analytics Descriptive Analytics Predictive Analytics Traditional, backward-looking business Machine Learning analyticsBigML Inc, 2012 Geneva, October 12, 2012 48
  49. 49. Predictive Model “The goal of a predictive model is not to predict the future but to help you make a better decision in the present” Taken from Paul Saffo, HBRBigML Inc, 2012 Geneva, October 12, 2012 49
  50. 50. Data-Driven Decision Making Analytics and Predictive Analytics combined with Experience&Intuition BigML Inc, 2012 Geneva, October 12, 2012 50
  51. 51. It’s time to switch the attention Assimilation (making insights actionable) More attention!!! (reject bad data, wrong insights) Absorption More focus on the models and Egestion (deriving insights) how to operationalize them than on the infrastructure to generate them Digestion (processing) less attention!!! Ingestion (capturing and storing)BigML Inc, 2012 Geneva, October 12, 2012 51
  52. 52. Take aways •Big Data is just data •It’s all about machine learning •Try to excel in one of the data science disciplines •Don’t shackle yourself to the wrong platform •Trying to predict the future can help you make the right decision in the present •Focus on evaluation and actionability of models and not on how they are builtBigML Inc, 2012 Geneva, October 12, 2012 52
  53. 53. Agenda ·•Short intro ·•The Big Data Revolution ·•What is BigML? ·•Behind the scenes ·•Coming down the pike ·•Hacking with the BigML API BigML Inc, 2012 Geneva, October 12, 2012 53
  54. 54. BigML Goal Highly Scalable, Cloud-based Machine Learning Service Simple, Easy-to-Use and Seamless-to- IntegrateBigML Inc, 2012 Geneva, October 12, 2012 54
  55. 55. BigML vs ML You can deal ...or you can deal with that! with this... BigML 1-click modelBigML Inc, 2012 Geneva, October 12, 2012 55
  56. 56. BigML vs Big Data You can deal ...or you can deal with that! with this... BigML 1-click modelBigML Inc, 2012 Geneva, October 12, 2012 56
  57. 57. How it WorksBigML Inc, 2012 Geneva, October 12, 2012 57
  58. 58. Machine Learning Made Easy TrueBigML Inc, 2012 Geneva, October 12, 2012 58
  59. 59. Simple is not easy “Any fool can make something complicated. It takes a genius to make it simple.” ― Woody GuthrieBigML Inc, 2012 Geneva, October 12, 2012 59
  60. 60. Fully Web basedBigML Inc, 2012 Geneva, October 12, 2012 60
  61. 61. RESTful APIBigML Inc, 2012 Geneva, October 12, 2012 61
  62. 62. Agenda ·•Short intro ·•The Big Data Revolution ·•What is BigML? - Demo ·•Behind the scenes ·•Coming down the pike ·•Hacking with the BigML API BigML Inc, 2012 Geneva, October 12, 2012 62
  63. 63. Agenda ·•Short intro ·•The Big Data Revolution ·•What is BigML? ·•Behind the scenes ·•Coming down the pike ·•Hacking with the BigML API BigML Inc, 2012 Geneva, October 12, 2012 63
  64. 64. BigML’ Software Architecture Front-end [Neutronia] [Medusa][CuriousYellow] [Sky]Middle-end [Apian]Backend [Wintermute]Infrastructure [Sauron] Boto, FabricBigML Inc, 2012 Geneva, October 12, 2012 64
  65. 65. BigML’s AWS-based ArchitectureBigML Inc, 2012 Geneva, October 12, 2012 65
  66. 66. Why Tree Models? ·•Highly scalable ·•Graphically representable and interactive ·•Easily understandable ·•Easily translatable into rules, PMML, and code. ·•Easily upgradable with ensembles: boosting, bagging, and random forests, etc ·•Top performers! http://www.niculescu-mizil.org/papers/empirical.icml06.pdfempirical.icml06.pdfBigML Inc, 2012 Geneva, October 12, 2012 66
  67. 67. BigML Histograms BigMLs trees and dataset summaries use histograms with the following traits: Streaming Memory constrained Dynamic Data is never kept in memory The less memory allocated, the The histogram bins adjust but needs only one pass over lossier the compressed themselves as they observe the the data to capture the distribution. data. distribution. Robust to ordered Merge friendly More... data So it works even if the data For parallelization and http://blog.bigml.com/ stream is non-stationary distribution. 2012/06/18/bigmls-fancy- histograms/BigML Inc, 2012 Geneva, October 12, 2012 67
  68. 68. BigML Streaming Trees BigMLs trees are: CART Grown breadth first Classification & Regression So partial trees are Trees meaningful Built Hoeffding-style Friendly for parallelization So they consume streaming Can work over multiple data and can split "early" cores or multiple computersBigML Inc, 2012 Geneva, October 12, 2012 68
  69. 69. Growing a Streaming Tree ·•Each split breaks the data into subsets. ·•The split should make the subsets as distinct from one another as possible. ·•Subsets are chosen to maximize information gain (classification) or minimize squared error (regression).BigML Inc, 2012 Geneva, October 12, 2012 69
  70. 70. Distributed Streaming Trees  BigML Inc, 2012 Geneva, October 12, 2012 70
  71. 71. Streaming Trees - Early SplitsBigML Inc, 2012 Geneva, October 12, 2012 71
  72. 72. Agenda ·•Short intro ·•The Big Data Revolution ·•What is BigML? ·•Behind the scenes ·•Coming down the pike ·•Hacking with the BigML API BigML Inc, 2012 Geneva, October 12, 2012 72
  73. 73. Automatic EvaluationsBigML Inc, 2012 Geneva, October 12, 2012 73
  74. 74. A marketplace for predictive modelsBigML Inc, 2012 Geneva, October 12, 2012 74
  75. 75. Simple is not easy “Any fool can make something complicated. It takes a genius to make it simple.” ― Woody GuthrieBigML Inc, 2012 Geneva, October 12, 2012 75
  76. 76. Machine Learning Made Easy TrueBigML Inc, 2012 Geneva, October 12, 2012 76
  77. 77. Agenda ·•Short intro ·•The Big Data Revolution ·•Demo ·•Behind the scenes ·•Coming down the pike ·•Hacking with the BigML API BigML Inc, 2012 Geneva, October 12, 2012 77
  78. 78. Back to the trenches GallipoliBigML Inc, 2012 Geneva, October 12, 2012 78
  79. 79. Good Reading Big Data Trends - David Feinleib http://www.slideshare.net/bigdatalandscape/big-data-trends Hey Graduates: Forget Plastics - Its All About Machine Learning. Jess Bussgang. http://bostonvcblog.typepad.com/vc/2012/05/forget-plastics-its-all-about-machine-learning.html More Data or Better Models. Xavier Amatriain http://technocalifornia.blogspot.ch/2012/07/more-data-or-better-models.html Making Big Data Small. Noel Welsh http://strataconf.com/strataeu/public/schedule/detail/25984 Data Killed the HiPPO star. Jeff Jordan, Andreessen Horowitz http://gigaom.com/2012/02/18/data-killed-the-hippo-star/ When There’s No Such Thing as Too Much Information. Steve Lohr http://www.nytimes.com/2011/04/24/business/24unboxed.html Nobody ever got fired for using Hadoop on a cluster. Antony Rowstron, Dushyanth Narayanan, Austin Donnelly, Greg O’Shea, Andrew Douglas http://research.microsoft.com/pubs/163083/hotcbp12%20final.pdf Six Rules for Effective Forecasting. Paul Saffo http://www.usc.edu/schools/annenberg/asc/projects/wkc/pdf/200912digitalleadership_saffo.pdf Large-scale Machine Learning at Twitter. Jimmy Lin and Alek Kolcz http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdfBigML Inc, 2012 Geneva, October 12, 2012 79
  80. 80. BigML Inc, 2012 Geneva, October 12, 2012
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×