• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
BigML's take on Big Data
 

BigML's take on Big Data

on

  • 1,608 views

BigML's take on Big Data. University of Geneva, October 12, 2012. ...

BigML's take on Big Data. University of Geneva, October 12, 2012.

In the "Big Data" era, rapidly and easily getting insights from your data or creating data-driven applications does not have to be painful. BigML shows how business managers, application developers, and data scientists can start building their own predictive models in a matter of minutes.

Statistics

Views

Total Views
1,608
Views on SlideShare
1,543
Embed Views
65

Actions

Likes
2
Downloads
61
Comments
1

3 Embeds 65

http://www.linkedin.com 46
https://www.linkedin.com 14
https://twitter.com 5

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • I like your Big Data presentation.
    I would like to share with you document about application of Big Data and Data Science in retail banking. http://www.slideshare.net/LadislavUrban/syoncloud-big-data-for-retail-banking-syoncloud
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    BigML's take on Big Data BigML's take on Big Data Presentation Transcript

    • BigML Inc, 2012 Geneva, October 12, 2012
    • Agenda ·•Short intro ·•The Big Data Revolution ·•What is BigML? ·•Behind the scenes ·•Coming down the pike ·•Hacking with the BigML API BigML Inc, 2012 Geneva, October 12, 2012 2
    • Francisco J Martin Background: • 5-year degree in Computer Science, UPV • Ph.D. in Artificial Intelligence, UPC • Postdoc (Machine Learning), Oregon State University • Founder and CEO at iSOCO • Founder and CEO at Strands • Co-authored 6 patents acquired by Apple Inc • Directly raised $75+MM in venture capital and cashed out additional $18+MM for early investors • Directly sold and negotiated $30+MM in licenses BigML: • Co-founder and CEO • Joined: January 2011 • Tasks: Product conceptualization, design, and architecture • Develops: BigML middle-end and public API • 1202 (19%) of commits to total BigML code baseBigML Inc, 2012 Geneva, October 12, 2012 3
    • Academia vs the Real-world Neo, sooner or later youre going to realize, just as I did, that theres a difference between knowing the path, and walking the pathBigML Inc, 2012 Geneva, October 12, 2012 4
    • Walking the data path Large-scale Machine Learning Recommender Systems Everything Machine Learning Personalization Music, video, Multi-agent fitness, finance Learning Intrusion Detection E-commerce ata D 8-queen problem1996 1999 2002 2004 2011 2012 Academia iSOCO Academia Strands Inc BigML IncBigML Inc, 2012 Geneva, October 12, 2012 5
    • BigML Status·•Founded in Jan 2011·•9 FTE, 1 PT·•5 Ph.Ds·•4 patent applications·•Advisors and BA: US Patent Application No. 61/555,615 For: VISUALIZATION AND INTERACTION WITH COMPACT REPRESENTATION OF DECISION TREES Filed: November, 2011 US Patent Application No. 61/557,826 For: METHODS FOR BUILDING AND USING DECISION TREES IN A DISTRIBUTED ENVIRONMENT Filed: November, 2011 US Patent Application No. 61/557,539 For: EVOLVING PARALLEL SYSTEM TO AUTOMATICALLY IMPROVE THE PERFORMANCE OF DISTRIBUTED SYSTEMS Filed: November, 2011 US Patent Application No. 61/710,175 For: SYSTEM AND METHODS TO EXCHANGE ACTIONABLE PREDICTIVE MODELS IN A VIRTUAL MARKETPLACE Filed: October, 2012BigML Inc, 2012 Geneva, October 12, 2012 6
    • From the trenches Beneath Hill 60 BigML TeamBigML Inc, 2012 Geneva, October 12, 2012 7
    • Agenda ·•Short intro ·•The Big Data Revolution ·•What is BigML? ·•Behind the scenes ·•Coming down the pike ·•Hacking with the BigML API BigML Inc, 2012 Geneva, October 12, 2012 8
    • Big Data What is Big Data? What is a Data Scientist? How not to start with Big Data? What is Data-driven Decision Making?BigML Inc, 2012 Geneva, October 12, 2012 9
    • Trends http://strata.oreilly.com/2011/08/building-data-startups.htmlBigML Inc, 2012 Geneva, October 12, 2012 10
    • What’s Big Data? Big Data means way too many different things to many different people “when the human cost of making the decision of throwing something away became higher than the machine cost of continuing to store it” George DysonBigML Inc, 2012 Geneva, October 12, 2012 11
    • What’s Big Data? The 3 v’s The 3 I’s Volume Immediate (big, enormous, huge, vast, immense, very In the sense that you need to do something large, etc) about it Variety Intimidating (heterogenous, diverse, complex, multiple What if you do not? sources, sensors, etc) Velocity Ill-defined (speed, dynamic real-time, streamed, etc) What is it? Anyway?BigML Inc, 2012 Data matters!!! Geneva, October 12, 2012 12
    • Machine Learning Even if we, human beings, are learning machines, we are really bad at processing small amounts of data Machines are good at quickly processing huge amounts of data. Machine Learning can make them learn from dataBigML Inc, 2012 Geneva, October 12, 2012 13
    • It’s all about machine learning Forget plastics. It’s all about machine learning http://www.youtube.com/watch?v=PSxihhBzCjk Its as if the machines have been in training all their lives to adapt and make use of the Big Data now being thrown at them - a combination of Moores Law and the cloud mixed in with Machine Learning finally makes it all possible. --- Jeff BussgangBigML Inc, 2012 Geneva, October 12, 2012 14
    • Learning from Data Unknown Model f : X -> Y Example: ideal credit approval formula f1 f2 fn label x1 Training Examples (x1, l1), (x2, l2), ..., (xN, lN) xN Example: historical records of credit customers Models Final Model M Learning g~f Example: set of candidate Algorithm Example: learned credit credit approval formulas approval formula Based on Learning from Data by Y. Abu-Mostafa, M. Magdon-Ismail and H. LinBigML Inc, 2012 Geneva, October 12, 2012 15
    • What’s Big Machine Learning? Volume Large-scale machine What to do when data is too big to fit within the system memory of a single computer? learning Clean, refine, update, join, merge, aggregate, Variety structure or deconstruct data until it matches the required input format or (why not) just generate/store data in the right format Velocity Stream AlgorithmsBigML Inc, 2012 Geneva, October 12, 2012 16
    • Machine Learning ...or you can deal with that!BigML Inc, 2012 Geneva, October 12, 2012 17
    • Does More Data beat Better Algorithms? More features More examples The Unreasonable Effectiveness of Data More Data or Better Models. Xavier AmatriainBigML Inc, 2012 Geneva, October 12, 2012 18
    • What’s Big Data? Global realization that learning from data (i.e., Machine Learning) can help us better analyze our past, understand our present, and predict our future. --- Francisco J Martin Data Past Present FutureBigML Inc, 2012 Geneva, October 12, 2012 19
    • Big Data What is Big Data? What is a Data Scientist? How not to start with Big Data? What is Data-driven Decision Making?BigML Inc, 2012 Geneva, October 12, 2012 20
    • Is Wikipedia right? Really? Seriously?? Are you kidding me???BigML Inc, 2012 Geneva, October 12, 2012 21
    • Data can’t be wrong?BigML Inc, 2012 Geneva, October 12, 2012 22
    • McKinsey can’t be wrong Critical Shortage Of “Data Scientist” Talent Predicted By 2018BigML Inc, 2012 Geneva, October 12, 2012 23
    • HBR can’t be wrongBigML Inc, 2012 Geneva, October 12, 2012 24
    • Wikipedia is right!BigML Inc, 2012 Geneva, October 12, 2012 25
    • If Data Scientists don’t exist can they be created?BigML Inc, 2012 Geneva, October 12, 2012 26
    • The first Data Scientist Computer Statistician Scientist Mathematician Hans’ brain, the first Data ScientistBigML Inc, 2012 Geneva, October 12, 2012 27
    • The magic formula A data scientist is“part analyst, part artist.” Anjul Bhambhri,Vice President of Big Data Products at IBMBigML Inc, 2012 Geneva, October 12, 2012 28
    • Are Data Scientists super heroes?BigML Inc, 2012 Geneva, October 12, 2012 29
    • The most powerful human super hero http://photos.oregonlive.com/photo-essay/2012/06/ashton_eaton_sets_decathlon_wo.htmlBigML Inc, 2012 Geneva, October 12, 2012 30
    • Are Data Scientists super heroes? High school Events Decathlon World Record World Record World Record 100 m 10.21 10.08 9.58 Long Jump 8.23 m 8.16 m 8.95 m Shot Put 14.20 m 20.65 m 23.12 m High Jump 2.05 m 2.31 m 2.45 m 400 m 46.70 44.69 43.18 110 m hurdles 13.70 13.74 12.80 Discus throw 42.81 m 61.38 m 74.08 m Pole Vault 5.30 m 5.56 m 6.14 m Javelin Throw 58.87 m 73.74 m 98.48 m 1500m 4:14.48 3:38.26 3:26.00BigML Inc, 2012 Geneva, October 12, 2012 31
    • The Wikipedia is always right!BigML Inc, 2012 Geneva, October 12, 2012 32
    • BigML’s Data Science Team UI Design Visualization Oscar Rovira, MSc* Infrastructure, Cloud-based Bea Garcia, BSc Product Design Common Sense Business and Justin Donaldson Ph.D. Computing Architecture, Francisco J Martin, PhD Software Design, Distributed Systems Jos Verwoerd, MSc Poul Petersen, MSc Large-scale and learning algorithm implementation Jao, PhD Charles Parker, Machine Learning Research PhD Adam Ashenfelter, MSc Tom Dietterich, PhDBigML Inc, 2012 Geneva, October 12, 2012 33
    • Take Away Oscar Rovira, MSc* Bea Garcia, BSc Justin Donaldson Ph.D. Francisco J Martin, PhD Jos Verwoerd, MSc Poul Petersen, MSc Jao, PhD Charles Parker, PhD Adam Ashenfelter, MSc Tom Dietterich, PhD So instead of trying to quickly create “mediocre data scientists”, Universities should focus on creating excellent mathematicians, statisticians, computer scientists, software architects, designers, etc who are fabulous team playersBigML Inc, 2012 Geneva, October 12, 2012 34
    • Big Data What is Big Data? What is a Data Scientist? How not to start with Big Data? What is Data-driven Decision Making?BigML Inc, 2012 Geneva, October 12, 2012 35
    • Iris Dataset http://en.wikipedia.org/wiki/Iris_flower_data_setBigML Inc, 2012 Geneva, October 12, 2012 36
    • Digesting Big Data Assimilation (making insights actionable) Almost no attention!!! (reject bad data, wrong insights) Absorption Egestion (deriving insights) Digestion (processing) Too much attention!!! Ingestion (capturing and storing)BigML Inc, 2012 Geneva, October 12, 2012 37
    • Big Data meets Hadoop ·•Hadoop has been excessively promoted as the way to make Big Data problems easy. ·•There are quite a few vendors pushing different Hadoop flavors to the market. However, Hadoop is complex, slow, expensive and batchBigML Inc, 2012 Geneva, October 12, 2012 38
    • Big Data and Hadoop Running Hadoop on a cluster - The New IT sport of 2012BigML Inc, 2012 Geneva, October 12, 2012 39
    • Real-Time Hadoop? Really? Seriously?? Are you kidding me???BigML Inc, 2012 Geneva, October 12, 2012 40
    • Why not Hadoop? Hadoop on a cluster is the right solution for jobs where the input data is multi-terabyte or larger ·•Evidence suggests that many MapReduce-like jobs process relatively small input data sets (less than 14 GB) ·•Iterative-machine learning algorithms, do not map trivially to MapReduce. ·•Memory has reached a GB/$ ratio such that it is now technically and financially feasible to have servers with 100s GB of DRAM ·•In terms of hardware and programmer time, this may be a better option for the majority of data processing jobs. Rowstron, A. et al, Nobody ever got fired for using Hadoop on a cluster, Microsoft Research, Cambridge, 2012 ·•Hadoop is bad at iterative algorithms: High job startup costs and awkward to retain state across iterations ·•High sensitivity to skew: iteration speed bounded by slowest task. ·•Potentially poor cluster utilization: must shuffle all data to a single reducer. Large-Scale Machine Learning at Twitter, Jimmy LinBigML Inc, 2012 Geneva, October 12, 2012 41
    • Making Big Data Small Hadoop Streaming Algorithms ·•Complex ·•Simple ·•Slow ·•Fast ·•Batch ·•Real-time ·•Expensive ·•Cheap Noel Welsh, Strata conference, London, October 2012BigML Inc, 2012 Geneva, October 12, 2012 42
    • Self-imposed Shackles Once a baby elephant accepts the limitation imposed on him it becomes a permanent belief, or in his case, a conditioned reaction. Now as the elephant grows into adulthood, he has the power to easily pull the stake out of the ground, but his conditioning has taught him that the effort will not only be futile, it will be painful as well. http://www.selfgrowth.com/articles/Martinez1.html Tackling Big Data with Hadoop on a cluster is like self-imposing shackles on your own projectBigML Inc, 2012 Geneva, October 12, 2012 43
    • Starting with Big Data •Buy a few machines and set up a cluster. •Installing and running any flavor of Hadoop. •Figure out how to implement complex map-reduce algorithms to compute a few analytics. •Start with a very small data sample. •Use free or cloud-based tools to build a first predictive model that you can understand. •Check if the model gives you any practical insight. •Use the model to generate predictions and see if it can improve your performance. •Check how more data can improve the model. •Check if more sophisticated models can beat your model •Iterate. •Check if the volume, variety, and velocity of your data require a behind-the-firewall/ cloud solution or a batch/stream solution.BigML Inc, 2012 Geneva, October 12, 2012 44
    • Big Data What is Big Data? What is a Data Scientist? How not to deal with Big Data? What is Data-driven Decision Making?BigML Inc, 2012 Geneva, October 12, 2012 45
    • Data-Driven Decisions Automated, data-driven decisions will significantly impact more industries than any other information system since “computers” were people http://www.nytimes.com/2011/04/24/business/24unboxed.htmlBigML Inc, 2012 Geneva, October 12, 2012 46
    • The “HiPPO” (Highest Paid Person’s Opinion) is deadBigML Inc, 2012 Geneva, October 12, 2012 47
    • Predictive Analytics Descriptive Analytics Predictive Analytics Traditional, backward-looking business Machine Learning analyticsBigML Inc, 2012 Geneva, October 12, 2012 48
    • Predictive Model “The goal of a predictive model is not to predict the future but to help you make a better decision in the present” Taken from Paul Saffo, HBRBigML Inc, 2012 Geneva, October 12, 2012 49
    • Data-Driven Decision Making Analytics and Predictive Analytics combined with Experience&Intuition BigML Inc, 2012 Geneva, October 12, 2012 50
    • It’s time to switch the attention Assimilation (making insights actionable) More attention!!! (reject bad data, wrong insights) Absorption More focus on the models and Egestion (deriving insights) how to operationalize them than on the infrastructure to generate them Digestion (processing) less attention!!! Ingestion (capturing and storing)BigML Inc, 2012 Geneva, October 12, 2012 51
    • Take aways •Big Data is just data •It’s all about machine learning •Try to excel in one of the data science disciplines •Don’t shackle yourself to the wrong platform •Trying to predict the future can help you make the right decision in the present •Focus on evaluation and actionability of models and not on how they are builtBigML Inc, 2012 Geneva, October 12, 2012 52
    • Agenda ·•Short intro ·•The Big Data Revolution ·•What is BigML? ·•Behind the scenes ·•Coming down the pike ·•Hacking with the BigML API BigML Inc, 2012 Geneva, October 12, 2012 53
    • BigML Goal Highly Scalable, Cloud-based Machine Learning Service Simple, Easy-to-Use and Seamless-to- IntegrateBigML Inc, 2012 Geneva, October 12, 2012 54
    • BigML vs ML You can deal ...or you can deal with that! with this... BigML 1-click modelBigML Inc, 2012 Geneva, October 12, 2012 55
    • BigML vs Big Data You can deal ...or you can deal with that! with this... BigML 1-click modelBigML Inc, 2012 Geneva, October 12, 2012 56
    • How it WorksBigML Inc, 2012 Geneva, October 12, 2012 57
    • Machine Learning Made Easy TrueBigML Inc, 2012 Geneva, October 12, 2012 58
    • Simple is not easy “Any fool can make something complicated. It takes a genius to make it simple.” ― Woody GuthrieBigML Inc, 2012 Geneva, October 12, 2012 59
    • Fully Web basedBigML Inc, 2012 Geneva, October 12, 2012 60
    • RESTful APIBigML Inc, 2012 Geneva, October 12, 2012 61
    • Agenda ·•Short intro ·•The Big Data Revolution ·•What is BigML? - Demo ·•Behind the scenes ·•Coming down the pike ·•Hacking with the BigML API BigML Inc, 2012 Geneva, October 12, 2012 62
    • Agenda ·•Short intro ·•The Big Data Revolution ·•What is BigML? ·•Behind the scenes ·•Coming down the pike ·•Hacking with the BigML API BigML Inc, 2012 Geneva, October 12, 2012 63
    • BigML’ Software Architecture Front-end [Neutronia] [Medusa][CuriousYellow] [Sky]Middle-end [Apian]Backend [Wintermute]Infrastructure [Sauron] Boto, FabricBigML Inc, 2012 Geneva, October 12, 2012 64
    • BigML’s AWS-based ArchitectureBigML Inc, 2012 Geneva, October 12, 2012 65
    • Why Tree Models? ·•Highly scalable ·•Graphically representable and interactive ·•Easily understandable ·•Easily translatable into rules, PMML, and code. ·•Easily upgradable with ensembles: boosting, bagging, and random forests, etc ·•Top performers! http://www.niculescu-mizil.org/papers/empirical.icml06.pdfempirical.icml06.pdfBigML Inc, 2012 Geneva, October 12, 2012 66
    • BigML Histograms BigMLs trees and dataset summaries use histograms with the following traits: Streaming Memory constrained Dynamic Data is never kept in memory The less memory allocated, the The histogram bins adjust but needs only one pass over lossier the compressed themselves as they observe the the data to capture the distribution. data. distribution. Robust to ordered Merge friendly More... data So it works even if the data For parallelization and http://blog.bigml.com/ stream is non-stationary distribution. 2012/06/18/bigmls-fancy- histograms/BigML Inc, 2012 Geneva, October 12, 2012 67
    • BigML Streaming Trees BigMLs trees are: CART Grown breadth first Classification & Regression So partial trees are Trees meaningful Built Hoeffding-style Friendly for parallelization So they consume streaming Can work over multiple data and can split "early" cores or multiple computersBigML Inc, 2012 Geneva, October 12, 2012 68
    • Growing a Streaming Tree ·•Each split breaks the data into subsets. ·•The split should make the subsets as distinct from one another as possible. ·•Subsets are chosen to maximize information gain (classification) or minimize squared error (regression).BigML Inc, 2012 Geneva, October 12, 2012 69
    • Distributed Streaming Trees  BigML Inc, 2012 Geneva, October 12, 2012 70
    • Streaming Trees - Early SplitsBigML Inc, 2012 Geneva, October 12, 2012 71
    • Agenda ·•Short intro ·•The Big Data Revolution ·•What is BigML? ·•Behind the scenes ·•Coming down the pike ·•Hacking with the BigML API BigML Inc, 2012 Geneva, October 12, 2012 72
    • Automatic EvaluationsBigML Inc, 2012 Geneva, October 12, 2012 73
    • A marketplace for predictive modelsBigML Inc, 2012 Geneva, October 12, 2012 74
    • Simple is not easy “Any fool can make something complicated. It takes a genius to make it simple.” ― Woody GuthrieBigML Inc, 2012 Geneva, October 12, 2012 75
    • Machine Learning Made Easy TrueBigML Inc, 2012 Geneva, October 12, 2012 76
    • Agenda ·•Short intro ·•The Big Data Revolution ·•Demo ·•Behind the scenes ·•Coming down the pike ·•Hacking with the BigML API BigML Inc, 2012 Geneva, October 12, 2012 77
    • Back to the trenches GallipoliBigML Inc, 2012 Geneva, October 12, 2012 78
    • Good Reading Big Data Trends - David Feinleib http://www.slideshare.net/bigdatalandscape/big-data-trends Hey Graduates: Forget Plastics - Its All About Machine Learning. Jess Bussgang. http://bostonvcblog.typepad.com/vc/2012/05/forget-plastics-its-all-about-machine-learning.html More Data or Better Models. Xavier Amatriain http://technocalifornia.blogspot.ch/2012/07/more-data-or-better-models.html Making Big Data Small. Noel Welsh http://strataconf.com/strataeu/public/schedule/detail/25984 Data Killed the HiPPO star. Jeff Jordan, Andreessen Horowitz http://gigaom.com/2012/02/18/data-killed-the-hippo-star/ When There’s No Such Thing as Too Much Information. Steve Lohr http://www.nytimes.com/2011/04/24/business/24unboxed.html Nobody ever got fired for using Hadoop on a cluster. Antony Rowstron, Dushyanth Narayanan, Austin Donnelly, Greg O’Shea, Andrew Douglas http://research.microsoft.com/pubs/163083/hotcbp12%20final.pdf Six Rules for Effective Forecasting. Paul Saffo http://www.usc.edu/schools/annenberg/asc/projects/wkc/pdf/200912digitalleadership_saffo.pdf Large-scale Machine Learning at Twitter. Jimmy Lin and Alek Kolcz http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdfBigML Inc, 2012 Geneva, October 12, 2012 79
    • BigML Inc, 2012 Geneva, October 12, 2012