BigML's take on Big Data. University of Geneva, October 12, 2012.
In the "Big Data" era, rapidly and easily getting insights from your data or creating data-driven applications does not have to be painful. BigML shows how business managers, application developers, and data scientists can start building their own predictive models in a matter of minutes.
2. Agenda
·•Short intro
·•The Big Data Revolution
·•What is BigML?
·•Behind the scenes
·•Coming down the pike
·•Hacking with the BigML API
BigML Inc, 2012 Geneva, October 12, 2012 2
3. Francisco J Martin
Background:
• 5-year degree in Computer Science, UPV
• Ph.D. in Artificial Intelligence, UPC
• Postdoc (Machine Learning), Oregon State University
• Founder and CEO at iSOCO
• Founder and CEO at Strands
• Co-authored 6 patents acquired by Apple Inc
• Directly raised $75+MM in venture capital and cashed
out additional $18+MM for early investors
• Directly sold and negotiated $30+MM in licenses
BigML:
• Co-founder and CEO
• Joined: January 2011
• Tasks: Product conceptualization, design, and architecture
• Develops: BigML middle-end and public API
• 1202 (19%) of commits to total BigML code base
BigML Inc, 2012 Geneva, October 12, 2012 3
4. Academia vs the Real-world
Neo, sooner or
later you're going
to realize, just as
I did, that there's
a difference
between knowing the
path, and walking
the path
BigML Inc, 2012 Geneva, October 12, 2012 4
5. Walking the data path
Large-scale
Machine Learning
Recommender
Systems
Everything
Machine Learning
Personalization
Music, video,
Multi-agent fitness, finance
Learning Intrusion
Detection
E-commerce
ata
D
8-queen problem
1996 1999 2002 2004 2011 2012
Academia iSOCO Academia Strands Inc BigML Inc
BigML Inc, 2012 Geneva, October 12, 2012 5
6. BigML Status
·•Founded in Jan 2011
·•9 FTE, 1 PT
·•5 Ph.Ds
·•4 patent applications
·•Advisors and BA:
US Patent Application No. 61/555,615
For: VISUALIZATION AND INTERACTION WITH COMPACT REPRESENTATION OF DECISION TREES
Filed: November, 2011
US Patent Application No. 61/557,826
For: METHODS FOR BUILDING AND USING DECISION TREES IN A DISTRIBUTED ENVIRONMENT
Filed: November, 2011
US Patent Application No. 61/557,539
For: EVOLVING PARALLEL SYSTEM TO AUTOMATICALLY IMPROVE THE PERFORMANCE OF DISTRIBUTED SYSTEMS
Filed: November, 2011
US Patent Application No. 61/710,175
For: SYSTEM AND METHODS TO EXCHANGE ACTIONABLE PREDICTIVE MODELS IN A VIRTUAL MARKETPLACE
Filed: October, 2012
BigML Inc, 2012 Geneva, October 12, 2012 6
7. From the trenches
Beneath Hill 60 BigML Team
BigML Inc, 2012 Geneva, October 12, 2012 7
8. Agenda
·•Short intro
·•The Big Data Revolution
·•What is BigML?
·•Behind the scenes
·•Coming down the pike
·•Hacking with the BigML API
BigML Inc, 2012 Geneva, October 12, 2012 8
9. Big Data
What is Big Data? What is a Data Scientist?
How not to start with Big Data? What is Data-driven
Decision Making?
BigML Inc, 2012 Geneva, October 12, 2012 9
11. What’s Big Data?
Big Data means way
too many different
things to
many different people
“when the human cost of making the decision of throwing
something away became higher than the machine cost of
continuing to store it” George Dyson
BigML Inc, 2012 Geneva, October 12, 2012 11
12. What’s Big Data?
The 3 v’s The 3 I’s
Volume Immediate
(big, enormous, huge, vast, immense, very In the sense that you need to do something
large, etc) about it
Variety Intimidating
(heterogenous, diverse, complex, multiple What if you do not?
sources, sensors, etc)
Velocity Ill-defined
(speed, dynamic real-time, streamed, etc) What is it? Anyway?
BigML Inc, 2012
Data matters!!! Geneva, October 12, 2012 12
13. Machine Learning
Even if we, human
beings, are learning
machines, we are really
bad at processing small
amounts of data
Machines are good at
quickly processing huge
amounts of data.
Machine Learning can
make them learn from
data
BigML Inc, 2012 Geneva, October 12, 2012 13
14. It’s all about machine learning
Forget plastics.
It’s all about
machine learning
http://www.youtube.com/watch?v=PSxihhBzCjk
It's as if the machines have been in training all their lives to
adapt and make use of the Big Data now being thrown at them
- a combination of Moore's Law and the cloud mixed in with
Machine Learning finally makes it all possible. --- Jeff Bussgang
BigML Inc, 2012 Geneva, October 12, 2012 14
15. Learning from Data
Unknown Model
f : X -> Y
Example: ideal credit approval formula
f1 f2 fn label
x1
Training Examples
(x1, l1), (x2, l2), ..., (xN, lN)
xN
Example: historical records of credit customers
Models Final Model
M Learning g~f
Example: set of candidate Algorithm Example: learned credit
credit approval formulas approval formula
Based on Learning from Data by Y. Abu-Mostafa, M. Magdon-Ismail and H. Lin
BigML Inc, 2012 Geneva, October 12, 2012 15
16. What’s Big Machine Learning?
Volume Large-scale machine
What to do when data is too big to fit within the
system memory of a single computer? learning
Clean, refine, update, join, merge, aggregate,
Variety structure or deconstruct data until it matches
the required input format or (why not) just
generate/store data in the right format
Velocity Stream Algorithms
BigML Inc, 2012 Geneva, October 12, 2012 16
17. Machine Learning
...or you can deal with that!
BigML Inc, 2012 Geneva, October 12, 2012 17
18. Does More Data beat Better Algorithms?
More features
More examples
The Unreasonable Effectiveness of Data
More Data or Better Models.
Xavier Amatriain
BigML Inc, 2012 Geneva, October 12, 2012 18
19. What’s Big Data?
Global realization that
learning from data (i.e., Machine Learning)
can help us better analyze our past, understand our present,
and predict our future. --- Francisco J Martin
Data Past Present Future
BigML Inc, 2012 Geneva, October 12, 2012 19
20. Big Data
What is Big Data? What is a Data Scientist?
How not to start with Big Data? What is Data-driven
Decision Making?
BigML Inc, 2012 Geneva, October 12, 2012 20
21. Is Wikipedia right?
Really? Seriously?? Are you kidding me???
BigML Inc, 2012 Geneva, October 12, 2012 21
22. Data can’t be wrong?
BigML Inc, 2012 Geneva, October 12, 2012 22
23. McKinsey can’t be wrong
Critical Shortage Of “Data Scientist”
Talent Predicted By 2018
BigML Inc, 2012 Geneva, October 12, 2012 23
24. HBR can’t be wrong
BigML Inc, 2012 Geneva, October 12, 2012 24
26. If Data Scientists don’t exist
can they be created?
BigML Inc, 2012 Geneva, October 12, 2012 26
27. The first Data Scientist
Computer
Statistician
Scientist
Mathematician
Hans’ brain, the first Data Scientist
BigML Inc, 2012 Geneva, October 12, 2012 27
28. The magic formula
A data scientist is“part
analyst, part artist.”
Anjul Bhambhri,Vice President of Big Data
Products at IBM
BigML Inc, 2012 Geneva, October 12, 2012 28
29. Are Data Scientists super heroes?
BigML Inc, 2012 Geneva, October 12, 2012 29
30. The most powerful human super hero
http://photos.oregonlive.com/photo-essay/2012/06/ashton_eaton_sets_decathlon_wo.html
BigML Inc, 2012 Geneva, October 12, 2012 30
31. Are Data Scientists super heroes?
High school
Events Decathlon World Record World Record
World Record
100 m 10.21 10.08 9.58
Long Jump 8.23 m 8.16 m 8.95 m
Shot Put 14.20 m 20.65 m 23.12 m
High Jump 2.05 m 2.31 m 2.45 m
400 m 46.70 44.69 43.18
110 m hurdles 13.70 13.74 12.80
Discus throw 42.81 m 61.38 m 74.08 m
Pole Vault 5.30 m 5.56 m 6.14 m
Javelin Throw 58.87 m 73.74 m 98.48 m
1500m 4:14.48 3:38.26 3:26.00
BigML Inc, 2012 Geneva, October 12, 2012 31
32. The Wikipedia is always right!
BigML Inc, 2012 Geneva, October 12, 2012 32
33. BigML’s Data Science Team
UI
Design
Visualization
Oscar Rovira, MSc*
Infrastructure, Cloud-based
Bea Garcia, BSc
Product Design
Common Sense
Business and Justin Donaldson Ph.D.
Computing
Architecture, Francisco J Martin, PhD
Software Design,
Distributed Systems Jos Verwoerd, MSc
Poul Petersen, MSc
Large-scale and learning
algorithm implementation Jao, PhD
Charles Parker,
Machine Learning Research PhD Adam Ashenfelter, MSc
Tom Dietterich, PhD
BigML Inc, 2012 Geneva, October 12, 2012 33
34. Take Away
Oscar Rovira, MSc*
Bea Garcia, BSc
Justin Donaldson Ph.D.
Francisco J Martin, PhD
Jos Verwoerd, MSc
Poul Petersen, MSc
Jao, PhD
Charles Parker,
PhD Adam Ashenfelter, MSc
Tom Dietterich, PhD
So instead of trying to quickly create “mediocre data scientists”, Universities
should focus on creating excellent mathematicians, statisticians, computer
scientists, software architects, designers, etc who are fabulous team players
BigML Inc, 2012 Geneva, October 12, 2012 34
35. Big Data
What is Big Data? What is a Data Scientist?
How not to start with Big Data? What is Data-driven
Decision Making?
BigML Inc, 2012 Geneva, October 12, 2012 35
37. Digesting Big Data
Assimilation
(making insights actionable)
Almost no attention!!!
(reject bad data, wrong insights)
Absorption
Egestion
(deriving insights)
Digestion
(processing)
Too much attention!!!
Ingestion
(capturing and storing)
BigML Inc, 2012 Geneva, October 12, 2012 37
38. Big Data meets Hadoop
·•Hadoop has been excessively
promoted as the way to make
Big Data problems easy.
·•There are quite a few vendors
pushing different Hadoop flavors
to the market.
However, Hadoop is complex, slow, expensive
and batch
BigML Inc, 2012 Geneva, October 12, 2012 38
39. Big Data and Hadoop
Running Hadoop on a cluster - The New IT sport of 2012
BigML Inc, 2012 Geneva, October 12, 2012 39
40. Real-Time Hadoop?
Really? Seriously?? Are you kidding me???
BigML Inc, 2012 Geneva, October 12, 2012 40
41. Why not Hadoop?
Hadoop on a cluster is the right solution for jobs where the input data is multi-terabyte or larger
·•Evidence suggests that many MapReduce-like jobs process relatively small input data sets
(less than 14 GB)
·•Iterative-machine learning algorithms, do not map trivially to MapReduce.
·•Memory has reached a GB/$ ratio such that it is now technically and financially feasible to
have servers with 100s GB of DRAM
·•In terms of hardware and programmer time, this may be a better option for the majority of
data processing jobs.
Rowstron, A. et al, Nobody ever got fired for using Hadoop
on a cluster, Microsoft Research, Cambridge, 2012
·•Hadoop is bad at iterative algorithms: High job startup costs and awkward to retain state
across iterations
·•High sensitivity to skew: iteration speed bounded by slowest task.
·•Potentially poor cluster utilization: must shuffle all data to a single reducer.
Large-Scale Machine Learning at Twitter, Jimmy Lin
BigML Inc, 2012 Geneva, October 12, 2012 41
42. Making Big Data Small
Hadoop Streaming Algorithms
·•Complex ·•Simple
·•Slow ·•Fast
·•Batch ·•Real-time
·•Expensive ·•Cheap
Noel Welsh, Strata conference, London, October 2012
BigML Inc, 2012 Geneva, October 12, 2012 42
43. Self-imposed Shackles
Once a baby elephant accepts the
limitation imposed on him it becomes
a permanent belief, or in his case, a
conditioned reaction. Now as the
elephant grows into adulthood, he
has the power to easily pull the
stake out of the ground, but his
conditioning has taught him that the
effort will not only be futile, it will be
painful as well.
http://www.selfgrowth.com/articles/Martinez1.html
Tackling Big Data with Hadoop on a cluster is like
self-imposing shackles on your own project
BigML Inc, 2012 Geneva, October 12, 2012 43
44. Starting with Big Data
•Buy a few machines and set up a cluster.
•Installing and running any flavor of Hadoop.
•Figure out how to implement complex map-reduce
algorithms to compute a few analytics.
•Start with a very small data sample.
•Use free or cloud-based tools to build a first predictive
model that you can understand.
•Check if the model gives you any practical insight.
•Use the model to generate predictions and see if it can
improve your performance.
•Check how more data can improve the model.
•Check if more sophisticated models can beat your model
•Iterate.
•Check if the volume, variety, and velocity of your data
require a behind-the-firewall/ cloud solution or a batch/stream
solution.
BigML Inc, 2012 Geneva, October 12, 2012 44
45. Big Data
What is Big Data? What is a Data Scientist?
How not to deal with Big Data? What is Data-driven
Decision Making?
BigML Inc, 2012 Geneva, October 12, 2012 45
46. Data-Driven Decisions
Automated, data-driven decisions will significantly
impact more industries than any other information
system since “computers” were people
http://www.nytimes.com/2011/04/24/business/24unboxed.html
BigML Inc, 2012 Geneva, October 12, 2012 46
47. The “HiPPO” (Highest Paid Person’s Opinion) is dead
BigML Inc, 2012 Geneva, October 12, 2012 47
49. Predictive Model
“The goal of a predictive
model is not
to predict the future but
to help you make a better
decision in the present”
Taken from Paul Saffo, HBR
BigML Inc, 2012 Geneva, October 12, 2012 49
50. Data-Driven Decision Making
Analytics and
Predictive Analytics
combined with
Experience&Intuition
BigML Inc, 2012 Geneva, October 12, 2012 50
51. It’s time to switch the attention
Assimilation
(making insights actionable)
More attention!!!
(reject bad data, wrong insights)
Absorption More focus on the models and
Egestion
(deriving insights) how to operationalize them than
on the infrastructure to generate
them
Digestion
(processing)
less attention!!!
Ingestion
(capturing and storing)
BigML Inc, 2012 Geneva, October 12, 2012 51
52. Take aways
•Big Data is just data
•It’s all about machine learning
•Try to excel in one of the data science disciplines
•Don’t shackle yourself to the wrong platform
•Trying to predict the future can help you make the right
decision in the present
•Focus on evaluation and actionability of models and not
on how they are built
BigML Inc, 2012 Geneva, October 12, 2012 52
53. Agenda
·•Short intro
·•The Big Data Revolution
·•What is BigML?
·•Behind the scenes
·•Coming down the pike
·•Hacking with the BigML API
BigML Inc, 2012 Geneva, October 12, 2012 53
54. BigML Goal
Highly Scalable, Cloud-based Machine Learning Service
Simple, Easy-to-Use and Seamless-to-
Integrate
BigML Inc, 2012 Geneva, October 12, 2012 54
55. BigML vs ML
You can deal ...or you can deal with that!
with this...
BigML 1-click model
BigML Inc, 2012 Geneva, October 12, 2012 55
56. BigML vs Big Data
You can deal ...or you can deal with that!
with this...
BigML 1-click model
BigML Inc, 2012 Geneva, October 12, 2012 56
59. Simple is not easy
“Any fool can make something complicated. It
takes a genius to make it simple.”
― Woody Guthrie
BigML Inc, 2012 Geneva, October 12, 2012 59
62. Agenda
·•Short intro
·•The Big Data Revolution
·•What is BigML? - Demo
·•Behind the scenes
·•Coming down the pike
·•Hacking with the BigML API
BigML Inc, 2012 Geneva, October 12, 2012 62
63. Agenda
·•Short intro
·•The Big Data Revolution
·•What is BigML?
·•Behind the scenes
·•Coming down the pike
·•Hacking with the BigML API
BigML Inc, 2012 Geneva, October 12, 2012 63
66. Why Tree Models?
·•Highly scalable
·•Graphically representable and interactive
·•Easily understandable
·•Easily translatable into rules, PMML, and code.
·•Easily upgradable with ensembles: boosting, bagging, and
random forests, etc
·•Top performers! http://www.niculescu-mizil.org/papers/empirical.icml06.pdfempirical.icml06.pdf
BigML Inc, 2012 Geneva, October 12, 2012 66
67. BigML Histograms
BigML's trees and dataset summaries use histograms with the following traits:
Streaming Memory constrained Dynamic
Data is never kept in memory The less memory allocated, the The histogram bins adjust
but needs only one pass over lossier the compressed themselves as they observe the
the data to capture the distribution. data.
distribution.
Robust to ordered Merge friendly More...
data
So it works even if the data For parallelization and http://blog.bigml.com/
stream is non-stationary distribution. 2012/06/18/bigmls-fancy-
histograms/
BigML Inc, 2012 Geneva, October 12, 2012 67
68. BigML Streaming Trees
BigML's trees are:
CART Grown breadth first
Classification & Regression So partial trees are
Trees meaningful
Built Hoeffding-style Friendly for parallelization
So they consume streaming Can work over multiple
data and can split "early" cores or multiple computers
BigML Inc, 2012 Geneva, October 12, 2012 68
69. Growing a Streaming Tree
·•Each split breaks the data
into subsets.
·•The split should make the
subsets as distinct from one
another as possible.
·•Subsets are chosen to
maximize information gain
(classification) or minimize
squared error (regression).
BigML Inc, 2012 Geneva, October 12, 2012 69
71. Streaming Trees - Early Splits
BigML Inc, 2012 Geneva, October 12, 2012 71
72. Agenda
·•Short intro
·•The Big Data Revolution
·•What is BigML?
·•Behind the scenes
·•Coming down the pike
·•Hacking with the BigML API
BigML Inc, 2012 Geneva, October 12, 2012 72
74. A marketplace for predictive models
BigML Inc, 2012 Geneva, October 12, 2012 74
75. Simple is not easy
“Any fool can make something complicated. It
takes a genius to make it simple.”
― Woody Guthrie
BigML Inc, 2012 Geneva, October 12, 2012 75
77. Agenda
·•Short intro
·•The Big Data Revolution
·•Demo
·•Behind the scenes
·•Coming down the pike
·•Hacking with the BigML API
BigML Inc, 2012 Geneva, October 12, 2012 77
78. Back to the trenches
Gallipoli
BigML Inc, 2012 Geneva, October 12, 2012 78
79. Good Reading
Big Data Trends - David Feinleib
http://www.slideshare.net/bigdatalandscape/big-data-trends
Hey Graduates: Forget Plastics - It's All About Machine Learning. Jess Bussgang.
http://bostonvcblog.typepad.com/vc/2012/05/forget-plastics-its-all-about-machine-learning.html
More Data or Better Models. Xavier Amatriain
http://technocalifornia.blogspot.ch/2012/07/more-data-or-better-models.html
Making Big Data Small. Noel Welsh
http://strataconf.com/strataeu/public/schedule/detail/25984
Data Killed the HiPPO star. Jeff Jordan, Andreessen Horowitz
http://gigaom.com/2012/02/18/data-killed-the-hippo-star/
When There’s No Such Thing as Too Much Information. Steve Lohr
http://www.nytimes.com/2011/04/24/business/24unboxed.html
Nobody ever got fired for using Hadoop on a cluster. Antony Rowstron, Dushyanth Narayanan, Austin Donnelly, Greg
O’Shea, Andrew Douglas
http://research.microsoft.com/pubs/163083/hotcbp12%20final.pdf
Six Rules for Effective Forecasting. Paul Saffo
http://www.usc.edu/schools/annenberg/asc/projects/wkc/pdf/200912digitalleadership_saffo.pdf
Large-scale Machine Learning at Twitter. Jimmy Lin and Alek Kolcz
http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf
BigML Inc, 2012 Geneva, October 12, 2012 79