Unexpected Challenges in Large Scale Machine Learning by Charles Parker
Upcoming SlideShare
Loading in...5

Unexpected Challenges in Large Scale Machine Learning by Charles Parker



Talk by Charles Parker (BigML) at BigMine12 at KDD12. ...

Talk by Charles Parker (BigML) at BigMine12 at KDD12.

In machine learning, scale adds complexity. The most obvious consequence of scale is that data takes longer to process. At certain points, however, scale makes trivial operations costly, thus forcing us to re-evaluate algorithms in light of the complexity of those operations. Here, we will discuss one important way a general large scale machine learning setting may diff er from the standard supervised classification setting and show some the results of some preliminary experiments highlighting this di fference. The results suggest that there is potential for signifi cant improvement beyond obvious solutions.



Total Views
Views on SlideShare
Embed Views



3 Embeds 1,389

http://big-data-mining.org 900
http://bigdata-mining.org 486
http://webcache.googleusercontent.com 3



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

 Unexpected Challenges in Large Scale Machine Learning by Charles Parker Unexpected Challenges in Large Scale Machine Learning by Charles Parker Presentation Transcript

  • Unexpected Challenges in Large Scale Machine Learning Charles Parker, BigML, Inc.
  • Who Am I? Ph. D. from Oregon State University, 2007 Four years with Eastman Kodak Research Labs − Data mining − Computer vision/image processing Currently with BigML − Developing a scalable, available, and beautiful platform for machine learning − Launched private beta in March − Still early days (Nine employees in Europe/U.S.)
  • Brief Summary Introduce you to BigML Review some of the recent research in the large-scale ML community Pose some research questions that may not be on the Big Data radar This is all very, very preliminary (comments appreciated)
  • A Little Bit More about BigML Right now, only decision trees (more to come) Going for a wide range of users Goals − All resources can be created and retrieved via our REST API  Programatic model creation  Downloadable, white-box models − A compelling front-end interface − Ease-of-use: As few clicks as possible; easy to understand visualizations A brief demo
  • Benefits Its all in the cloud − Easy to share with others − Can “deploy” the model to anywhere − Can trigger learning from anywhere (couch- based machine learning) Learns at scale − Up to 64 GB (and counting) − No specialized hardware or software required
  • Wheres The Big? Among our users, we find that very few have data greater than 100mb Why is this? − Takes too long? − Inadequate infrastructure? − Dont have the right algorithms? Maybe its something else . . .
  • Research Direction #1 Algorithms Speed, speed, speed − Langfords Vowpal Wabbit − PEGaSoS Parallelism − Domingos, 2001 − Bekkermans Tutorial at KDD 11
  • Research Direction #2 Tools Setting up clusters for large scale, parallel execution of jobs − Hadoop − Storm Languages allowing for hardware-independent specification of parallel algorithms − Spark − Scalops Using the GPU
  • The Benefits of Big Data Tools for processing of massive data are crucial Often, worse learners can be “fixed” by more data If the hypothesis space is large enough, accuracy improvement can be log-linear even to billions of examples Banko and Brill, 2001
  • Is This What is Needed? Processing big data is crucial, but many interesting ML algorithms are trivially parallel Is the focus on parallelism and architecture really necessary, or just popular? For most jobs, multi-machine architectures are probably not necessary “No one ever got fired for using Hadoop on a cluster”http://research.microsoft.com/pubs/163083/hotcbp12%20final.pdf If not parallel architectures, then what?
  • Some Old Assumptions Much of the current work in large scale learning makes the standard assumptions about the data: That it is drawn i.i.d. From a stationary distribution That linear time algorithms are cheap That super-linear time algorithms are expensive
  • Big Data, Assumption Breaker Could easily be non-i.d.d. − Even shuffling is expensive − What if its not all there? − For many common large datasets, the distribution is almost certainly not stationary as the world itself isnt stationary The easy solutions . . . − Make a pass over the data to shuffle it − Wait for it all to be there . . . both break responsivness.
  • The New Complexity Network latency and disk read times may dominate the cost of some learning algorithms − One pass over the data is expensive − Multiple passes may be out of the question Because reading the data dominates costs, we can do intensive computation in a given locality without significantly impacting cost − Read the data once into memory, do several hundred passes, read the next block, . . . − Super-linear algorithms arent so bad?
  • Example: The “Slow Arrival” Problem A lot of big data doesnt arrive all at once − Transactional data − Sensor data − Economic data We only get a chunk of the data every so often The distribution may be non-stationary
  • Some Simple Solutions Streaming algorithm, incremental updates − Good, but limits our options somewhat − Typically have to make choices about how long it takes for data to “expire” (e.g., learning rate) Lazy accumulators / Reservoir sampling − Lazy algorithms limit options − Reservoir sampling isnt using all data − Implicit expiry of data is “never” Window-based retraining − Completely forgets past data − Window size is an explicit choice
  • Related Research #1: Theory Strong “Mixing Conditions” - Analysis of time- series data that is asymptotically independent when it is sufficiently far apart in time Block-wise Stationarity – The data is drawn from the same distribution for some period of time before the distribution changes Concept Drift – When the concept learned by a classifier becomes invalid due to changes in the generating distributions of either the input or the output
  • Some Slow Arrival Data Simulated traffic data (closely mirrors some of our user data) − Cars per minute on a busy street − Predict: Number of cars that will be on the street in a given minute on a given date Varies by time of day − Rush hours have more traffic − Night time has little Varies by month of year − Less weekend travel in the winter − Less weekday travel in the summer Gaussian noise added to make it interesting
  • Algorithm and Strawmen Basic algorithm: − Given: Classifier at time n and the data − When a new block arrives at n + 1, train a classifier on half of the data − Use the other half to estimate performance of the new classifier vs. the old − Resample according to the amount of “drift” detected, train new classifier − Repeat Compare with − Reservior Sampling − Training only on last block − Training on last four blocks
  • Some Results #1: Regular Seasonal Effects Training on the last n blocks does well in the present but not in general Reservoir sampling trades a little present performance for better performance in the general case Adaptive resampling does more or less the same
  • Some Results #2: Dramatic Changes Sampling fails completely as history outside of the current block doesnt matter Adaptive resampling is able to detect the uselessness of the history and maintain performance
  • Summary Processing big data quickly is important But it isnt everything! − Big data brings new problems − Some of these might be new learning settings that are scientifically interesting “Slow Arrival” data is one of these − Seems general enough to be generally interesting − Benefits from something more than the naïve approach
  • Try BigML! Were still in private beta, but go to:www.bigml.comAnd request an invitation!