KnittingBoar Toronto Hadoop User Group Nov 27 2012
Upcoming SlideShare
Loading in...5
×
 

KnittingBoar Toronto Hadoop User Group Nov 27 2012

on

  • 1,182 views

 

Statistics

Views

Total Views
1,182
Views on SlideShare
1,181
Embed Views
1

Actions

Likes
0
Downloads
8
Comments
0

1 Embed 1

https://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Examples of key information: selecting embryos based on 60 featuresYou may be asking “why arent we talking about mahout?”What we want to do here is look at the fundamentals that will underly all of the systems, not just mahoutSome of the wording may be different, but it’s the same
  • Yeah? Ok let’s look at doing ETL in HadoopAnd then running the model construction phase in another tool like RNo?We need to think of a way to either Refactor the algorithm into MapReducePartition the data such that a reducer can work on each subset
  • Frequent itemset mining – what appears together
  • “What do other people w/ similar tastes like?”“strength of associations”
  • “say hello to my leeeeetle friend….”
  • Vorpal: doesn’t natively run on HadoopSpark: scala, overhead, integration issues
  • “Unlikely optimization algorithms such as stochastic gradient descent show  amazing performance for large-scale problems.“Bottou, 2010SGD has been around for decadesyet recently Langford, Bottou, others have shown impressive speed increasesSGD has been shown to train multiple orders of magnitude faster than batch style learnerswith no loss on model accuracy
  • At current disk bandwidth and capacity (2TB at 100MB/s throughput) 6 hours to read the content of a single HD
  • Bottou similar to Xu2010 in the 2010 paper
  • Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failuresAcyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data:Iterative algorithms (many in machine learning)• No single programming model or framework can excel atevery problem; there are always tradeoffs between simplicity, expressivity, fault tolerance, performance, etc.
  • Some of these are in progress towards being ready on YARN, some not; wanted to focus on OLR and not framework for now
  • POLR: Parallel Online Logistic RegressionTalking points:wanted to start with a known tool to the hadoop community, with expected characteristicsMahout’s SGD is well known, and so we used that as a base point
  • 3 major costs of BSP style computations:Max unit compute timeCost of global communicationCost of barrier sync at end of super step
  • Multi-dimensional: need to constantly think about the Client, the Master, and the Worker, how they interact and the implications of failures, etc.
  • Basecamp: use story of how we get to basecamp to see how to climb some more

KnittingBoar Toronto Hadoop User Group Nov 27 2012 KnittingBoar Toronto Hadoop User Group Nov 27 2012 Presentation Transcript

  • KNITTING BOAR Machine Learning, Mahout, and Parallel Iterative Algorithms Josh Patterson Principal Solutions Architect1
  • ✛ Josh Patterson > Master’s Thesis: self-organizing mesh networks ∗ Published in IAAI-09: TinyTermite: A Secure Routing Algorithm > Conceived, built, and led Hadoop integration for openPDC project at Tennessee Valley Authority (TVA) > Twitter: @jpatanooga > Email: josh@cloudera.com
  • ✛ Introduction to Machine Learning✛ Mahout✛ Knitting Boar and YARN✛ Parting Thoughts
  • Introduction to MACHINE LEARNING4
  • ✛ What is Data Mining? > “the process of extracting patterns from data”✛ Why are we interested in Data Mining? > Raw data essentially useless ∗ Data is simply recorded facts ∗ Information is the patterns underlying the data✛ Machine Learning > Algorithms for acquiring structural descriptions from data “examples” ∗ Process of learning “concepts”
  • ✛ Information Retrieval > information science, information architecture, cognitive psychology, linguistics, and statistics.✛ Natural Language Processing > grounded in machine learning, especially statistical machine learning✛ Statistics > Math and stuff✛ Machine Learning > Considered a branch of artificial intelligence
  • ✛ ETL✛ Joining multiple disparate data sources✛ Filtering data✛ Aggregation✛ Cube materialization “Descriptive Statistics”
  • ✛ Don’t always assume you need “scale” and parallelization > Try it out on a single machine first > See if it becomes a bottleneck!✛ Will the data fit in memory on a beefy machine?✛ We can always use the constructed model back in MapReduce to score a ton of new data
  • ✛ http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIG MOD2012.pdf > Looks to study data with descriptive statistics in the hopes of building models for predictive analytics✛ Does majority of ML work via Pig custom integrations > Pipeline is very “Pig-centric” > Example: https://github.com/tdunning/pig-vector > They use SGD and Ensemble methods mostly being conducive to large scale data mining✛ Questions they try to answer > Is this tweet spam? > What star rating might this user give this movie?
  • ✛ Data collection performed w Flume✛ Data cleansing / ETL performed with Hive or Pig✛ ML work performed with > SAS > SPSS > R > Mahout
  • Introduction to11 MAHOUT
  • ✛ Classification > “Fraud detection” ✛ Recommendation > “Collaborative Filtering” ✛ Clustering > “Segmentation” ✛ Frequent Itemset Mining12 Copyright 2010 Cloudera Inc. All rights reserved
  • ✛ Stochastic Gradient Descent > Single process > Logistic Regression Model Construction ✛ Naïve Bayes > MapReduce-based > Text Classification ✛ Random Forests > MapReduce-based13 Copyright 2010 Cloudera Inc. All rights reserved
  • ✛ An algorithm that looks at a user’s past actions and suggests > Products > Services > People✛ Advertisement > Cloudera has a great Data Science training course on this topic > http://university.cloudera.com/training/data_science/in troduction_to_data_science_- _building_recommender_systems.html
  • ✛ Cluster words across docs to identify topics✛ Latent Dirichlet Allocation
  • ✛ Why Machine Learning? > Growing interest in predictive modeling✛ Linear Models are Simple, Useful > Stochastic Gradient Descent is a very popular tool for building linear models like Logistic Regression✛ Building Models Still is Time Consuming > The “Need for speed” > “More data beats a cleverer algorithm”
  • IntroducingKNITTING BOAR 17
  • ✛ Parallelize Mahout’s Stochastic Gradient Descent > With as few extra dependencies as possible✛ Wanted to explore parallel iterative algorithms using YARN > Wanted a first class Hadoop-Yarn citizen > Work through dev progressions towards a stable state > Worry about “frameworks” later
  • ✛ Training Training Data > Simple gradient descent procedure > Loss functions needs to be convex ✛ Prediction SGD > Logistic Regression: ∗ Sigmoid function using parameter vector (dot) example as exponential Model parameter19
  • Current Limitations ✛ Sequential algorithms on a single node only goes so far ✛ The “Data Deluge” > Presents algorithmic challenges when combined with large data sets > need to design algorithms that are able to perform in a distributed fashion ✛ MapReduce only fits certain types of algorithms20
  • Distributed Learning Strategies ✛ Langford, 2007 > Vowpal Wabbit ✛ McDonald 2010 > Distributed Training Strategies for the Structured Perceptron ✛ Dekel 2010 > Optimal Distributed Online Prediction Using Mini- Batches21
  • Input Processor Processor Processor Superstep 1 Map Map Map Processor Processor Processor Reduce Reduce Superstep 2 . . . Output22
  • “Are the gains gotten from using X worth the integration costs incurred in building the end-to-end solution? If no, then operationally, we can consider the Hadoop stack … there are substantial costs in knitting together a patchwork of different frameworks, programming models, etc.” –– Lin, 201223
  • ✛ Parallel Iterative implementation of SGD on YARN ✛ Workers work on partitions of the data ✛ Master keeps global copy of merged parameter vector24
  • ✛ Each given a split of the total dataset > Similar to a map task ✛ Using a modified OLR > process N samples in a batch (subset of split) ✛ Batched gradient accumulation updates sent to master node > Gradient influences future models vectors towards better predictions25
  • ✛ Accumulates gradient updates > From batches of worker OLR runs ✛ Produces new global parameter vector > By averaging workers’ vectors ✛ Sends update to all workers > Workers replace local parameter vector with new global parameter vector26
  • OnlineLogisticRegression Knitting Boar’s POLR Split 1 Split 2 Split 3 Training Data Worker 1 Worker 2 … Worker N Partial Model Partial Model Partial Model OnlineLogisticRegression Master Model Global Model27
  • 300 250 200 150 OLR POLR 100 50 0 4.1 8.2 12.3 16.4 20.5 24.6 28.7 32.8 36.9 41 Input Size vs Processing Time28
  • Knitting Boar PARTING THOUGHTS29
  • ✛ Parallel SGD > The Boar is temperamental, experimental ∗ Linear speedup (roughly) ✛ Developing YARN Applications > More complex the just MapReduce > Requires lots of “plumbing” ✛ IterativeReduce > Great native-Hadoop way to implement algorithms > Easy to use and well integrated30
  • ✛ Knitting Boar > https://github.com/jpatanooga/KnittingBoar > 100% Java > ASF 2.0 Licensed > Quick Start ∗ https://github.com/jpatanooga/KnittingBoar/wiki/Quick-Start ✛ IterativeReduce > https://github.com/emsixteeen/IterativeReduce > 100% Java > ASF 2.0 Licensed31
  • ✛ Machine Learning is hard > Don’t believe the hype > Do the work ✛ Model development takes time > Lots of iterations > Speed is key here Picture: http://evertrek.files.wordpress.com/2011/06/everestsign.jpg32
  • ✛ Strata / Hadoop World 2012 Slides > http://www.cloudera.com/content/cloudera/en/resourc es/library/hadoopworld/strata-hadoop-world-2012- knitting-boar_slide_deck.html ✛ Mahout’s SGD implementation > http://lingpipe.files.wordpress.com/2008/04/lazysgdre gression.pdf ✛ MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That’s Not a Nail! > http://arxiv.org/pdf/1209.2191v1.pdf33
  • ✛ Langford > http://hunch.net/~vw/ ✛ McDonald, 2010 > http://dl.acm.org/citation.cfm?id=185806834
  • ✛ http://eteamjournal.files.wordpress.com/2011/03/ photos-of-mount-everest-pictures.jpg ✛ http://images.fineartamerica.com/images- medium-large/-say-hello-to-my-little-friend--luis- ludzska.jpg ✛ http://freewallpaper.in/wallpaper2/2202-2- 2001_space_odyssey_-_5.jpg35