KnittingBoar Toronto Hadoop User Group Nov 27 2012


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Examples of key information: selecting embryos based on 60 featuresYou may be asking “why arent we talking about mahout?”What we want to do here is look at the fundamentals that will underly all of the systems, not just mahoutSome of the wording may be different, but it’s the same
  • Yeah? Ok let’s look at doing ETL in HadoopAnd then running the model construction phase in another tool like RNo?We need to think of a way to either Refactor the algorithm into MapReducePartition the data such that a reducer can work on each subset
  • Frequent itemset mining – what appears together
  • “What do other people w/ similar tastes like?”“strength of associations”
  • “say hello to my leeeeetle friend….”
  • Vorpal: doesn’t natively run on HadoopSpark: scala, overhead, integration issues
  • “Unlikely optimization algorithms such as stochastic gradient descent show  amazing performance for large-scale problems.“Bottou, 2010SGD has been around for decadesyet recently Langford, Bottou, others have shown impressive speed increasesSGD has been shown to train multiple orders of magnitude faster than batch style learnerswith no loss on model accuracy
  • At current disk bandwidth and capacity (2TB at 100MB/s throughput) 6 hours to read the content of a single HD
  • Bottou similar to Xu2010 in the 2010 paper
  • Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failuresAcyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data:Iterative algorithms (many in machine learning)• No single programming model or framework can excel atevery problem; there are always tradeoffs between simplicity, expressivity, fault tolerance, performance, etc.
  • Some of these are in progress towards being ready on YARN, some not; wanted to focus on OLR and not framework for now
  • POLR: Parallel Online Logistic RegressionTalking points:wanted to start with a known tool to the hadoop community, with expected characteristicsMahout’s SGD is well known, and so we used that as a base point
  • 3 major costs of BSP style computations:Max unit compute timeCost of global communicationCost of barrier sync at end of super step
  • Multi-dimensional: need to constantly think about the Client, the Master, and the Worker, how they interact and the implications of failures, etc.
  • Basecamp: use story of how we get to basecamp to see how to climb some more
  • KnittingBoar Toronto Hadoop User Group Nov 27 2012

    1. 1. KNITTING BOAR Machine Learning, Mahout, and Parallel Iterative Algorithms Josh Patterson Principal Solutions Architect1
    2. 2. ✛ Josh Patterson > Master’s Thesis: self-organizing mesh networks ∗ Published in IAAI-09: TinyTermite: A Secure Routing Algorithm > Conceived, built, and led Hadoop integration for openPDC project at Tennessee Valley Authority (TVA) > Twitter: @jpatanooga > Email:
    3. 3. ✛ Introduction to Machine Learning✛ Mahout✛ Knitting Boar and YARN✛ Parting Thoughts
    4. 4. Introduction to MACHINE LEARNING4
    5. 5. ✛ What is Data Mining? > “the process of extracting patterns from data”✛ Why are we interested in Data Mining? > Raw data essentially useless ∗ Data is simply recorded facts ∗ Information is the patterns underlying the data✛ Machine Learning > Algorithms for acquiring structural descriptions from data “examples” ∗ Process of learning “concepts”
    6. 6. ✛ Information Retrieval > information science, information architecture, cognitive psychology, linguistics, and statistics.✛ Natural Language Processing > grounded in machine learning, especially statistical machine learning✛ Statistics > Math and stuff✛ Machine Learning > Considered a branch of artificial intelligence
    7. 7. ✛ ETL✛ Joining multiple disparate data sources✛ Filtering data✛ Aggregation✛ Cube materialization “Descriptive Statistics”
    8. 8. ✛ Don’t always assume you need “scale” and parallelization > Try it out on a single machine first > See if it becomes a bottleneck!✛ Will the data fit in memory on a beefy machine?✛ We can always use the constructed model back in MapReduce to score a ton of new data
    9. 9. ✛ MOD2012.pdf > Looks to study data with descriptive statistics in the hopes of building models for predictive analytics✛ Does majority of ML work via Pig custom integrations > Pipeline is very “Pig-centric” > Example: > They use SGD and Ensemble methods mostly being conducive to large scale data mining✛ Questions they try to answer > Is this tweet spam? > What star rating might this user give this movie?
    10. 10. ✛ Data collection performed w Flume✛ Data cleansing / ETL performed with Hive or Pig✛ ML work performed with > SAS > SPSS > R > Mahout
    11. 11. Introduction to11 MAHOUT
    12. 12. ✛ Classification > “Fraud detection” ✛ Recommendation > “Collaborative Filtering” ✛ Clustering > “Segmentation” ✛ Frequent Itemset Mining12 Copyright 2010 Cloudera Inc. All rights reserved
    13. 13. ✛ Stochastic Gradient Descent > Single process > Logistic Regression Model Construction ✛ Naïve Bayes > MapReduce-based > Text Classification ✛ Random Forests > MapReduce-based13 Copyright 2010 Cloudera Inc. All rights reserved
    14. 14. ✛ An algorithm that looks at a user’s past actions and suggests > Products > Services > People✛ Advertisement > Cloudera has a great Data Science training course on this topic > troduction_to_data_science_- _building_recommender_systems.html
    15. 15. ✛ Cluster words across docs to identify topics✛ Latent Dirichlet Allocation
    16. 16. ✛ Why Machine Learning? > Growing interest in predictive modeling✛ Linear Models are Simple, Useful > Stochastic Gradient Descent is a very popular tool for building linear models like Logistic Regression✛ Building Models Still is Time Consuming > The “Need for speed” > “More data beats a cleverer algorithm”
    17. 17. IntroducingKNITTING BOAR 17
    18. 18. ✛ Parallelize Mahout’s Stochastic Gradient Descent > With as few extra dependencies as possible✛ Wanted to explore parallel iterative algorithms using YARN > Wanted a first class Hadoop-Yarn citizen > Work through dev progressions towards a stable state > Worry about “frameworks” later
    19. 19. ✛ Training Training Data > Simple gradient descent procedure > Loss functions needs to be convex ✛ Prediction SGD > Logistic Regression: ∗ Sigmoid function using parameter vector (dot) example as exponential Model parameter19
    20. 20. Current Limitations ✛ Sequential algorithms on a single node only goes so far ✛ The “Data Deluge” > Presents algorithmic challenges when combined with large data sets > need to design algorithms that are able to perform in a distributed fashion ✛ MapReduce only fits certain types of algorithms20
    21. 21. Distributed Learning Strategies ✛ Langford, 2007 > Vowpal Wabbit ✛ McDonald 2010 > Distributed Training Strategies for the Structured Perceptron ✛ Dekel 2010 > Optimal Distributed Online Prediction Using Mini- Batches21
    22. 22. Input Processor Processor Processor Superstep 1 Map Map Map Processor Processor Processor Reduce Reduce Superstep 2 . . . Output22
    23. 23. “Are the gains gotten from using X worth the integration costs incurred in building the end-to-end solution? If no, then operationally, we can consider the Hadoop stack … there are substantial costs in knitting together a patchwork of different frameworks, programming models, etc.” –– Lin, 201223
    24. 24. ✛ Parallel Iterative implementation of SGD on YARN ✛ Workers work on partitions of the data ✛ Master keeps global copy of merged parameter vector24
    25. 25. ✛ Each given a split of the total dataset > Similar to a map task ✛ Using a modified OLR > process N samples in a batch (subset of split) ✛ Batched gradient accumulation updates sent to master node > Gradient influences future models vectors towards better predictions25
    26. 26. ✛ Accumulates gradient updates > From batches of worker OLR runs ✛ Produces new global parameter vector > By averaging workers’ vectors ✛ Sends update to all workers > Workers replace local parameter vector with new global parameter vector26
    27. 27. OnlineLogisticRegression Knitting Boar’s POLR Split 1 Split 2 Split 3 Training Data Worker 1 Worker 2 … Worker N Partial Model Partial Model Partial Model OnlineLogisticRegression Master Model Global Model27
    28. 28. 300 250 200 150 OLR POLR 100 50 0 4.1 8.2 12.3 16.4 20.5 24.6 28.7 32.8 36.9 41 Input Size vs Processing Time28
    29. 29. Knitting Boar PARTING THOUGHTS29
    30. 30. ✛ Parallel SGD > The Boar is temperamental, experimental ∗ Linear speedup (roughly) ✛ Developing YARN Applications > More complex the just MapReduce > Requires lots of “plumbing” ✛ IterativeReduce > Great native-Hadoop way to implement algorithms > Easy to use and well integrated30
    31. 31. ✛ Knitting Boar > > 100% Java > ASF 2.0 Licensed > Quick Start ∗ ✛ IterativeReduce > > 100% Java > ASF 2.0 Licensed31
    32. 32. ✛ Machine Learning is hard > Don’t believe the hype > Do the work ✛ Model development takes time > Lots of iterations > Speed is key here Picture:
    33. 33. ✛ Strata / Hadoop World 2012 Slides > es/library/hadoopworld/strata-hadoop-world-2012- knitting-boar_slide_deck.html ✛ Mahout’s SGD implementation > gression.pdf ✛ MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That’s Not a Nail! >
    34. 34. ✛ Langford > ✛ McDonald, 2010 >
    35. 35. ✛ photos-of-mount-everest-pictures.jpg ✛ medium-large/-say-hello-to-my-little-friend--luis- ludzska.jpg ✛ 2001_space_odyssey_-_5.jpg35