Examples of key information: selecting embryos based on 60 featuresYou may be asking “why arent we talking about mahout?”What we want to do here is look at the fundamentals that will underly all of the systems, not just mahoutSome of the wording may be different, but it’s the same
Yeah? Ok let’s look at doing ETL in HadoopAnd then running the model construction phase in another tool like RNo?We need to think of a way to either Refactor the algorithm into MapReducePartition the data such that a reducer can work on each subset
Frequent itemset mining – what appears together
“What do other people w/ similar tastes like?”“strength of associations”
“say hello to my leeeeetle friend….”
Vorpal: doesn’t natively run on HadoopSpark: scala, overhead, integration issues
“Unlikely optimization algorithms such as stochastic gradient descent show amazing performance for large-scale problems.“Bottou, 2010SGD has been around for decadesyet recently Langford, Bottou, others have shown impressive speed increasesSGD has been shown to train multiple orders of magnitude faster than batch style learnerswith no loss on model accuracy
At current disk bandwidth and capacity (2TB at 100MB/s throughput) 6 hours to read the content of a single HD
Bottou similar to Xu2010 in the 2010 paper
Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failuresAcyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data:Iterative algorithms (many in machine learning)• No single programming model or framework can excel atevery problem; there are always tradeoﬀs between simplicity, expressivity, fault tolerance, performance, etc.
Some of these are in progress towards being ready on YARN, some not; wanted to focus on OLR and not framework for now
POLR: Parallel Online Logistic RegressionTalking points:wanted to start with a known tool to the hadoop community, with expected characteristicsMahout’s SGD is well known, and so we used that as a base point
3 major costs of BSP style computations:Max unit compute timeCost of global communicationCost of barrier sync at end of super step
Multi-dimensional: need to constantly think about the Client, the Master, and the Worker, how they interact and the implications of failures, etc.
Basecamp: use story of how we get to basecamp to see how to climb some more
Transcript
1.
KNITTING BOAR Machine Learning, Mahout, and Parallel Iterative Algorithms Josh Patterson Principal Solutions Architect1
2.
✛ Josh Patterson > Master’s Thesis: self-organizing mesh networks ∗ Published in IAAI-09: TinyTermite: A Secure Routing Algorithm > Conceived, built, and led Hadoop integration for openPDC project at Tennessee Valley Authority (TVA) > Twitter: @jpatanooga > Email: josh@cloudera.com
3.
✛ Introduction to Machine Learning✛ Mahout✛ Knitting Boar and YARN✛ Parting Thoughts
5.
✛ What is Data Mining? > “the process of extracting patterns from data”✛ Why are we interested in Data Mining? > Raw data essentially useless ∗ Data is simply recorded facts ∗ Information is the patterns underlying the data✛ Machine Learning > Algorithms for acquiring structural descriptions from data “examples” ∗ Process of learning “concepts”
6.
✛ Information Retrieval > information science, information architecture, cognitive psychology, linguistics, and statistics.✛ Natural Language Processing > grounded in machine learning, especially statistical machine learning✛ Statistics > Math and stuff✛ Machine Learning > Considered a branch of artificial intelligence
8.
✛ Don’t always assume you need “scale” and parallelization > Try it out on a single machine first > See if it becomes a bottleneck!✛ Will the data fit in memory on a beefy machine?✛ We can always use the constructed model back in MapReduce to score a ton of new data
9.
✛ http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIG MOD2012.pdf > Looks to study data with descriptive statistics in the hopes of building models for predictive analytics✛ Does majority of ML work via Pig custom integrations > Pipeline is very “Pig-centric” > Example: https://github.com/tdunning/pig-vector > They use SGD and Ensemble methods mostly being conducive to large scale data mining✛ Questions they try to answer > Is this tweet spam? > What star rating might this user give this movie?
10.
✛ Data collection performed w Flume✛ Data cleansing / ETL performed with Hive or Pig✛ ML work performed with > SAS > SPSS > R > Mahout
12.
✛ Classification > “Fraud detection” ✛ Recommendation > “Collaborative Filtering” ✛ Clustering > “Segmentation” ✛ Frequent Itemset Mining12 Copyright 2010 Cloudera Inc. All rights reserved
13.
✛ Stochastic Gradient Descent > Single process > Logistic Regression Model Construction ✛ Naïve Bayes > MapReduce-based > Text Classification ✛ Random Forests > MapReduce-based13 Copyright 2010 Cloudera Inc. All rights reserved
14.
✛ An algorithm that looks at a user’s past actions and suggests > Products > Services > People✛ Advertisement > Cloudera has a great Data Science training course on this topic > http://university.cloudera.com/training/data_science/in troduction_to_data_science_- _building_recommender_systems.html
15.
✛ Cluster words across docs to identify topics✛ Latent Dirichlet Allocation
16.
✛ Why Machine Learning? > Growing interest in predictive modeling✛ Linear Models are Simple, Useful > Stochastic Gradient Descent is a very popular tool for building linear models like Logistic Regression✛ Building Models Still is Time Consuming > The “Need for speed” > “More data beats a cleverer algorithm”
18.
✛ Parallelize Mahout’s Stochastic Gradient Descent > With as few extra dependencies as possible✛ Wanted to explore parallel iterative algorithms using YARN > Wanted a first class Hadoop-Yarn citizen > Work through dev progressions towards a stable state > Worry about “frameworks” later
19.
✛ Training Training Data > Simple gradient descent procedure > Loss functions needs to be convex ✛ Prediction SGD > Logistic Regression: ∗ Sigmoid function using parameter vector (dot) example as exponential Model parameter19
20.
Current Limitations ✛ Sequential algorithms on a single node only goes so far ✛ The “Data Deluge” > Presents algorithmic challenges when combined with large data sets > need to design algorithms that are able to perform in a distributed fashion ✛ MapReduce only fits certain types of algorithms20
21.
Distributed Learning Strategies ✛ Langford, 2007 > Vowpal Wabbit ✛ McDonald 2010 > Distributed Training Strategies for the Structured Perceptron ✛ Dekel 2010 > Optimal Distributed Online Prediction Using Mini- Batches21
23.
“Are the gains gotten from using X worth the integration costs incurred in building the end-to-end solution? If no, then operationally, we can consider the Hadoop stack … there are substantial costs in knitting together a patchwork of different frameworks, programming models, etc.” –– Lin, 201223
24.
✛ Parallel Iterative implementation of SGD on YARN ✛ Workers work on partitions of the data ✛ Master keeps global copy of merged parameter vector24
25.
✛ Each given a split of the total dataset > Similar to a map task ✛ Using a modified OLR > process N samples in a batch (subset of split) ✛ Batched gradient accumulation updates sent to master node > Gradient influences future models vectors towards better predictions25
26.
✛ Accumulates gradient updates > From batches of worker OLR runs ✛ Produces new global parameter vector > By averaging workers’ vectors ✛ Sends update to all workers > Workers replace local parameter vector with new global parameter vector26
27.
OnlineLogisticRegression Knitting Boar’s POLR Split 1 Split 2 Split 3 Training Data Worker 1 Worker 2 … Worker N Partial Model Partial Model Partial Model OnlineLogisticRegression Master Model Global Model27
30.
✛ Parallel SGD > The Boar is temperamental, experimental ∗ Linear speedup (roughly) ✛ Developing YARN Applications > More complex the just MapReduce > Requires lots of “plumbing” ✛ IterativeReduce > Great native-Hadoop way to implement algorithms > Easy to use and well integrated30
32.
✛ Machine Learning is hard > Don’t believe the hype > Do the work ✛ Model development takes time > Lots of iterations > Speed is key here Picture: http://evertrek.files.wordpress.com/2011/06/everestsign.jpg32
33.
✛ Strata / Hadoop World 2012 Slides > http://www.cloudera.com/content/cloudera/en/resourc es/library/hadoopworld/strata-hadoop-world-2012- knitting-boar_slide_deck.html ✛ Mahout’s SGD implementation > http://lingpipe.files.wordpress.com/2008/04/lazysgdre gression.pdf ✛ MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That’s Not a Nail! > http://arxiv.org/pdf/1209.2191v1.pdf33
Be the first to comment