Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab


Published on

This gives a characterization of the machine learning computations and brings out the deficiencies of Hadoop 1.0. It gives the motivation for Hadoop YARN and a brief view of YARN architecture. It illustrates the power of specialized processing frameworks over YARN, such as Spark and GraphLab. In short, Hadoop YARN allows your data to be stored in HDFS and specialized processing frameworks may be used to process the data in various ways.

Published in: Technology, Education

Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab

  1. 1. Beyond Hadoop Map-Reduce Dr. Vijay Srinivas Agneeswaran, Director and Head, Big-data R&D, Innovation Labs, Impetus 1
  2. 2. Contents Big Data Computations Hadoop 2.0 (Hadoop YARN) Berkeley data analytics stack • BDAS Spark • BDAS Discretized Streams Real-time analytics with Storm PMML • PMML Primer Scoring for Naïve • Naïve Bayes Primer Bayes 2
  3. 3. Big Data Computations Computations/Operations Giant 1 (simple stats) is perfect for Hadoop 1.0. Giants 2 (linear algebra), 3 (Nbody), 4 (optimization) Spark from UC Berkeley is efficient. Interactive/On-the-fly data processing – Storm. Logistic regression, kernel SVMs, conjugate gradient descent, collaborative filtering, Gibbs sampling, alternating least squares. Example is social group-first approach for consumer churn analysis [1] OLAP – data cube operations. Dremel/Drill Data sets – not embarrassingly parallel? Machine vision from Google Deep Learning Artificial Neural Networks Speech analysis from Microsoft Giant 5 – Graph processing – GraphLab, Pregel, Giraph 3 [1] National Research Council. Frontiers in Massive Data Analysis . Washington, DC: The National Academies Press, 2013. [2] RICHTER, Yossi ; YOM-TOV, Elad ; SLONIM, Noam: Predicting Customer Churn in Mobile Networks through Analysis of Social Groups. In: Proceedings of SIAM International Conference on Data Mining, 2010, S. 732-741
  4. 4. Hadoop YARN Requirements or 1.0 shortcomings R1: Scalability R2: Multi-tenancy • single cluster limitation • Addressed by Hadoopon-Demand • Security, Quotas R3: Locality awareness R4: Shared cluster utilization • Shuffle of records • Hogging by users • Typed slots R5: Reliability/Availability • Job Tracker bugs R6: Iterative Machine Learning 4 Vinod Kumar Vavilapalli, Arun C Murthy , Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe , Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler, “Apache Hadoop YARN: Yet Another Resource Negotiator”, ACM Symposium on Cloud Computing, Oct 2013, ACM Press.
  5. 5. 5 Hadoop YARN Architecture
  6. 6. YARN Internals Application Master • Sends ResourceRequests to the YARN RM • Captures containers, resources per container, locality preferences. YARN RM • Generates tokens and containers • Global view of cluster – monolithic scheduling. Node Manager • Node health monitoring, advertise available resources through heartbeats to RM. 6
  7. 7. Berkeley Big-data Analytics Stack (BDAS) 7
  8. 8. BDAS: Spark Transformations/Actions Map(function f1) Filter(function f2) flatMap(function f3) Union(RDD r1) Sample(flag, p, seed) groupByKey(noTasks) Description Pass each element of the RDD through f1 in parallel and return the resulting RDD. Select elements of RDD that return true when passed through f2. Similar to Map, but f3 returns a sequence to facilitate mapping single input to multiple outputs. Returns result of union of the RDD r1 with the self. Returns a randomly sampled (with seed) p percentage of the RDD. Can only be invoked on key-value paired data – returns data grouped by value. No. of parallel tasks is given as an argument (default is 8). Aggregates result of applying f4 on elements with same key. No. of parallel tasks is the second argument. Joins RDD r2 with self – computes all possible pairs for given key. Joins RDD r3 with self and groups by key. reduceByKey(function f4, noTasks) Join(RDD r2, noTasks) groupWith(RDD r3, noTasks) sortByKey(flag) Sorts the self RDD in ascending or descending based on flag. Reduce(function f5) Aggregates result of applying function f5 on all elements of self RDD Collect() Return all elements of the RDD as an array. Count() Count no. of elements in RDD take(n) Get first n elements of RDD. First() Equivalent to take(1) saveAsTextFile(path) Persists RDD in a file in HDFS or other Hadoop supported file system at given path. saveAsSequenceFile(path Persist RDD as a Hadoop sequence file. Can be invoked only on key-value paired RDDs ) that implement Hadoop writable interface or equivalent. foreach(function f6) Run f6 in parallel on elements of self Ankur [MZ12] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das,RDD. Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: a fault-tolerant abstraction for inmemory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2-2.
  9. 9. BDAS: Use Cases Ooyala Conviva Uses Cassandra for video data personalization. Uses Hive for repeatedly running ad-hoc queries on video data. Pre-compute aggregates VS onthe-fly queries. Optimized ad-hoc queries using Spark RDDs – found Spark is 30 times faster than Hive Moved to Spark for ML and computing views. ML for connection analysis and video streaming optimization. 9 Moved to Shark for on-the-fly queries – C* OLAP aggregate queries on Cassandra 130 secs, 60 ms in Spark Yahoo Advertisement targeting: 30K nodes on Hadoop Yarn Hadoop – batch processing Spark – iterative processing Storm – on-the-fly processing Content recommendation – collaborative filtering
  10. 10. 10
  11. 11. Real-time Analytics: R over Storm 11
  12. 12. Real-time Analytics UC 1: Internet Traffic Analysis
  13. 13. Real-time Analysis UC2: Arrhythmia Detection 13
  14. 14. GraphLab: Ideal Engine for Processing Natural Graphs [YL12] Goals – targeted at machine learning. • Model graph dependencies, be asynchronous, iterative, dynamic. Data associated with edges (weights, for instance) and vertices (user profile data, current interests etc.). Update functions – lives on each vertex Consistency is important in ML algorithms (some do not even converge when there are inconsistent updates – collaborative filtering). • Transforms data in scope of vertex. • Can choose to trigger neighbours (for example only if Rank changes drastically) • Run asynchronously till convergence – no global barrier. • GraphLab – provides varying level of consistency. Parallelism VS consistency. Implemented several algorithms, including ALS, K-means, SVM, Belief propagation, matrix factorization, Gibbs sampling, SVD, CoEM etc. • Co-EM (Expectation Maximization) algorithm 15x faster than Hadoop MR – on distributed GraphLab, only 0.3% of Hadoop execution time. [YL12] Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment 5, 8 (April 2012), 716-727.
  15. 15. GraphLab 2: PowerGraph – Modeling Natural Graphs [1] GraphLab could not scale to Altavista web graph 2002, 1.4B vertices, 6.7B edges. Powergraph provides new way of partitioning power law graphs • Most graph parallel abstractions assume small neighbourhoods – low degree vertices • But natural graphs (LinkedIn, Facebook, Twitter) – power law graphs. • Hard to partition power law graphs, high degree vertices limit parallelism. • Edges are tied to machines, vertices (esp. high degree ones) span machines • Execution split into 3 phases: • Gather, apply and scatter. Triangle counting on Twitter graph • Hadoop MR took 423 minutes on 1536 machines • GraphLab 2 took 1.5 minutes on 1024 cores (64 machines) [1] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin (2012). "PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs." Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI '12).
  16. 16. PMML Primer Predictive Model Markup Language Developed by DMG (Data Mining Group) PMML offers a standard to define a model, so that a model generated in tool-A can be directly used in tool-B. XML representation of a model. May contain a myriad of data transformations (pre- and post-processing) as well as one or more predictive models. 16
  17. 17. Naïve Bayes Primer A simple probabilistic classifier based on Bayes Theorem Given features X1,X2,…,Xn, predict a label Y by calculating the probability for all possible Y value Likelihood Normalization Constant Prior 17
  18. 18. PMML Scoring for Naïve Bayes Wrote a PMML based scoring engine for Naïve Bayes algorithm. This can theoretically be used in any framework for data processing by invoking the API Deployed a Naïve Bayes PMML generated from R into Storm / Spark and Samza frameworks Real time predictions with the above APIs 18
  19. 19. Header • Version and timestamp • Model development environment information Data Dictionary • Variable types, missing valid and invalid values, Data Munging/Transformation • Normalization, mapping, discretization Model • Model specifi attributes • Mining Schema • Treatment for missing and outlier values • Targets • Prior probability and default • Outputs • List of computer output fields • Post-processing • Definition of model architecture/parameters. 19
  20. 20. PMML Scoring for Naïve Bayes <DataDictionary numberOfFields="4"> <DataField name="Class" optype="categorical" dataType="string"> <Value value="democrat"/> <Value value="republican"/> </DataField> <DataField name="V1" optype="categorical" dataType="string"> <Value value="n"/> <Value value="y"/> </DataField> <DataField name="V2" optype="categorical" dataType="string"> <Value value="n"/> <Value value="y"/> </DataField> <DataField name="V3" optype="categorical" dataType="string"> <Value value="n"/> <Value value="y"/> </DataField> </DataDictionary> (ctd on the next slide) 20
  21. 21. PMML Scoring for Naïve Bayes <NaiveBayesModel modelName="naiveBayes_Model" functionName="classification" threshold="0.003"> <MiningSchema> <MiningField name="Class" usageType="predicted"/> <MiningField name="V1" usageType="active"/> <MiningField name="V2" usageType="active"/> <MiningField name="V3" usageType="active"/> </MiningSchema> <Output> <OutputField name="Predicted_Class" feature="predictedValue"/> <OutputField name="Probability_democrat" optype="continuous" dataType="double" feature="probability" value="democrat"/> <OutputField name="Probability_republican" optype="continuous" dataType="double" feature="probability" value="republican"/> </Output> <BayesInputs> (ctd on the next page) 21
  22. 22. PMML Scoring for Naïve Bayes 22 <BayesInputs> <BayesInput fieldName="V1"> <PairCounts value="n"> <TargetValueCounts> <TargetValueCount value="democrat" count="51"/> <TargetValueCount value="republican" count="85"/> </TargetValueCounts> </PairCounts> <PairCounts value="y"> <TargetValueCounts> <TargetValueCount value="democrat" count="73"/> <TargetValueCount value="republican" count="23"/> </TargetValueCounts> </PairCounts> </BayesInput> <BayesInput fieldName="V2"> * <BayesInput fieldName="V3"> * </BayesInputs> <BayesOutput fieldName="Class"> <TargetValueCounts> <TargetValueCount value="democrat" count="124"/> <TargetValueCount value="republican" count="108"/> </TargetValueCounts> </BayesOutput>
  23. 23. PMML Scoring for Naïve Bayes Definition Of Elements:DataDictionary : Definitions for fields as used in mining models ( Class, V1, V2, V3 ) NaiveBayesModel : Indicates that this is a NaiveBayes PMML MiningSchema : lists fields as used in that model. Class is “predicted” field, V1,V2,V3 are “active” predictor fields Output: Describes a set of result values that can be returned from a model 23
  24. 24. PMML Scoring for Naïve Bayes Definition Of Elements (ctd .. ) :BayesInputs: For each type of inputs, contains the counts of outputs BayesOutput: Contains the counts associated with the values of the target field 24
  25. 25. PMML Scoring for Naïve Bayes Sample Input Eg1 - n y y n y y n n n n n n y y y y Eg2 - n y n y y y n n n n n y y y n y • 1st , 2nd and 3rd Columns: Predictor variables ( Attribute “name” in element MiningField ) • Using these we predict whether the Output is Democrat or Republican ( PMML element BayesOutput) 25
  26. 26. PMML Scoring for Naïve Bayes • 3 Node Xeon Machines Storm cluster ( 8 quad code CPUs, 32 GB RAM, 32 GB Swap space, 1 Nimbus, 2 Supervisors ) Number of records ( in millions ) Time Taken (seconds) 0.1 4 0.4 7 1.0 12 2.0 21 10 129 25 310 26
  27. 27. PMML Scoring for Naïve Bayes • 3 Node Xeon Machines Spark cluster( 8 quad code CPUs, 32 GB RAM and 32 GB Swap space ) Number of records ( in millions ) Time Taken ( 0.1 1 min 47 sec 0.2 3 min 35 src 0.4 6 min 40 secs 1.0 35 mins 17 sec 10 More than 3 hrs 27
  28. 28. Conclusion • Beyond Hadoop Map-Reduce philosophy • Optimization and other problems. • Real-time computation • Processing specialized data structures • PMML scoring • Spark for batch computations • Spark streaming and Storm for real-time. 28 • Allows traditional analytical tools/algorithms to be re-used.
  29. 29. Thank You! Mail LinkedIn • • Blogs • Twitter • @a_vijaysrinivas.