Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Distributed computing abstractions_data_science_6_june_2016_ver_0.4


Published on

These slides outline the common distributed computing abstractions necessary to implement data science at scale. It starts with a characterization of the computations required to realize common machine learning at scale. Introductions to Hadoop MR, Spark, GraphLab are covered currently. Going forward, we shall update with Flink, Titan and TensorFlow and how to realize machine learning/deep learning algorithms on top of these frameworks as well as trade-offs between these frameworks.

Published in: Software
  • Login to see the comments

Distributed computing abstractions_data_science_6_june_2016_ver_0.4

  1. 1. Distributed Computing Abstractions for Big Data Cloud Analytics Dr. Vijay Srinivas Agneeswaran Director of Technology, Data Sciences Group, SapientNitro, Bangalore.
  2. 2. © Tally Solutions Pvt. Ltd. All Rights Reserved 22 Big Data Computations Computations/Operations Giant 1 (simple stats) is perfect for Hadoop 1.0. Giants 2 (linear algebra), 3 (N- body), 4 (optimization) Spark from UC Berkeley is efficient. Logistic regression, kernel SVMs, conjugate gradient descent, collaborative filtering, Gibbs sampling, alternating least squares. Example is social group-first approach for consumer churn analysis [2] Interactive/On-the-fly data processing – Storm. OLAP – data cube operations. Dremel/Drill Data sets – not embarrassingly parallel? Deep Learning Artificial Neural Networks/Deep Belief Networks Machine vision from Google [3] Speech analysis from Microsoft Giant 5 – Graph processing – GraphLab, Pregel, Giraph [1] National Research Council. Frontiers in Massive Data Analysis . Washington, DC: The National Academies Press, 2013. [2] Richter, Yossi ; Yom-Tov, Elad ; Slonim, Noam: Predicting Customer Churn in Mobile Networks through Analysis of Social Groups. In: Proceedings of SIAM International Conference on Data Mining, 2010, S. 732-741 [3] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc'Aurelio Ranzato, Andrew W. Senior, Paul A. Tucker, Ke Yang, Andrew Y. Ng: Large Scale Distributed Deep Networks. NIPS 2012: 1232-1240
  3. 3. © Tally Solutions Pvt. Ltd. All Rights Reserved 33 3 What is Hadoop? This diagram is from
  4. 4. © Tally Solutions Pvt. Ltd. All Rights Reserved 44 4 Data Flow in Spark and Hadoop
  5. 5. © Tally Solutions Pvt. Ltd. All Rights Reserved 55 Comparison of Graph Processing Paradigms Graph Paradigm Computation (Synchronous or Asynchronous) Determinism and Effects Efficiency VS Asynchrony Efficiency of Processing Power Law Graphs. GraphLab1 Asynchronous Deterministic – serializable computation. Uses inefficient locking protocols Inefficient – Locking protocol is unfair to high degree vertices Pregel Synchronous – BSP based Deterministic – serial computation. NA Inefficient – curse of slow jobs. Piccolo Asynchronous Non- deterministic – non-serializable computation. Efficient, but may lead to incorrectness. May be efficient. GraphLab2 (PowerGraph) Asynchronous Deterministic – serializable computation. Uses parallel locking for efficient, serializable and asynchronous computations. Efficient – parallel locking and other optimizations for processing natural graphs.
  6. 6. GraphLab: Ideal Engine for Processing Graphs [YL12] – Dato. [YL12] Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment 5, 8 (April 2012), 716-727. Goals – targeted at machine learning. • Model graph dependencies, be asynchronous, iterative, dynamic. Data associated with edges (weights, for instance) and vertices (user profile data, current interests etc.). Update functions – lives on each vertex • Transforms data in scope of vertex. • Can choose to trigger neighbours (for example only if Rank changes drastically) • Run asynchronously till convergence – no global barrier. Consistency is important in ML algorithms (some do not even converge when there are inconsistent updates – collaborative filtering). • GraphLab – provides varying level of consistency. Parallelism VS consistency. Implemented several algorithms, including ALS, K-means, SVM, Belief propagation, matrix factorization, Gibbs sampling, SVD, CoEM etc. • Co-EM (Expectation Maximization) algorithm 15x faster than Hadoop MR – on distributed GraphLab, only 0.3% of Hadoop execution time.
  7. 7. © Tally Solutions Pvt. Ltd. All Rights Reserved 77 Other Abstractions • Functional data science • Exploration of recent algorithms • EMPCA (Expectation-Maximization Principal component analysis) • ADMM (Alternating Direction Method of Multipliers) • Deep Learning • Distributed computing abstractions for data science - Useful for engineering RFPs and next gen architectures for clients. • Iterative computations – Apache Spark or Flink or MPI or Concord • Graph computations – TensorFlow, GraphLab, Titan. • High performance computations – CUDA.
  8. 8. © Tally Solutions Pvt. Ltd. All Rights Reserved 88 Other Abstractions • Adhoc queries on large data sets. – Next gen SQL • Approximate queries – Blink DB • Adhoc queries – Apache Drill • Distributed SQL – Spark SQL, Apache Tajo, HAWQ • Data governance – Important for all big data architectures. • Security – policies for role based fine grained access control • Encryption – at-rest data as well as data-in-motion. • Tokenization/Obfuscation – data masking.
  9. 9. © Tally Solutions Pvt. Ltd. All Rights Reserved 99 Future Abstractions • Internet of Things (IoT) Analytics • Fog/Edge computing – strong devices, weakly connected. • Data Lake Management – next gen data management • Hadoop based data lakes • NewSQL data stores
  10. 10. © Tally Solutions Pvt. Ltd. All Rights Reserved 1010 Thank You! Mail • LinkedIn • Blogs • Twitter • @a_vijaysrinivas.