Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Short introduction to ML frameworks on Hadoop

914 views

Published on

About Hadoop/Spark/Petuum

Published in: Data & Analytics
  • Be the first to comment

Short introduction to ML frameworks on Hadoop

  1. 1. Short introduction to ML frameworks on Hadoop Yuya Takashina 2016 1
  2. 2. Hadoop(2011-) • De facto standard for storage distribution and parallel processing on big data in application. • Google, Yahoo, Facebook, IBM, Twitter, … • The largest Hadoop cluster in the world has 4,500 nodes (Yahoo) • Consists of two parts. • Hadoop Distributed File System • MapReduce • There are some replacements for MapReduce. https://dzone.com/articles/how-hadoop-mapreduce-works 2 Barrier
  3. 3. Spark(2014-) • Framework for data analytics on Hadoop. • Use memory to cache data. • Up to 10x faster than MapReduce for certain applications. • Machine learning • Graph computation • Stream processing • API for Scala/Java/Python/R. 3
  4. 4. Petuum(2015-) • Framework for machine learning on Hadoop. • Faster than Spark • Barrier synchronization as bottleneck in MapReduce and Spark. • Adopt P2P and async-like communication strategy to reduce network communication costs. • Guarantee the theoretical convergence to the optimal value using the unique characters of ML programs. • optimization-centric • iterative convergent • Implemented in C++. • Providing Deep learning API. 4
  5. 5. Reference • Powered by Apache Hadoop: https://wiki.apache.org/hadoop/PoweredBy • The Hadoop Ecosystem Table: https://hadoopecosystemtable.github.io/ • A New Look at the System, Algorithm and Theory Foundations of Distributed Machine Learning: https://petuum.github.io/papers/SysAlgTheoryKDD2015.pdf • Strategies and Principles of Distributed Machine Learning on Big Data: https://arxiv.org/abs/1512.09295 5

×