Short introduction to
ML frameworks on Hadoop
Yuya Takashina 2016
• De facto standard for storage distribution and parallel processing on
big data in application.
• Google, Yahoo, Facebook, IBM, Twitter, …
• The largest Hadoop cluster in the world has 4,500 nodes (Yahoo)
• Consists of two parts.
• Hadoop Distributed File System
• There are some replacements
• Framework for data analytics on Hadoop.
• Use memory to cache data.
• Up to 10x faster than MapReduce for certain applications.
• Machine learning
• Graph computation
• Stream processing
• API for Scala/Java/Python/R.
• Framework for machine learning on Hadoop.
• Faster than Spark
• Barrier synchronization as bottleneck in MapReduce and Spark.
• Adopt P2P and async-like communication strategy to reduce
network communication costs.
• Guarantee the theoretical convergence to the optimal value
using the unique characters of ML programs.
• iterative convergent
• Implemented in C++.
• Providing Deep learning API.
• Powered by Apache Hadoop:
• The Hadoop Ecosystem Table:
• A New Look at the System, Algorithm and Theory Foundations of
Distributed Machine Learning:
• Strategies and Principles of Distributed Machine Learning on Big Data: