Apache Hamaa Bulk Synchronous Parallel Computing Edward J. Yoon <email@example.com>
Who Am I• Edward J. Yoon – @eddieyoon• Founder of Apache Hama• PMC member of Apache BigTop• Oracle Employee
What’s Hama?• Open Source – Under Apache 2.0 License• Written In Java• Apache Top Level Project
Characteristics• a General BSP computing engine – M/R like Input/Output Formatter • SequenceFile, Text, Accumulo, Hbase, …, etc. – Job Manager – Checkpoint Recovery• Streaming and Pipes – Python, C++, …, etc.• Graph and Machine Learning Packages – K-means, Gradient Descent, Collaborative Filtering
Bulk Synchronous Parallel?• Originally introduced by Valiant• a Sequence of supersteps
Compare to M/R and MPI• Supports message-passing paradigm style of application development• Provides a flexible, simple, and easy-to-use small APIs• Enables to perform better than MPI for communication-intensive applications• Guarantees impossibility of deadlocks or collisions in the communication mechanisms
So, fit for what?• Processing Big Data w/ complicated relationships – e.g., graph or network.• Iterative or Recursive scientific applications• Continuous Event Processing
Which is the Big Data?
Could be applied to• Analyze user actions and patterns• Social Target Marketing• Observe evolution of Social networks• Detect anomaly rapidly in Real-time• Business Intelligence
Internals• Pluggable RPC Architecture for message transfer – e.g., Hadoop RPC, Avro RPC, …, etc.• Message Collector, Bundler, and Compressor to reduce network overheads and contentions – e.g., Snappy, Bzip2, …, etc.