Hadoop: Beyond MapReduce


Published on

Overview of the above and beyond MapReduce, for the HPC/science community. Key point: move up the stack, reuse what is there. But: some of these people are capable of writing their own YARN apps, so they should be encouraged to do so if they see a need.

Published in: Technology, Business
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hadoop: Beyond MapReduce

  1. 1. © Hortonworks Inc. 2013 Hadoop: Beyond MapReduce Steve Loughran, Hortonworks stevel@hortonworks.com @steveloughran Big Data workshop, June 2013
  2. 2. © Hortonworks Inc. Hadoop MapReduce 1. Map: events  <k,v>* pairs 2. Reduce: <k,[v1, v2,.. vn]>  <k,v'> •Map trivially parallelisable on blocks in a file •Reduce parallelise on keys •MapReduce engine can execute Map and Reduce sequences against data •HDFS provides data location for work placement Page 2
  3. 3. © Hortonworks Inc. MapReduce democratised big data •Conceptual model easy to grasp •Can write and test locally, superlinear scaleup •Tools and stack Page 3 You don't need to understand parallel coding to run apps across 1000 machines
  4. 4. © Hortonworks Inc. 2012 The stack is key to use Page 4 Kafka
  5. 5. © Hortonworks Inc. 2012 Example: Pig Page 5 generated = LOAD '$src/$srcfile' USING PigStorage(',' , '-noschema') AS (line: int, gaussian: double, b: boolean, c:chararray ); sorted = ORDER generated BY c ASC; result = FILTER sorted BY gaussian >= 0;
  6. 6. © Hortonworks Inc. Example: Apache Giraph •Graph nodes in RAM •exchange data with peers at barriers •use cases: PageRank, Friend-of-Friend •But also: modelling cells in a heart Bulk-Synchronous-Parallel -read Pregel paper Page 6
  7. 7. © Hortonworks Inc. But there is a lot more we can do Page 7
  8. 8. © Hortonworks Inc. New Algorithms and runtimes •Giraph for graph work •Stream processing: Storm •Iterative and chained processing: Dryad-style •Long-lived processes Page 8
  9. 9. © Hortonworks Inc. Production-side issues •Scale to 10K nodes •Eliminate SPOFs & Bottlenecks •Improve versioning by moving MR engine user-side •Avoid having dedicated servers for other roles Page 9
  10. 10. © Hortonworks Inc. 2012 YARN: Yet Another Resource Negotiator Resource Manager MapReduce Status Job Submission Client Node Manager Node Manager Container Node Manager App Mstr Node Status Resource Request App Master manages the app AM can request containers and run code in them
  11. 11. © Hortonworks Inc. YARN vs Other Resource Negotiators •MapReduce #1 initial use case •Failures: AM handles worker failures, YARN handles AM failures •Scheduling Locality: sources of data, destinations. AM gets provides location requests along with (CPU, RAM Page 11
  12. 12. © Hortonworks Inc. Pig/Hive-MR versus Pig/Hive-Tez Page 12 I/O Synchronization Barrier I/O Pipelining Pig/Hive - Tez Pig/Hive - Tez SELECT a.state, COUNT(*) FROM a JOIN b ON (a.id = b.id) GROUP BY a.state
  13. 13. © Hortonworks Inc. FastQuery: Beyond Batch with YARN Page 13 Tez Generalizes Map-Reduce Simplified execution plans process data more efficiently Always-On Tez Service Low latency processing for all Hadoop data processing
  14. 14. © Hortonworks Inc. You too can write a distributed execution framework -if you need to Page 14
  15. 15. © Hortonworks Inc. Start the work in progress •Hamster: MPI •Storm-YARN from Yahoo! •Hoya: HBase on YARN  me And start with other people's code •Continuuity Weave -looks best place to start Page 15
  16. 16. © Hortonworks Inc. What are the services and algorithms we are going to need? Page 16
  17. 17. © Hortonworks Inc http://hortonworks.com/careers/ Page 17 P.S: we are hiring