Your SlideShare is downloading. ×
Extending Hadoop for Fun & Profit
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Extending Hadoop for Fun & Profit


Published on

Apache Hadoop project, and the Hadoop ecosystem has been designed be extremely flexible, and extensible. HDFS, Yarn, and MapReduce combined have more that 1000 configuration parameters that allow …

Apache Hadoop project, and the Hadoop ecosystem has been designed be extremely flexible, and extensible. HDFS, Yarn, and MapReduce combined have more that 1000 configuration parameters that allow users to tune performance of Hadoop applications, and more importantly, extend Hadoop with application-specific functionality, without having to modify any of the core Hadoop code.
In this talk, I will start with simple extensions, such as writing a new InputFormat to efficiently process video files. I will provide with some extensions that boost application performance, such as optimized compression codecs, and pluggable shuffle implementations. With refactoring of MapReduce framework, and emergence of YARN, as a generic resource manager for Hadoop, one can extend Hadoop further by implementing new computation paradigms.
I will discuss one such computation framework, that allows Message Passing applications to run in the Hadoop cluster alongside MapReduce. I will conclude by outlining some of our ongoing work, that extends HDFS, by removing namespace limitations of the current Namenode implementation.

Published in: Technology

1 Comment
  • Hi Milind, slides are very much informative. thanks for sharing.

    - Peeyush
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Extending Hadoop for Fun & Profit Milind Bhandarkar Chief Scientist, Pivotal Software, (Twitter: @techmilind)
  • 2. About Me • • Founding member of Hadoop team atYahoo! [2005-2010] • Contributor to Apache Hadoop since v0.1 • Built and led Grid SolutionsTeam atYahoo! [2007-2010] • Parallel Programming Paradigms [1989-today] (PhD • Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems (acquired by Oracle), Pathscale Inc. (acquired by QLogic),Yahoo!, LinkedIn, and Pivotal (formerly Greenplum)
  • 3. Agenda • Extending MapReduce • Functionality • Performance • Beyond MapReduce withYARN • Hamster & GraphLab • Extending HDFS • Q & A
  • 4. Extending MapReduce
  • 5. MapReduce Overview •Record = (Key,Value) •Key : Comparable, Serializable •Value: Serializable •Logical Phases: Input, Map, Shuffle, Reduce, Output
  • 6. Map •Input: (Key1,Value1) •Output: List(Key2,Value2) •Projections, Filtering,Transformation
  • 7. Shuffle •Input: List(Key2,Value2) •Output •Sort(Partition(List(Key2, List(Value2)))) •Provided by Hadoop : Several Customizations Possible
  • 8. Reduce •Input: List(Key2, List(Value2)) •Output: List(Key3,Value3) •Aggregations
  • 9. MapReduce DataFlow
  • 10. Configuration •Unified Mechanism for •Configuring Daemons •Runtime environment for Jobs/Tasks •Defaults: *-default.xml •Site-Specific: *-site.xml •final parameters
  • 11. <configuration> <property> <name>mapred.job.tracker</name> <value></value> </property> <property> <name></name> <value>hdfs://</value> </property> <property> <name></name> <value>-Xmx512m</value> <final>true</final> </property> .... </configuration> Example
  • 12. Extending Input Phase • Convert ByteStream to List(Key,Value) • Several Formats pre-packaged • TextInputFormat<long, Text>! • SequenceFileInputFormat<K,V>! • KeyValueTextInputFormat<Text,Text>! • Specify InputFormat for each job • JobConf.setInputFormat()
  • 13. InputFormat •getSplits() : From Input descriptors, get Input Splits, such that each Split can be processed independently •<FileName, startOffset, length>! •getRecordReader() : From an InputSplit, get list of Records
  • 14. Industry Use Case ! SurveillanceVideo Anomaly Detection
  • 15. Acknowledgements • Victor Fang • Regu Radhakrishnan • Derek Lin • SameerTiwari
  • 16. Anomaly Detection in SurveillanceVideo • Detect anomalous objects in a restricted perimeter • Typical large enterprise collectsTB’s video per day • Hadoop MapReduce runs computer vision algorithms in parallel and captures violation events • Post-Incident monitoring enabled by Interactive Query
  • 17. Video DataFlow •TimestampedVideo Files as input •DistributedVideoTranscoding : ETL in Hadoop •DistributedVideo Analytics in Hadoop/ HAWQ •Insights in relational DB
  • 18. Real WorldVideo Data • Benchmark Surveillance videos from UK Home Office (iLids) • CCTVVideo footage depicting scenarios central to Govt requirements
  • 19. CommonVideo Standards • MPEG & ITU responsible for most video standards • MPEG-2 (1995) Widely adopted in DVDs, TV, SetTop boxes
  • 20. MPEG Standard Format •Sequence of encoded video frames •Compression by eliminating: •Redundancy inTime: Inter-Frame Encoding •Redundancy in Space: Intra-Frame Encoding
  • 21. Motion Compensation • I-Frame: Intra-Frame encoding • P-Frame: Predicated frame from previous frame • B-Frame: Predicted frame from both previous & next frame
  • 22. Distributed MPEG Decoding •HDFS splits large files in 64 MB/128 MB blocks •Each HDFS block can be processed independently by a Map task •Can we decode individual video frames from an arbitrary HDFS block in an MPEG File ?
  • 23. Splitting MPEG-2 • Header Information available only once per file • Group of Pictures (GOP) header repeats • Each GOP starts with an I-Frame and ends with an I-Frame • Each GOP can be decoded independently • First and last GOP may straddle HDFS blocks
  • 24. MPEG2InputFormat •Derived from FileInputFormat •getSplits() : Identical to FileInputFormat •InputSplit = HDFS Block •getRecordReader()! •MPEG2RecordReader
  • 25. MPEG2RecordReader •Start from beginning of block •Search for the first GOP Header •Locate an I-Frame, decode, keep in memory •If P-Frame, decode using last frame •If B-Frame, keep current frame in memory, read next frame, decode current frame
  • 26. Considerations for Input Format •Use as little metadata as possible •Number of Splits = Number of MapTasks •Combine small files •Split determination happens in a single process, so should be metadata-based •Affects scalability of MapReduce
  • 27. Scalability •If one node processes k MB/s, then N nodes should process (k*N) MB/s •If some fixed amount of data is processed in T minutes on one node, the N nodes should process same data in (T/N) minutes •Linear Scalability
  • 28. Reduce Latency Minimize Job Execution time
  • 29. Increase Throughput Maximize amount of data processed per unit time
  • 30. Amdahl’s Law S = N 1+!(N !1)
  • 31. Multi-Phase Computations •If computation C is split into N different parts, C1..CN •If partial computation Ci can be speeded up by a factor of Si
  • 32. Amdahl’s Law, Restated S = Ci i=1 N ∑ Ci Sii=1 N ∑
  • 33. Amdahl’s Law • Suppose Job has 5 phases: P0 is 10 seconds, P1, P2, P3 are 200 seconds each, and P4 is 10 seconds • Sequential runtime = 620 seconds • P1, P2, P3 parallelized on 100 machines with speedup of 80 (Each executes in 2.5 seconds) • After parallelization, runtime = 27.5 seconds • Effective Speedup: (620s/27.5s) = 22.5
  • 34. MapReduce Workflow
  • 35. Extending Shuffle
  • 36. Why Shuffle ? •Often, the most expensive phase in MapReduce, involves slow disks and network •Map tasks partition, sort and serialize outputs, and write to local disk •Reduce tasks pull individual Map outputs over network, merge, and may spill to disk
  • 37. Message Cost Model T = α + Nβ
  • 38. Message Granularity •For Gigabit Ethernet •α = 300 μS •β = 100 MB/s •100 Messages of 10KB each = 40 ms •10 Messages of 100 KB each = 13 ms
  • 39. Alpha-Beta • Common Mistake:Assuming that α is constant • Scheduling latency for responder • MR daemons time slice inversely proportional to number of concurrent tasks • Common Mistake:Assuming that β is constant • Network congestion • TCP incast
  • 40. Efficient Hardware Platforms •Mellanox - Hadoop Acceleration through Network-assisted Merge •RoCE - Brocade, Cisco, Extreme,Arista... •SSD -Velobit,Violin, FusionIO, Samsung.. •Niche - Compression, Encryption...
  • 41. Pluggable Shuffle & Sort •Replace HTTP-based pull with RDMA •Avoid spilling altogether •Replace default Sort implementation with Job-optimized sorting algorithm •Experimental APIs •google PluggableShuffleAndPluggableSort.html
  • 42. Mellanox UDA • Developed jointly with Auburn University • 2x Performance on TeraSort • Reduces disk writes by 45%, disk reads by 15%
  • 43. Syncsort DMX-h
  • 44. Beyond MapReduce withYARN
  • 45. Single'App' BATCH HDFS Single'App' INTERACTIVE Single'App' BATCH HDFS Single'App' BATCH HDFS Single'App' ONLINE Hadoop 1.0 (Image Courtesy Arun Murthy, Hortonworks)
  • 46. MapReduce 1.0 (Image Courtesy Arun Murthy, Hortonworks)
  • 47. Hadoop 2.0 (Image Courtesy Arun Murthy, Hortonworks) HADOOP 1.0 HDFS% (redundant,*reliable*storage)* MapReduce% (cluster*resource*management* *&*data*processing)* HDFS2% (redundant,*reliable*storage)* YARN% (cluster*resource*management)* Tez% (execu7on*engine)* HADOOP 2.0 Pig% (data*flow)* Hive% (sql)* % Others% (cascading)* * Pig% (data*flow)* Hive% (sql)* % Others% (cascading)* % MR% (batch)* RT%% Stream,% Graph% Storm,'' Giraph' * Services% HBase' *
  • 48. Applica'ons+Run+Na'vely+IN+Hadoop+ HDFS2+(Redundant,*Reliable*Storage)* YARN+(Cluster*Resource*Management)*** BATCH+ (MapReduce)+ INTERACTIVE+ (Tez)+ STREAMING+ (Storm,+S4,…)+ GRAPH+ (Giraph)+ INLMEMORY+ (Spark)+ HPC+MPI+ (OpenMPI)+ ONLINE+ (HBase)+ OTHER+ (Search)+ (Weave…)+ YARN Platform (Image Courtesy Arun Murthy, Hortonworks)
  • 49. NodeManager* NodeManager* NodeManager* NodeManager* Container*1.1* Container*2.4* NodeManager* NodeManager* NodeManager* NodeManager* NodeManager* NodeManager* NodeManager* NodeManager* Container*1.2* Container*1.3* AM*1* Container*2.2* Container*2.1* Container*2.3* AM2* Client2* ResourceManager* Scheduler* YARN Architecture (Image Courtesy Arun Murthy, Hortonworks)
  • 50. YARN •Yet Another Resource Negotiator •Resource Manager •Node Managers •Application Masters •Specific to paradigm, e.g. MR Application master (aka JobTracker)
  • 51. Beyond MapReduce •Apache Giraph - BSP & Graph Processing •Storm onYarn - Streaming Computation •HOYA - HBase onYarn •Hamster - MPI on Hadoop •More to come ...
  • 52. Hamster • Hadoop and MPI on the same cluster • OpenMPI Runtime on HadoopYARN • Hadoop Provides: Resource Scheduling, Process monitoring, Distributed File System • Open MPI Provides: Process launching, Communication, I/O forwarding
  • 53. Hamster Components •Hamster Application Master •Gang Scheduler,YARN Application Preemption •Resource Isolation (lxc Containers) •ORTE: Hamster Runtime •Process launching,Wireup, Interconnect
  • 54. Resource Manager Scheduler AMService Node Manager Node Manager Node Manager … Proc/ Container Framework Daemon NS MPI Scheduler HNP MPI AM Proc/ Container …RM-AM AM-NM RM-NodeManagerClient Client-RM Aux Srvcs Proc/ Container Framework Daemon NS Proc/ Container … Aux Srvcs RM- NodeManager Hamster Architecture
  • 55. Hamster Scalability •Sufficient for small to medium HPC workloads •Job launch time gated byYARN resource scheduler Launch WireUp Collective s Monitor OpenMPI O(logN) O(logN) O(logN) O(logN) Hamster O(N) O(logN) O(logN) O(logN)
  • 56. GraphLab + Hamster on Hadoop !
  • 57. About GraphLab •Graph-based, High-Performance distributed computation framework •Started by Prof. Carlos Guestrin in CMU in 2009 •Recently founded Graphlab Inc to commercialize
  • 58. GraphLab Features •Topic Modeling (e.g. LDA) •Graph Analytics (Pagerank,Triangle counting) •Clustering (K-Means) •Collaborative Filtering •Linear Solvers •etc...
  • 59. Only Graphs are not Enough •Full Data processing workflow required ETL/ Postprocessing,Visualization, Data Wrangling, Serving •MapReduce excels at data wrangling •OLTP/NoSQL Row-Based stores excel at Serving •GraphLab should co-exist with other Hadoop frameworks
  • 60. Coming Soon…
  • 61. Extending HDFS
  • 62. HCFS •Hadoop Compatible File Systems •FileSystem, FileContext •S3, Local FS, webhdfs •Azure Blob Storage, CassandraFS, Ceph, CleverSafe, Google Cloud Storage, Gluster, Lustre, QFS, EMCViPR (more to come)
  • 63. New Dataset •Reuse Namenode and Datanode implementations •Substitute a different DataSet implementation: FsDatasetSpi, FsVolumeSpi •Jira: HDFS-5194
  • 64. Extending Namenode •Pluggable Namespace: HDFS-5324, HDFS-5389 •Pluggable Block Management: HDFS-5477 •Requires fine-grained locking in Namenode: HDFS-5453
  • 65. Questions ?