Jan 2013 HUG: Dist cpv2 for hug 20130116
Upcoming SlideShare
Loading in...5
×
 

Jan 2013 HUG: Dist cpv2 for hug 20130116

on

  • 982 views

 

Statistics

Views

Total Views
982
Views on SlideShare
982
Embed Views
0

Actions

Likes
0
Downloads
5
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Pick out the lie.
  • DistCp used to be cleverly hidden in the swirling depths of the Map Reduce source tree, under an “examples” directory, alongside another program, “DistCh”. Distributed chmod. A look at its location, (or at the code, if you have the stomach for it), and it is evident that the program was initially written to illustrate how MR can solve non-MR problems. The example program unfortunately solved a real world problem, so patch upon patch was applied on top to make it “production ready”. In the end, the front-end, InputFormat, Mapper code all fused into one 1500 line monolith.
  • At its core, if you discount workflow-management, metrics and monitoring, user dashboards, etc., then at its core, GDM Replication could just use DistCp and not write its own MR jobs.
  • Setup-times: In an extreme experiment, copying a dataset of 1 million small files (half of which had different checksums) required a setup time of 8.5 hoursThis marriage would not work as advertised. GDM Replication shipped to production without DistCp. DistCp was rewritten from scratch promptly after. Why did we rewrite? Let me give you an analogy.
  • Anyone heard the story behind U2’s “Where the Streets have no name”, on “The Joshua Tree” album?The band spent more time working on that song than the rest of the album combined. Specifically, they were trying to fit the initial guitar riff (written by Edge) to the rest of the song, which was in a different time signature. Days, weeks, months, trying to get the 2 bits to meet.And they were working off a single master tape. The only way they could get forward was because of Brian Eno.
  • BrianEno, a producer on the album, comes in early one day, picks up the master tape, and stages an “accident”, where he wipes the tape. And they have to start recording from scratch.
  • These were the changes. And no discussion of “Changes” would be complete without discussing “David Bowie”.
  • This is David Bowie playing Nikola Tesla in a film titled “The Prestige”. In the film, a millionaire magician asks Tesla to build him a teleportation device, to transport a grown, live adult over a distance of 100 feet, in under a second.Tesla asks the magician, “Have you considered the cost of such a machine?”The millionaire magician says, “Price is no object”.Tesla replies, “Perhaps not, but have you considered the cost?”
  • How do you predict which files take longer to copy?
  • Reads over hdfs: The DFS Client takes care of reconnecting to each data-node that serves the blocks.
  • FS client makes connection to a single data-node. All blocks are funnelled through the same data-node, regardless of origin.If the data-node has a hardware fault, the whole copy is ruined.
  • Consider input-split
  • How do you predict which files take longer to copy?
  • How do you predict which files take longer to copy?

Jan 2013 HUG: Dist cpv2 for hug 20130116 Jan 2013 HUG: Dist cpv2 for hug 20130116 Presentation Transcript

  • DistCp (v.2) and the Dynamic InputFormat Mithun RK (mithunr@yahoo-inc.com) 20130116
  • Me Yahoo: HCatalog, Hive, GDM Firmware Engineer at Hewlett Packard Fluent Hindi Gold medal at the nationals, last year.Yahoo! Presentation, Confidential 2 1/18/2013
  • Prelude
  • “Legacy” DistCp Inter-cluster file copy using Map/Reduce Command-line: hadoop distcp –m 20 hftp://source_nn:50070/datasets/search/20120523/US hftp://source_nn:50070/datasets/search/20120523/UK hdfs://target_nn:8020/home/mithunr/target Algo 1. for (FileStatus f : FileSystem.globStatus(sourcePath)) { recurse(f); } 2. ~/_distCp_WIP_201301161600/file.list 3. InputSplit calculation: 1. Divide paths into „m‟ groups (“splits”), one per mapper 2. Total file size in each split roughly equal to others. 4. Launch MR job 1. Each Map task copies files specified in its InputSplit Source-path › http://svn.apache.org/repos/asf/hadoop/common/branches/branch- 1.0/src/tools/org/apache/hadoop/tools/DistCp.javaYahoo! Presentation, Confidential 4 1/18/2013
  • The (Unfortunate) Beginning
  • Data Management on Y! Clusters Grid Data Management (GDM) › Life-cycle management for data on Hadoop Clusters › 1+ Petabytes per day. Facets: 1. Acquisition: Warehouse -> Cluster 2. Replication: Cluster -> Cluster 3. “Retention”: Eviction of old data 4. Workflow tracking, Metrics, Monitoring, User-dashboards, Configuration management, etc. GDM Replication: 1. Use DistCp! 2. Don‟t re-implement MR-job for file-copy.Yahoo! Presentation, Confidential 6 1/18/2013
  • A marriage doomed to fail Poor programmatic use: › DistCp.main(“-m”, “20”, “hftp://source_nn1:50070/source”, “hdfs://target_nn1:8020/target”); › Blocking call › Equal-size Copy-distribution: Can‟t be overridden. Long Setup-times: › Optimization: file.list contains only files that are changed/absent on target › Compare checksums › E.g. Experiment with 200 GB dataset: 14 minutes setup time. Atomic commit: › Example: Oozie workflow-launch on data availability › Premature consumption › Workarounds: _DONE_ markers: • Name-node pressure • Hacks to ignore at source OthersYahoo! Presentation, Confidential 7 1/18/2013
  • Yahoo! Presentation, Confidential 8 1/18/2013
  • 9 1/18/2013
  • DistCp Redux(Available in Hadoop 0.23/2.0)
  • Changes (Almost) Identical command-line: hadoop distcp –m 20 hftp://source_nn:50070/datasets/search/20120523/US/ hftp://source_nn:50070/datasets/search/20120523/UK hdfs://target_nn:8020/home/mithunr/target/ Reduced setup-times: › Postpone everything to MR job › E.g. Experiment with 200 GB dataset: • Old: 14 minutes setup time • New: 7 seconds Improved Copy-times: › Large dataset copy test: Time cut down from 17 hours to 7 hours. Atomic commit: hadoop distcp –atomic –tmp /home/mithunr/tmp /source /target Improved programmatic use: › options = new DistCpOptions(srcPaths, destPath).preserve(BLOCKSIZE).setBlocking(false); › Job job = new DistCp(hadoopConf, options).execute(); Others › Bandwidth throttling, Asynchronous mode, Configurable copy-strategies.Yahoo! Presentation, Confidential 11 1/18/2013
  • Cost of copy Copy-time is directly proportional to file-size › (All else being equal) Long-tailed MR jobs › Copy twenty 2GB files between clusters. Why does one take longer than the rest? › Hint: Sometimes, a file is slow initially, and then speeds up after a “block boundary”. Are data-nodes equivalent? › Slower hard-drives › Failing NICs › Misconfiguration Take a closer look at the command-line: hadoop distcp –m 20 hftp://source_nn:50070/datasets/search/20120523/US/ hftp://source_nn:50070/datasets/search/20120523/UK hdfs://target_nn:8020/home/mithunr/target/Yahoo! Presentation, Confidential 13 1/18/2013
  • Reads over hdfs://Yahoo! Presentation, Confidential 14 1/18/2013
  • Reads over hftp://Yahoo! Presentation, Confidential 15 1/18/2013
  • Long-tails STUCK! SPLIT!Yahoo! Presentation, Confidential 16 1/18/2013
  • Mitigation Break static binding between InputSplits and Mappers E.g. Consider DistCp of N files with 10 mappers: 1. Don‟t create 10 InputSplits. Create 20 instead. 2. Store each InputSplit as a separate file. 1. hdfs://home/mithunr/_distcp_20130116/work-pool/ 3. Mapper consumes one InputSplit and checks for more. 4. Mappers quit when no more InputSplits are left. Single file per InputSplit? › NameNode pressure. DynamicInputFormat › Separate Library Perf: › Worst-case is no worse than UniformSizeInputFormat › Best-case: 17 hours -> 7 hours.Yahoo! Presentation, Confidential 17 1/18/2013
  • Future Block-level parallelism › Stream blocks individually › Stitch at the end: Metadata Yarn › Master-worker paradigmYahoo! Presentation, Confidential 18 1/18/2013
  • _DONE_