Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Hadoop 0.22 and Other Versions


Published on

Overview of the features of Hadoop 0.22 release and its comparison to other versions.

Published in: Technology
  • Be the first to comment

Apache Hadoop 0.22 and Other Versions

  1. 1. Apache Hadoop 0.22and Other VersionsKonstantin V ShvachkoPrincipal Hadoop Architect, eBayIBM Karmasphere TwitterFebruary – March, 2012
  2. 2. eBay Inc. confidentialApache Hadoop Ecosystem• Hadoop Core– Common – communication and user facing APIs– HDFS – distributed file system– MapReduce – distributed computation framework• Pig – dataflow language• Hive – data warehouse, SQL• Zookeeper – distributed coordination service• HBase – columnar store• Oozie – complex job workflow• eBay Specific– Cascading– Lzo compression2
  3. 3. eBay Inc. confidentialHadoop Versioning• Straight line from 0.1 to 0.20• Fanned out starting from 0.20.2• Multiple distributions in 2010 based on 0.20– Apache, Y, CDH, FB– More today• Focus on Apache Releases– Release 0.20.2 2010-02-16– Release 0.21.0 2010-08-13– Release 2011-05-11 Security Stable– Release 2011-09-05 Improvements– Release 2011-10-17 HBase support• Genealogy of elephants3
  4. 4. eBay Inc. confidential4
  5. 5. eBay Inc. confidentialMajor Branches• Hadoop 1.0.0 (security branch) 2011-12-27– Rename of 0.20.205– Beta• Hadoop 0.22.0 2011-12-10– Continuation of 0.21.0– Beta• Hadoop 0.23.0 2011-11-11– Fedaration – static partitioning of HDFS namespace– Yarn – new implementation of MapReduce– Scalability– Alpha• 2011 – record number of major releases!• No unifying release, containing all the good features5
  6. 6. eBay Inc. confidentialHadoop 0.22 Branch• Branched 2010-11-17• Released 2011-12-10• Many events in-between• RM role – started in August 2011• Stabilization–Hadoop Platform team, eBay–Many contributors from the community6
  7. 7. eBay Inc. confidentialFeatures HDFS - 0.22• New implementation of file append• HBase support with hflush and hsync• Symbolic links• BackupNode and CheckpointNode• DataNodes tolerate single disk failure. Disk-fail-in-place• File concatenation• SLive test• Sticky bit• Offline Image Viewer7
  8. 8. eBay Inc. confidentialFeatures MapReduce - 0.22• Hierarchical job queues• Job limits per queue / pool• Dynamically stop / start job queues• Andvances in new MapReduce API– Input/Output formats, ChainMapper / ChainReducer• TaskTracker blacklisting• DistributedCache sharing8
  9. 9. eBay Inc. confidentialFeatures not Supported in Hadoop 0.22.0Compared to Hadoop 1.0• Security– LinuxTaskController removed MAPREDUCE-2767• Optimizations (operability) of the MapReduce frameworkintroduced in the Hadoop line of releases– Limits on per-job JobConf, Counters, StatusReport, Split-Sizes– User / queue limits on tasks / jobs in the CapacityScheduler• Disk-fail-in-place – MapReduce part• JMX-based metrics v2• Jetty workaround• CapacityScheduler should assign multiple tasks per heartbeat• Users task logs filling up local disks on the TaskTrackers• FairScheduler back-port from trunk9
  10. 10. eBay Inc. confidentialNot in Hadoop 0.22.0 HDFS Part• Shortcut a local client reads to a Datanodes files directly– Important HBase optimization– Porting is in progress• WebHDFS: accessing HDFS over HTTP– New experimental feature, back-ported from trunk• NameNode startup time– Handling block reports and missed heartbeats from DataNodes– The rest is forward ported from 1.0– More startup improvements in 0.2210
  11. 11. eBay Inc. confidentialHadoop 0.23 Features• HDFS Federation– Independent NameNodes sharing a common pool of DataNodes– Cluster is a family of volumes with shared block storage layer– User sees volumes as isolated file systems– ViewFS: the client-side mount table– Federated approach provides a static partitioning of the federated namespace• Yarn: Scalability for MapReduce framework– Separation of JobTracker functions1. Job scheduling and resource allocation:• Fundamentally centralized2. Job monitoring and job life-cycle coordination• Delegate coordination of different jobs to other nodes– Dynamic partitioning of cluster resources: no fixed slots• “Apache Hadoop: The scalability update” USENIX ;login: June, 201111
  12. 12. eBay Inc. confidentialAppend and HBase• Append means– Reopening of existing files for appending new data– Replica synchronization after failure– Consistent view of file data during writing by different clients– hflush, hsync – guarantee data delivered to DNs and persisted on NN• First implementation of append in 0.19 HADOOP-1700– 0.20-append branch• Redesign of append in 0.21 HDFS-265• HBase needs hflush and hsync only• Hadoop 1.0 - HBase support via hflush, hsync• Hadoop 0.22 – fully functional append, including HBase support12
  13. 13. eBay Inc. confidentialBackupNode• BackupNode a read-only NameNode– Contains all file system metadata: files and directoriesexcluding block locations– Can perform NameNode operations that don’t modify namespace• BN maintains up-to-date in-memory image of file system namespacealways synchronized with the NameNode state– NameNode streams journal to BackupNode• BackupNode can create a checkpoint without downloadingcheckpoint and journal files from active NameNode• Intended to evolve into hot HA HDFS-206413
  14. 14. eBay Inc. confidentialHadoop at eBay• 2011 started with 532-node 5 PB cluster running CDH2• EBay 0.20.203-based build (Wilma)– Hadoop 0.20.203 – latest stable Apache release• HDFS, MapReduce, Pig, Hive, Cascading, Mobius, lzo– 500+ users; 2000 jobs per day• Runs on 1000-node cluster– 24 PB – capacity, 72 GB RAM / node• Many smaller clusters• Stabilization of Hadoop platform based on 0.2214
  15. 15. eBay Inc. confidentialTesting• One year of testing by different groups in Hadoop ecosystem• Extensive testing of append by HBase community• Fully automated build and certification with BigTop• Hadoop platform team at eBay– Extensive stabilization effort starting September– Most bugs found in 0.22 are also in trunk and 0.23– All new features tested– Stress testing– Reliability testing• Works with: Pig 0.8, Hive 0.7, custom changesHBase 0.92, Oozie, open sourcedZookeeper, Cascading no changes needed15
  16. 16. eBay Inc. confidentialTesting Tools, Examples• TeraSort, TestDFSIO, DistCp• GridMix, Rumen – production job traces• SLive – adjustable mix of HDFS operations, permanent load• Upgrade / rollback from 0.20.? and 0.20.203 to 0.22• Oversubscribed cluster running out of memory• Loosing racks with running jobs and HBase– Cluster survived consecutive loss of 4 racks, shrinking to single rackwith HBase still alive and MR jobs completing• Disk-fail-in-place helps identify bad drives during hardware burn-in16
  17. 17. eBay Inc. confidentialBenchmarking• TestDFSIO: 10 GB files (same as 100 GB)• TeraSort: -5% (scheduler to blame)• YCSB - same• Internal eBay applications – same or better• Lots of tuning: Hadoop, Java, OS, HW– Gradual improvement of results17ThroughputMB/secRead Write AppendHadoop-0.22 100 84 830.20 breed 96 66 n/a
  18. 18. eBay Inc. confidentialGood to have for 0.22.1• Restore Security• Disk Fail in place for MapReduce• Optimizations– Multiple tasks per heartbeat for CapacityScheduler– CapacityScheduler preemption• MR job and task limits• Cluster startup time• Add HA?• Merge MR-1.0 into Hadoop 0.22?18
  19. 19. eBay Inc. confidentialImportant• Works but not 0.20– Good new features– Reliability is the first concern– Performance and missing functionality can be reconstructed• Community release– Not distributed / advertized by commercial distributors– Community involvement important• Don’t try to upgrade from Hadoop 0.21 to Hadoop 1.0It’s the other way around– Go to Hadoop 0.22 instead• Forward-going release progress– Stop porting new features, start releasing them19
  20. 20. eBay Inc. confidentialThank you20Hadoop 0.22 Contributions Accepted