08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
Apache Hadoop 0.22 and Other Versions
1. Apache Hadoop 0.22
and Other Versions
Konstantin V Shvachko
Principal Hadoop Architect, eBay
IBM Karmasphere Twitter
February – March, 2012
2. eBay Inc. confidential
Apache Hadoop Ecosystem
• Hadoop Core
– Common – communication and user facing APIs
– HDFS – distributed file system
– MapReduce – distributed computation framework
• Pig – dataflow language
• Hive – data warehouse, SQL
• Zookeeper – distributed coordination service
• HBase – columnar store
• Oozie – complex job workflow
• eBay Specific
– Cascading
– Lzo compression
2
3. eBay Inc. confidential
Hadoop Versioning
• Straight line from 0.1 to 0.20
• Fanned out starting from 0.20.2
• Multiple distributions in 2010 based on 0.20
– Apache, Y, CDH, FB
– More today
• Focus on Apache Releases
– Release 0.20.2 2010-02-16
– Release 0.21.0 2010-08-13
– Release 0.20.203.0 2011-05-11 Security Stable
– Release 0.20.204.0 2011-09-05 Improvements
– Release 0.20.205.0 2011-10-17 HBase support
• Genealogy of elephants
3
5. eBay Inc. confidential
Major Branches
• Hadoop 1.0.0 (security branch) 2011-12-27
– Rename of 0.20.205
– Beta
• Hadoop 0.22.0 2011-12-10
– Continuation of 0.21.0
– Beta
• Hadoop 0.23.0 2011-11-11
– Fedaration – static partitioning of HDFS namespace
– Yarn – new implementation of MapReduce
– Scalability
– Alpha
• 2011 – record number of major releases!
• No unifying release, containing all the good features
5
6. eBay Inc. confidential
Hadoop 0.22 Branch
• Branched 2010-11-17
• Released 2011-12-10
• Many events in-between
• RM role – started in August 2011
• Stabilization
–Hadoop Platform team, eBay
–Many contributors from the community
6
7. eBay Inc. confidential
Features HDFS - 0.22
• New implementation of file append
• HBase support with hflush and hsync
• Symbolic links
• BackupNode and CheckpointNode
• DataNodes tolerate single disk failure. Disk-fail-in-place
• File concatenation
• SLive test
• Sticky bit
• Offline Image Viewer
7
8. eBay Inc. confidential
Features MapReduce - 0.22
• Hierarchical job queues
• Job limits per queue / pool
• Dynamically stop / start job queues
• Andvances in new MapReduce API
– Input/Output formats, ChainMapper / ChainReducer
• TaskTracker blacklisting
• DistributedCache sharing
8
9. eBay Inc. confidential
Features not Supported in Hadoop 0.22.0
Compared to Hadoop 1.0
• Security
– LinuxTaskController removed MAPREDUCE-2767
• Optimizations (operability) of the MapReduce framework
introduced in the Hadoop 0.20.security line of releases
– Limits on per-job JobConf, Counters, StatusReport, Split-Sizes
– User / queue limits on tasks / jobs in the CapacityScheduler
• Disk-fail-in-place – MapReduce part
• JMX-based metrics v2
• Jetty workaround
• CapacityScheduler should assign multiple tasks per heartbeat
• User's task logs filling up local disks on the TaskTrackers
• FairScheduler back-port from trunk
9
10. eBay Inc. confidential
Not in Hadoop 0.22.0 HDFS Part
• Shortcut a local client reads to a Datanodes files directly
– Important HBase optimization
– Porting is in progress
• WebHDFS: accessing HDFS over HTTP
– New experimental feature, back-ported from trunk
• NameNode startup time
– Handling block reports and missed heartbeats from DataNodes
– The rest is forward ported from 1.0
– More startup improvements in 0.22
10
11. eBay Inc. confidential
Hadoop 0.23 Features
• HDFS Federation
– Independent NameNodes sharing a common pool of DataNodes
– Cluster is a family of volumes with shared block storage layer
– User sees volumes as isolated file systems
– ViewFS: the client-side mount table
– Federated approach provides a static partitioning of the federated namespace
• Yarn: Scalability for MapReduce framework
– Separation of JobTracker functions
1. Job scheduling and resource allocation:
• Fundamentally centralized
2. Job monitoring and job life-cycle coordination
• Delegate coordination of different jobs to other nodes
– Dynamic partitioning of cluster resources: no fixed slots
• “Apache Hadoop: The scalability update” USENIX ;login: June, 2011
11
12. eBay Inc. confidential
Append and HBase
• Append means
– Reopening of existing files for appending new data
– Replica synchronization after failure
– Consistent view of file data during writing by different clients
– hflush, hsync – guarantee data delivered to DNs and persisted on NN
• First implementation of append in 0.19 HADOOP-1700
– 0.20-append branch
• Redesign of append in 0.21 HDFS-265
• HBase needs hflush and hsync only
• Hadoop 1.0 - HBase support via hflush, hsync
• Hadoop 0.22 – fully functional append, including HBase support
12
13. eBay Inc. confidential
BackupNode
• BackupNode a read-only NameNode
– Contains all file system metadata: files and directories
excluding block locations
– Can perform NameNode operations that don’t modify namespace
• BN maintains up-to-date in-memory image of file system namespace
always synchronized with the NameNode state
– NameNode streams journal to BackupNode
• BackupNode can create a checkpoint without downloading
checkpoint and journal files from active NameNode
• Intended to evolve into hot HA HDFS-2064
13
14. eBay Inc. confidential
Hadoop at eBay
• 2011 started with 532-node 5 PB cluster running CDH2
• EBay 0.20.203-based build (Wilma)
– Hadoop 0.20.203 – latest stable Apache release
• HDFS, MapReduce, Pig, Hive, Cascading, Mobius, lzo
– 500+ users; 2000 jobs per day
• Runs on 1000-node cluster
– 24 PB – capacity, 72 GB RAM / node
• Many smaller clusters
• Stabilization of Hadoop platform based on 0.22
14
15. eBay Inc. confidential
Testing
• One year of testing by different groups in Hadoop ecosystem
• Extensive testing of append by HBase community
• Fully automated build and certification with BigTop
• Hadoop platform team at eBay
– Extensive stabilization effort starting September
– Most bugs found in 0.22 are also in trunk and 0.23
– All new features tested
– Stress testing
– Reliability testing
• Works with: Pig 0.8, Hive 0.7, custom changes
HBase 0.92, Oozie, open sourced
Zookeeper, Cascading no changes needed
15
16. eBay Inc. confidential
Testing Tools, Examples
• TeraSort, TestDFSIO, DistCp
• GridMix, Rumen – production job traces
• SLive – adjustable mix of HDFS operations, permanent load
• Upgrade / rollback from 0.20.? and 0.20.203 to 0.22
• Oversubscribed cluster running out of memory
• Loosing racks with running jobs and HBase
– Cluster survived consecutive loss of 4 racks, shrinking to single rack
with HBase still alive and MR jobs completing
• Disk-fail-in-place helps identify bad drives during hardware burn-in
16
17. eBay Inc. confidential
Benchmarking
• TestDFSIO: 10 GB files (same as 100 GB)
• TeraSort: -5% (scheduler to blame)
• YCSB - same
• Internal eBay applications – same or better
• Lots of tuning: Hadoop, Java, OS, HW
– Gradual improvement of results
17
Throughput
MB/sec
Read Write Append
Hadoop-0.22 100 84 83
0.20 breed 96 66 n/a
18. eBay Inc. confidential
Good to have for 0.22.1
• Restore Security
• Disk Fail in place for MapReduce
• Optimizations
– Multiple tasks per heartbeat for CapacityScheduler
– CapacityScheduler preemption
• MR job and task limits
• Cluster startup time
• Add HA?
• Merge MR-1.0 into Hadoop 0.22?
18
19. eBay Inc. confidential
Important
• Works but not 0.20
– Good new features
– Reliability is the first concern
– Performance and missing functionality can be reconstructed
• Community release
– Not distributed / advertized by commercial distributors
– Community involvement important
• Don’t try to upgrade from Hadoop 0.21 to Hadoop 1.0
It’s the other way around
– Go to Hadoop 0.22 instead
• Forward-going release progress
– Stop porting new features, start releasing them
19