Your SlideShare is downloading. ×
20080528dublinpt3
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

20080528dublinpt3

1,870
views

Published on

One in a series of presentations given at the IBM Cloud Computing Center in Dublin.

One in a series of presentations given at the IBM Cloud Computing Center in Dublin.

Published in: Business, Technology

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,870
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
69
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Managing a Large Hadoop Cluster Jeff Hammerbacher Manager, Data May 28 - 29, 2008
  • 2. Anatomy of the Facebook Cluster Hardware ▪ Individual nodes ▪ CPU: Intel Xeon dual socket quad cores (8 cores per box) ▪ Memory: 16 GB ECC DRAM ▪ Disk: 4 x 1 TB 7200 RPM SATA ▪ Network: 1 gE ▪ Topology ▪ 320 nodes arranged into 8 racks of 40 nodes each ▪ 8 x 1 Gbps links out to the core switch
  • 3. Anatomy of the Facebook Cluster Functional Separation ▪ Need to have test, staging, and production clusters ▪ Break nodes into groups of 10 ▪ First 30 machines on each rack run DFS ▪ Last 10 machines used for DFS and upgrade testing or left idle ▪ Run main MapReduce cluster on 20 machines in each rack ▪ Run test MapReduce cluster on 10 machines in four racks ▪ Do MapReduce testing on 10 machines in four racks ▪ A few other MapReduce clusters for isolated applications
  • 4. Anatomy of the Facebook Cluster Software for Administration ▪ Most utilities are included in hadoop/bin ▪ Format DFS, start/stop daemons, fsck, rebalance blocks, etc. ▪ Hypershell (internal): provides distributed shell functionality ▪ See also: dsh, GXP, Capistrano, ClusterIt ▪ Cfengine: ensure uniform system images, configuration, and libraries ▪ ODS (internal): monitoring and alerting ▪ See also: Ganglia for monitoring, Nagios for alerting ▪ Cacti: network monitoring
  • 5. Anatomy of the Facebook Cluster Excerpts from Facebook’s conf/hadoop-site.xml dfs.block.size 134,217,728 Larger block size for less NN metadata dfs.datanode.du.reserved 1,024,000,000 Don’t fill up the local disk dfs.namenode.handler.count 40 More NN server threads for DN RPCs dfs.network.script /mnt/vol/hive/stable/bin/rackid.pl Print machine network name fs.trash.interval 1,440 fs.trash.root /Trash io.file.buffer.size 32,768 Size of r/w buffer used by SequenceFile io.sort.factor 100 More streams merged while sorting io.sort.mb 200 Higher memory limit while sorting data mapred.child.java.opts -Xmx1024m -Djava.net.preferIPv4Stack=true Large heap size; avoid RPC timeout mapred.linerecordreader.maxlength 1,000,000 Skip malformed lines mapred.min.split.size 65,536 mapred.reduce.copy.backoff 5 mapred.reduce.parallel.copies 20 More threads to fetch map output data mapred.tasktracker.tasks.maximum 5 mapred.speculative.map.enabled TRUE mapred.speculative.reduce.enabled FALSE mapred.speculative.map.gap 1 webinterface.private.actions TRUE
  • 6. Anatomy of the Facebook Cluster HDFS Tips from Dhruba Borthakur ▪ Be careful when using profilers to examine NN state ▪ Never load many small files ▪ Always use java 1.6, otherwise NN will consume about 50% more CPU ▪ When decommissioning DNs, do a max of 10 machines or so at a time, otherwise the NN gets overloaded ▪ Run fsck every night and monitor the number of missing/under- replicated blocks ▪ If a block stays unreplicated, force its replication factor up, then down ▪ When adding new DNs to the cluster, run the rebalancing script
  • 7. Anatomy of the Facebook Cluster Common Issues ▪ Client libraries out of sync ▪ Non-uniform availability of software or libraries on TT nodes ▪ Bad disk: manifested as ROFS ▪ NIC decides to go into 100 Mbps Ethernet mode ▪ DN reserved amount not honored resulting in disk filled to capacity ▪ Resource contention
  • 8. Anatomy of the Facebook Cluster More About Monitoring ▪ Hadoop has an abstract interface for metrics reporting ▪ org.apache.hadoop.metrics.spi ▪ Currently has “file” and “ganglia” implementations ▪ Every Metric belongs to a Context and a Record ▪ Metrics can also have Tags for disambiguation ▪ See conf/hadoop-metrics.properties for configuration ▪ Web interfaces to NN and JT also have detailed information ▪ A variety of cron’d scripts also take care of system-level monitoring
  • 9. Anatomy of the Facebook Cluster More About Performance ▪ In addition to the metrics package, logs are rich source of information ▪ Starting to regularly parse logs and store information into MySQL db ▪ Multiple research labs working on this area ▪ Berkeley RAD Lab ▪ Carnegie Mellon PDL ▪ Watch OSDI this year for papers
  • 10. Anatomy of the Facebook Cluster Recent DFS Performance Numbers ▪ All DNs are on same rack to isolate switch performance from test ▪ 8 DNs, each with 2 map slots: hence performance levels off at 16 files ▪ Each mapper writes 1 GB/file. Block size is 128MB. Replication factor is 3. ▪ Uses Java 1.6 Number of Files 0.15.4 (MB/s) 0.17.0 (MB/s) 1 30 60 2 25 53 3 20 43 5 18 33 8 9 27 13 8 18 20 9 17 24 8 18 28 8 16
  • 11. X-Trace + Hadoop HDFS Performance analysis
  • 12. Anatomy of the Facebook Cluster Resource Management and Job Scheduling ▪ By far the most intensive cluster management responsibility ▪ At Facebook: manually set job priorities and kill jobs ▪ HOD ▪ Integrates with Torque resource manager ▪ Torque frequently paired with Maui cluster scheduler ▪ Other options ▪ Sun Grid Engine ▪ Condor ▪ Platform LSF (commercial)
  • 13. Manual Job Scheduling Job Priorities and “Kill this Job” from JT Web Interface
  • 14. Anatomy of the Facebook Cluster Recent Cluster Statistics ▪ From May 2nd to May 21st: ▪ Total jobs: 8,794 ▪ Total map tasks: 1,362,429 ▪ Total reduce tasks: 86,806 ▪ Average duration of a successful job: 296 s ▪ Average duration of a successful map: 81 s ▪ Average duration of a successful reduce: 678 s
  • 15. (c) 2008 Facebook, Inc. or its licensors.  quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

×