Hadoop - Lessons Learned

3,461 views

Published on

Hadoop has proven to be an invaluable tool for many companies over the past few years. Yet it has it's ways and knowing them up front can safe valuable time. This session is a run down of the ever recurring lessons learned from running various Hadoop clusters in production since version 0.15.
What to expect from Hadoop - and what not? How to integrate Hadoop into existing infrastructure? Which data formats to use? What compression? Small files vs big files? Append or not? Essential configuration and operations tips. What about querying all the data? The project, the community and pointers to interesting projects that complement the Hadoop experience.

Published in: Technology, Business
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,461
On SlideShare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
84
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Hadoop - Lessons Learned

  1. 1. Hadoop lessons  learned
  2. 2. @tcurdt github.com/tcurdtyourdailygeekery.com
  3. 3. Data
  4. 4. hiring
  5. 5. Agenda· hadoop?  really?  cloud?· integration· mapreduce· operations· community  and  outlook
  6. 6. Why  Hadoop?
  7. 7. “It is a new and improvedversion of enterprise tape drive”
  8. 8. 20  machines Map  Reduce20  files,  1.5  GB  each hadoop job grep.jar grep “needle” file ir0 17.5 35.0 52.5 70.0 f a u n
  9. 9. Run  your  own?http://bit.ly/elastic-mr-pig
  10. 10. Integration
  11. 11. black  box
  12. 12. Engineers· hadoop-cat· hadoop-grep· hadoop-range --prefix /logs --from 2012-05-15 --until 2012-05-22 --postfix /*play*.seq | xargs hadoop jar· streaming  jobs
  13. 13. Non-Engineering  Folks· mount  hdfs· pig  /  hive· data  dumps
  14. 14. Map  Reduce HDFS files InputFormat Split Split Split Split Map Map Map MapCombiner Combiner Combiner Combiner Sort Sort Sort Sort Partitioner Copy and Merge Combiner Combiner Reducer Reducer OutputFormat
  15. 15. Job  Counters12/05/25 01:27:38 INFO mapred.JobClient: Reduce input records=106..12/05/25 01:27:38 INFO mapred.JobClient: Combine output records=40912/05/25 01:27:38 INFO mapred.JobClient: Map input records=11270584412/05/25 01:27:38 INFO mapred.JobClient: Reduce output records=412/05/25 01:27:38 INFO mapred.JobClient: Combine input records=64842079..12/05/25 01:27:38 INFO mapred.JobClient: Map output records=64841776map in : 112705844 *********************************map out : 64841776 *****************combine in : 64842079 *****************combine out : 409 |reduce in : 106 |reduce out : 4 | MAPREDUCE-346  (since  2009)
  16. 16. Job  Countersmap in : 20000 **************map out : 40000 ******************************combine in : 40000 ******************************combine out : 10001 ********reduce in : 10001 ********reduce out : 10001 ********
  17. 17. Map-onlymapred.reduce.tasks = 0
  18. 18. EOF  on  appendpublic class EofSafeSequenceFileInputFormat<K,V> extends SequenceFileInputFormat<K,V> { ...}public class EofSafeRecordReader<K,V> extends RecordReader<K,V> { ... public boolean nextKeyValue() throws IOException, InterruptedException { try { return this.delegate.nextKeyValue(); } catch(EOFException e) { return false; } } ...}
  19. 19. Serializationbefore ASN1, custom java serialization, Thriftnow protobuf
  20. 20. Custom  Writablespublic static class Play extends CustomWritable { public final LongWritable time = new LongWritable(); public final LongWritable owner_id = new LongWritable(); public final LongWritable track_id = new LongWritable(); public Play() { fields = new WritableComparable[] { owner_id, track_id, time }; }}
  21. 21. Fear  the  StateBytesWritable bytes = new BytesWritable();...byte[] buffer = bytes.getBytes();
  22. 22. Re-Iteratepublic void reduce( LongTriple key, Iterable<LongWritable> values, Context ctx) { for(LongWritable v : values) { } for(LongWritable v : values) { }}public void reduce( LongTriple key, Iterable<LongWritable> values, Context ctx) { buffer.clear(); for(LongWritable v : values) { buffer.add(v); } for(LongWritable v : buffer.values()) { }} HADOOP-5266  (applied  to  0.21.0)
  23. 23. BitSetslong min = 1;long max = 10000000;FastBitSet set = new FastBitSet(min, max);for(long i = min; i<max; i++) { set.set(i);} org.apache.lucene.util.*BitSet
  24. 24. Data  Structureshttp://bit.ly/data-structureshttp://bit.ly/bloom-filtershttp://bit.ly/stream-lib
  25. 25. General  Tips· test  on  small  datasets,  test  on  your  machine· many  reducers· always  consider  a  combiner  and  partitioner· pig  /  streaming  for  one-time  jobs, java/scala  for  recurring http://bit.ly/map-reduce-book
  26. 26. Operationsuse  chef  /  puppetrunit  /  init.dpdsh  /  dsh pdsh -w "hdd[001-019]" "sudo sv restart /etc/sv/hadoop-tasktracker"
  27. 27. Hardware· 2x  name  nodes  raid  1· 12  cores,  48GB  RAM,  xfs,  2x1TB· n  x  data  nodes  no  raid· 12  cores,  16GB  RAM,  xfs,  4x2TB
  28. 28. Monitoringdfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext31dfs.period=10dfs.servers=...mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext31mapred.period=10mapred.servers=...jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31jvm.period=10jvm.servers=...rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext31rpc.period=10rpc.servers=...# ignoreugi.class=org.apache.hadoop.metrics.spi.NullContext
  29. 29. Monitoringtotal  capacity capacity  used
  30. 30. Compression#  of  64MB  blocks#  of  bytes  needed#  of  bytes  used#  bytes  reclaimed bzip2  /  gzip  /  lzo  /   snappy io.seqfile.compression.type = BLOCK io.seqfile.compression.blocksize = 512000
  31. 31. Janitorhadoop-expire -url namenode.here -path /tmp -mtime 7d -delete
  32. 32. The last block of an HDFS block onlyoccupies the required space. So a 4kfile only consumes 4k on disk.-- Owen E D S T B U
  33. 33. Logfilesfind -wholename "/var/log/hadoop/hadoop-*" -wholename "/var/log/hadoop/job_*.xml" -wholename "/var/log/hadoop/history/*" -wholename "/var/log/hadoop/history/.*.crc" -wholename "/var/log/hadoop/history/done/*" -wholename "/var/log/hadoop/history/done/.*.crc" -wholename "/var/log/hadoop/userlogs/attempt_*" -mtime +7 -daystart -delete
  34. 34. Limitssysctl.conf fs.file-max = 128000limits.conf hdfs hard nofile 128000 hdfs soft nofile 64000 mapred hard nofile 128000 mapred soft nofile 64000
  35. 35. Localhostbefore 127.0.0.1 localhost localhost.localdomain 127.0.1.1 hdd01.some.net hdd01hadoop 127.0.0.1 localhost localhost.localdomain 127.0.1.1 hdd01
  36. 36. Rackawaresite  config <property> <name>topology.script.file.name</name> <value>/path/to/script/location-from-ip</value> <final>true</final> </property>topology  script #!/usr/bin/ruby location = { hdd001.some.net => /ams/1, 10.20.2.1 => /ams/1, hdd002.some.net => /ams/2, 10.20.2.2 => /ams/2, } puts ARGV.map { |ip| location[ARGV.first] || /default-rack }.join( )
  37. 37. Fix  the  Policyfor f in `hdfs hadoop fsck / | grep "Replicaplacement policy is violated" | awk -F: {print $1}| sort | uniq | head -n1000`; do hadoop fs -setrep -w 4 $f hadoop fs -setrep 3 $fdone
  38. 38. Fsckhadoop fsck / -openforwrite -files | grep -i"OPENFORWRITE: MISSING 1 blocks of total size" | awk{print $1} | xargs -L 1 -i hadoop dfs -mv {} /lost+notfound
  39. 39. Communityhadoop *  from  markmail.org
  40. 40. Community The  Enterprise  Effect“The  Community  Effect”  (in  2011)
  41. 41. Communitycore mapreduce *  from  markmail.org
  42. 42. The  Futureincremental real  time refined  API flexible  pipelines refined  implementation
  43. 43. Real  Time  Datamining  and  Aggregation  at  Scale  (Ted  Dunning) Eventually  Consistent  Data  Structures  (Sean  Cribbs) Real-time  Analytics  with  HBase  (Alex  Baranau)Profiling  and  performance-tuning  your  Hadoop  pipelines  (Aaron  Beppu) From  Batch  to  Realtime  with  Hadoop  (Lars  George) Event-Stream  Processing  with  Kafka  (Tim  Lossen) Real-/Neartime  analysis  with  Hadoop  &  VoltDB  (Ralf  Neeb)
  44. 44. Take  Aways· use  hadoop  only  if  you  must· really  understand  the  pipeline· unbox  the  black  box
  45. 45. That’s  it   folks! @tcurdt github.com/tcurdtyourdailygeekery.com

×