This document provides an overview and lessons learned from Hadoop. It discusses why Hadoop is used, how MapReduce and HDFS work, tips for integration and operations, and the outlook for the Hadoop community moving forward with real-time capabilities and refined APIs. Key takeaways include only using Hadoop if necessary, fully understanding your data pipeline, and "unboxing the black box" of Hadoop.
23. Custom Writables
public static class Play extends CustomWritable {
public final LongWritable time
= new LongWritable();
public final LongWritable owner_id
= new LongWritable();
public final LongWritable track_id
= new LongWritable();
public Play() {
fields = new WritableComparable[] {
owner_id, track_id, time };
}
}
25. Re-Iterate
public void reduce(
LongTriple key,
Iterable<LongWritable> values,
Context ctx) {
for(LongWritable v : values) { }
for(LongWritable v : values) { }
}
public void reduce(
LongTriple key,
Iterable<LongWritable> values,
Context ctx) {
buffer.clear();
for(LongWritable v : values) { buffer.add(v); }
for(LongWritable v : buffer.values()) { }
}
HADOOP-5266 (applied to 0.21.0)
26. BitSets
long min = 1;
long max = 10000000;
FastBitSet set = new FastBitSet(min, max);
for(long i = min; i<max; i++) {
set.set(i);
}
org.apache.lucene.util.*BitSet
28. General Tips
· test on small datasets, test on your machine
· many reducers
· always consider a combiner and partitioner
· pig / streaming for one-time jobs,
java/scala for recurring
http://bit.ly/map-reduce-book
29. Operations
use chef / puppet
runit / init.d
pdsh / dsh
pdsh -w "hdd[001-019]"
"sudo sv restart /etc/sv/hadoop-tasktracker"
30. Hardware
· 2x name nodes raid 1
· 12 cores, 48GB RAM, xfs, 2x1TB
· n x data nodes no raid
· 12 cores, 16GB RAM, xfs, 4x2TB
48. Real Time Datamining and Aggregation at Scale (Ted Dunning)
Eventually Consistent Data Structures (Sean Cribbs)
Real-time Analytics with HBase (Alex Baranau)
Profiling and performance-tuning your Hadoop pipelines (Aaron Beppu)
From Batch to Realtime with Hadoop (Lars George)
Event-Stream Processing with Kafka (Tim Lossen)
Real-/Neartime analysis with Hadoop & VoltDB (Ralf Neeb)
49. Take Aways
· use hadoop only if you must
· really understand the pipeline
· unbox the black box
50. That’s it
folks!
@tcurdt
github.com/tcurdt
yourdailygeekery.com