15. The Numbers
Machines
– 60 machines
●
Datanode
●
Tasktracker
●
RegionServer
– 2 machines
●
Namenode
– 2 machines
●
HBase Master
– In the processing of doubling our capacity
16. The Numbers
Machines
– 2 * Intel Xeon E5620 @ 2.40GHz (16 core total)
– 24GB RAM
– 6 * 2 TB disks in JBOD (small partition on frst
disk for OS, rest is storage)
– 1 Gigabit network links
17. The Numbers
Data
– Avg load of 3500 interactions/second
– Peak load of 6000 interactions/second
– Highest during the Superbowl – 12000
interactions/second
– Avg size of interaction 2 KB – thats 2 TB a day
with replication (RF = 3)
– And that's not it!
18. The Use Cases
HBase
– Recordings
– Archive/Ultrahose
Map/Reduce
– Exports
– Historics
19. The Use Cases
Recordings
– User defned streams
– Stored in HBase for later retrieval
– Export to multiple output formats and stores
– <recording-id><interaction-uuid>
●
Recording-id is a SHA-1 hash
●
Allows recordings to be distributed by their key
without generating hot-spots.
21. The Use Cases
Exporter
– Export data from HBase for customer
– Export fles 5 – 10 GB or 3-6 million records
– MR over HBase using TableInputFormat
– But the data needs to be sorted
●
TotalOrderPartioner
25. The Use Cases
Archive/Ultrahose
– Not just the Firehose but the Ultrahose
– Stored in HBase as well
– HBase architecture (BigTable) creates Hotspots with Time
Series data
●
Leading randomizing bit (see HBaseWD)
●
Pre-split regions
●
Concurrent writes
26. The Use Cases
Archive continued …
2 years of Tweets
– 11 TB compressed
– <Number of tweets we got>
27. The Use Cases
Historics
– Export archive data
– Slightly different from Exporter
●
Much larger time lines (1 – 3 months)
●
Unfltered Input Data
●
Therefore longer processing time
●
Hence more optimizations required
29. The Lessons - HBase
Tune Tune Tune (Default == BAD)
Based on use case tune -
– Heap
– Block Size
– Memstore size
Keep number of column families low
Be aware of hot-spotting issue when writing time-
series data
Use compression (eg. Snappy)
30. The Lessons - HBase
Ops need intimate understanding of
system
Monitor metrics (GC, CPU, Compaction,
I/O)
Don't be afraid to fddle with HBase code
Using a distribution is advisable