THE NUMBERS• Machines • HBase • 60 Machines as RegionServers •1 HMaster •3 Zookeeper nodes
THE NUMBERS• Machines • Hadoop • 135 Machines divided into 2 clusters • Datanodes/Tasktrakers • Namenodes with High-Availability Failover •1 Jobtracker each
THE NUMBERS• Machines • DL380 Gen8 • 2 * Intel Xeon E5646 @ 2.40GHz (24 core total) • 48GB RAM • 6 * 2 TB disks in JBOD (small partition on ﬁrst disk for OS, rest is storage) • 1 Gigabit network links
THE NUMBERS• Data • Average load of 7500 interactions per second • Peak loads of 15000 interactions per second sustained over a min • Peak of 21000 interactions per second during superbowl • Total current capacity ~ 1.6 PB; Total current usage ~ 800 TB • Avg size of interaction 2 KB – thats ~ 1GB a min or ~ 2 TB a day with replication (RF = 3) • And that’s not it!
THE USE CASES• HBase • Recordings • Archive• Map/Reduce • Exports • Historics • Migration
THE USE CASES• Recordings • User deﬁned streams • Stored in HBase for later retrieval • Export to multiple output formats and stores • <recording-id><interaction-uuid> • Recording-id is a SHA-1 hash • Allows recordings to be distributed by their key without generating hot- spots.
THE USE CASES• Exporter • Export data from HBase for customer • Export ﬁles ~ 5 – 10 GB or ~ 3-6 million records • MR over HBase using TableInputFormat • But the data needs to be sorted • TotalOrderPartioner
THE USE CASES• Twitter Import •2 years of Tweets • About 95,000,000,000 tweets • Over 300 TB with added augmentation • Import was not as simple as you would imagine
THE USE CASES• Archive • Not just the Firehose but the Ultrahose • Stored in HBase as well • HBase architecture (BigTable) creates Hotspots with Time Series data • Leading randomizing bit (see HBaseWD) • Pre-split regions • Concurrent writes
THE USE CASES• Historics • Export archive data • Slightly different from Exporter • Much larger time lines (1 – 3 months) • Controlled access to Hadoop cluster with efﬁcient job scheduling • Unﬁltered Input Data • Therefore longer processing time • Hence more optimizations required
THE LESSONS• Tune Tune Tune (Default == BAD)• Based on use case tune - • Heap • Block Size • Memstore size• Keep number of column families low• Be aware of hot-spotting issue when writing time-series data
THE LESSONS• Use compression (eg. Snappy)• Ops need intimate understanding of system• Monitor system metrics (GC, CPU, Compaction, I/O) and application metrics (writes/sec etc)• Dont be afraid to ﬁddle with HBase code• Using a distribution is advisable
QUESTIONS? We are hiringhttp://datasift.com/about-us/careers