Hadoop at datasift

1,485 views
1,539 views

Published on

Presentation given at Edinburgh University Student Tech-Meetup on 6th Feb, 2013.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,485
On SlideShare
0
From Embeds
0
Number of Embeds
658
Actions
Shares
0
Downloads
18
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Hadoop at datasift

  1. 1. HADOOP ATDATASIFT
  2. 2. ABOUT ME Jairam Chandar Big Data Engineer Datasift @jairamc http://about.me/jairam http://jairam.meAnd I’m a Formula 1 Fan!
  3. 3. OUTLINE• What is Datasift ?• Where do we use Hadoop ? • The Numbers • The Use-cases • The Lessons
  4. 4. !! SALES PITCH ALERT !!
  5. 5. WHAT IS DATASIFT?
  6. 6. WHAT IS DATASIFT?
  7. 7. WHAT IS DATASIFT?
  8. 8. WHAT IS DATASIFT?
  9. 9. WHAT IS DATASIFT?
  10. 10. WHAT IS DATASIFT?
  11. 11. WHAT IS DATASIFT?
  12. 12. WHAT IS DATASIFT?
  13. 13. WHAT IS DATASIFT?
  14. 14. THE NUMBERS• Machines • HBase • 60 Machines as RegionServers •1 HMaster •3 Zookeeper nodes
  15. 15. THE NUMBERS• Machines • Hadoop • 135 Machines divided into 2 clusters • Datanodes/Tasktrakers • Namenodes with High-Availability Failover •1 Jobtracker each
  16. 16. THE NUMBERS• Machines • DL380 Gen8 • 2 * Intel Xeon E5646 @ 2.40GHz (24 core total) • 48GB RAM • 6 * 2 TB disks in JBOD (small partition on first disk for OS, rest is storage) • 1 Gigabit network links
  17. 17. THE NUMBERS• Data • Average load of 7500 interactions per second • Peak loads of 15000 interactions per second sustained over a min • Peak of 21000 interactions per second during superbowl • Total current capacity ~ 1.6 PB; Total current usage ~ 800 TB • Avg size of interaction 2 KB – thats ~ 1GB a min or ~ 2 TB a day with replication (RF = 3) • And that’s not it!
  18. 18. THE USE CASES• HBase • Recordings • Archive• Map/Reduce • Exports • Historics • Migration
  19. 19. THE USE CASES• Recordings • User defined streams • Stored in HBase for later retrieval • Export to multiple output formats and stores • <recording-id><interaction-uuid> • Recording-id is a SHA-1 hash • Allows recordings to be distributed by their key without generating hot- spots.
  20. 20. THE RECORDER
  21. 21. THE USE CASES• Exporter • Export data from HBase for customer • Export files ~ 5 – 10 GB or ~ 3-6 million records • MR over HBase using TableInputFormat • But the data needs to be sorted • TotalOrderPartioner
  22. 22. EXPORTER
  23. 23. HISTORICS
  24. 24. THE USE CASES• Twitter Import •2 years of Tweets • About 95,000,000,000 tweets • Over 300 TB with added augmentation • Import was not as simple as you would imagine
  25. 25. THE USE CASES• Archive • Not just the Firehose but the Ultrahose • Stored in HBase as well • HBase architecture (BigTable) creates Hotspots with Time Series data • Leading randomizing bit (see HBaseWD) • Pre-split regions • Concurrent writes
  26. 26. THE USE CASES• Historics • Export archive data • Slightly different from Exporter • Much larger time lines (1 – 3 months) • Controlled access to Hadoop cluster with efficient job scheduling • Unfiltered Input Data • Therefore longer processing time • Hence more optimizations required
  27. 27. HISTORICS
  28. 28. THE LESSONS• Tune Tune Tune (Default == BAD)• Based on use case tune - • Heap • Block Size • Memstore size• Keep number of column families low• Be aware of hot-spotting issue when writing time-series data
  29. 29. THE LESSONS• Use compression (eg. Snappy)• Ops need intimate understanding of system• Monitor system metrics (GC, CPU, Compaction, I/O) and application metrics (writes/sec etc)• Dont be afraid to fiddle with HBase code• Using a distribution is advisable
  30. 30. QUESTIONS? We are hiringhttp://datasift.com/about-us/careers

×