Your SlideShare is downloading. ×
Hadoop at datasift
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hadoop at datasift

398
views

Published on

Presentation given at Edinburgh University Student Tech-Meetup on 6th Feb, 2013.

Presentation given at Edinburgh University Student Tech-Meetup on 6th Feb, 2013.

Published in: Technology

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
398
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
12
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. HADOOP ATDATASIFT
  • 2. ABOUT ME Jairam Chandar Big Data Engineer Datasift @jairamc http://about.me/jairam http://jairam.meAnd I’m a Formula 1 Fan!
  • 3. OUTLINE• What is Datasift ?• Where do we use Hadoop ? • The Numbers • The Use-cases • The Lessons
  • 4. !! SALES PITCH ALERT !!
  • 5. WHAT IS DATASIFT?
  • 6. WHAT IS DATASIFT?
  • 7. WHAT IS DATASIFT?
  • 8. WHAT IS DATASIFT?
  • 9. WHAT IS DATASIFT?
  • 10. WHAT IS DATASIFT?
  • 11. WHAT IS DATASIFT?
  • 12. WHAT IS DATASIFT?
  • 13. WHAT IS DATASIFT?
  • 14. THE NUMBERS• Machines • HBase • 60 Machines as RegionServers •1 HMaster •3 Zookeeper nodes
  • 15. THE NUMBERS• Machines • Hadoop • 135 Machines divided into 2 clusters • Datanodes/Tasktrakers • Namenodes with High-Availability Failover •1 Jobtracker each
  • 16. THE NUMBERS• Machines • DL380 Gen8 • 2 * Intel Xeon E5646 @ 2.40GHz (24 core total) • 48GB RAM • 6 * 2 TB disks in JBOD (small partition on first disk for OS, rest is storage) • 1 Gigabit network links
  • 17. THE NUMBERS• Data • Average load of 7500 interactions per second • Peak loads of 15000 interactions per second sustained over a min • Peak of 21000 interactions per second during superbowl • Total current capacity ~ 1.6 PB; Total current usage ~ 800 TB • Avg size of interaction 2 KB – thats ~ 1GB a min or ~ 2 TB a day with replication (RF = 3) • And that’s not it!
  • 18. THE USE CASES• HBase • Recordings • Archive• Map/Reduce • Exports • Historics • Migration
  • 19. THE USE CASES• Recordings • User defined streams • Stored in HBase for later retrieval • Export to multiple output formats and stores • <recording-id><interaction-uuid> • Recording-id is a SHA-1 hash • Allows recordings to be distributed by their key without generating hot- spots.
  • 20. THE RECORDER
  • 21. THE USE CASES• Exporter • Export data from HBase for customer • Export files ~ 5 – 10 GB or ~ 3-6 million records • MR over HBase using TableInputFormat • But the data needs to be sorted • TotalOrderPartioner
  • 22. EXPORTER
  • 23. HISTORICS
  • 24. THE USE CASES• Twitter Import •2 years of Tweets • About 95,000,000,000 tweets • Over 300 TB with added augmentation • Import was not as simple as you would imagine
  • 25. THE USE CASES• Archive • Not just the Firehose but the Ultrahose • Stored in HBase as well • HBase architecture (BigTable) creates Hotspots with Time Series data • Leading randomizing bit (see HBaseWD) • Pre-split regions • Concurrent writes
  • 26. THE USE CASES• Historics • Export archive data • Slightly different from Exporter • Much larger time lines (1 – 3 months) • Controlled access to Hadoop cluster with efficient job scheduling • Unfiltered Input Data • Therefore longer processing time • Hence more optimizations required
  • 27. HISTORICS
  • 28. THE LESSONS• Tune Tune Tune (Default == BAD)• Based on use case tune - • Heap • Block Size • Memstore size• Keep number of column families low• Be aware of hot-spotting issue when writing time-series data
  • 29. THE LESSONS• Use compression (eg. Snappy)• Ops need intimate understanding of system• Monitor system metrics (GC, CPU, Compaction, I/O) and application metrics (writes/sec etc)• Dont be afraid to fiddle with HBase code• Using a distribution is advisable
  • 30. QUESTIONS? We are hiringhttp://datasift.com/about-us/careers

×