• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hadoop at datasift
 

Hadoop at datasift

on

  • 4,333 views

Slides from the presentation at Hadoop UK User group meetup in London as part of BigDataWeek.

Slides from the presentation at Hadoop UK User group meetup in London as part of BigDataWeek.

Statistics

Views

Total Views
4,333
Views on SlideShare
4,096
Embed Views
237

Actions

Likes
4
Downloads
72
Comments
1

10 Embeds 237

http://www.scoop.it 108
http://lanyrd.com 69
https://awesometrics.atlassian.net 20
http://www.linkedin.com 14
http://www.twylah.com 10
https://twitter.com 10
http://us-w1.rockmelt.com 3
http://tweets.mariewallace.info 1
http://paper.li 1
https://www.linkedin.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Bonjour
    Mon nom est Mlle merci hassan j'ai vu votre profil aujourd'hui et je
    est devenu intéressé à vous, je tiens également à vous en savez plus
    et je veux que vous envoyez un e-mail à mon adresse email afin que je puisse vous donner ma photo
    votre nouvel ami.

    mercy_hassan22@yahoo.in

    =====================================================================

    Hello
    My name is Miss mercy hassan I saw your profile today and i
    became interested in you, I will also want to know you more
    and I want you to send an email to my mailbox so that I can give you my picture
    yours new friend.

    mercy_hassan22@yahoo.in
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hadoop at datasift Hadoop at datasift Presentation Transcript

    • Hadoop AtDatasift
    • About meJairam Chandar Big Data Engineer Datasift @jairamc http://about.me/jairam http://blog.jairam.me
    • Outline What is Datasift? Where do we use Hadoop? – The Numbers – The Use-cases – The Lessons
    • !! Sales Pitch Alert !!
    • What is Datasift?
    • What is Datasift?
    • What is Datasift?
    • What is Datasift?
    • What is Datasift?
    • What is Datasift?
    • What is Datasift?
    • What is Datasift?
    • What is Datasift?
    • What is Datasift?
    • The Numbers Machines – 60 machines ● Datanode ● Tasktracker ● RegionServer – 2 machines ● Namenode – 2 machines ● HBase Master
    • The Numbers Machines – 2 * Intel Xeon E5620 @ 2.40GHz (16 core total) – 24GB RAM – 6 * 2 TB disks in JBOD (small partition on frst disk for OS, rest is storage) – 1 Gigabit network links
    • The Numbers Data – Avg load of 3500 interactions/second – Peak load of 8000+ interactions/second – Highest during the Superbowl – 12000 interactions/second – Avg size of interaction 2 KB – thats ~ 1GB a min or ~ 2 TB a day with replication (RF = 3) – And thats not it!
    • The Use Cases HBase – Recordings – Archive/Ultrahose Map/Reduce – Exports – Historics
    • The Use Cases Recordings – User defned streams – Stored in HBase for later retrieval – Export to multiple output formats and stores – <recording-id><interaction-uuid> ● Recording-id is a SHA-1 hash ● Allows recordings to be distributed by their key without generating hot-spots.
    • The Use Cases Recordings continued ...
    • The Use Cases Exporter – Export data from HBase for customer – Export fles ~ 5 – 10 GB or ~ 3-6 million records – MR over HBase using TableInputFormat – But the data needs to be sorted ● TotalOrderPartioner
    • The Use Cases Exporter Continued
    • !! Sales Pitch Alert !!
    • Historics
    • The Use Cases Twitter import 2 years of Tweets – About 95,000,000,000 tweets – Over 300 TB with added augmentation – Import was not as simple as you would imagine
    • The Use Cases Archive/Ultrahose – Not just the Firehose but the Ultrahose – Stored in HBase as well – HBase architecture (BigTable) creates Hotspots with Time Series data ● Leading randomizing bit (see HBaseWD) ● Pre-split regions ● Concurrent writes
    • The Use Cases Historics – Export archive data – Slightly different from Exporter ● Much larger time lines (1 – 3 months) ● Unfltered Input Data ● Therefore longer processing time ● Hence more optimizations required
    • The Use Cases Historics continued ...
    • The Lessons Tune Tune Tune (Default == BAD) Based on use case tune - – Heap – Block Size – Memstore size Keep number of column families low Be aware of hot-spotting issue when writing time-series data
    • The Lessons Use compression (eg. Snappy) Ops need intimate understanding of system Monitor system metrics (GC, CPU, Compaction, I/O) and application metrics (writes/sec etc) Dont be afraid to fddle with HBase code Using a distribution is advisable
    • Questions?