SlideShare a Scribd company logo
1 of 31
Download to read offline
Hadoop
At
Datasift
About me
Jairam Chandar
Big Data Engineer
Datasift
@jairamc
http://about.me/jairam
http://blog.jairam.me
Outline
 What is Datasift?
 Where do we use Hadoop?
– The Numbers
– The Use-cases
– The Lessons
!! Sales Pitch Alert !!
What is Datasift?
What is Datasift?
What is Datasift?
What is Datasift?
What is Datasift?
What is Datasift?
What is Datasift?
What is Datasift?
What is Datasift?
What is Datasift?
The Numbers
 Machines
– 60 machines
●
Datanode
●
Tasktracker
●
RegionServer
– 2 machines
●
Namenode
– 2 machines
●
HBase Master
– In the processing of doubling our capacity
The Numbers
 Machines
– 2 * Intel Xeon E5620 @ 2.40GHz (16 core total)
– 24GB RAM
– 6 * 2 TB disks in JBOD (small partition on frst
disk for OS, rest is storage)
– 1 Gigabit network links
The Numbers
 Data
– Avg load of 3500 interactions/second
– Peak load of 6000 interactions/second
– Highest during the Superbowl – 12000
interactions/second
– Avg size of interaction 2 KB – thats 2 TB a day
with replication (RF = 3)
– And that's not it!
The Use Cases
 HBase
– Recordings
– Archive/Ultrahose
 Map/Reduce
– Exports
– Historics
The Use Cases
 Recordings
– User defned streams
– Stored in HBase for later retrieval
– Export to multiple output formats and stores
– <recording-id><interaction-uuid>
●
Recording-id is a SHA-1 hash
●
Allows recordings to be distributed by their key
without generating hot-spots.
The Use Cases
 Recordings continued ...
The Use Cases
 Exporter
– Export data from HBase for customer
– Export fles 5 – 10 GB or 3-6 million records
– MR over HBase using TableInputFormat
– But the data needs to be sorted
●
TotalOrderPartioner
The Use Cases
 Exporter Continued
!! Sales Pitch Alert !!
Historics
The Use Cases
 Archive/Ultrahose
– Not just the Firehose but the Ultrahose
– Stored in HBase as well
– HBase architecture (BigTable) creates Hotspots with Time
Series data
●
Leading randomizing bit (see HBaseWD)
●
Pre-split regions
●
Concurrent writes
The Use Cases
 Archive continued …
 2 years of Tweets
– 11 TB compressed
– <Number of tweets we got>
The Use Cases
 Historics
– Export archive data
– Slightly different from Exporter
●
Much larger time lines (1 – 3 months)
●
Unfltered Input Data
●
Therefore longer processing time
●
Hence more optimizations required
The Use Cases
 Historics continued ...
The Lessons - HBase
 Tune Tune Tune (Default == BAD)
 Based on use case tune -
– Heap
– Block Size
– Memstore size
 Keep number of column families low
 Be aware of hot-spotting issue when writing time-
series data
 Use compression (eg. Snappy)
The Lessons - HBase
 Ops need intimate understanding of
system
 Monitor metrics (GC, CPU, Compaction,
I/O)
 Don't be afraid to fddle with HBase code
 Using a distribution is advisable
Questions?

More Related Content

What's hot

What's hot (20)

Map reduce & HDFS with Hadoop
Map reduce & HDFS with HadoopMap reduce & HDFS with Hadoop
Map reduce & HDFS with Hadoop
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixit
 
Foss evolution cos-boudnik
Foss evolution cos-boudnikFoss evolution cos-boudnik
Foss evolution cos-boudnik
 
Introduce to spark
Introduce to sparkIntroduce to spark
Introduce to spark
 
Redis Modules - Redis India Tour - 2017
Redis Modules - Redis India Tour - 2017Redis Modules - Redis India Tour - 2017
Redis Modules - Redis India Tour - 2017
 
Big data solution capacity planning
Big data solution capacity planningBig data solution capacity planning
Big data solution capacity planning
 
Handling the growth of data
Handling the growth of dataHandling the growth of data
Handling the growth of data
 
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a FlurryHBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a Flurry
 
Basic Hadoop Architecture V1 vs V2
Basic  Hadoop Architecture  V1 vs V2Basic  Hadoop Architecture  V1 vs V2
Basic Hadoop Architecture V1 vs V2
 
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
 
Hadoop Research
Hadoop Research Hadoop Research
Hadoop Research
 
An introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoopAn introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoop
 
Barcamp MySQL
Barcamp MySQLBarcamp MySQL
Barcamp MySQL
 
simple introduction to hadoop
simple introduction to hadoopsimple introduction to hadoop
simple introduction to hadoop
 
JOSA TechTalks - Big Data on Hadoop
JOSA TechTalks - Big Data on HadoopJOSA TechTalks - Big Data on Hadoop
JOSA TechTalks - Big Data on Hadoop
 
ImpalaToGo design explained
ImpalaToGo design explainedImpalaToGo design explained
ImpalaToGo design explained
 
Performance evaluation of apache tajo
Performance evaluation of apache tajoPerformance evaluation of apache tajo
Performance evaluation of apache tajo
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
ImpalaToGo introduction
ImpalaToGo introductionImpalaToGo introduction
ImpalaToGo introduction
 
Hadoop sqoop
Hadoop sqoop Hadoop sqoop
Hadoop sqoop
 

Viewers also liked

Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark Summit
 

Viewers also liked (20)

Creating streams with DataSift
Creating streams with DataSiftCreating streams with DataSift
Creating streams with DataSift
 
The Value of Social Data
The Value of Social DataThe Value of Social Data
The Value of Social Data
 
Five Habits of Highly Successful SaaS/Cloud Businesses
Five Habits of Highly Successful SaaS/Cloud BusinessesFive Habits of Highly Successful SaaS/Cloud Businesses
Five Habits of Highly Successful SaaS/Cloud Businesses
 
Pinterest hadoop summit_talk
Pinterest hadoop summit_talkPinterest hadoop summit_talk
Pinterest hadoop summit_talk
 
101 ways to configure kafka - badly
101 ways to configure kafka - badly101 ways to configure kafka - badly
101 ways to configure kafka - badly
 
면접질문준비하기
면접질문준비하기면접질문준비하기
면접질문준비하기
 
YARN - Next Generation Compute Platform fo Hadoop
YARN - Next Generation Compute Platform fo HadoopYARN - Next Generation Compute Platform fo Hadoop
YARN - Next Generation Compute Platform fo Hadoop
 
The Many Faces of Apache Kafka: Leveraging real-time data at scale
The Many Faces of Apache Kafka: Leveraging real-time data at scaleThe Many Faces of Apache Kafka: Leveraging real-time data at scale
The Many Faces of Apache Kafka: Leveraging real-time data at scale
 
Deep dive into Apache Kafka consumption
Deep dive into Apache Kafka consumptionDeep dive into Apache Kafka consumption
Deep dive into Apache Kafka consumption
 
How Spark is Making an Impact at Goldman Sachs by Vincent Saulys
How Spark is Making an Impact at Goldman Sachs by Vincent SaulysHow Spark is Making an Impact at Goldman Sachs by Vincent Saulys
How Spark is Making an Impact at Goldman Sachs by Vincent Saulys
 
GPU Accelerated Natural Language Processing by Guillermo Molini
GPU Accelerated Natural Language Processing by Guillermo MoliniGPU Accelerated Natural Language Processing by Guillermo Molini
GPU Accelerated Natural Language Processing by Guillermo Molini
 
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Kafka as Message Broker
Kafka as Message BrokerKafka as Message Broker
Kafka as Message Broker
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearch
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
 
Microservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka EcosystemMicroservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka Ecosystem
 
Cisco OpenSOC
Cisco OpenSOCCisco OpenSOC
Cisco OpenSOC
 

Similar to Hadoop at datasift

Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
WANdisco Plc
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
Cisco Canada
 

Similar to Hadoop at datasift (20)

Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_Plan
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
Zero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter MigrationZero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter Migration
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizIntroduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Gcp data engineer
Gcp data engineerGcp data engineer
Gcp data engineer
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
Pilot Hadoop Towards 2500 Nodes and Cluster Redundancy
Pilot Hadoop Towards 2500 Nodes and Cluster RedundancyPilot Hadoop Towards 2500 Nodes and Cluster Redundancy
Pilot Hadoop Towards 2500 Nodes and Cluster Redundancy
 
Hadoop Architecture in Depth
Hadoop Architecture in DepthHadoop Architecture in Depth
Hadoop Architecture in Depth
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
Google Bigtable Paper Presentation
Google Bigtable Paper PresentationGoogle Bigtable Paper Presentation
Google Bigtable Paper Presentation
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 

Recently uploaded

Recently uploaded (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 

Hadoop at datasift

  • 2. About me Jairam Chandar Big Data Engineer Datasift @jairamc http://about.me/jairam http://blog.jairam.me
  • 3. Outline  What is Datasift?  Where do we use Hadoop? – The Numbers – The Use-cases – The Lessons
  • 4. !! Sales Pitch Alert !!
  • 15. The Numbers  Machines – 60 machines ● Datanode ● Tasktracker ● RegionServer – 2 machines ● Namenode – 2 machines ● HBase Master – In the processing of doubling our capacity
  • 16. The Numbers  Machines – 2 * Intel Xeon E5620 @ 2.40GHz (16 core total) – 24GB RAM – 6 * 2 TB disks in JBOD (small partition on frst disk for OS, rest is storage) – 1 Gigabit network links
  • 17. The Numbers  Data – Avg load of 3500 interactions/second – Peak load of 6000 interactions/second – Highest during the Superbowl – 12000 interactions/second – Avg size of interaction 2 KB – thats 2 TB a day with replication (RF = 3) – And that's not it!
  • 18. The Use Cases  HBase – Recordings – Archive/Ultrahose  Map/Reduce – Exports – Historics
  • 19. The Use Cases  Recordings – User defned streams – Stored in HBase for later retrieval – Export to multiple output formats and stores – <recording-id><interaction-uuid> ● Recording-id is a SHA-1 hash ● Allows recordings to be distributed by their key without generating hot-spots.
  • 20. The Use Cases  Recordings continued ...
  • 21. The Use Cases  Exporter – Export data from HBase for customer – Export fles 5 – 10 GB or 3-6 million records – MR over HBase using TableInputFormat – But the data needs to be sorted ● TotalOrderPartioner
  • 22. The Use Cases  Exporter Continued
  • 23. !! Sales Pitch Alert !!
  • 25. The Use Cases  Archive/Ultrahose – Not just the Firehose but the Ultrahose – Stored in HBase as well – HBase architecture (BigTable) creates Hotspots with Time Series data ● Leading randomizing bit (see HBaseWD) ● Pre-split regions ● Concurrent writes
  • 26. The Use Cases  Archive continued …  2 years of Tweets – 11 TB compressed – <Number of tweets we got>
  • 27. The Use Cases  Historics – Export archive data – Slightly different from Exporter ● Much larger time lines (1 – 3 months) ● Unfltered Input Data ● Therefore longer processing time ● Hence more optimizations required
  • 28. The Use Cases  Historics continued ...
  • 29. The Lessons - HBase  Tune Tune Tune (Default == BAD)  Based on use case tune - – Heap – Block Size – Memstore size  Keep number of column families low  Be aware of hot-spotting issue when writing time- series data  Use compression (eg. Snappy)
  • 30. The Lessons - HBase  Ops need intimate understanding of system  Monitor metrics (GC, CPU, Compaction, I/O)  Don't be afraid to fddle with HBase code  Using a distribution is advisable