SlideShare a Scribd company logo
1 of 31
Download to read offline
Hadoop
At
Datasift
About me
Jairam Chandar
Big Data Engineer
Datasift
@jairamc
http://about.me/jairam
http://blog.jairam.me
Outline
 What is Datasift?
 Where do we use Hadoop?
– The Numbers
– The Use-cases
– The Lessons
!! Sales Pitch Alert !!
What is Datasift?
What is Datasift?
What is Datasift?
What is Datasift?
What is Datasift?
What is Datasift?
What is Datasift?
What is Datasift?
What is Datasift?
What is Datasift?
The Numbers
 Machines
– 60 machines
●
Datanode
●
Tasktracker
●
RegionServer
– 2 machines
●
Namenode
– 2 machines
●
HBase Master
– In the processing of doubling our capacity
The Numbers
 Machines
– 2 * Intel Xeon E5620 @ 2.40GHz (16 core total)
– 24GB RAM
– 6 * 2 TB disks in JBOD (small partition on frst
disk for OS, rest is storage)
– 1 Gigabit network links
The Numbers
 Data
– Avg load of 3500 interactions/second
– Peak load of 6000 interactions/second
– Highest during the Superbowl – 12000
interactions/second
– Avg size of interaction 2 KB – thats 2 TB a day
with replication (RF = 3)
– And that's not it!
The Use Cases
 HBase
– Recordings
– Archive/Ultrahose
 Map/Reduce
– Exports
– Historics
The Use Cases
 Recordings
– User defned streams
– Stored in HBase for later retrieval
– Export to multiple output formats and stores
– <recording-id><interaction-uuid>
●
Recording-id is a SHA-1 hash
●
Allows recordings to be distributed by their key
without generating hot-spots.
The Use Cases
 Recordings continued ...
The Use Cases
 Exporter
– Export data from HBase for customer
– Export fles 5 – 10 GB or 3-6 million records
– MR over HBase using TableInputFormat
– But the data needs to be sorted
●
TotalOrderPartioner
The Use Cases
 Exporter Continued
!! Sales Pitch Alert !!
Historics
The Use Cases
 Archive/Ultrahose
– Not just the Firehose but the Ultrahose
– Stored in HBase as well
– HBase architecture (BigTable) creates Hotspots with Time
Series data
●
Leading randomizing bit (see HBaseWD)
●
Pre-split regions
●
Concurrent writes
The Use Cases
 Archive continued …
 2 years of Tweets
– 11 TB compressed
– <Number of tweets we got>
The Use Cases
 Historics
– Export archive data
– Slightly different from Exporter
●
Much larger time lines (1 – 3 months)
●
Unfltered Input Data
●
Therefore longer processing time
●
Hence more optimizations required
The Use Cases
 Historics continued ...
The Lessons - HBase
 Tune Tune Tune (Default == BAD)
 Based on use case tune -
– Heap
– Block Size
– Memstore size
 Keep number of column families low
 Be aware of hot-spotting issue when writing time-
series data
 Use compression (eg. Snappy)
The Lessons - HBase
 Ops need intimate understanding of
system
 Monitor metrics (GC, CPU, Compaction,
I/O)
 Don't be afraid to fddle with HBase code
 Using a distribution is advisable
Questions?

More Related Content

What's hot

Map reduce & HDFS with Hadoop
Map reduce & HDFS with HadoopMap reduce & HDFS with Hadoop
Map reduce & HDFS with HadoopDiego Pacheco
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitData Con LA
 
Foss evolution cos-boudnik
Foss evolution cos-boudnikFoss evolution cos-boudnik
Foss evolution cos-boudnikData Con LA
 
Redis Modules - Redis India Tour - 2017
Redis Modules - Redis India Tour - 2017Redis Modules - Redis India Tour - 2017
Redis Modules - Redis India Tour - 2017HashedIn Technologies
 
Big data solution capacity planning
Big data solution capacity planningBig data solution capacity planning
Big data solution capacity planningRiyaz Shaikh
 
Handling the growth of data
Handling the growth of dataHandling the growth of data
Handling the growth of dataPiyush Katariya
 
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a FlurryHBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a FlurryHBaseCon
 
Basic Hadoop Architecture V1 vs V2
Basic  Hadoop Architecture  V1 vs V2Basic  Hadoop Architecture  V1 vs V2
Basic Hadoop Architecture V1 vs V2VIVEKVANAVAN
 
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Cloudera, Inc.
 
An introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoopAn introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoopAmir Sedighi
 
simple introduction to hadoop
simple introduction to hadoopsimple introduction to hadoop
simple introduction to hadoopvishnu rao
 
ImpalaToGo design explained
ImpalaToGo design explainedImpalaToGo design explained
ImpalaToGo design explainedDavid Groozman
 
Performance evaluation of apache tajo
Performance evaluation of apache tajoPerformance evaluation of apache tajo
Performance evaluation of apache tajoJihoon Son
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
ImpalaToGo introduction
ImpalaToGo introductionImpalaToGo introduction
ImpalaToGo introductionDavid Groozman
 

What's hot (20)

Map reduce & HDFS with Hadoop
Map reduce & HDFS with HadoopMap reduce & HDFS with Hadoop
Map reduce & HDFS with Hadoop
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixit
 
Foss evolution cos-boudnik
Foss evolution cos-boudnikFoss evolution cos-boudnik
Foss evolution cos-boudnik
 
Introduce to spark
Introduce to sparkIntroduce to spark
Introduce to spark
 
Redis Modules - Redis India Tour - 2017
Redis Modules - Redis India Tour - 2017Redis Modules - Redis India Tour - 2017
Redis Modules - Redis India Tour - 2017
 
Big data solution capacity planning
Big data solution capacity planningBig data solution capacity planning
Big data solution capacity planning
 
Handling the growth of data
Handling the growth of dataHandling the growth of data
Handling the growth of data
 
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a FlurryHBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a Flurry
 
Basic Hadoop Architecture V1 vs V2
Basic  Hadoop Architecture  V1 vs V2Basic  Hadoop Architecture  V1 vs V2
Basic Hadoop Architecture V1 vs V2
 
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
 
Hadoop Research
Hadoop Research Hadoop Research
Hadoop Research
 
An introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoopAn introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoop
 
Barcamp MySQL
Barcamp MySQLBarcamp MySQL
Barcamp MySQL
 
simple introduction to hadoop
simple introduction to hadoopsimple introduction to hadoop
simple introduction to hadoop
 
JOSA TechTalks - Big Data on Hadoop
JOSA TechTalks - Big Data on HadoopJOSA TechTalks - Big Data on Hadoop
JOSA TechTalks - Big Data on Hadoop
 
ImpalaToGo design explained
ImpalaToGo design explainedImpalaToGo design explained
ImpalaToGo design explained
 
Performance evaluation of apache tajo
Performance evaluation of apache tajoPerformance evaluation of apache tajo
Performance evaluation of apache tajo
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
ImpalaToGo introduction
ImpalaToGo introductionImpalaToGo introduction
ImpalaToGo introduction
 
Hadoop sqoop
Hadoop sqoop Hadoop sqoop
Hadoop sqoop
 

Viewers also liked

Creating streams with DataSift
Creating streams with DataSiftCreating streams with DataSift
Creating streams with DataSiftDataSift
 
The Value of Social Data
The Value of Social DataThe Value of Social Data
The Value of Social DataTim Barker
 
Five Habits of Highly Successful SaaS/Cloud Businesses
Five Habits of Highly Successful SaaS/Cloud BusinessesFive Habits of Highly Successful SaaS/Cloud Businesses
Five Habits of Highly Successful SaaS/Cloud BusinessesTim Barker
 
Pinterest hadoop summit_talk
Pinterest hadoop summit_talkPinterest hadoop summit_talk
Pinterest hadoop summit_talkKrishna Gade
 
101 ways to configure kafka - badly
101 ways to configure kafka - badly101 ways to configure kafka - badly
101 ways to configure kafka - badlyHenning Spjelkavik
 
면접질문준비하기
면접질문준비하기면접질문준비하기
면접질문준비하기이창훈
 
YARN - Next Generation Compute Platform fo Hadoop
YARN - Next Generation Compute Platform fo HadoopYARN - Next Generation Compute Platform fo Hadoop
YARN - Next Generation Compute Platform fo HadoopHortonworks
 
The Many Faces of Apache Kafka: Leveraging real-time data at scale
The Many Faces of Apache Kafka: Leveraging real-time data at scaleThe Many Faces of Apache Kafka: Leveraging real-time data at scale
The Many Faces of Apache Kafka: Leveraging real-time data at scaleNeha Narkhede
 
Deep dive into Apache Kafka consumption
Deep dive into Apache Kafka consumptionDeep dive into Apache Kafka consumption
Deep dive into Apache Kafka consumptionAlexandre Tamborrino
 
How Spark is Making an Impact at Goldman Sachs by Vincent Saulys
How Spark is Making an Impact at Goldman Sachs by Vincent SaulysHow Spark is Making an Impact at Goldman Sachs by Vincent Saulys
How Spark is Making an Impact at Goldman Sachs by Vincent SaulysSpark Summit
 
GPU Accelerated Natural Language Processing by Guillermo Molini
GPU Accelerated Natural Language Processing by Guillermo MoliniGPU Accelerated Natural Language Processing by Guillermo Molini
GPU Accelerated Natural Language Processing by Guillermo MoliniBig Data Spain
 
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014Chen-en Lu
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka StreamsGuozhang Wang
 
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...Lucas Jellema
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0Russell Jurney
 
Kafka as Message Broker
Kafka as Message BrokerKafka as Message Broker
Kafka as Message BrokerHaluan Irsad
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchRafał Kuć
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark Summit
 
Microservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka EcosystemMicroservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka Ecosystemconfluent
 

Viewers also liked (20)

Creating streams with DataSift
Creating streams with DataSiftCreating streams with DataSift
Creating streams with DataSift
 
The Value of Social Data
The Value of Social DataThe Value of Social Data
The Value of Social Data
 
Five Habits of Highly Successful SaaS/Cloud Businesses
Five Habits of Highly Successful SaaS/Cloud BusinessesFive Habits of Highly Successful SaaS/Cloud Businesses
Five Habits of Highly Successful SaaS/Cloud Businesses
 
Pinterest hadoop summit_talk
Pinterest hadoop summit_talkPinterest hadoop summit_talk
Pinterest hadoop summit_talk
 
101 ways to configure kafka - badly
101 ways to configure kafka - badly101 ways to configure kafka - badly
101 ways to configure kafka - badly
 
면접질문준비하기
면접질문준비하기면접질문준비하기
면접질문준비하기
 
YARN - Next Generation Compute Platform fo Hadoop
YARN - Next Generation Compute Platform fo HadoopYARN - Next Generation Compute Platform fo Hadoop
YARN - Next Generation Compute Platform fo Hadoop
 
The Many Faces of Apache Kafka: Leveraging real-time data at scale
The Many Faces of Apache Kafka: Leveraging real-time data at scaleThe Many Faces of Apache Kafka: Leveraging real-time data at scale
The Many Faces of Apache Kafka: Leveraging real-time data at scale
 
Deep dive into Apache Kafka consumption
Deep dive into Apache Kafka consumptionDeep dive into Apache Kafka consumption
Deep dive into Apache Kafka consumption
 
How Spark is Making an Impact at Goldman Sachs by Vincent Saulys
How Spark is Making an Impact at Goldman Sachs by Vincent SaulysHow Spark is Making an Impact at Goldman Sachs by Vincent Saulys
How Spark is Making an Impact at Goldman Sachs by Vincent Saulys
 
GPU Accelerated Natural Language Processing by Guillermo Molini
GPU Accelerated Natural Language Processing by Guillermo MoliniGPU Accelerated Natural Language Processing by Guillermo Molini
GPU Accelerated Natural Language Processing by Guillermo Molini
 
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Kafka as Message Broker
Kafka as Message BrokerKafka as Message Broker
Kafka as Message Broker
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearch
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
 
Microservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka EcosystemMicroservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka Ecosystem
 
Cisco OpenSOC
Cisco OpenSOCCisco OpenSOC
Cisco OpenSOC
 

Similar to Hadoop at datasift

Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanNarayana B
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Adam Kawa
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad ranaData Con LA
 
Zero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter MigrationZero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter MigrationScott Miao
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizIntroduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizITJobZone.biz
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGuang Xu
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQueryCsaba Toth
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013WANdisco Plc
 
Pilot Hadoop Towards 2500 Nodes and Cluster Redundancy
Pilot Hadoop Towards 2500 Nodes and Cluster RedundancyPilot Hadoop Towards 2500 Nodes and Cluster Redundancy
Pilot Hadoop Towards 2500 Nodes and Cluster RedundancyStuart Pook
 
Hadoop Architecture in Depth
Hadoop Architecture in DepthHadoop Architecture in Depth
Hadoop Architecture in DepthSyed Hadoop
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware ProvisioningMongoDB
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simpleDori Waldman
 
Google Bigtable Paper Presentation
Google Bigtable Paper PresentationGoogle Bigtable Paper Presentation
Google Bigtable Paper Presentationvanjakom
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...Big Data Montreal
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and DeploymentCisco Canada
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keownCisco Canada
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 

Similar to Hadoop at datasift (20)

Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_Plan
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
Zero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter MigrationZero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter Migration
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizIntroduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Gcp data engineer
Gcp data engineerGcp data engineer
Gcp data engineer
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
Pilot Hadoop Towards 2500 Nodes and Cluster Redundancy
Pilot Hadoop Towards 2500 Nodes and Cluster RedundancyPilot Hadoop Towards 2500 Nodes and Cluster Redundancy
Pilot Hadoop Towards 2500 Nodes and Cluster Redundancy
 
Hadoop Architecture in Depth
Hadoop Architecture in DepthHadoop Architecture in Depth
Hadoop Architecture in Depth
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
Google Bigtable Paper Presentation
Google Bigtable Paper PresentationGoogle Bigtable Paper Presentation
Google Bigtable Paper Presentation
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 

Recently uploaded

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Recently uploaded (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Hadoop at datasift

  • 2. About me Jairam Chandar Big Data Engineer Datasift @jairamc http://about.me/jairam http://blog.jairam.me
  • 3. Outline  What is Datasift?  Where do we use Hadoop? – The Numbers – The Use-cases – The Lessons
  • 4. !! Sales Pitch Alert !!
  • 15. The Numbers  Machines – 60 machines ● Datanode ● Tasktracker ● RegionServer – 2 machines ● Namenode – 2 machines ● HBase Master – In the processing of doubling our capacity
  • 16. The Numbers  Machines – 2 * Intel Xeon E5620 @ 2.40GHz (16 core total) – 24GB RAM – 6 * 2 TB disks in JBOD (small partition on frst disk for OS, rest is storage) – 1 Gigabit network links
  • 17. The Numbers  Data – Avg load of 3500 interactions/second – Peak load of 6000 interactions/second – Highest during the Superbowl – 12000 interactions/second – Avg size of interaction 2 KB – thats 2 TB a day with replication (RF = 3) – And that's not it!
  • 18. The Use Cases  HBase – Recordings – Archive/Ultrahose  Map/Reduce – Exports – Historics
  • 19. The Use Cases  Recordings – User defned streams – Stored in HBase for later retrieval – Export to multiple output formats and stores – <recording-id><interaction-uuid> ● Recording-id is a SHA-1 hash ● Allows recordings to be distributed by their key without generating hot-spots.
  • 20. The Use Cases  Recordings continued ...
  • 21. The Use Cases  Exporter – Export data from HBase for customer – Export fles 5 – 10 GB or 3-6 million records – MR over HBase using TableInputFormat – But the data needs to be sorted ● TotalOrderPartioner
  • 22. The Use Cases  Exporter Continued
  • 23. !! Sales Pitch Alert !!
  • 25. The Use Cases  Archive/Ultrahose – Not just the Firehose but the Ultrahose – Stored in HBase as well – HBase architecture (BigTable) creates Hotspots with Time Series data ● Leading randomizing bit (see HBaseWD) ● Pre-split regions ● Concurrent writes
  • 26. The Use Cases  Archive continued …  2 years of Tweets – 11 TB compressed – <Number of tweets we got>
  • 27. The Use Cases  Historics – Export archive data – Slightly different from Exporter ● Much larger time lines (1 – 3 months) ● Unfltered Input Data ● Therefore longer processing time ● Hence more optimizations required
  • 28. The Use Cases  Historics continued ...
  • 29. The Lessons - HBase  Tune Tune Tune (Default == BAD)  Based on use case tune - – Heap – Block Size – Memstore size  Keep number of column families low  Be aware of hot-spotting issue when writing time- series data  Use compression (eg. Snappy)
  • 30. The Lessons - HBase  Ops need intimate understanding of system  Monitor metrics (GC, CPU, Compaction, I/O)  Don't be afraid to fddle with HBase code  Using a distribution is advisable