SlideShare a Scribd company logo
1 of 17
Couchbase to Hadoop at Linkedin
Kafka is Enabling the Big Data Pipeline
• Define Problem Domain
Justin Michaels | Solution Architect, Couchbase
• Use case at LinkedIn
Michael Kehoe | Site Reliability Engineer, Linkedin
• Supporting Technology Overview and Demo
Matt Ingenthron | Senior Director, Couchbase
• Q&A
Agenda
2
Lambda Architecture
4
1
2
3
4
5
DATA
BATCH
SPEED
SERVE
QUER
Y
Lambda Architecture
5
Interactive and
Real Time Applications
1
2
3
4
5
DATA
BATCH
SPEED
SERVE
QUER
YHADOOP
COUCHBASE
STORM
COUCHBASEBroker
Cluster
Spout for
Topic
Kafka
Producers
Ordered
Subscriptions
• Hadoop … an open-source framework written for distributed storage
and distributed processing of very large data sets on commodity
hardware
• Kafka … append only write-ahead log that records messages to a
persistent store and allows subscribers to read and apply these
changes to their own stores in an appropriate time-frame
• Storm … distributed framework that uses custom created "spouts"
and "bolts" to define information sources and manipulations for
processing of streaming data
• Couchbase … an open source, distributed NoSQL document-
oriented database that is optimized for interactive applications with
an integrated data cache and incremental map reduce facility
6
COMPLEX
EVENT PROCESSING
Real Time
REPOSITORY
PERPETUAL
STORE
ANALYTICAL
DB
BUSINESS
INTELLIGENCE
MONITORING
CHAT/VOICE
SYSTEM
BATCH
TRACK
REAL-TIME
TRACK
DASHBOARD
TRACKING and
COLLECTION
ANALYSIS AND
VISUALIZATION
REST FILTER METRICS
Use Case at Linkedin
10
• Site Reliability Engineer (SRE) at LinkedIn
• SRE for Profile & Higher-Education
• Member of LinkedIn’s CBVT
• B.E. (Electrical Engineering) from
the University of Queensland,
Australia
Michael Kehoe
• Kafka was created by LinkedIn
• Kafka is a publish-subcribe system built as a distributed commit log
• Processes 500+ TB/ day (~500 billion messages) @ LinkedIn
Kafka @ LinkedIn
• Monitoring
• InGraphs
• Traditional Messaging (Pub-Sub)
• Analytics
• Who Viewed my Profile
• Experiment reports
• Executive reports
• Building block for (log) distributibuted applications
• Pinot
• Espresso
LinkedIn’s uses of Kafka
Use Case: Kafka to Hadoop (Analytics)
• LinkedIn tracks data to better understand how members use our
products
• Information such as which page got viewed and which content got
clicked on are sent into a Kafka cluster in each data center
• Some of these events are all centrally collected and pushed onto our
Hadoop grid for analysis and daily report generation
Couchbase @ LinkedIn
• About 25 separate services with one or more clusters in multiple data
centers
• Up to 100 servers in a cluster
• Single and Multi-tenant clusters
Use Case: Jobs Cluster
• Read scaling, Couchbase ~80k QPS, 24 server cluster(s)
• Hadoop to pre-build data by partition
• Couchbase 99 percentile latencies
Hadoop to Couchbase
• Our primary use-case for Hadoop  Couchbase is for building
(warming) / recovering Couchbase buckets
• LinkedIn built it’s own in-house solution to work with our ETL
processes, cache invalidation procedures etc

More Related Content

What's hot

Developing custom transformation in the Kafka connect to minimize data redund...
Developing custom transformation in the Kafka connect to minimize data redund...Developing custom transformation in the Kafka connect to minimize data redund...
Developing custom transformation in the Kafka connect to minimize data redund...HostedbyConfluent
 
DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...
DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...
DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...HostedbyConfluent
 
Apache Kafka: Past, Present and Future
Apache Kafka: Past, Present and FutureApache Kafka: Past, Present and Future
Apache Kafka: Past, Present and Futureconfluent
 
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Data Con LA
 
Event-driven Applications with Kafka, Micronaut, and AWS Lambda | Dave Klein,...
Event-driven Applications with Kafka, Micronaut, and AWS Lambda | Dave Klein,...Event-driven Applications with Kafka, Micronaut, and AWS Lambda | Dave Klein,...
Event-driven Applications with Kafka, Micronaut, and AWS Lambda | Dave Klein,...HostedbyConfluent
 
Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...
Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...
Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...HostedbyConfluent
 
Bootstrap SaaS startup using Open Source Tools
Bootstrap SaaS startup using Open Source ToolsBootstrap SaaS startup using Open Source Tools
Bootstrap SaaS startup using Open Source Toolsbotsplash.com
 
Kafka at the core of an AIOps pipeline | Sunanda Kommula, Selector.ai and Ala...
Kafka at the core of an AIOps pipeline | Sunanda Kommula, Selector.ai and Ala...Kafka at the core of an AIOps pipeline | Sunanda Kommula, Selector.ai and Ala...
Kafka at the core of an AIOps pipeline | Sunanda Kommula, Selector.ai and Ala...HostedbyConfluent
 
Kafka Summit NYC 2017 - Achieving Predictability and Compliance with BNY Mell...
Kafka Summit NYC 2017 - Achieving Predictability and Compliance with BNY Mell...Kafka Summit NYC 2017 - Achieving Predictability and Compliance with BNY Mell...
Kafka Summit NYC 2017 - Achieving Predictability and Compliance with BNY Mell...confluent
 
0-330km/h: Porsche's Data Streaming Journey | Sridhar Mamella, Porsche
0-330km/h: Porsche's Data Streaming Journey | Sridhar Mamella, Porsche0-330km/h: Porsche's Data Streaming Journey | Sridhar Mamella, Porsche
0-330km/h: Porsche's Data Streaming Journey | Sridhar Mamella, PorscheHostedbyConfluent
 
Real-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQLReal-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQLSingleStore
 
Instaclustr: When and how to migrate from a relational database to Cassandra
Instaclustr: When and how to migrate from a relational database to CassandraInstaclustr: When and how to migrate from a relational database to Cassandra
Instaclustr: When and how to migrate from a relational database to CassandraDataStax Academy
 
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...HostedbyConfluent
 
Death of the dumb pipes: Using Apache Kafka® for Integration projects
Death of the dumb pipes: Using Apache Kafka® for Integration projectsDeath of the dumb pipes: Using Apache Kafka® for Integration projects
Death of the dumb pipes: Using Apache Kafka® for Integration projectsHostedbyConfluent
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...Data Con LA
 
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...HostedbyConfluent
 
Building Software to Scale
Building Software to Scale Building Software to Scale
Building Software to Scale SingleStore
 
Mainframe Integration, Offloading and Replacement with Apache Kafka | Kai Wae...
Mainframe Integration, Offloading and Replacement with Apache Kafka | Kai Wae...Mainframe Integration, Offloading and Replacement with Apache Kafka | Kai Wae...
Mainframe Integration, Offloading and Replacement with Apache Kafka | Kai Wae...HostedbyConfluent
 
Column and hadoop
Column and hadoopColumn and hadoop
Column and hadoopAlex Jiang
 
Stream processing IoT time series data with Kafka & InfluxDB | Al Sargent, In...
Stream processing IoT time series data with Kafka & InfluxDB | Al Sargent, In...Stream processing IoT time series data with Kafka & InfluxDB | Al Sargent, In...
Stream processing IoT time series data with Kafka & InfluxDB | Al Sargent, In...HostedbyConfluent
 

What's hot (20)

Developing custom transformation in the Kafka connect to minimize data redund...
Developing custom transformation in the Kafka connect to minimize data redund...Developing custom transformation in the Kafka connect to minimize data redund...
Developing custom transformation in the Kafka connect to minimize data redund...
 
DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...
DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...
DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...
 
Apache Kafka: Past, Present and Future
Apache Kafka: Past, Present and FutureApache Kafka: Past, Present and Future
Apache Kafka: Past, Present and Future
 
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
 
Event-driven Applications with Kafka, Micronaut, and AWS Lambda | Dave Klein,...
Event-driven Applications with Kafka, Micronaut, and AWS Lambda | Dave Klein,...Event-driven Applications with Kafka, Micronaut, and AWS Lambda | Dave Klein,...
Event-driven Applications with Kafka, Micronaut, and AWS Lambda | Dave Klein,...
 
Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...
Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...
Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...
 
Bootstrap SaaS startup using Open Source Tools
Bootstrap SaaS startup using Open Source ToolsBootstrap SaaS startup using Open Source Tools
Bootstrap SaaS startup using Open Source Tools
 
Kafka at the core of an AIOps pipeline | Sunanda Kommula, Selector.ai and Ala...
Kafka at the core of an AIOps pipeline | Sunanda Kommula, Selector.ai and Ala...Kafka at the core of an AIOps pipeline | Sunanda Kommula, Selector.ai and Ala...
Kafka at the core of an AIOps pipeline | Sunanda Kommula, Selector.ai and Ala...
 
Kafka Summit NYC 2017 - Achieving Predictability and Compliance with BNY Mell...
Kafka Summit NYC 2017 - Achieving Predictability and Compliance with BNY Mell...Kafka Summit NYC 2017 - Achieving Predictability and Compliance with BNY Mell...
Kafka Summit NYC 2017 - Achieving Predictability and Compliance with BNY Mell...
 
0-330km/h: Porsche's Data Streaming Journey | Sridhar Mamella, Porsche
0-330km/h: Porsche's Data Streaming Journey | Sridhar Mamella, Porsche0-330km/h: Porsche's Data Streaming Journey | Sridhar Mamella, Porsche
0-330km/h: Porsche's Data Streaming Journey | Sridhar Mamella, Porsche
 
Real-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQLReal-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQL
 
Instaclustr: When and how to migrate from a relational database to Cassandra
Instaclustr: When and how to migrate from a relational database to CassandraInstaclustr: When and how to migrate from a relational database to Cassandra
Instaclustr: When and how to migrate from a relational database to Cassandra
 
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
 
Death of the dumb pipes: Using Apache Kafka® for Integration projects
Death of the dumb pipes: Using Apache Kafka® for Integration projectsDeath of the dumb pipes: Using Apache Kafka® for Integration projects
Death of the dumb pipes: Using Apache Kafka® for Integration projects
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
 
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
 
Building Software to Scale
Building Software to Scale Building Software to Scale
Building Software to Scale
 
Mainframe Integration, Offloading and Replacement with Apache Kafka | Kai Wae...
Mainframe Integration, Offloading and Replacement with Apache Kafka | Kai Wae...Mainframe Integration, Offloading and Replacement with Apache Kafka | Kai Wae...
Mainframe Integration, Offloading and Replacement with Apache Kafka | Kai Wae...
 
Column and hadoop
Column and hadoopColumn and hadoop
Column and hadoop
 
Stream processing IoT time series data with Kafka & InfluxDB | Al Sargent, In...
Stream processing IoT time series data with Kafka & InfluxDB | Al Sargent, In...Stream processing IoT time series data with Kafka & InfluxDB | Al Sargent, In...
Stream processing IoT time series data with Kafka & InfluxDB | Al Sargent, In...
 

Viewers also liked

SRECon USA 2016: Growing your Entry Level Talent
SRECon USA 2016: Growing your Entry Level TalentSRECon USA 2016: Growing your Entry Level Talent
SRECon USA 2016: Growing your Entry Level TalentMichael Kehoe
 
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedInCouchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedInMichael Kehoe
 
SouthBay SRE Meetup Jan 2016
SouthBay SRE Meetup Jan 2016SouthBay SRE Meetup Jan 2016
SouthBay SRE Meetup Jan 2016Michael Kehoe
 
Couchbase Connect 2016
Couchbase Connect 2016Couchbase Connect 2016
Couchbase Connect 2016Michael Kehoe
 
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...Michael Kehoe
 
Using SaltStack to Auto Triage and Remediate Production Systems
Using SaltStack to Auto Triage and Remediate Production SystemsUsing SaltStack to Auto Triage and Remediate Production Systems
Using SaltStack to Auto Triage and Remediate Production SystemsMichael Kehoe
 
Reducing MTTR and False Escalations: Event Correlation at LinkedIn
Reducing MTTR and False Escalations: Event Correlation at LinkedInReducing MTTR and False Escalations: Event Correlation at LinkedIn
Reducing MTTR and False Escalations: Event Correlation at LinkedInMichael Kehoe
 
Software reliability tools and common software errors
Software reliability tools and common software errorsSoftware reliability tools and common software errors
Software reliability tools and common software errorsHimanshu
 
How TPM saves the day
How TPM saves the dayHow TPM saves the day
How TPM saves the dayPooja Tangi
 
Cloud Native, Microservices and SRE/Chaos Engineering: The new Rules of The G...
Cloud Native, Microservices and SRE/Chaos Engineering: The new Rules of The G...Cloud Native, Microservices and SRE/Chaos Engineering: The new Rules of The G...
Cloud Native, Microservices and SRE/Chaos Engineering: The new Rules of The G...Diego Pacheco
 
AppSensor CodeMash 2017
AppSensor CodeMash 2017AppSensor CodeMash 2017
AppSensor CodeMash 2017jtmelton
 
Software Reliability Engineering
Software Reliability EngineeringSoftware Reliability Engineering
Software Reliability Engineeringguest90cec6
 
Software reliability growth model
Software reliability growth modelSoftware reliability growth model
Software reliability growth modelHimanshu
 
CQRS & event sourcing in the wild
CQRS & event sourcing in the wildCQRS & event sourcing in the wild
CQRS & event sourcing in the wildMichiel Rook
 
Feedback loops: How SREs benefit and what is needed to realize their potential
Feedback loops: How SREs benefit and what is needed to realize their potentialFeedback loops: How SREs benefit and what is needed to realize their potential
Feedback loops: How SREs benefit and what is needed to realize their potentialPooja Tangi
 
Going Serverless with CQRS on AWS
Going Serverless with CQRS on AWSGoing Serverless with CQRS on AWS
Going Serverless with CQRS on AWSAnton Udovychenko
 
Load balancing in the SRE way
Load balancing in the SRE wayLoad balancing in the SRE way
Load balancing in the SRE wayShawn Zhu
 
Developing functional domain models with event sourcing (sbtb, sbtb2015)
Developing functional domain models with event sourcing (sbtb, sbtb2015)Developing functional domain models with event sourcing (sbtb, sbtb2015)
Developing functional domain models with event sourcing (sbtb, sbtb2015)Chris Richardson
 
A year with event sourcing and CQRS
A year with event sourcing and CQRSA year with event sourcing and CQRS
A year with event sourcing and CQRSSteve Pember
 
CQRS and Event Sourcing, An Alternative Architecture for DDD
CQRS and Event Sourcing, An Alternative Architecture for DDDCQRS and Event Sourcing, An Alternative Architecture for DDD
CQRS and Event Sourcing, An Alternative Architecture for DDDDennis Doomen
 

Viewers also liked (20)

SRECon USA 2016: Growing your Entry Level Talent
SRECon USA 2016: Growing your Entry Level TalentSRECon USA 2016: Growing your Entry Level Talent
SRECon USA 2016: Growing your Entry Level Talent
 
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedInCouchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
 
SouthBay SRE Meetup Jan 2016
SouthBay SRE Meetup Jan 2016SouthBay SRE Meetup Jan 2016
SouthBay SRE Meetup Jan 2016
 
Couchbase Connect 2016
Couchbase Connect 2016Couchbase Connect 2016
Couchbase Connect 2016
 
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...
 
Using SaltStack to Auto Triage and Remediate Production Systems
Using SaltStack to Auto Triage and Remediate Production SystemsUsing SaltStack to Auto Triage and Remediate Production Systems
Using SaltStack to Auto Triage and Remediate Production Systems
 
Reducing MTTR and False Escalations: Event Correlation at LinkedIn
Reducing MTTR and False Escalations: Event Correlation at LinkedInReducing MTTR and False Escalations: Event Correlation at LinkedIn
Reducing MTTR and False Escalations: Event Correlation at LinkedIn
 
Software reliability tools and common software errors
Software reliability tools and common software errorsSoftware reliability tools and common software errors
Software reliability tools and common software errors
 
How TPM saves the day
How TPM saves the dayHow TPM saves the day
How TPM saves the day
 
Cloud Native, Microservices and SRE/Chaos Engineering: The new Rules of The G...
Cloud Native, Microservices and SRE/Chaos Engineering: The new Rules of The G...Cloud Native, Microservices and SRE/Chaos Engineering: The new Rules of The G...
Cloud Native, Microservices and SRE/Chaos Engineering: The new Rules of The G...
 
AppSensor CodeMash 2017
AppSensor CodeMash 2017AppSensor CodeMash 2017
AppSensor CodeMash 2017
 
Software Reliability Engineering
Software Reliability EngineeringSoftware Reliability Engineering
Software Reliability Engineering
 
Software reliability growth model
Software reliability growth modelSoftware reliability growth model
Software reliability growth model
 
CQRS & event sourcing in the wild
CQRS & event sourcing in the wildCQRS & event sourcing in the wild
CQRS & event sourcing in the wild
 
Feedback loops: How SREs benefit and what is needed to realize their potential
Feedback loops: How SREs benefit and what is needed to realize their potentialFeedback loops: How SREs benefit and what is needed to realize their potential
Feedback loops: How SREs benefit and what is needed to realize their potential
 
Going Serverless with CQRS on AWS
Going Serverless with CQRS on AWSGoing Serverless with CQRS on AWS
Going Serverless with CQRS on AWS
 
Load balancing in the SRE way
Load balancing in the SRE wayLoad balancing in the SRE way
Load balancing in the SRE way
 
Developing functional domain models with event sourcing (sbtb, sbtb2015)
Developing functional domain models with event sourcing (sbtb, sbtb2015)Developing functional domain models with event sourcing (sbtb, sbtb2015)
Developing functional domain models with event sourcing (sbtb, sbtb2015)
 
A year with event sourcing and CQRS
A year with event sourcing and CQRSA year with event sourcing and CQRS
A year with event sourcing and CQRS
 
CQRS and Event Sourcing, An Alternative Architecture for DDD
CQRS and Event Sourcing, An Alternative Architecture for DDDCQRS and Event Sourcing, An Alternative Architecture for DDD
CQRS and Event Sourcing, An Alternative Architecture for DDD
 

Similar to CouchbasetoHadoop_Matt_Michael_Justin v4

Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven productsLars Albertsson
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptxbetalab
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseDataStax
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopEvans Ye
 
Using Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architectureUsing Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architectureOliver Buckley-Salmon
 
Distributed messaging through Kafka
Distributed messaging through KafkaDistributed messaging through Kafka
Distributed messaging through KafkaDileep Kalidindi
 
Pacemaker hadoop infrastructure and soft serve experience
Pacemaker   hadoop infrastructure and soft serve experiencePacemaker   hadoop infrastructure and soft serve experience
Pacemaker hadoop infrastructure and soft serve experienceVitaliy Bashun
 
Introduction to Google Cloud Platform
Introduction to Google Cloud PlatformIntroduction to Google Cloud Platform
Introduction to Google Cloud PlatformSujai Prakasam
 
Building High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in KafkaBuilding High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in Kafkaconfluent
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3Simon Ambridge
 
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data ArchitectHadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data ArchitectSoftServe
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Data Con LA
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge
 

Similar to CouchbasetoHadoop_Matt_Michael_Justin v4 (20)

Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
kafka for db as postgres
kafka for db as postgreskafka for db as postgres
kafka for db as postgres
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Using Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architectureUsing Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architecture
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Apache drill
Apache drillApache drill
Apache drill
 
Distributed messaging through Kafka
Distributed messaging through KafkaDistributed messaging through Kafka
Distributed messaging through Kafka
 
Pacemaker hadoop infrastructure and soft serve experience
Pacemaker   hadoop infrastructure and soft serve experiencePacemaker   hadoop infrastructure and soft serve experience
Pacemaker hadoop infrastructure and soft serve experience
 
Introduction to Google Cloud Platform
Introduction to Google Cloud PlatformIntroduction to Google Cloud Platform
Introduction to Google Cloud Platform
 
Building High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in KafkaBuilding High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in Kafka
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
 
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data ArchitectHadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 

More from Michael Kehoe

Code Yellow: Helping operations top-heavy teams the smart way
Code Yellow: Helping operations top-heavy teams the smart wayCode Yellow: Helping operations top-heavy teams the smart way
Code Yellow: Helping operations top-heavy teams the smart wayMichael Kehoe
 
QConSF 2018: Building Production-Ready Applications
QConSF 2018: Building Production-Ready ApplicationsQConSF 2018: Building Production-Ready Applications
QConSF 2018: Building Production-Ready ApplicationsMichael Kehoe
 
Helping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart wayHelping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart wayMichael Kehoe
 
AllDayDevops: What the NTSB teaches us about incident management & postmortems
AllDayDevops: What the NTSB teaches us about incident management & postmortemsAllDayDevops: What the NTSB teaches us about incident management & postmortems
AllDayDevops: What the NTSB teaches us about incident management & postmortemsMichael Kehoe
 
Linux Container Basics
Linux Container BasicsLinux Container Basics
Linux Container BasicsMichael Kehoe
 
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet DropsPapers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet DropsMichael Kehoe
 
What the NTSB teaches us about incident management & postmortems
What the NTSB teaches us about incident management & postmortemsWhat the NTSB teaches us about incident management & postmortems
What the NTSB teaches us about incident management & postmortemsMichael Kehoe
 
PyBay 2018: Production-Ready Python Applications
PyBay 2018: Production-Ready Python ApplicationsPyBay 2018: Production-Ready Python Applications
PyBay 2018: Production-Ready Python ApplicationsMichael Kehoe
 
Helping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart wayHelping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart wayMichael Kehoe
 
The Next Wave of Reliability Engineering
The Next Wave of Reliability EngineeringThe Next Wave of Reliability Engineering
The Next Wave of Reliability EngineeringMichael Kehoe
 
Building Production-Ready Microservices: DevopsExchangeSF
Building Production-Ready Microservices: DevopsExchangeSFBuilding Production-Ready Microservices: DevopsExchangeSF
Building Production-Ready Microservices: DevopsExchangeSFMichael Kehoe
 
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...Michael Kehoe
 
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...Michael Kehoe
 
SRECon-Europe-2017: Networks for SREs
SRECon-Europe-2017: Networks for SREsSRECon-Europe-2017: Networks for SREs
SRECon-Europe-2017: Networks for SREsMichael Kehoe
 
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scaleVelocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scaleMichael Kehoe
 

More from Michael Kehoe (17)

eBPF Workshop
eBPF WorkshopeBPF Workshop
eBPF Workshop
 
eBPF Basics
eBPF BasicseBPF Basics
eBPF Basics
 
Code Yellow: Helping operations top-heavy teams the smart way
Code Yellow: Helping operations top-heavy teams the smart wayCode Yellow: Helping operations top-heavy teams the smart way
Code Yellow: Helping operations top-heavy teams the smart way
 
QConSF 2018: Building Production-Ready Applications
QConSF 2018: Building Production-Ready ApplicationsQConSF 2018: Building Production-Ready Applications
QConSF 2018: Building Production-Ready Applications
 
Helping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart wayHelping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart way
 
AllDayDevops: What the NTSB teaches us about incident management & postmortems
AllDayDevops: What the NTSB teaches us about incident management & postmortemsAllDayDevops: What the NTSB teaches us about incident management & postmortems
AllDayDevops: What the NTSB teaches us about incident management & postmortems
 
Linux Container Basics
Linux Container BasicsLinux Container Basics
Linux Container Basics
 
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet DropsPapers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
 
What the NTSB teaches us about incident management & postmortems
What the NTSB teaches us about incident management & postmortemsWhat the NTSB teaches us about incident management & postmortems
What the NTSB teaches us about incident management & postmortems
 
PyBay 2018: Production-Ready Python Applications
PyBay 2018: Production-Ready Python ApplicationsPyBay 2018: Production-Ready Python Applications
PyBay 2018: Production-Ready Python Applications
 
Helping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart wayHelping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart way
 
The Next Wave of Reliability Engineering
The Next Wave of Reliability EngineeringThe Next Wave of Reliability Engineering
The Next Wave of Reliability Engineering
 
Building Production-Ready Microservices: DevopsExchangeSF
Building Production-Ready Microservices: DevopsExchangeSFBuilding Production-Ready Microservices: DevopsExchangeSF
Building Production-Ready Microservices: DevopsExchangeSF
 
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
 
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
 
SRECon-Europe-2017: Networks for SREs
SRECon-Europe-2017: Networks for SREsSRECon-Europe-2017: Networks for SREs
SRECon-Europe-2017: Networks for SREs
 
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scaleVelocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
 

CouchbasetoHadoop_Matt_Michael_Justin v4

  • 1. Couchbase to Hadoop at Linkedin Kafka is Enabling the Big Data Pipeline
  • 2. • Define Problem Domain Justin Michaels | Solution Architect, Couchbase • Use case at LinkedIn Michael Kehoe | Site Reliability Engineer, Linkedin • Supporting Technology Overview and Demo Matt Ingenthron | Senior Director, Couchbase • Q&A Agenda 2
  • 3.
  • 5. Lambda Architecture 5 Interactive and Real Time Applications 1 2 3 4 5 DATA BATCH SPEED SERVE QUER YHADOOP COUCHBASE STORM COUCHBASEBroker Cluster Spout for Topic Kafka Producers Ordered Subscriptions
  • 6. • Hadoop … an open-source framework written for distributed storage and distributed processing of very large data sets on commodity hardware • Kafka … append only write-ahead log that records messages to a persistent store and allows subscribers to read and apply these changes to their own stores in an appropriate time-frame • Storm … distributed framework that uses custom created "spouts" and "bolts" to define information sources and manipulations for processing of streaming data • Couchbase … an open source, distributed NoSQL document- oriented database that is optimized for interactive applications with an integrated data cache and incremental map reduce facility 6
  • 7.
  • 10. Use Case at Linkedin 10
  • 11. • Site Reliability Engineer (SRE) at LinkedIn • SRE for Profile & Higher-Education • Member of LinkedIn’s CBVT • B.E. (Electrical Engineering) from the University of Queensland, Australia Michael Kehoe
  • 12. • Kafka was created by LinkedIn • Kafka is a publish-subcribe system built as a distributed commit log • Processes 500+ TB/ day (~500 billion messages) @ LinkedIn Kafka @ LinkedIn
  • 13. • Monitoring • InGraphs • Traditional Messaging (Pub-Sub) • Analytics • Who Viewed my Profile • Experiment reports • Executive reports • Building block for (log) distributibuted applications • Pinot • Espresso LinkedIn’s uses of Kafka
  • 14. Use Case: Kafka to Hadoop (Analytics) • LinkedIn tracks data to better understand how members use our products • Information such as which page got viewed and which content got clicked on are sent into a Kafka cluster in each data center • Some of these events are all centrally collected and pushed onto our Hadoop grid for analysis and daily report generation
  • 15. Couchbase @ LinkedIn • About 25 separate services with one or more clusters in multiple data centers • Up to 100 servers in a cluster • Single and Multi-tenant clusters
  • 16. Use Case: Jobs Cluster • Read scaling, Couchbase ~80k QPS, 24 server cluster(s) • Hadoop to pre-build data by partition • Couchbase 99 percentile latencies
  • 17. Hadoop to Couchbase • Our primary use-case for Hadoop  Couchbase is for building (warming) / recovering Couchbase buckets • LinkedIn built it’s own in-house solution to work with our ETL processes, cache invalidation procedures etc

Editor's Notes

  1. Note: Remove the logos from the animation and speed up build. Distributed users communities relying on interactive applications require systems to be distributed. As a result data is created in a variety of forms and places … the Polyglot Persistence … as the complexity of problems to be solved in creases applications demand a variety of development environments for tackling different problems. These complex, real-time applications combine different problems. Reliably storing, providing access to, and analyzing this data landscape leads to the Polyglot Persistence of data.
  2. Users and consumers of information increasingly demand an always on, low latency access to their data. As well as providing a framework for businesses to understand what’s happening in real time while addressing Polyglot Persistence in managing data. The conceptual framework Lambda Architecture evolved out of Twitter and coined by Nathan Marz for a generic data processing architecture. In a way the architecture is an extended event sourced system but aims to accommodate streaming data at large scale. 1. All data entering the system is dispatched to both the batch layer and the speed layer for processing. 2. The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views. 3. The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way. 4. The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only. 5. Any incoming query can be answered by merging results from batch views and real-time views.
  3. Hadoop is engineered for storage and analysis. It can store petabytes of data, and if can be deployed to thousands of servers. It started with map / reduce. It added Hive. Today, we see efforts like Impala and Drill along with Hortonworks Stinger Initiative and Tez. Some of the Hadoop distributions are bundling Storm and / or Spark. The analytical capabilities of Hadoop are continuing to evolve and improve. However, it’s not well suited to operational workloads. It’s not intended to serve as a backend for enterprise application, mobile or web. It’s not intended to provide interactive data access.
  4. Note: Remove the logos from the animation and speed up build. Distributed users communities relying on interactive applications require systems to be distributed. As a result data is created in a variety of forms and places … the Polyglot Persistence … as the complexity of problems to be solved in creases applications demand a variety of development environments for tackling different problems. These complex, real-time applications combine different problems. Reliably storing, providing access to, and analyzing this data landscape leads to the Polyglot Persistence of data.
  5. The data generated by users is published to Apache Kafka. Next, it’s pulled into Apache Storm for real time analysis and processing as well as into Hadoop. Finally, Storm writes the data to Couchbase Server for real-time access by LivePerson agents while the data in Hadoop is eventually accessed via HP Vertica and MicroStrategy for offline business intelligence and analysis.
  6. The data is first collected by tracking and collection service. Next, Storm pulls the data in for filtering, enrichment, and statistical analysis. The raw data is written to one Couchbase Server cluster while the processed data is written to a separate Couchbase Server cluster. The processed data is access by a front end for visualization and analysis. In addition, the raw data is copied from Couchbase Server to Hadoop. It’s combine with additional data and the whole is moved into HBase for ad hoc analysis. PayPal was able to handle both the volume and the velocity of data as well as meet both operation and analytical requirements. They relied on data capture, stream processing, NoSQL and Hadoop to do so.