CouchbasetoHadoop_Matt_Michael_Justin v4

•Download as PPTX, PDF•

2 likes•590 views

Michael Kehoe

Couchbase to Hadoop at Linkedin
Kafka is Enabling the Big Data Pipeline

Lambda Architecture
4
1
2
3
4
5
DATA
BATCH
SPEED
SERVE
QUER
Y

Lambda Architecture
5
Interactive and
Real Time Applications
1
2
3
4
5
DATA
BATCH
SPEED
SERVE
QUER
YHADOOP
COUCHBASE
STORM
COUCHBASEBroker
Cluster
Spout for
Topic
Kafka
Producers
Ordered
Subscriptions

• Hadoop … an open-source framework written for distributed storage
and distributed processing of very large data sets on commodity
hardware
• Kafka … append only write-ahead log that records messages to a
persistent store and allows subscribers to read and apply these
changes to their own stores in an appropriate time-frame
• Storm … distributed framework that uses custom created "spouts"
and "bolts" to define information sources and manipulations for
processing of streaming data
• Couchbase … an open source, distributed NoSQL document-
oriented database that is optimized for interactive applications with
an integrated data cache and incremental map reduce facility
6

COMPLEX
EVENT PROCESSING
Real Time
REPOSITORY
PERPETUAL
STORE
ANALYTICAL
DB
BUSINESS
INTELLIGENCE
MONITORING
CHAT/VOICE
SYSTEM
BATCH
TRACK
REAL-TIME
TRACK
DASHBOARD

TRACKING and
COLLECTION
ANALYSIS AND
VISUALIZATION
REST FILTER METRICS

• Kafka was created by LinkedIn
• Kafka is a publish-subcribe system built as a distributed commit log
• Processes 500+ TB/ day (~500 billion messages) @ LinkedIn
Kafka @ LinkedIn

Use Case: Kafka to Hadoop (Analytics)
• LinkedIn tracks data to better understand how members use our
products
• Information such as which page got viewed and which content got
clicked on are sent into a Kafka cluster in each data center
• Some of these events are all centrally collected and pushed onto our
Hadoop grid for analysis and daily report generation

Couchbase @ LinkedIn
• About 25 separate services with one or more clusters in multiple data
centers
• Up to 100 servers in a cluster
• Single and Multi-tenant clusters

Use Case: Jobs Cluster
• Read scaling, Couchbase ~80k QPS, 24 server cluster(s)
• Hadoop to pre-build data by partition
• Couchbase 99 percentile latencies

Hadoop to Couchbase
• Our primary use-case for Hadoop  Couchbase is for building
(warming) / recovering Couchbase buckets
• LinkedIn built it’s own in-house solution to work with our ETL
processes, cache invalidation procedures etc

What's hot

Developing custom transformation in the Kafka connect to minimize data redund...HostedbyConfluent

DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...HostedbyConfluent

Apache Kafka: Past, Present and Futureconfluent

Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Data Con LA

Event-driven Applications with Kafka, Micronaut, and AWS Lambda | Dave Klein,...HostedbyConfluent

Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...HostedbyConfluent

Bootstrap SaaS startup using Open Source Toolsbotsplash.com

Kafka at the core of an AIOps pipeline | Sunanda Kommula, Selector.ai and Ala...HostedbyConfluent

Kafka Summit NYC 2017 - Achieving Predictability and Compliance with BNY Mell...confluent

0-330km/h: Porsche's Data Streaming Journey | Sridhar Mamella, PorscheHostedbyConfluent

Real-Time Analytics with Confluent and MemSQLSingleStore

Instaclustr: When and how to migrate from a relational database to CassandraDataStax Academy

Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...HostedbyConfluent

Death of the dumb pipes: Using Apache Kafka® for Integration projectsHostedbyConfluent

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...Data Con LA

Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...HostedbyConfluent

Building Software to Scale SingleStore

Mainframe Integration, Offloading and Replacement with Apache Kafka | Kai Wae...HostedbyConfluent

Column and hadoopAlex Jiang

Stream processing IoT time series data with Kafka & InfluxDB | Al Sargent, In...HostedbyConfluent

What's hot (20)

Developing custom transformation in the Kafka connect to minimize data redund...

DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...

Apache Kafka: Past, Present and Future

Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...

Event-driven Applications with Kafka, Micronaut, and AWS Lambda | Dave Klein,...

Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...

Bootstrap SaaS startup using Open Source Tools

Kafka at the core of an AIOps pipeline | Sunanda Kommula, Selector.ai and Ala...

Kafka Summit NYC 2017 - Achieving Predictability and Compliance with BNY Mell...

0-330km/h: Porsche's Data Streaming Journey | Sridhar Mamella, Porsche

Real-Time Analytics with Confluent and MemSQL

Instaclustr: When and how to migrate from a relational database to Cassandra

Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...

Death of the dumb pipes: Using Apache Kafka® for Integration projects

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...

Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...

Building Software to Scale

Mainframe Integration, Offloading and Replacement with Apache Kafka | Kai Wae...

Column and hadoop

Stream processing IoT time series data with Kafka & InfluxDB | Al Sargent, In...

Viewers also liked

SRECon USA 2016: Growing your Entry Level TalentMichael Kehoe

Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedInMichael Kehoe

SouthBay SRE Meetup Jan 2016Michael Kehoe

Couchbase Connect 2016Michael Kehoe

APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...Michael Kehoe

Using SaltStack to Auto Triage and Remediate Production SystemsMichael Kehoe

Reducing MTTR and False Escalations: Event Correlation at LinkedInMichael Kehoe

Software reliability tools and common software errorsHimanshu

How TPM saves the dayPooja Tangi

Cloud Native, Microservices and SRE/Chaos Engineering: The new Rules of The G...Diego Pacheco

AppSensor CodeMash 2017jtmelton

Software Reliability Engineeringguest90cec6

Software reliability growth modelHimanshu

CQRS & event sourcing in the wildMichiel Rook

Feedback loops: How SREs benefit and what is needed to realize their potentialPooja Tangi

Going Serverless with CQRS on AWSAnton Udovychenko

Load balancing in the SRE wayShawn Zhu

Developing functional domain models with event sourcing (sbtb, sbtb2015)Chris Richardson

A year with event sourcing and CQRSSteve Pember

CQRS and Event Sourcing, An Alternative Architecture for DDDDennis Doomen

Viewers also liked (20)

SRECon USA 2016: Growing your Entry Level Talent

Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn

SouthBay SRE Meetup Jan 2016

Couchbase Connect 2016

APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...

Using SaltStack to Auto Triage and Remediate Production Systems

Reducing MTTR and False Escalations: Event Correlation at LinkedIn

Software reliability tools and common software errors

How TPM saves the day

Cloud Native, Microservices and SRE/Chaos Engineering: The new Rules of The G...

AppSensor CodeMash 2017

Software Reliability Engineering

Software reliability growth model

CQRS & event sourcing in the wild

Feedback loops: How SREs benefit and what is needed to realize their potential

Going Serverless with CQRS on AWS

Load balancing in the SRE way

Developing functional domain models with event sourcing (sbtb, sbtb2015)

A year with event sourcing and CQRS

CQRS and Event Sourcing, An Alternative Architecture for DDD

Similar to CouchbasetoHadoop_Matt_Michael_Justin v4

Building real time data-driven productsLars Albertsson

kafka for db as postgresPivotalOpenSourceHub

Big Data_Architecture.pptxbetalab

Architecting Your First Big Data ImplementationAdaryl "Bob" Wakefield, MBA

Data Pipelines with Spark & DataStax EnterpriseDataStax

Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367

Trend Micro Big Data Platform and Apache BigtopEvans Ye

Using Hazelcast in the Kappa architectureOliver Buckley-Salmon

List of Engineering Colleges in UttarakhandRoorkee College of Engineering, Roorkee

Hadoop.pptxarslanhaneef

Hadoop.pptxsonukumar379092

Apache drillMapR Technologies

Distributed messaging through KafkaDileep Kalidindi

Pacemaker hadoop infrastructure and soft serve experienceVitaliy Bashun

Introduction to Google Cloud PlatformSujai Prakasam

Building High-Throughput, Low-Latency Pipelines in Kafkaconfluent

20160331 sa introduction to big data pipelining berlin meetup 0.3Simon Ambridge

Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data ArchitectSoftServe

Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Data Con LA

Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge

Similar to CouchbasetoHadoop_Matt_Michael_Justin v4 (20)

Building real time data-driven products

kafka for db as postgres

Big Data_Architecture.pptx

Architecting Your First Big Data Implementation

Data Pipelines with Spark & DataStax Enterprise

Building Scalable Big Data Infrastructure Using Open Source Software Presenta...

Trend Micro Big Data Platform and Apache Bigtop

Using Hazelcast in the Kappa architecture

List of Engineering Colleges in Uttarakhand

Hadoop.pptx

Apache drill

Distributed messaging through Kafka

Pacemaker hadoop infrastructure and soft serve experience

Introduction to Google Cloud Platform

Building High-Throughput, Low-Latency Pipelines in Kafka

20160331 sa introduction to big data pipelining berlin meetup 0.3

Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect

Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...

Sa introduction to big data pipelining with cassandra & spark west mins...

More from Michael Kehoe

eBPF WorkshopMichael Kehoe

eBPF BasicsMichael Kehoe

Code Yellow: Helping operations top-heavy teams the smart wayMichael Kehoe

QConSF 2018: Building Production-Ready ApplicationsMichael Kehoe

Helping operations top-heavy teams the smart wayMichael Kehoe

AllDayDevops: What the NTSB teaches us about incident management & postmortemsMichael Kehoe

Linux Container BasicsMichael Kehoe

Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet DropsMichael Kehoe

What the NTSB teaches us about incident management & postmortemsMichael Kehoe

PyBay 2018: Production-Ready Python ApplicationsMichael Kehoe

Helping operations top-heavy teams the smart wayMichael Kehoe

The Next Wave of Reliability EngineeringMichael Kehoe

Building Production-Ready Microservices: DevopsExchangeSFMichael Kehoe

SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...Michael Kehoe

SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...Michael Kehoe

SRECon-Europe-2017: Networks for SREsMichael Kehoe

Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scaleMichael Kehoe

More from Michael Kehoe (17)

eBPF Workshop

eBPF Basics

Code Yellow: Helping operations top-heavy teams the smart way

QConSF 2018: Building Production-Ready Applications

Helping operations top-heavy teams the smart way

AllDayDevops: What the NTSB teaches us about incident management & postmortems

Linux Container Basics

Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops

What the NTSB teaches us about incident management & postmortems

PyBay 2018: Production-Ready Python Applications

Helping operations top-heavy teams the smart way

The Next Wave of Reliability Engineering

Building Production-Ready Microservices: DevopsExchangeSF

SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...

SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...

SRECon-Europe-2017: Networks for SREs

Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale

CouchbasetoHadoop_Matt_Michael_Justin v4

1. Couchbase to Hadoop at Linkedin Kafka is Enabling the Big Data Pipeline

2. • Define Problem Domain Justin Michaels | Solution Architect, Couchbase • Use case at LinkedIn Michael Kehoe | Site Reliability Engineer, Linkedin • Supporting Technology Overview and Demo Matt Ingenthron | Senior Director, Couchbase • Q&A Agenda 2

4. Lambda Architecture 4 1 2 3 4 5 DATA BATCH SPEED SERVE QUER Y

5. Lambda Architecture 5 Interactive and Real Time Applications 1 2 3 4 5 DATA BATCH SPEED SERVE QUER YHADOOP COUCHBASE STORM COUCHBASEBroker Cluster Spout for Topic Kafka Producers Ordered Subscriptions

6. • Hadoop … an open-source framework written for distributed storage and distributed processing of very large data sets on commodity hardware • Kafka … append only write-ahead log that records messages to a persistent store and allows subscribers to read and apply these changes to their own stores in an appropriate time-frame • Storm … distributed framework that uses custom created "spouts" and "bolts" to define information sources and manipulations for processing of streaming data • Couchbase … an open source, distributed NoSQL document- oriented database that is optimized for interactive applications with an integrated data cache and incremental map reduce facility 6

8. COMPLEX EVENT PROCESSING Real Time REPOSITORY PERPETUAL STORE ANALYTICAL DB BUSINESS INTELLIGENCE MONITORING CHAT/VOICE SYSTEM BATCH TRACK REAL-TIME TRACK DASHBOARD

9. TRACKING and COLLECTION ANALYSIS AND VISUALIZATION REST FILTER METRICS

10. Use Case at Linkedin 10

11. • Site Reliability Engineer (SRE) at LinkedIn • SRE for Profile & Higher-Education • Member of LinkedIn’s CBVT • B.E. (Electrical Engineering) from the University of Queensland, Australia Michael Kehoe

12. • Kafka was created by LinkedIn • Kafka is a publish-subcribe system built as a distributed commit log • Processes 500+ TB/ day (~500 billion messages) @ LinkedIn Kafka @ LinkedIn

13. • Monitoring • InGraphs • Traditional Messaging (Pub-Sub) • Analytics • Who Viewed my Profile • Experiment reports • Executive reports • Building block for (log) distributibuted applications • Pinot • Espresso LinkedIn’s uses of Kafka

14. Use Case: Kafka to Hadoop (Analytics) • LinkedIn tracks data to better understand how members use our products • Information such as which page got viewed and which content got clicked on are sent into a Kafka cluster in each data center • Some of these events are all centrally collected and pushed onto our Hadoop grid for analysis and daily report generation

15. Couchbase @ LinkedIn • About 25 separate services with one or more clusters in multiple data centers • Up to 100 servers in a cluster • Single and Multi-tenant clusters

16. Use Case: Jobs Cluster • Read scaling, Couchbase ~80k QPS, 24 server cluster(s) • Hadoop to pre-build data by partition • Couchbase 99 percentile latencies

17. Hadoop to Couchbase • Our primary use-case for Hadoop  Couchbase is for building (warming) / recovering Couchbase buckets • LinkedIn built it’s own in-house solution to work with our ETL processes, cache invalidation procedures etc

Editor's Notes

Note: Remove the logos from the animation and speed up build. Distributed users communities relying on interactive applications require systems to be distributed. As a result data is created in a variety of forms and places … the Polyglot Persistence … as the complexity of problems to be solved in creases applications demand a variety of development environments for tackling different problems. These complex, real-time applications combine different problems. Reliably storing, providing access to, and analyzing this data landscape leads to the Polyglot Persistence of data.
Users and consumers of information increasingly demand an always on, low latency access to their data. As well as providing a framework for businesses to understand what’s happening in real time while addressing Polyglot Persistence in managing data. The conceptual framework Lambda Architecture evolved out of Twitter and coined by Nathan Marz for a generic data processing architecture. In a way the architecture is an extended event sourced system but aims to accommodate streaming data at large scale. 1. All data entering the system is dispatched to both the batch layer and the speed layer for processing. 2. The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views. 3. The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way. 4. The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only. 5. Any incoming query can be answered by merging results from batch views and real-time views.
Hadoop is engineered for storage and analysis. It can store petabytes of data, and if can be deployed to thousands of servers. It started with map / reduce. It added Hive. Today, we see efforts like Impala and Drill along with Hortonworks Stinger Initiative and Tez. Some of the Hadoop distributions are bundling Storm and / or Spark. The analytical capabilities of Hadoop are continuing to evolve and improve. However, it’s not well suited to operational workloads. It’s not intended to serve as a backend for enterprise application, mobile or web. It’s not intended to provide interactive data access.
Note: Remove the logos from the animation and speed up build. Distributed users communities relying on interactive applications require systems to be distributed. As a result data is created in a variety of forms and places … the Polyglot Persistence … as the complexity of problems to be solved in creases applications demand a variety of development environments for tackling different problems. These complex, real-time applications combine different problems. Reliably storing, providing access to, and analyzing this data landscape leads to the Polyglot Persistence of data.
The data generated by users is published to Apache Kafka. Next, it’s pulled into Apache Storm for real time analysis and processing as well as into Hadoop. Finally, Storm writes the data to Couchbase Server for real-time access by LivePerson agents while the data in Hadoop is eventually accessed via HP Vertica and MicroStrategy for offline business intelligence and analysis.
The data is first collected by tracking and collection service. Next, Storm pulls the data in for filtering, enrichment, and statistical analysis. The raw data is written to one Couchbase Server cluster while the processed data is written to a separate Couchbase Server cluster. The processed data is access by a front end for visualization and analysis. In addition, the raw data is copied from Couchbase Server to Hadoop. It’s combine with additional data and the whole is moved into HBase for ad hoc analysis. PayPal was able to handle both the volume and the velocity of data as well as meet both operation and analytical requirements. They relied on data capture, stream processing, NoSQL and Hadoop to do so.

CouchbasetoHadoop_Matt_Michael_Justin v4

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to CouchbasetoHadoop_Matt_Michael_Justin v4

Similar to CouchbasetoHadoop_Matt_Michael_Justin v4 (20)

More from Michael Kehoe

More from Michael Kehoe (17)

CouchbasetoHadoop_Matt_Michael_Justin v4

Editor's Notes