SlideShare a Scribd company logo
Introduction to Apache
BookKeeper
April 2018
• Distributed storage service
• Developed at Yahoo
• Designed for low latency and scalability
• Architected for resiliency and data
durability
• Users include:
Introducing Apache BookKeeper
!2
BookKeeper design goals
!3
Write and read streams of entries
with very low latency (< 5 ms)
Ensure that stored data is durable,
consistent, and resilient
Immediate access to data—
stream or tail data as it is written
Efficiently store and access both
historic and real-time data
BookKeeper key capabilities
!4
Data consistency
Simple, repeatable read
consistency model
Data durability
Built-in replication
and resiliency
Performance
Efficient distribution of
load across cluster
Flexibility
Tunable write model to
optimize balance
Scalability
Isolation of writes and
reads for consistent
performance
BookKeeper core concepts
!5
Entry 

(aka record)
Sequence of bytes that is the
smallest unit of data storage and
access
Log
Ledger: append-only sequence
of records
Stream: unbounded, infinite
sequence of data records
1 2 3 4 5 6
1 2 3
Entry
Stream
Ledger Ledger
1 2 3 4
• Bookies
• Individual BookKeeper storage node
• Bookies manage access to ledgers
• Ledgers striped across bookies
• Interfaces
• Ledger API: low-level API for direct interaction
with ledgers
• Distributed Log API: higher-level abstraction for
storing and reading data
• Metadata
• Stored in ZooKeeper cluster
• Ledger and ensemble information
Bookies
BookKeeper architecture
!6
Metadata
Client Interfaces
Ledger API Log API
Ledgers
Data storage in BookKeeper
!7
Physical storage
Logical view Segment 1 Segment 2 Segment 3 Segment 4 Segment n
Segment ……
…
…
…
Segment 1
Segment 2
Segment n
Bookie 1
Segment 1
Segment n
Segment 3
Bookie 2
Segment 4
Segment 2
Segment 3
Bookie 3
Segment 3
Segment 4
Segment n
Bookie 4
Segment 1
Segment 2
Segment 4
Bookie 5
• Data stored in segments
• Segments striped across bookies
Data storage in BookKeeper
!8
Physical storage
Logical view Segment 1 Segment 2 Segment 3 Segment 4 Segment n
Segment ……
…
…
…
Segment 1
Segment 2
Segment n
Bookie 1
Segment 1
Segment n
Segment 3
Bookie 2
Segment 4
Segment 2
Segment 3
Bookie 3
Segment 3
Segment 4
Segment n
Bookie 4
Segment 1
Segment 2
Segment 4
Bookie 5
• Data stored in segments
• Storage striped across bookies
• Segments replicated across cluster
Write quorum
ACK quorum
• Single bookie can serve and store
thousands of ledgers
• Separation of write and read paths
• Bookies use separate I/O paths for writes,
tailing reads, and catch-up reads
• Avoid read activity impact on write latency
• Entries sorted to allow for mostly
sequential reads
Consistent, low latency performance
!9
Bookie 1 Bookie 2 Bookie 3 Bookie 4
Writer
Reader
Part of the Streamlio platform for fast data
!10
Interfaces
APIs Libraries & Connectivity
Real-time processing
Messaging & queuing
Stream storage
ConnectorsClientData SourceStormKafka Functional
Management
Resource Management
Metadata
Security
Monitoring
Orchestration
Powered by
Powered by
Powered by
Introduction to Apache BookKeeper Distributed Storage

More Related Content

What's hot

Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
StreamNative
 

What's hot (20)

October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
 
Effectively-once semantics in Apache Pulsar
Effectively-once semantics in Apache PulsarEffectively-once semantics in Apache Pulsar
Effectively-once semantics in Apache Pulsar
 
Cloud Messaging Service: Technical Overview
Cloud Messaging Service: Technical OverviewCloud Messaging Service: Technical Overview
Cloud Messaging Service: Technical Overview
 
Linked In Stream Processing Meetup - Apache Pulsar
Linked In Stream Processing Meetup - Apache PulsarLinked In Stream Processing Meetup - Apache Pulsar
Linked In Stream Processing Meetup - Apache Pulsar
 
Apache Pulsar First Overview
Apache PulsarFirst OverviewApache PulsarFirst Overview
Apache Pulsar First Overview
 
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
 
kafka
kafkakafka
kafka
 
Flume vs. kafka
Flume vs. kafkaFlume vs. kafka
Flume vs. kafka
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
KSQL- Streaming Sql for Kafka
KSQL- Streaming Sql for KafkaKSQL- Streaming Sql for Kafka
KSQL- Streaming Sql for Kafka
 
Kafka Overview
Kafka OverviewKafka Overview
Kafka Overview
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka Streams
 
Devoxx Morocco 2016 - Microservices with Kafka
Devoxx Morocco 2016 - Microservices with KafkaDevoxx Morocco 2016 - Microservices with Kafka
Devoxx Morocco 2016 - Microservices with Kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache BookKeeper Distributed Store- a Salesforce use case
Apache BookKeeper Distributed Store- a Salesforce use caseApache BookKeeper Distributed Store- a Salesforce use case
Apache BookKeeper Distributed Store- a Salesforce use case
 
A Unified Platform for Real-time Storage and Processing
A Unified Platform for Real-time Storage and ProcessingA Unified Platform for Real-time Storage and Processing
A Unified Platform for Real-time Storage and Processing
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Pulsar Storage on BookKeeper _Seamless Evolution
Pulsar Storage on BookKeeper _Seamless EvolutionPulsar Storage on BookKeeper _Seamless Evolution
Pulsar Storage on BookKeeper _Seamless Evolution
 

Similar to Introduction to Apache BookKeeper Distributed Storage

Webinar: Faster Log Indexing with Fusion
Webinar: Faster Log Indexing with FusionWebinar: Faster Log Indexing with Fusion
Webinar: Faster Log Indexing with Fusion
Lucidworks
 
Data Indexing Presentation-My.pptppt.ppt
Data Indexing Presentation-My.pptppt.pptData Indexing Presentation-My.pptppt.ppt
Data Indexing Presentation-My.pptppt.ppt
sdsm2
 

Similar to Introduction to Apache BookKeeper Distributed Storage (20)

Apache Con 2021 : Apache Bookkeeper Key Value Store and use cases
Apache Con 2021 : Apache Bookkeeper Key Value Store and use casesApache Con 2021 : Apache Bookkeeper Key Value Store and use cases
Apache Con 2021 : Apache Bookkeeper Key Value Store and use cases
 
Messaging, storage, or both? The real time story of Pulsar and Apache Distri...
Messaging, storage, or both?  The real time story of Pulsar and Apache Distri...Messaging, storage, or both?  The real time story of Pulsar and Apache Distri...
Messaging, storage, or both? The real time story of Pulsar and Apache Distri...
 
Kafka overview v0.1
Kafka overview v0.1Kafka overview v0.1
Kafka overview v0.1
 
How Pulsar Stores Your Data - Pulsar Summit NA 2021
How Pulsar Stores Your Data - Pulsar Summit NA 2021How Pulsar Stores Your Data - Pulsar Summit NA 2021
How Pulsar Stores Your Data - Pulsar Summit NA 2021
 
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
 
Massive Storage Engine
Massive Storage EngineMassive Storage Engine
Massive Storage Engine
 
Kafka tiered-storage-meetup-2022-final-presented
Kafka tiered-storage-meetup-2022-final-presentedKafka tiered-storage-meetup-2022-final-presented
Kafka tiered-storage-meetup-2022-final-presented
 
Webinar: Faster Log Indexing with Fusion
Webinar: Faster Log Indexing with FusionWebinar: Faster Log Indexing with Fusion
Webinar: Faster Log Indexing with Fusion
 
Data Indexing Presentation-My.pptppt.ppt
Data Indexing Presentation-My.pptppt.pptData Indexing Presentation-My.pptppt.ppt
Data Indexing Presentation-My.pptppt.ppt
 
MSE
MSEMSE
MSE
 
Integrating Apache Pulsar with Big Data Ecosystem
Integrating Apache Pulsar with Big Data EcosystemIntegrating Apache Pulsar with Big Data Ecosystem
Integrating Apache Pulsar with Big Data Ecosystem
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
Building an Event Bus at Scale
Building an Event Bus at ScaleBuilding an Event Bus at Scale
Building an Event Bus at Scale
 
Real-Time Data Exploration and Analytics with Amazon Elasticsearch Service
Real-Time Data Exploration and Analytics with Amazon Elasticsearch ServiceReal-Time Data Exploration and Analytics with Amazon Elasticsearch Service
Real-Time Data Exploration and Analytics with Amazon Elasticsearch Service
 
Evaluating Streaming Data Solutions
Evaluating Streaming Data SolutionsEvaluating Streaming Data Solutions
Evaluating Streaming Data Solutions
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
 
Elasticsearch 5.0
Elasticsearch 5.0Elasticsearch 5.0
Elasticsearch 5.0
 
A Technical Introduction to WiredTiger
A Technical Introduction to WiredTigerA Technical Introduction to WiredTiger
A Technical Introduction to WiredTiger
 
kafka simplicity and complexity
kafka simplicity and complexitykafka simplicity and complexity
kafka simplicity and complexity
 

More from Streamlio

More from Streamlio (11)

Infinite Topic Backlogs with Apache Pulsar
Infinite Topic Backlogs with Apache PulsarInfinite Topic Backlogs with Apache Pulsar
Infinite Topic Backlogs with Apache Pulsar
 
Apache Pulsar Overview
Apache Pulsar OverviewApache Pulsar Overview
Apache Pulsar Overview
 
Streamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache Pulsar
 
Strata London 2018: Multi-everything with Apache Pulsar
Strata London 2018:  Multi-everything with Apache PulsarStrata London 2018:  Multi-everything with Apache Pulsar
Strata London 2018: Multi-everything with Apache Pulsar
 
Self Regulating Streaming - Data Platforms Conference 2018
Self Regulating Streaming - Data Platforms Conference 2018Self Regulating Streaming - Data Platforms Conference 2018
Self Regulating Streaming - Data Platforms Conference 2018
 
Event Data Processing with Streamlio
Event Data Processing with StreamlioEvent Data Processing with Streamlio
Event Data Processing with Streamlio
 
Stream-Native Processing with Pulsar Functions
Stream-Native Processing with Pulsar FunctionsStream-Native Processing with Pulsar Functions
Stream-Native Processing with Pulsar Functions
 
Building data-driven microservices
Building data-driven microservicesBuilding data-driven microservices
Building data-driven microservices
 
Distributed Crypto-Currency Trading with Apache Pulsar
Distributed Crypto-Currency Trading with Apache PulsarDistributed Crypto-Currency Trading with Apache Pulsar
Distributed Crypto-Currency Trading with Apache Pulsar
 
Autopiloting Realtime Processing in Heron
Autopiloting Realtime Processing in HeronAutopiloting Realtime Processing in Heron
Autopiloting Realtime Processing in Heron
 
Introduction to Apache Heron
Introduction to Apache HeronIntroduction to Apache Heron
Introduction to Apache Heron
 

Recently uploaded

AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
Alluxio, Inc.
 
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
mbmh111980
 

Recently uploaded (20)

Breaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdfBreaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdf
 
Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024
 
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with StrimziStrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi
 
How to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabberHow to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabber
 
A Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationA Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data Migration
 
OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024
 
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
 
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
 
APVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purityAPVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purity
 
Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024
 
A Guideline to Gorgias to to Re:amaze Data Migration
A Guideline to Gorgias to to Re:amaze Data MigrationA Guideline to Gorgias to to Re:amaze Data Migration
A Guideline to Gorgias to to Re:amaze Data Migration
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
how-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdfhow-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdf
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
 
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
 
5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
 

Introduction to Apache BookKeeper Distributed Storage

  • 2. • Distributed storage service • Developed at Yahoo • Designed for low latency and scalability • Architected for resiliency and data durability • Users include: Introducing Apache BookKeeper !2
  • 3. BookKeeper design goals !3 Write and read streams of entries with very low latency (< 5 ms) Ensure that stored data is durable, consistent, and resilient Immediate access to data— stream or tail data as it is written Efficiently store and access both historic and real-time data
  • 4. BookKeeper key capabilities !4 Data consistency Simple, repeatable read consistency model Data durability Built-in replication and resiliency Performance Efficient distribution of load across cluster Flexibility Tunable write model to optimize balance Scalability Isolation of writes and reads for consistent performance
  • 5. BookKeeper core concepts !5 Entry 
 (aka record) Sequence of bytes that is the smallest unit of data storage and access Log Ledger: append-only sequence of records Stream: unbounded, infinite sequence of data records 1 2 3 4 5 6 1 2 3 Entry Stream Ledger Ledger 1 2 3 4
  • 6. • Bookies • Individual BookKeeper storage node • Bookies manage access to ledgers • Ledgers striped across bookies • Interfaces • Ledger API: low-level API for direct interaction with ledgers • Distributed Log API: higher-level abstraction for storing and reading data • Metadata • Stored in ZooKeeper cluster • Ledger and ensemble information Bookies BookKeeper architecture !6 Metadata Client Interfaces Ledger API Log API Ledgers
  • 7. Data storage in BookKeeper !7 Physical storage Logical view Segment 1 Segment 2 Segment 3 Segment 4 Segment n Segment …… … … … Segment 1 Segment 2 Segment n Bookie 1 Segment 1 Segment n Segment 3 Bookie 2 Segment 4 Segment 2 Segment 3 Bookie 3 Segment 3 Segment 4 Segment n Bookie 4 Segment 1 Segment 2 Segment 4 Bookie 5 • Data stored in segments • Segments striped across bookies
  • 8. Data storage in BookKeeper !8 Physical storage Logical view Segment 1 Segment 2 Segment 3 Segment 4 Segment n Segment …… … … … Segment 1 Segment 2 Segment n Bookie 1 Segment 1 Segment n Segment 3 Bookie 2 Segment 4 Segment 2 Segment 3 Bookie 3 Segment 3 Segment 4 Segment n Bookie 4 Segment 1 Segment 2 Segment 4 Bookie 5 • Data stored in segments • Storage striped across bookies • Segments replicated across cluster
  • 9. Write quorum ACK quorum • Single bookie can serve and store thousands of ledgers • Separation of write and read paths • Bookies use separate I/O paths for writes, tailing reads, and catch-up reads • Avoid read activity impact on write latency • Entries sorted to allow for mostly sequential reads Consistent, low latency performance !9 Bookie 1 Bookie 2 Bookie 3 Bookie 4 Writer Reader
  • 10. Part of the Streamlio platform for fast data !10 Interfaces APIs Libraries & Connectivity Real-time processing Messaging & queuing Stream storage ConnectorsClientData SourceStormKafka Functional Management Resource Management Metadata Security Monitoring Orchestration Powered by Powered by Powered by