SlideShare a Scribd company logo
1 of 33
Download to read offline
Data Product at Airbnb
LIYIN TANG & JINGWEI LU
Data Infrastructure at Airbnb
Event
Logs
MySQL
Dumps
Gold Cluster
HDFS
Hive
Kafk
Sqoo
Silver Cluster Spark Cluster
Spark
ReAi
Airflow Scheduling
S3
Presto Cluster
AirPal
SuperSet
Tableau
Batch Infrastructure
Yarn HDFS
Hive
Yarn
Liyin Tang and Jingwei Lu
3
Streaming at Airbnb
Liyin Tang and Jingwei Lu
4
Cluster
Spark Streaming
Airflow Scheduling
HBase
Sources
Kafka
S3
HDFS
…
Sinks
Datadog
Kafka
Dynamo
DB
Elastic
Search
…
Lambda Architecture
Batch
AirStream
Hive
Spark SQL
Lambda Architecture
Liyin Tang and Jingwei Lu
6
Streaming
Kafka
Spark Streaming
HBase
Liyin Tang and Jingwei Lu
Our Foundations
• Combine stream with batch
• Shared global state store
7
Liyin Tang and Jingwei Lu
Unified API through AirStream
• Declarative job configuration
• Computation operator or sink can be shared by stream and
batch job.
• Stream source vs static source
• Single driver execute stream/batch mode job
8
AirStream
Liyin Tang and Jingwei Lu
9
HBase Tables
Spark StreamingSpark StreamingSpark StreamingSpark Streaming
Spark BatchSpark BatchSpark BatchSpark Batch
Shared Global State Store
Shared Global State Store
Liyin Tang and Jingwei Lu
10
DataFrame
HBase
Region 1
Region 2
Region N
Re-partition
<Region 1, [RowKey,
Value]>
<Region 2, [RowKey,
Value]>
<Region N, [RowKey,
Value]>
… …
Puts
HFile
BulkLoad
Shared Global State Store
Liyin Tang and Jingwei Lu
11
HBase Tables
Spark Streaming/Batch Jobs
Multi-Gets Prefix Scan Time Range Scan
Why HBase ?
• Rich API for point-lookups and sequential scan
(TimeRange, TTL, Prefix Scan …)
• Merged view based on version
• Unified API for streaming writes and bulk uploads
• Unified API for reading from live table and snapshot table
13
Why HBase
Streaming Computation
Merged Storage
Liyin Tang and Jingwei Lu
15
Row Key
R1 V100 V100
R1 V99 V99
R1 V01 V01
… … … …
Time
Streaming Writes
Streaming Writes
Streaming Writes
Merged Storage
Liyin Tang and Jingwei Lu
16
Row Key
R1 V100 V100
R1 V99 V99
R1 V01 V01
Time
Streaming Writes
Streaming Writes
Streaming Writes
R1 V100 100
Batch Bulk Upload
Distinct Count
Liyin Tang and Jingwei Lu
17
Row Key
Prefix _ R1 V102 V102
Prefix _ R2 V101 V101
Prefix _ R4 V01 V01
Prefix _ R3 V100 100
Prefix Scan with
TimeRange
Prefix Scan with
TimeRange
Time
Moving Average
Liyin Tang and Jingwei Lu
18
Row Key
R1 V102 102
R1 V101 101
R1 V01 V01
R1 V100 100
Count Difference/
Time Elapsed
Count Difference/
Time Elapsed
Time
… … …
… … …
Window 1
Window 2
Long Window Computation
Liyin Tang and Jingwei Lu
19
Spark HBase Connector
Liyin Tang and Jingwei Lu
20
Zeppelin
Spark
HBase
Connector
HBase
Presto HBase Connector
Schema
Mapping
Split -> RS
Mapping
Presto - HBase Connector
Liyin Tang and Jingwei Lu
21
Presto HBase
Hive - HBase Connector
Liyin Tang and Jingwei Lu
22
Hive HBase
Hive HBase Connector
Table
InputFormat
Snapshot
InputFormat
Use Cases
Mysql DB Snapshot Using
Binlog Replay
• Large amount of data: Multiple large mysql DBs
• Realtime-ness: minutes delay/ hours delay
• Transaction : Need to keep transaction cross different tables
• Schema change: Table schema evolves
Database Snapshot
25
Move Elephant
26
20+ hr 4+ hr
Airstream Job
5 mins
15
1 hr
Binlog Replay on Spark
• Streaming and Batch shares Logic:
Binlog file reader, DDL processor,
transaction processor, DML
processor.
• Idempotent: Log can be replayed
multiple times.
• Schema changes: Full schema
change history.
27
Log Parser
xvid
DML
DDL
HBASE
Architecture
Binlog
Realtime Ingestion &
Interactive Query
Realtime Ingestion and Interactive Query
Liyin Tang and Jingwei Lu
29
HBase
AirStream
Spark
Streaming
Kafka
Query
Engine
Data
Portal
Spark SQL
Hive SQL
Presto SQL
Interactive Query in Superset
30
Summary
Unify Batch and
Streaming Computation
32
33
We are hiring

More Related Content

What's hot

hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on KubernetesHBaseCon
 
Streaming Analytics & CEP - Two sides of the same coin?
Streaming Analytics & CEP - Two sides of the same coin?Streaming Analytics & CEP - Two sides of the same coin?
Streaming Analytics & CEP - Two sides of the same coin?Till Rohrmann
 
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBaseHBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBaseHBaseCon
 
uReplicator: Uber Engineering’s Scalable, Robust Kafka Replicator
uReplicator: Uber Engineering’s Scalable,  Robust Kafka ReplicatoruReplicator: Uber Engineering’s Scalable,  Robust Kafka Replicator
uReplicator: Uber Engineering’s Scalable, Robust Kafka ReplicatorMichael Hongliang Xu
 
Flink Forward San Francisco 2018: Steven Wu - "Scaling Flink in Cloud"
Flink Forward San Francisco 2018: Steven Wu - "Scaling Flink in Cloud" Flink Forward San Francisco 2018: Steven Wu - "Scaling Flink in Cloud"
Flink Forward San Francisco 2018: Steven Wu - "Scaling Flink in Cloud" Flink Forward
 
Benchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per nodeBenchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per nodeTao Feng
 
Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...
Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...
Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...Flink Forward
 
Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, Confluent
Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, ConfluentTemporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, Confluent
Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, ConfluentHostedbyConfluent
 
KSQL: Streaming SQL for Kafka
KSQL: Streaming SQL for KafkaKSQL: Streaming SQL for Kafka
KSQL: Streaming SQL for Kafkaconfluent
 
Kafka Summit NYC 2017 Hanging Out with Your Past Self in VR
Kafka Summit NYC 2017 Hanging Out with Your Past Self in VRKafka Summit NYC 2017 Hanging Out with Your Past Self in VR
Kafka Summit NYC 2017 Hanging Out with Your Past Self in VRconfluent
 
Real Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & StormReal Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & StormRan Silberman
 
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
Hadoop summit - Scaling Uber’s Real-Time Infra for  Trillion Events per DayHadoop summit - Scaling Uber’s Real-Time Infra for  Trillion Events per Day
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per DayAnkur Bansal
 
Unified Stream Processing at Scale with Apache Samza - BDS2017
Unified Stream Processing at Scale with Apache Samza - BDS2017Unified Stream Processing at Scale with Apache Samza - BDS2017
Unified Stream Processing at Scale with Apache Samza - BDS2017Jacob Maes
 
Stream Processing Live Traffic Data with Kafka Streams
Stream Processing Live Traffic Data with Kafka StreamsStream Processing Live Traffic Data with Kafka Streams
Stream Processing Live Traffic Data with Kafka StreamsTim Ysewyn
 
stream-processing-at-linkedin-with-apache-samza
stream-processing-at-linkedin-with-apache-samzastream-processing-at-linkedin-with-apache-samza
stream-processing-at-linkedin-with-apache-samzaAbhishek Shivanna
 
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick EvansRealtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick EvansSpark Summit
 
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkTill Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkFlink Forward
 
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache KafkaKafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafkaconfluent
 
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
Flink Forward SF 2017:  Cliff Resnick & Seth Wiesman -   From Zero to Streami...Flink Forward SF 2017:  Cliff Resnick & Seth Wiesman -   From Zero to Streami...
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...Flink Forward
 

What's hot (20)

hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
 
Streaming Analytics & CEP - Two sides of the same coin?
Streaming Analytics & CEP - Two sides of the same coin?Streaming Analytics & CEP - Two sides of the same coin?
Streaming Analytics & CEP - Two sides of the same coin?
 
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBaseHBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
 
ApacheCon BigData Europe 2015
ApacheCon BigData Europe 2015 ApacheCon BigData Europe 2015
ApacheCon BigData Europe 2015
 
uReplicator: Uber Engineering’s Scalable, Robust Kafka Replicator
uReplicator: Uber Engineering’s Scalable,  Robust Kafka ReplicatoruReplicator: Uber Engineering’s Scalable,  Robust Kafka Replicator
uReplicator: Uber Engineering’s Scalable, Robust Kafka Replicator
 
Flink Forward San Francisco 2018: Steven Wu - "Scaling Flink in Cloud"
Flink Forward San Francisco 2018: Steven Wu - "Scaling Flink in Cloud" Flink Forward San Francisco 2018: Steven Wu - "Scaling Flink in Cloud"
Flink Forward San Francisco 2018: Steven Wu - "Scaling Flink in Cloud"
 
Benchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per nodeBenchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per node
 
Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...
Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...
Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...
 
Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, Confluent
Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, ConfluentTemporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, Confluent
Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, Confluent
 
KSQL: Streaming SQL for Kafka
KSQL: Streaming SQL for KafkaKSQL: Streaming SQL for Kafka
KSQL: Streaming SQL for Kafka
 
Kafka Summit NYC 2017 Hanging Out with Your Past Self in VR
Kafka Summit NYC 2017 Hanging Out with Your Past Self in VRKafka Summit NYC 2017 Hanging Out with Your Past Self in VR
Kafka Summit NYC 2017 Hanging Out with Your Past Self in VR
 
Real Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & StormReal Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & Storm
 
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
Hadoop summit - Scaling Uber’s Real-Time Infra for  Trillion Events per DayHadoop summit - Scaling Uber’s Real-Time Infra for  Trillion Events per Day
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
 
Unified Stream Processing at Scale with Apache Samza - BDS2017
Unified Stream Processing at Scale with Apache Samza - BDS2017Unified Stream Processing at Scale with Apache Samza - BDS2017
Unified Stream Processing at Scale with Apache Samza - BDS2017
 
Stream Processing Live Traffic Data with Kafka Streams
Stream Processing Live Traffic Data with Kafka StreamsStream Processing Live Traffic Data with Kafka Streams
Stream Processing Live Traffic Data with Kafka Streams
 
stream-processing-at-linkedin-with-apache-samza
stream-processing-at-linkedin-with-apache-samzastream-processing-at-linkedin-with-apache-samza
stream-processing-at-linkedin-with-apache-samza
 
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick EvansRealtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
 
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkTill Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
 
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache KafkaKafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
 
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
Flink Forward SF 2017:  Cliff Resnick & Seth Wiesman -   From Zero to Streami...Flink Forward SF 2017:  Cliff Resnick & Seth Wiesman -   From Zero to Streami...
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
 

Similar to HBaseCon2017 Data Product at AirBnB

Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Databricks
 
Building Stream Processing as a Service
Building Stream Processing as a ServiceBuilding Stream Processing as a Service
Building Stream Processing as a ServiceSteven Wu
 
Large scale stream processing with Apache Flink
Large scale stream processing with Apache FlinkLarge scale stream processing with Apache Flink
Large scale stream processing with Apache FlinkNikolay Stoitsev
 
Apache Flink: Past, Present and Future
Apache Flink: Past, Present and FutureApache Flink: Past, Present and Future
Apache Flink: Past, Present and FutureGyula Fóra
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and visionStephan Ewen
 
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...Khai Tran
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System OverviewFlink Forward
 
Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016Stephan Ewen
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Value Association
 
Spline: Data Lineage For Spark Structured Streaming
Spline: Data Lineage For Spark Structured StreamingSpline: Data Lineage For Spark Structured Streaming
Spline: Data Lineage For Spark Structured StreamingVaclav Kosar
 
#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & Geode
#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & Geode#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & Geode
#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & GeodePivotalOpenSourceHub
 
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...Flink Forward
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Robert Metzger
 
How did we move one of the world’s largest SAP BW HANA landscape to Microsoft...
How did we move one of the world’s largest SAP BW HANA landscape to Microsoft...How did we move one of the world’s largest SAP BW HANA landscape to Microsoft...
How did we move one of the world’s largest SAP BW HANA landscape to Microsoft...Capgemini
 
Streams and serverless at DAZN
Streams and serverless at DAZNStreams and serverless at DAZN
Streams and serverless at DAZNYan Cui
 
When Streaming Needs Batch With Konstantin Knauf | Current 2022
When Streaming Needs Batch With Konstantin Knauf | Current 2022When Streaming Needs Batch With Konstantin Knauf | Current 2022
When Streaming Needs Batch With Konstantin Knauf | Current 2022HostedbyConfluent
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterPaolo Castagna
 
Robust stream processing with Apache Flink
Robust stream processing with Apache FlinkRobust stream processing with Apache Flink
Robust stream processing with Apache FlinkAljoscha Krettek
 
What can you do with lambda in 2020
What can you do with lambda in 2020What can you do with lambda in 2020
What can you do with lambda in 2020Yan Cui
 

Similar to HBaseCon2017 Data Product at AirBnB (20)

Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
 
Building Stream Processing as a Service
Building Stream Processing as a ServiceBuilding Stream Processing as a Service
Building Stream Processing as a Service
 
Large scale stream processing with Apache Flink
Large scale stream processing with Apache FlinkLarge scale stream processing with Apache Flink
Large scale stream processing with Apache Flink
 
Apache Flink: Past, Present and Future
Apache Flink: Past, Present and FutureApache Flink: Past, Present and Future
Apache Flink: Past, Present and Future
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and vision
 
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System Overview
 
Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICS
 
Spline: Data Lineage For Spark Structured Streaming
Spline: Data Lineage For Spark Structured StreamingSpline: Data Lineage For Spark Structured Streaming
Spline: Data Lineage For Spark Structured Streaming
 
#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & Geode
#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & Geode#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & Geode
#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & Geode
 
Zurich Flink Meetup
Zurich Flink MeetupZurich Flink Meetup
Zurich Flink Meetup
 
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
 
How did we move one of the world’s largest SAP BW HANA landscape to Microsoft...
How did we move one of the world’s largest SAP BW HANA landscape to Microsoft...How did we move one of the world’s largest SAP BW HANA landscape to Microsoft...
How did we move one of the world’s largest SAP BW HANA landscape to Microsoft...
 
Streams and serverless at DAZN
Streams and serverless at DAZNStreams and serverless at DAZN
Streams and serverless at DAZN
 
When Streaming Needs Batch With Konstantin Knauf | Current 2022
When Streaming Needs Batch With Konstantin Knauf | Current 2022When Streaming Needs Batch With Konstantin Knauf | Current 2022
When Streaming Needs Batch With Konstantin Knauf | Current 2022
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matter
 
Robust stream processing with Apache Flink
Robust stream processing with Apache FlinkRobust stream processing with Apache Flink
Robust stream processing with Apache Flink
 
What can you do with lambda in 2020
What can you do with lambda in 2020What can you do with lambda in 2020
What can you do with lambda in 2020
 

More from HBaseCon

hbaseconasia2017: HBase on Beam
hbaseconasia2017: HBase on Beamhbaseconasia2017: HBase on Beam
hbaseconasia2017: HBase on BeamHBaseCon
 
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: HBase Disaster Recovery Solution at Huaweihbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: HBase Disaster Recovery Solution at HuaweiHBaseCon
 
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinteresthbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
hbaseconasia2017: Removable singularity: a story of HBase upgrade in PinterestHBaseCon
 
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程HBaseCon
 
hbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: Apache HBase at Neteasehbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: Apache HBase at NeteaseHBaseCon
 
hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践HBaseCon
 
hbaseconasia2017: 基于HBase的企业级大数据平台
hbaseconasia2017: 基于HBase的企业级大数据平台hbaseconasia2017: 基于HBase的企业级大数据平台
hbaseconasia2017: 基于HBase的企业级大数据平台HBaseCon
 
hbaseconasia2017: HBase at JD.com
hbaseconasia2017: HBase at JD.comhbaseconasia2017: HBase at JD.com
hbaseconasia2017: HBase at JD.comHBaseCon
 
hbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecturehbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architectureHBaseCon
 
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huaweihbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
hbaseconasia2017: Ecosystems with HBase and CloudTable service at HuaweiHBaseCon
 
hbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: HBase Practice At XiaoMihbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: HBase Practice At XiaoMiHBaseCon
 
hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0HBaseCon
 
HBaseCon2017 Democratizing HBase
HBaseCon2017 Democratizing HBaseHBaseCon2017 Democratizing HBase
HBaseCon2017 Democratizing HBaseHBaseCon
 
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Removable singularity: a story of HBase upgrade in PinterestHBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Removable singularity: a story of HBase upgrade in PinterestHBaseCon
 
HBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBaseHBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBaseHBaseCon
 
HBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at DidiHBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at DidiHBaseCon
 
HBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase ClientHBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase ClientHBaseCon
 
HBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environmentHBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environmentHBaseCon
 
HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...
HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...
HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...HBaseCon
 
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBaseHBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBaseHBaseCon
 

More from HBaseCon (20)

hbaseconasia2017: HBase on Beam
hbaseconasia2017: HBase on Beamhbaseconasia2017: HBase on Beam
hbaseconasia2017: HBase on Beam
 
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: HBase Disaster Recovery Solution at Huaweihbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
 
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinteresthbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
 
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
 
hbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: Apache HBase at Neteasehbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: Apache HBase at Netease
 
hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践
 
hbaseconasia2017: 基于HBase的企业级大数据平台
hbaseconasia2017: 基于HBase的企业级大数据平台hbaseconasia2017: 基于HBase的企业级大数据平台
hbaseconasia2017: 基于HBase的企业级大数据平台
 
hbaseconasia2017: HBase at JD.com
hbaseconasia2017: HBase at JD.comhbaseconasia2017: HBase at JD.com
hbaseconasia2017: HBase at JD.com
 
hbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecturehbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecture
 
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huaweihbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
 
hbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: HBase Practice At XiaoMihbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: HBase Practice At XiaoMi
 
hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0
 
HBaseCon2017 Democratizing HBase
HBaseCon2017 Democratizing HBaseHBaseCon2017 Democratizing HBase
HBaseCon2017 Democratizing HBase
 
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Removable singularity: a story of HBase upgrade in PinterestHBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
 
HBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBaseHBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBase
 
HBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at DidiHBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at Didi
 
HBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase ClientHBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase Client
 
HBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environmentHBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environment
 
HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...
HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...
HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...
 
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBaseHBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
 

Recently uploaded

Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 

Recently uploaded (20)

Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 

HBaseCon2017 Data Product at AirBnB

  • 1. Data Product at Airbnb LIYIN TANG & JINGWEI LU
  • 3. Event Logs MySQL Dumps Gold Cluster HDFS Hive Kafk Sqoo Silver Cluster Spark Cluster Spark ReAi Airflow Scheduling S3 Presto Cluster AirPal SuperSet Tableau Batch Infrastructure Yarn HDFS Hive Yarn Liyin Tang and Jingwei Lu 3
  • 4. Streaming at Airbnb Liyin Tang and Jingwei Lu 4 Cluster Spark Streaming Airflow Scheduling HBase Sources Kafka S3 HDFS … Sinks Datadog Kafka Dynamo DB Elastic Search …
  • 6. Batch AirStream Hive Spark SQL Lambda Architecture Liyin Tang and Jingwei Lu 6 Streaming Kafka Spark Streaming HBase
  • 7. Liyin Tang and Jingwei Lu Our Foundations • Combine stream with batch • Shared global state store 7
  • 8. Liyin Tang and Jingwei Lu Unified API through AirStream • Declarative job configuration • Computation operator or sink can be shared by stream and batch job. • Stream source vs static source • Single driver execute stream/batch mode job 8
  • 9. AirStream Liyin Tang and Jingwei Lu 9 HBase Tables Spark StreamingSpark StreamingSpark StreamingSpark Streaming Spark BatchSpark BatchSpark BatchSpark Batch Shared Global State Store
  • 10. Shared Global State Store Liyin Tang and Jingwei Lu 10 DataFrame HBase Region 1 Region 2 Region N Re-partition <Region 1, [RowKey, Value]> <Region 2, [RowKey, Value]> <Region N, [RowKey, Value]> … … Puts HFile BulkLoad
  • 11. Shared Global State Store Liyin Tang and Jingwei Lu 11 HBase Tables Spark Streaming/Batch Jobs Multi-Gets Prefix Scan Time Range Scan
  • 13. • Rich API for point-lookups and sequential scan (TimeRange, TTL, Prefix Scan …) • Merged view based on version • Unified API for streaming writes and bulk uploads • Unified API for reading from live table and snapshot table 13 Why HBase
  • 15. Merged Storage Liyin Tang and Jingwei Lu 15 Row Key R1 V100 V100 R1 V99 V99 R1 V01 V01 … … … … Time Streaming Writes Streaming Writes Streaming Writes
  • 16. Merged Storage Liyin Tang and Jingwei Lu 16 Row Key R1 V100 V100 R1 V99 V99 R1 V01 V01 Time Streaming Writes Streaming Writes Streaming Writes R1 V100 100 Batch Bulk Upload
  • 17. Distinct Count Liyin Tang and Jingwei Lu 17 Row Key Prefix _ R1 V102 V102 Prefix _ R2 V101 V101 Prefix _ R4 V01 V01 Prefix _ R3 V100 100 Prefix Scan with TimeRange Prefix Scan with TimeRange Time
  • 18. Moving Average Liyin Tang and Jingwei Lu 18 Row Key R1 V102 102 R1 V101 101 R1 V01 V01 R1 V100 100 Count Difference/ Time Elapsed Count Difference/ Time Elapsed Time … … … … … … Window 1 Window 2
  • 19. Long Window Computation Liyin Tang and Jingwei Lu 19
  • 20. Spark HBase Connector Liyin Tang and Jingwei Lu 20 Zeppelin Spark HBase Connector HBase
  • 21. Presto HBase Connector Schema Mapping Split -> RS Mapping Presto - HBase Connector Liyin Tang and Jingwei Lu 21 Presto HBase
  • 22. Hive - HBase Connector Liyin Tang and Jingwei Lu 22 Hive HBase Hive HBase Connector Table InputFormat Snapshot InputFormat
  • 24. Mysql DB Snapshot Using Binlog Replay
  • 25. • Large amount of data: Multiple large mysql DBs • Realtime-ness: minutes delay/ hours delay • Transaction : Need to keep transaction cross different tables • Schema change: Table schema evolves Database Snapshot 25 Move Elephant
  • 26. 26 20+ hr 4+ hr Airstream Job 5 mins 15 1 hr Binlog Replay on Spark
  • 27. • Streaming and Batch shares Logic: Binlog file reader, DDL processor, transaction processor, DML processor. • Idempotent: Log can be replayed multiple times. • Schema changes: Full schema change history. 27 Log Parser xvid DML DDL HBASE Architecture Binlog
  • 29. Realtime Ingestion and Interactive Query Liyin Tang and Jingwei Lu 29 HBase AirStream Spark Streaming Kafka Query Engine Data Portal Spark SQL Hive SQL Presto SQL
  • 30. Interactive Query in Superset 30
  • 32. Unify Batch and Streaming Computation 32