SlideShare a Scribd company logo
1 of 30
FROM
KAFKA
TO
BigQuery
Ofir Sharony
A Guide For Delivering Billions of Daily Events
Why are we here?
As data engineers, our goal is building GREAT data pipelines.
Delivering data to analysis reliably & efficiently with the ability to scale.
This session is based this post from MyHeritage tech blog:
https://medium.com/myheritage-engineering
Considerations for building data pipelines
● End2End latency
● Data structure
● Data partitioning
● Reprocessing old data
Discover, preserve &
share your
FAMILY HISTORY
Case Study
MyHeritage data pipeline
Reveal the secrets
from your
DNA
100,000,000+
Registered Users
250,000,000+
Web Requests Daily
Billions
Change-data-capture events
User behavior and A/B testing
High level architecture
Database
Database
Database
Change-data-capture events
Requests from multiple sources
User Activities
Google
BigQuery
Apache
Kafka
Apache Kafka in a nutshell
• Distributed streaming platform
• Highly scalable
• Fault tolerant
• Topics and partitions
Producer
Producer
Producer
Consumer
Consumer
Consumer
Kafka Broker
Partition
Partition
Partition
Topic
Google BigQuery in a nutshell
• Cloud data-warehouse
• Pay as you go
• Data sources
• Streaming support
• Daily partitioning feature. Partitioning options:
• By processing time - when the event was observed
• By event time - when the event actually occurred
Our Data Delivery Journey
Take 1 - Batch loading to Google Cloud Storage
• Two steps for loading
• Batch loading to GCS with Secor
• Loading from GCS to BigQuery
Batch loading to Google Cloud Storage
Secor
Google Cloud
Storage
Loading
Script
bq load dataset.table$20171201
gs://bucket/topic/dt=2017–12–01/*
• Partitioning options
• By ingestion date (processing time)
• By record timestamp (event time)
• Concurrency and scaling
• No extra cost for streaming
• Data formats
• End2End latency
Tradeoffs - Batch loading with Secor
Take 2 - Streaming data via BigQuery API
Streaming data via BigQuery API
• No preliminary step
• Comes with a cost and quota limitation
DIY Kafka
Consumer
Streaming
Framework
BigQuery
Streaming
API
Kafka Streams streaming framework
• Streaming applications on top of Kafka
• Simple Java application
• Auto-replicated application state
• Supports both processing time and event time semantics
Basic Kafka Streams application
public static void main(String[] args) {
Properties config = new Properties();
config.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "host:port");
config.put(StreamsConfig.APPLICATION_ID_CONFIG, "AppId");
KStreamBuilder builder = new KStreamBuilder();
builder.stream(Serdes.String(), Serdes.String(), "my_topic")
.filter((key, value) -> value.equals("Please, take me to BigQuery"))
.foreach((key, value) -> deliverDataToBigQuery(key, value));
KafkaStreams streams = new KafkaStreams(builder, config);
streams.start();
}
• Low latency
• Complete control in your hands
• You have to design a robust service
• Handle streaming errors, record batching, etc
Tradeoffs - Streaming with BigQuery API
Take 3 - Streaming data with Kafka Connect
Streaming with Kafka Connect
• Deliver data in and out of Kafka
• Supports many connectors out of the box
• Simple connector creation - we used WePay’s BigQuery connector
• You get redundancy and scaling for free
Kafka
Connector
Kafka
Connector
Kafka
Connector
• Plug and play solution
• Leverages Kafka for scaling and auto-failover
• Auto creation of BigQuery tables
• Partition by processing time
• Process data only from Kafka
• Topic to table mapping
Tradeoffs - Streaming with Kafka Connect
Take 4 - Streaming data with Apache Beam
Streaming data with Apache Beam
• Unified model for batch and streaming
• Portable platform
• Managed Cloud Dataflow runner
Cloud
Dataflow
Worker
Cloud
Dataflow
Worker
Cloud
Dataflow
Worker
Apache
Beam
Streaming to BigQuery with Beam
String getDestination(ValueInSingleWindow<PageView> element) {
String table = element.getValue().isBot() ? "bot_pageviews" : "user_pageviews";
Long eventTimestamp = element.getValue().getEventTimestamp();
String partition = new SimpleDateFormat("yyyyMMdd")
.format(new Date(eventTimestamp));
return String.format("%s:%s.%s$%s", project, dataset, table, partition);
}
Tradeoffs - Streaming with Beam
• Unified model for multiple runners
• Dataflow runner advantages
• Can process data from different sources
• Dynamic destinations
• Extra cost of running managed workers
• Not a part of Kafka ecosystem
• New kid in the block
Batch vs. Stream loading summary
Q&A ?
ofir.sharony@myheritage.com

More Related Content

What's hot

Solving Hybrid Cloud Data Replication with Apache Cassandra
Solving Hybrid Cloud Data Replication with Apache CassandraSolving Hybrid Cloud Data Replication with Apache Cassandra
Solving Hybrid Cloud Data Replication with Apache CassandraAaron Ploetz
 
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...DataStax
 
Stream processing IoT time series data with Kafka & InfluxDB | Al Sargent, In...
Stream processing IoT time series data with Kafka & InfluxDB | Al Sargent, In...Stream processing IoT time series data with Kafka & InfluxDB | Al Sargent, In...
Stream processing IoT time series data with Kafka & InfluxDB | Al Sargent, In...HostedbyConfluent
 
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...HostedbyConfluent
 
Going from three nines to four nines using Kafka | Tejas Chopra, Netflix
Going from three nines to four nines using Kafka | Tejas Chopra, NetflixGoing from three nines to four nines using Kafka | Tejas Chopra, Netflix
Going from three nines to four nines using Kafka | Tejas Chopra, NetflixHostedbyConfluent
 
Live Event Debugging With ksqlDB at Reddit | Hannah Hagen and Paul Kiernan, R...
Live Event Debugging With ksqlDB at Reddit | Hannah Hagen and Paul Kiernan, R...Live Event Debugging With ksqlDB at Reddit | Hannah Hagen and Paul Kiernan, R...
Live Event Debugging With ksqlDB at Reddit | Hannah Hagen and Paul Kiernan, R...HostedbyConfluent
 
Kafka Summit NYC 2017 - Venice: A Distributed Database on top of Kafka
Kafka Summit NYC 2017 - Venice: A Distributed Database on top of KafkaKafka Summit NYC 2017 - Venice: A Distributed Database on top of Kafka
Kafka Summit NYC 2017 - Venice: A Distributed Database on top of Kafkaconfluent
 
Mainframe Integration, Offloading and Replacement with Apache Kafka | Kai Wae...
Mainframe Integration, Offloading and Replacement with Apache Kafka | Kai Wae...Mainframe Integration, Offloading and Replacement with Apache Kafka | Kai Wae...
Mainframe Integration, Offloading and Replacement with Apache Kafka | Kai Wae...HostedbyConfluent
 
Serverless Architectures with AWS Lambda and MongoDB Atlas by Sig Narvaez
Serverless Architectures with AWS Lambda and MongoDB Atlas by Sig NarvaezServerless Architectures with AWS Lambda and MongoDB Atlas by Sig Narvaez
Serverless Architectures with AWS Lambda and MongoDB Atlas by Sig NarvaezData Con LA
 
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...HostedbyConfluent
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...HostedbyConfluent
 
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...HostedbyConfluent
 
A Tour of Apache Kafka
A Tour of Apache KafkaA Tour of Apache Kafka
A Tour of Apache Kafkaconfluent
 
INTRODUCING: CREATE PIPELINE
INTRODUCING: CREATE PIPELINEINTRODUCING: CREATE PIPELINE
INTRODUCING: CREATE PIPELINESingleStore
 
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational DatabasesReal-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational DatabasesSingleStore
 
Building Microservices with Apache Kafka by Colin McCabe
Building Microservices with Apache Kafka by Colin McCabeBuilding Microservices with Apache Kafka by Colin McCabe
Building Microservices with Apache Kafka by Colin McCabeData Con LA
 
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...HostedbyConfluent
 
The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)Eva Tse
 
From my sql to postgresql using kafka+debezium
From my sql to postgresql using kafka+debeziumFrom my sql to postgresql using kafka+debezium
From my sql to postgresql using kafka+debeziumClement Demonchy
 
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...Michael Stack
 

What's hot (20)

Solving Hybrid Cloud Data Replication with Apache Cassandra
Solving Hybrid Cloud Data Replication with Apache CassandraSolving Hybrid Cloud Data Replication with Apache Cassandra
Solving Hybrid Cloud Data Replication with Apache Cassandra
 
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
 
Stream processing IoT time series data with Kafka & InfluxDB | Al Sargent, In...
Stream processing IoT time series data with Kafka & InfluxDB | Al Sargent, In...Stream processing IoT time series data with Kafka & InfluxDB | Al Sargent, In...
Stream processing IoT time series data with Kafka & InfluxDB | Al Sargent, In...
 
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
 
Going from three nines to four nines using Kafka | Tejas Chopra, Netflix
Going from three nines to four nines using Kafka | Tejas Chopra, NetflixGoing from three nines to four nines using Kafka | Tejas Chopra, Netflix
Going from three nines to four nines using Kafka | Tejas Chopra, Netflix
 
Live Event Debugging With ksqlDB at Reddit | Hannah Hagen and Paul Kiernan, R...
Live Event Debugging With ksqlDB at Reddit | Hannah Hagen and Paul Kiernan, R...Live Event Debugging With ksqlDB at Reddit | Hannah Hagen and Paul Kiernan, R...
Live Event Debugging With ksqlDB at Reddit | Hannah Hagen and Paul Kiernan, R...
 
Kafka Summit NYC 2017 - Venice: A Distributed Database on top of Kafka
Kafka Summit NYC 2017 - Venice: A Distributed Database on top of KafkaKafka Summit NYC 2017 - Venice: A Distributed Database on top of Kafka
Kafka Summit NYC 2017 - Venice: A Distributed Database on top of Kafka
 
Mainframe Integration, Offloading and Replacement with Apache Kafka | Kai Wae...
Mainframe Integration, Offloading and Replacement with Apache Kafka | Kai Wae...Mainframe Integration, Offloading and Replacement with Apache Kafka | Kai Wae...
Mainframe Integration, Offloading and Replacement with Apache Kafka | Kai Wae...
 
Serverless Architectures with AWS Lambda and MongoDB Atlas by Sig Narvaez
Serverless Architectures with AWS Lambda and MongoDB Atlas by Sig NarvaezServerless Architectures with AWS Lambda and MongoDB Atlas by Sig Narvaez
Serverless Architectures with AWS Lambda and MongoDB Atlas by Sig Narvaez
 
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
 
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
 
A Tour of Apache Kafka
A Tour of Apache KafkaA Tour of Apache Kafka
A Tour of Apache Kafka
 
INTRODUCING: CREATE PIPELINE
INTRODUCING: CREATE PIPELINEINTRODUCING: CREATE PIPELINE
INTRODUCING: CREATE PIPELINE
 
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational DatabasesReal-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
 
Building Microservices with Apache Kafka by Colin McCabe
Building Microservices with Apache Kafka by Colin McCabeBuilding Microservices with Apache Kafka by Colin McCabe
Building Microservices with Apache Kafka by Colin McCabe
 
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
 
The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)
 
From my sql to postgresql using kafka+debezium
From my sql to postgresql using kafka+debeziumFrom my sql to postgresql using kafka+debezium
From my sql to postgresql using kafka+debezium
 
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
 

Similar to From Kafka to BigQuery - Strata Singapore

Powering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta LakePowering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta LakeDatabricks
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...Cisco DevNet
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
 
Logisland "Event Mining at scale"
Logisland "Event Mining at scale"Logisland "Event Mining at scale"
Logisland "Event Mining at scale"Thomas Bailet
 
Enterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshEnterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshSion Smith
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21JDA Labs MTL
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT_MTL
 
Avoiding Common Pitfalls: Spark Structured Streaming with Kafka
Avoiding Common Pitfalls: Spark Structured Streaming with KafkaAvoiding Common Pitfalls: Spark Structured Streaming with Kafka
Avoiding Common Pitfalls: Spark Structured Streaming with KafkaHostedbyConfluent
 
Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...
Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...
Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...Jamie Kinney
 
AWS as platform for scalable applications
AWS as platform for scalable applicationsAWS as platform for scalable applications
AWS as platform for scalable applicationsRoman Gomolko
 
Improving the Reliability of Market Data Subscription Feeds With Ruchir Vani ...
Improving the Reliability of Market Data Subscription Feeds With Ruchir Vani ...Improving the Reliability of Market Data Subscription Feeds With Ruchir Vani ...
Improving the Reliability of Market Data Subscription Feeds With Ruchir Vani ...HostedbyConfluent
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming VisualizationGuido Schmutz
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
Optimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystemOptimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystemDataWorks Summit
 
Pragmatic CQRS with existing applications and databases (Digital Xchange, May...
Pragmatic CQRS with existing applications and databases (Digital Xchange, May...Pragmatic CQRS with existing applications and databases (Digital Xchange, May...
Pragmatic CQRS with existing applications and databases (Digital Xchange, May...Lucas Jellema
 
What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?confluent
 
Big Data LDN 2017: Look Ma, No Code! Building Streaming Data Pipelines With A...
Big Data LDN 2017: Look Ma, No Code! Building Streaming Data Pipelines With A...Big Data LDN 2017: Look Ma, No Code! Building Streaming Data Pipelines With A...
Big Data LDN 2017: Look Ma, No Code! Building Streaming Data Pipelines With A...Matt Stubbs
 
Microsoft Azure in der Praxis
Microsoft Azure in der PraxisMicrosoft Azure in der Praxis
Microsoft Azure in der PraxisYvette Teiken
 
Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table NotesTimothy Spann
 

Similar to From Kafka to BigQuery - Strata Singapore (20)

Powering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta LakePowering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta Lake
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
App fabric introduction
App fabric introductionApp fabric introduction
App fabric introduction
 
Logisland "Event Mining at scale"
Logisland "Event Mining at scale"Logisland "Event Mining at scale"
Logisland "Event Mining at scale"
 
Enterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshEnterprise guide to building a Data Mesh
Enterprise guide to building a Data Mesh
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
 
Avoiding Common Pitfalls: Spark Structured Streaming with Kafka
Avoiding Common Pitfalls: Spark Structured Streaming with KafkaAvoiding Common Pitfalls: Spark Structured Streaming with Kafka
Avoiding Common Pitfalls: Spark Structured Streaming with Kafka
 
Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...
Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...
Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...
 
AWS as platform for scalable applications
AWS as platform for scalable applicationsAWS as platform for scalable applications
AWS as platform for scalable applications
 
Improving the Reliability of Market Data Subscription Feeds With Ruchir Vani ...
Improving the Reliability of Market Data Subscription Feeds With Ruchir Vani ...Improving the Reliability of Market Data Subscription Feeds With Ruchir Vani ...
Improving the Reliability of Market Data Subscription Feeds With Ruchir Vani ...
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Optimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystemOptimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystem
 
Pragmatic CQRS with existing applications and databases (Digital Xchange, May...
Pragmatic CQRS with existing applications and databases (Digital Xchange, May...Pragmatic CQRS with existing applications and databases (Digital Xchange, May...
Pragmatic CQRS with existing applications and databases (Digital Xchange, May...
 
What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?
 
Big Data LDN 2017: Look Ma, No Code! Building Streaming Data Pipelines With A...
Big Data LDN 2017: Look Ma, No Code! Building Streaming Data Pipelines With A...Big Data LDN 2017: Look Ma, No Code! Building Streaming Data Pipelines With A...
Big Data LDN 2017: Look Ma, No Code! Building Streaming Data Pipelines With A...
 
Microsoft Azure in der Praxis
Microsoft Azure in der PraxisMicrosoft Azure in der Praxis
Microsoft Azure in der Praxis
 
Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table Notes
 

Recently uploaded

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 

Recently uploaded (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 

From Kafka to BigQuery - Strata Singapore

  • 1. FROM KAFKA TO BigQuery Ofir Sharony A Guide For Delivering Billions of Daily Events
  • 2. Why are we here? As data engineers, our goal is building GREAT data pipelines. Delivering data to analysis reliably & efficiently with the ability to scale. This session is based this post from MyHeritage tech blog: https://medium.com/myheritage-engineering
  • 3. Considerations for building data pipelines ● End2End latency ● Data structure ● Data partitioning ● Reprocessing old data
  • 4. Discover, preserve & share your FAMILY HISTORY Case Study MyHeritage data pipeline Reveal the secrets from your DNA
  • 5. 100,000,000+ Registered Users 250,000,000+ Web Requests Daily Billions Change-data-capture events
  • 6.
  • 7. User behavior and A/B testing
  • 8.
  • 9.
  • 10. High level architecture Database Database Database Change-data-capture events Requests from multiple sources User Activities Google BigQuery Apache Kafka
  • 11. Apache Kafka in a nutshell • Distributed streaming platform • Highly scalable • Fault tolerant • Topics and partitions Producer Producer Producer Consumer Consumer Consumer Kafka Broker Partition Partition Partition Topic
  • 12. Google BigQuery in a nutshell • Cloud data-warehouse • Pay as you go • Data sources • Streaming support • Daily partitioning feature. Partitioning options: • By processing time - when the event was observed • By event time - when the event actually occurred
  • 13. Our Data Delivery Journey
  • 14. Take 1 - Batch loading to Google Cloud Storage
  • 15. • Two steps for loading • Batch loading to GCS with Secor • Loading from GCS to BigQuery Batch loading to Google Cloud Storage Secor Google Cloud Storage Loading Script bq load dataset.table$20171201 gs://bucket/topic/dt=2017–12–01/*
  • 16. • Partitioning options • By ingestion date (processing time) • By record timestamp (event time) • Concurrency and scaling • No extra cost for streaming • Data formats • End2End latency Tradeoffs - Batch loading with Secor
  • 17. Take 2 - Streaming data via BigQuery API
  • 18. Streaming data via BigQuery API • No preliminary step • Comes with a cost and quota limitation DIY Kafka Consumer Streaming Framework BigQuery Streaming API
  • 19. Kafka Streams streaming framework • Streaming applications on top of Kafka • Simple Java application • Auto-replicated application state • Supports both processing time and event time semantics
  • 20. Basic Kafka Streams application public static void main(String[] args) { Properties config = new Properties(); config.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "host:port"); config.put(StreamsConfig.APPLICATION_ID_CONFIG, "AppId"); KStreamBuilder builder = new KStreamBuilder(); builder.stream(Serdes.String(), Serdes.String(), "my_topic") .filter((key, value) -> value.equals("Please, take me to BigQuery")) .foreach((key, value) -> deliverDataToBigQuery(key, value)); KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); }
  • 21. • Low latency • Complete control in your hands • You have to design a robust service • Handle streaming errors, record batching, etc Tradeoffs - Streaming with BigQuery API
  • 22. Take 3 - Streaming data with Kafka Connect
  • 23. Streaming with Kafka Connect • Deliver data in and out of Kafka • Supports many connectors out of the box • Simple connector creation - we used WePay’s BigQuery connector • You get redundancy and scaling for free Kafka Connector Kafka Connector Kafka Connector
  • 24. • Plug and play solution • Leverages Kafka for scaling and auto-failover • Auto creation of BigQuery tables • Partition by processing time • Process data only from Kafka • Topic to table mapping Tradeoffs - Streaming with Kafka Connect
  • 25. Take 4 - Streaming data with Apache Beam
  • 26. Streaming data with Apache Beam • Unified model for batch and streaming • Portable platform • Managed Cloud Dataflow runner Cloud Dataflow Worker Cloud Dataflow Worker Cloud Dataflow Worker Apache Beam
  • 27. Streaming to BigQuery with Beam String getDestination(ValueInSingleWindow<PageView> element) { String table = element.getValue().isBot() ? "bot_pageviews" : "user_pageviews"; Long eventTimestamp = element.getValue().getEventTimestamp(); String partition = new SimpleDateFormat("yyyyMMdd") .format(new Date(eventTimestamp)); return String.format("%s:%s.%s$%s", project, dataset, table, partition); }
  • 28. Tradeoffs - Streaming with Beam • Unified model for multiple runners • Dataflow runner advantages • Can process data from different sources • Dynamic destinations • Extra cost of running managed workers • Not a part of Kafka ecosystem • New kid in the block
  • 29. Batch vs. Stream loading summary