SlideShare a Scribd company logo
1 of 44
Download to read offline
1
Stream All Things
Patterns of Modern Data Integration
2
About Me
• Working @confluent
• Wrote a book or two
• Product manager
• Apache Kafka PMC member
• Used to be an engineer, consultant, DBA
• @gwenshap
• Github.com/gwenshap
3
Evolution of Data Integration
Data
Warehouse
Data Lake Data Stream
• Difficult modeling
• Difficult ingest
• Missing data
• Limited throughput
• Batch only
• Still batch only
• Data quality is
suspect
• Data model left as
exercise for each
developer
• Where is my data?
• Kept the scale
• Reduced latency
• Totally new model
• Data producers are 1st
class participants
• Microservices? Apps?
• Still figuring the whole
thing out
4
Evolution of Data Integration
DWH Data Lake
Stream
Data
ETL
Engineers
Software
Engineers
5
Evolution of IT
2000
Software engineers
Data modelers
DBAs
Sysadmins
Storage admins
Network admins
ETL engineers
QA
….
2017
Software engineers
6
The line between
application development and ETL
is blurring.
7
8
Few Patterns
1. Stream all things (in one place)
2. Keep Compatible and Process On
3. Ridiculously Parallel Data Integration
4. Streaming Data Enrichment
9
Example: Large Hotel Chain
10
12
PageViewEvent
{
sessionId: 676fc8983gu563,
timestamp: 1413215458,
viewType: "propertyView",
propertyId: 7879,
loyaltyId: 6764532
origin: "promotion",
...... lots of metadata....
}
14
15
Pattern #1 – Put all streams in One Kafka
Pattern	
  #2
23
APIs between services are Contracts
In Stream Processing World – Event Schemas ARE the API
…except they stick around for a lot longer
25
Schema Registry
Elastic
Cassandra
Streams
App
Example Consumers
Serializer
App 1
Serializer
App 2
!
Kafka Topic!
Schema Registry
Define the expected fields for each Kafka topic
Automatically handle schema changes (e.g. new fields)
Verify Schema Compatibility at every stage of dev cycle
Supports multi-datacenter environments
26
Pattern #3 – Ridiculously parallel data integration
{
sessionId: 676fc8983gu563,
timestamp: 1413215458,
viewType: "propertyView",
propertyId: 7879,
loyaltyId: 6764532,
origin: "promotion",
cc: “4444—3333-2222-1111”
}
{
sessionId: 676fc8983gu563,
timestamp: 1413215458,
viewType: "propertyView",
propertyId: 7879,
loyaltyId: 6764532,
origin: "promotion",
cc: “xxxx—xxxx-xxxx-1111”
}
27
Hipster Stream Processing
28
Is there a data store involved? KafkaConnect can help
29
Ecosystem of Connectors
Databases Datastore/File Store
Analytics Applications / Other
30
How Connect Works?
Log
Connector
MQTT
Connector
REST API
Logs MQTT
Log Task Log Task
MQTT
Task
MQTT
Task
31
Connect + Single Message Transformations
Mix:
Ridiculously parallel database integration
With
Ridiculously parallel simple transformations
32
Promotion!
We want to send Platinum members
Who are looking at beach properties
An email about
Discount package in new Hawaii hotel
33
Pattern #4 – Enriching events
From Event
sessionID
timestamp
loyaltyID
propertyID
To Enriched Event
sessionID
timestamp
loyaltyID
Level
isPlatium
propertyID
Location
isBeach
40
We use Kafka, Connect and CDC Connectors
To turn state (in DB)
Into stream of events (in Kafka)
And back into state (in application cache)
42
In reality…
• Maintaining the distributed cache isn’t trivial
• How do we persist state? Failover? Recover?
• How do we co-partition for joins?
KafkaStreams makes it easy:
Ktable loyaltyTbl = builder.table(…, “loyalty-cdc-topic”)
PageViewStream.leftJoin(loyaltyTbl, <operation>)
Example:
https://www.confluent.io/blog/distributed-real-time-joins-and-aggregations-on-user-activity-events-
using-kafka-streams/
Few	
  Patterns
1. Stream all things (in one place)
2. Keep Compatible and Process On
3. Ridiculously Parallel Single Message Transformations
4. Streaming Data Enrichment
44
Thank You!

More Related Content

What's hot

Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 
Kappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology Comparison
Kai Wähner
 
Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?
Kai Wähner
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 

What's hot (20)

Apache Kafka vs. Cloud-native iPaaS Integration Platform Middleware
Apache Kafka vs. Cloud-native iPaaS Integration Platform MiddlewareApache Kafka vs. Cloud-native iPaaS Integration Platform Middleware
Apache Kafka vs. Cloud-native iPaaS Integration Platform Middleware
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
CDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkCDC Stream Processing with Apache Flink
CDC Stream Processing with Apache Flink
 
Kappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology Comparison
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?
 
Microservices Architecture - Bangkok 2018
Microservices Architecture - Bangkok 2018Microservices Architecture - Bangkok 2018
Microservices Architecture - Bangkok 2018
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
 
[Webinar] WSO2 Enterprise Integrator 7.1.0 Release
[Webinar] WSO2 Enterprise Integrator 7.1.0 Release[Webinar] WSO2 Enterprise Integrator 7.1.0 Release
[Webinar] WSO2 Enterprise Integrator 7.1.0 Release
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Envoy and Kafka
Envoy and KafkaEnvoy and Kafka
Envoy and Kafka
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Application modernization patterns with apache kafka, debezium, and kubernete...
Application modernization patterns with apache kafka, debezium, and kubernete...Application modernization patterns with apache kafka, debezium, and kubernete...
Application modernization patterns with apache kafka, debezium, and kubernete...
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Migrating with Debezium
Migrating with DebeziumMigrating with Debezium
Migrating with Debezium
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 

Similar to Stream All Things—Patterns of Modern Data Integration with Gwen Shapira

SF big Analytics : Stream all things by Gwen Shapira @ Lyft 2018
SF big Analytics : Stream all things by Gwen Shapira @ Lyft 2018SF big Analytics : Stream all things by Gwen Shapira @ Lyft 2018
SF big Analytics : Stream all things by Gwen Shapira @ Lyft 2018
Chester Chen
 
Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache...
Тарас Кльоба  "ETL — вже не актуальна; тривалі живі потоки із системою Apache...Тарас Кльоба  "ETL — вже не актуальна; тривалі живі потоки із системою Apache...
Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache...
Lviv Startup Club
 

Similar to Stream All Things—Patterns of Modern Data Integration with Gwen Shapira (20)

Streaming Data Ingest and Processing with Apache Kafka
Streaming Data Ingest and Processing with Apache KafkaStreaming Data Ingest and Processing with Apache Kafka
Streaming Data Ingest and Processing with Apache Kafka
 
How Kafka and Modern Databases Benefit Apps and Analytics
How Kafka and Modern Databases Benefit Apps and AnalyticsHow Kafka and Modern Databases Benefit Apps and Analytics
How Kafka and Modern Databases Benefit Apps and Analytics
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
 
batbern43 Stream all Things: Patterns of Data Integration in Event Driven Sys...
batbern43 Stream all Things: Patterns of Data Integration in Event Driven Sys...batbern43 Stream all Things: Patterns of Data Integration in Event Driven Sys...
batbern43 Stream all Things: Patterns of Data Integration in Event Driven Sys...
 
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture Patterns
 
SF big Analytics : Stream all things by Gwen Shapira @ Lyft 2018
SF big Analytics : Stream all things by Gwen Shapira @ Lyft 2018SF big Analytics : Stream all things by Gwen Shapira @ Lyft 2018
SF big Analytics : Stream all things by Gwen Shapira @ Lyft 2018
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetStreaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
 
Scylla Summit 2022: An Odyssey to ScyllaDB and Apache Kafka
Scylla Summit 2022: An Odyssey to ScyllaDB and Apache KafkaScylla Summit 2022: An Odyssey to ScyllaDB and Apache Kafka
Scylla Summit 2022: An Odyssey to ScyllaDB and Apache Kafka
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
Confluent kafka meetupseattle jan2017
Confluent kafka meetupseattle jan2017Confluent kafka meetupseattle jan2017
Confluent kafka meetupseattle jan2017
 
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
 
Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache...
Тарас Кльоба  "ETL — вже не актуальна; тривалі живі потоки із системою Apache...Тарас Кльоба  "ETL — вже не актуальна; тривалі живі потоки із системою Apache...
Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache...
 
Leveraging Mainframe Data for Modern Analytics
Leveraging Mainframe Data for Modern AnalyticsLeveraging Mainframe Data for Modern Analytics
Leveraging Mainframe Data for Modern Analytics
 
Neo4j GraphTalks Oslo - Graph Your Business - Rik Van Bruggen, Neo4j
Neo4j GraphTalks Oslo - Graph Your Business - Rik Van Bruggen, Neo4jNeo4j GraphTalks Oslo - Graph Your Business - Rik Van Bruggen, Neo4j
Neo4j GraphTalks Oslo - Graph Your Business - Rik Van Bruggen, Neo4j
 
kafka for db as postgres
kafka for db as postgreskafka for db as postgres
kafka for db as postgres
 
Streaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache KafkaStreaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache Kafka
 
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
pyhepag
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
pyhepag
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
RafigAliyev2
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
pyhepag
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Valters Lauzums
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
cyebo
 

Recently uploaded (20)

AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdf
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
Machine Learning for Accident Severity Prediction
Machine Learning for Accident Severity PredictionMachine Learning for Accident Severity Prediction
Machine Learning for Accident Severity Prediction
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp online
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeral
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prison
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdf
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 

Stream All Things—Patterns of Modern Data Integration with Gwen Shapira

  • 1. 1 Stream All Things Patterns of Modern Data Integration
  • 2. 2 About Me • Working @confluent • Wrote a book or two • Product manager • Apache Kafka PMC member • Used to be an engineer, consultant, DBA • @gwenshap • Github.com/gwenshap
  • 3. 3 Evolution of Data Integration Data Warehouse Data Lake Data Stream • Difficult modeling • Difficult ingest • Missing data • Limited throughput • Batch only • Still batch only • Data quality is suspect • Data model left as exercise for each developer • Where is my data? • Kept the scale • Reduced latency • Totally new model • Data producers are 1st class participants • Microservices? Apps? • Still figuring the whole thing out
  • 4. 4 Evolution of Data Integration DWH Data Lake Stream Data ETL Engineers Software Engineers
  • 5. 5 Evolution of IT 2000 Software engineers Data modelers DBAs Sysadmins Storage admins Network admins ETL engineers QA …. 2017 Software engineers
  • 6. 6 The line between application development and ETL is blurring.
  • 7. 7
  • 8. 8 Few Patterns 1. Stream all things (in one place) 2. Keep Compatible and Process On 3. Ridiculously Parallel Data Integration 4. Streaming Data Enrichment
  • 10. 10
  • 11.
  • 12. 12 PageViewEvent { sessionId: 676fc8983gu563, timestamp: 1413215458, viewType: "propertyView", propertyId: 7879, loyaltyId: 6764532 origin: "promotion", ...... lots of metadata.... }
  • 13.
  • 14. 14
  • 15. 15 Pattern #1 – Put all streams in One Kafka
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 23. 23 APIs between services are Contracts In Stream Processing World – Event Schemas ARE the API …except they stick around for a lot longer
  • 24.
  • 25. 25 Schema Registry Elastic Cassandra Streams App Example Consumers Serializer App 1 Serializer App 2 ! Kafka Topic! Schema Registry Define the expected fields for each Kafka topic Automatically handle schema changes (e.g. new fields) Verify Schema Compatibility at every stage of dev cycle Supports multi-datacenter environments
  • 26. 26 Pattern #3 – Ridiculously parallel data integration { sessionId: 676fc8983gu563, timestamp: 1413215458, viewType: "propertyView", propertyId: 7879, loyaltyId: 6764532, origin: "promotion", cc: “4444—3333-2222-1111” } { sessionId: 676fc8983gu563, timestamp: 1413215458, viewType: "propertyView", propertyId: 7879, loyaltyId: 6764532, origin: "promotion", cc: “xxxx—xxxx-xxxx-1111” }
  • 28. 28 Is there a data store involved? KafkaConnect can help
  • 29. 29 Ecosystem of Connectors Databases Datastore/File Store Analytics Applications / Other
  • 30. 30 How Connect Works? Log Connector MQTT Connector REST API Logs MQTT Log Task Log Task MQTT Task MQTT Task
  • 31. 31 Connect + Single Message Transformations Mix: Ridiculously parallel database integration With Ridiculously parallel simple transformations
  • 32. 32 Promotion! We want to send Platinum members Who are looking at beach properties An email about Discount package in new Hawaii hotel
  • 33. 33 Pattern #4 – Enriching events From Event sessionID timestamp loyaltyID propertyID To Enriched Event sessionID timestamp loyaltyID Level isPlatium propertyID Location isBeach
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40. 40 We use Kafka, Connect and CDC Connectors To turn state (in DB) Into stream of events (in Kafka) And back into state (in application cache)
  • 41.
  • 42. 42 In reality… • Maintaining the distributed cache isn’t trivial • How do we persist state? Failover? Recover? • How do we co-partition for joins? KafkaStreams makes it easy: Ktable loyaltyTbl = builder.table(…, “loyalty-cdc-topic”) PageViewStream.leftJoin(loyaltyTbl, <operation>) Example: https://www.confluent.io/blog/distributed-real-time-joins-and-aggregations-on-user-activity-events- using-kafka-streams/
  • 43. Few  Patterns 1. Stream all things (in one place) 2. Keep Compatible and Process On 3. Ridiculously Parallel Single Message Transformations 4. Streaming Data Enrichment