Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Your Streaming Data Platform

Bringing Streaming Data
To The Masses
Lowering The "Cost Of Admission”
For Your Streaming Data Platform
San Francisco, CA
October 17th
, 2018

About Me
• Bob Lehmann
• Started life as an Electrical Engineer,
switched to IT 20 years ago
• Have worked with data in many
capacities – sensors and controls,
manufacturing process data, ERP
systems, enterprise data, etc.
• Architect and manage the Enterprise
DataHub at Bayer
• Live in St. Louis, MO

Inbred
(Parent 1)
Inbred
(Parent 2)
Hybrid

The Journey Starts Here
Circa 2014…
• Siloed IT org with different tech stacks
(Bayer IT org > 4000)
• MANY legacy systems and platforms
• Bayer adopted “cloud first” philosophy
• Embraced open source (finally J)
• Cross functional team of architects was established
to define strategies and architectures
DIRECTIVE: Develop a strategy for cloud-based enterprise wide analytics

Houston, We Have A Data Problem
• Data sprawl
• Data inconsistency
• Difficult to find data
• Can’t propagate
changes fast enough
Legacy
• Increased data sprawl
• Can’t forklift
applications to cloud
• Cloud apps need on-
prem data and vice-
versa
Cloud
Volume
Variety
Velocity
Veracity

Let’s Clean Up This Mess!
Relational
Databases
App App App
Cache
Poll For Changes
Caches &
Derived Stores
Relational
Data
Warehouse
ODS
Data Guard
Hadoop
CSV Dump
Transforms
Transforms
Apps and Services
Splunk
ActiveMQ
Apps
ActiveMQ
Apps Apps
Log Aggregation
HTTP
NFS
NFS
rsync
Transform & Load
Load
Monitoring
Apps and ServicesApps and Services
HTTP
Key-value
Store
Apps
OLTP Queries
Kafka
Log
Search
Monitoring
Real-time
Analytics
Social
Graph
Search Newsfeed OLAP
Samza
Apps
Key-Value
Storage
Oracle
Apps AppsAppsApps
Security &
Fraud
Hadoop Teradata
Apps
Courtesy: Jay Kreps

The Enterprise DataHub – Original Concept
- Kafka clusters on prem and in AWS
- Datacenter agnostic
- Establish cross-datacenter connection
- Replicate across datacenters
- Apps only interact with local Kafka cluster
- Use AVRO schemas
VPN
Use Schema
Registry
Mirrormaker?
Maybe GCP in the future?

Enterprise DataHub POC
Circa 2015
EC2 Instance
MirrorMaker
MirrorMaker
MirrorMaker
MirrorMaker
MirrorMaker
MirrorMaker
Kafka
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 6
Kafka
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 6
REST
Proxy
Schema
Registry
Schema
Registry
REST
Proxy
ZookeeperZookeeper
Oracle
SQL
Server
Cloud Foundry
Producer
Producer
Network
Monitor
Application
Ticketing
Application
Other
Monitoring
Apps
Postgres
RDS
Cloud Foundry
Consumer
Consumer
Consumer
Conﬂuent 1.0 / Kafka 0.8
VPN
Tunnel

What a Long, Strange Trip It’s Been!
First Phase
The Launch
Second Phase
Reaching
Orbit
Third Phase
Escaping
Gravity
Current Phase
To Infinity
And Beyond…
Kafka
Kafka
Streams
Portal
Portal
KSQL
Kafka
Connect
Portal
Documentation
Site

First Phase - The Launch
September, 2016
• Confluent 2.0 / Kafka 0.9
• Security via SSL certs – developed
patch to dynamically load broker
certs
• Replicant - Process to replace
Mirrormaker
• Basic platform monitoring
• Most user interaction via command
line tools
EC2
Container
Service
Replicant
Replicant
Replicant
Replicant
Kafka
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 6
Kafka
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 6
REST
Proxy
Schema
Registry
Schema
Registry
REST
Proxy
VPN
Tunnel
Core Platform
Monitoring
Documentation
Portal
Kafka
Manager

Phase 1 Results
First Phase
The Launch
Second Phase
Reaching
Orbit
Third Phase
Escaping
Gravity
Current Phase
To Infinity
And Beyond…
Kafka
Kafka
Streams
Portal
Portal
KSQL
Kafka
Connect
Portal
Documentation
Site
• Java/Scala Developers
• AWS skills
• Linux command line
• Data movement from
on-prem to cloud
• Customer 360 –
bidirectional
movement

Phase 2 - Reaching Orbit
• Self Service User Portal
• Improved replication
process - Replikant
• Security Improvements
• Infrastructure automation
• Monitoring for topics and
consumers
• Slack integration for alerts
• Initial evaluation of Kafka
connect
Replikant
Cluster
Replikant
Replikant
Replikant
Replikant
Replikant
Cluster
Replikant
Replikant
Replikant
Replikant
Kafka
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 6
Kafka
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 6
REST
Proxy
Schema
Registry
Schema
Registry
REST
Proxy
VPN
Tunnel
Core Platform
Monitoring
Documentation
Portal
Kafka
Manager
User
Self-Service
Portal
Consumer and
Replikant
Monitoring
Slack
Integration

Phase 2 Results
First Phase
The Launch
Second Phase
Reaching
Orbit
Third Phase
Escaping
Gravity
Current Phase
To Infinity
And Beyond…
Kafka
Kafka
Streams
Portal
Portal
KSQL
Kafka
Connect
Portal
Documentation
Site
• AWS skills
on-prem to cloud
bidirectional
movement
• Python, Node, etc.
• Some Analytics
• Some BI
• Event sourcing
• Company360
• Exadata CDC
• Incremental migration
to the cloud

Stage 3 - Leaving Orbit
• Kubernetes / Kafka
Connect
• Expansion to Google
Cloud
• CDC from SAP using
Informatica Data
Replication
• Integration with Data
Historian
• Detailed training class
Replikant
Cluster
Replikant
Replikant
Replikant
Replikant
Replikant
Cluster
Replikant
Replikant
Replikant
Replikant
Kafka
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 6
Kafka
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 6
REST
Proxy
Schema
Registry
Schema
Registry
REST
Proxy
VPN
Tunnel
Core Platform
Monitoring
Documentation
Portal
Kafka
Manager
User
Self-Service
Portal
Consumer and
Replikant
Monitoring
Slack
Integration
Kubernetes
JDBC
Connector
S3
Connector
JMS
Connector
Elasticsearch
Connector
Kubernetes
JDBC
Connector
JDBC
Connector
JDBC
Connector
JMS
Connector
Data Historian
Ingestion
EMR
Hive SparkPresto
S3/
Parquet

• Code-free, simple
• Connector universe is expanding rapidly
• Secure - SSL connection
• AVRO support
Kafka Connect and
Kubernetes
• JDBC (Oracle, Postges, MySQL, SQL Server,
Teradata, Redshift)
• JMS
• S3
• File
• Elasticsearch
Connecters In Use
• Highly scalable
• Cluster in each environment
• Keeps processing local to the
environment
• Efficient use of resources
• Increased security
KubernetesKafka Connect

Expansion To
Other
Datacenters
North America
Datacenter
Greenhouse
Datacenter
Future
Datacenter
Diﬀerent
Region

Phase 3 Results
First Phase
The Launch
Second Phase
Reaching
Orbit
Third Phase
Escaping
Gravity
Current Phase
To Infinity
And Beyond…
Kafka
Kafka
Streams
Portal
Portal
KSQL
Kafka
Connect
Portal
Documentation
Site
• AWS skills
on-prem to cloud
bidirectional
movement
• Data movement
between all datacenters
on central platform
• LIMS Integration
• Serverless apps in AWS
• SAP/Oracle CDC
• Product360 in GCP - Go
• BI / Reporting
• Analytics platform
• Some Analytics
• Some BI
• Event sourcing
• Company360
• Exadata CDC
to the cloud

Current Phase – To Infinity And Beyond
• Bring stream processing to
the masses!
• Data validation across the
pipeline
• SQL interface for Kafka
(using Presto)
• Improve topic
discoverability and reuse
• Expose consumer metrics
to end users
Replikant
Cluster
Replikant
Replikant
Replikant
Replikant
Replikant
Cluster
Replikant
Replikant
Replikant
Replikant
Kafka
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 6
Kafka
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 6
REST
Proxy
Schema
Registry
Schema
Registry
REST
Proxy
VPN
Tunnel
Kubernetes
I/O
File
Connector
JDBC
Connector
JDBC
Connector
JMS
Connector
Stream
Processing
KSQL
Kafka
Streams
Custom
Stream Proc
Kubernetes
I/O
File
Connector
JDBC
Connector
S3
Connector
Elasticsearch
Connector
Stream
Processing
KSQL
Kafka
Streams
Custom
Stream Proc
Data Historian
Ingestion
EMR
Hive SparkPresto
S3/
Parquet
Presto
SQL Engine
Haystack
Metadata
Platform

Many Clients In Many Places
Managed By DataHub Team
GoldenGate
CDC
Oracle
SQL
Server
TeraData
ExaData
Neo4J
Cloud Foundry
ProducerProducerProducers
ConsumerConsumerConsumers
ConsumerConsumerApplications
Legacy
Apps
(WebLogic)
Legacy
Apps
(WebLogic)
Legacy
Apps
EMR
S3/
Parquet
Postgres
Cloud Foundry
ProducerProducerProducers
ConsumerConsumerConsumers
ConsumerConsumerApplications
MySQL
Cassandra
RedShift
Integration
With
salesforce.com
Replikant
Cluster
Replikant
Replikant
Replikant
Replikant
Replikant
Cluster
Replikant
Replikant
Replikant
Replikant
Kafka
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 6
Kafka
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 6
REST
Proxy
Schema
Registry
Schema
Registry
REST
Proxy
VPN
Tunnel
Core Platform
Monitoring
Documentation
Portal
Kafka
Manager
User
Self-Service
Portal
Consumer and
Replikant
Monitoring
Slack
Integration

“First Mile”
Processing
Automatic ingestion

Use Case – CDC From SAP to Data Historian
Schema Splitter converts an input
stream with a “generic” schema
(many different tables flowing
through one topic) to individual table
streams with table specific schemas
Data Historian
Ingestion
EMR
Hive SparkPresto
S3/
Parquet
Kubernetes
Cluster
Schema
Splitter
KSQL
Filter
KSQL
Agg
Kafka
Streams Proc
Replikant
Cluster
Replikant
Kafka
Schema
Registry
Schema
Registry
ZookeeperZookeeper
Oracle
ETL
SAP
Informatica
Data
Replication
Kafka
DB
Schema 2
DB
Schema 4
DB
Schema 1
DB
Schema 3
Teradata
Staging
Teradata
Final
Table 1
Table 3
DB
Schema 1
Table 2
Replikant
Replikant
Replikant
VPN
Tunnel

Topic Reuse
• Not as good as we would like. Why?
• Discoverability
• Developers are not altruistic when creating
schemas

MetaData Platform
• Haystack is our enterprise metadata platform
• Kafka topic metadata is automatically synced to
Haystack
• Haystack links back to the DataHub portal
• Will be able to search for topics in Haystack and
immediately find the topic in the DataHub portal

Apache Presto
• Presto is being implemented as an enterprise data
virtualization solution…not just for the DataHub
• Will also be used to provide data validation across the
pipeline via SQL
Example: Join a topic in Kafka to a table in Postgres to confirm that all
data has transferred correctly
• Will also be used to provide a SQL interface in the
DataHub portal to allow querying for specific messages.
• Developed a patch to the Presto Kafka connector to
connect with SSL and deserialize AVRO

Phase 4 – Current Phase
First Phase
The Launch
Second Phase
Reaching
Orbit
Third Phase
Escaping
Gravity
Current Phase
To Infinity
And Beyond…
Kafka
Kafka
Streams
Portal
Portal
KSQL
Kafka
Connect
Portal
Documentation
Site
• Product360 in GCP - Go
• BI / Reporting
• Analytics platform
• AWS skills
on-prem to cloud
bidirectional
movement
• Data Stewards
• Everyone else!!
• Global streaming
• IOT data
• SAP Hana
• Data movement
between all datacenters
on central platform
• LIMS Integration
• Serverless apps in AWS
• SAP/Oracle CDC
• Some Analytics
• Some BI
• Event sourcing
• Company360
• Exadata CDC
to the cloud

Future
• Move as much ETL as possible to the
streaming layer
• Monitoring and auditing of data flow across
the pipeline
• Consumer monitoring and configurable
alerting
• Improved data governance
• Integrate with enterprise security platform

The Enterprise DataHub is a living,
scalable, robust central nervous system
for data that facilitates the seamless
acquisition, transport and processing of
information in real time across multiple
datacenter and cloud environments.

THANK YOU!
Bob Lehmann
robert.lehmann@bayer.com

Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Your Streaming Data Platform

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Your Streaming Data Platform

Similar to Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Your Streaming Data Platform (20)

More from confluent

More from confluent (20)

Recently uploaded

Recently uploaded (20)

Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Your Streaming Data Platform