(Bob Lehmann, Bayer) Kafka Summit SF 2018
You’ve built your streaming data platform. The early adopters are “all in” and have developed producers, consumers and stream processing apps for a number of use cases. A large percentage of the enterprise, however, has expressed interest but hasn’t made the leap. Why?
In 2014, Bayer Crop Science (formerly Monsanto) adopted a cloud first strategy and started a multi-year transition to the cloud. A Kafka-based cross-datacenter DataHub was created to facilitate this migration and to drive the shift to real-time stream processing. The DataHub has seen strong enterprise adoption and supports a myriad of use cases. Data is ingested from a wide variety of sources and the data can move effortlessly between an on premise datacenter, AWS and Google Cloud. The DataHub has evolved continuously over time to meet the current and anticipated needs of our internal customers. The “cost of admission” for the platform has been lowered dramatically over time via our DataHub Portal and technologies such as Kafka Connect, Kubernetes and Presto. Most operations are now self-service, onboarding of new data sources is relatively painless and stream processing via KSQL and other technologies is being incorporated into the core DataHub platform.
In this talk, Bob Lehmann will describe the origins and evolution of the Enterprise DataHub with an emphasis on steps that were taken to drive user adoption. Bob will also talk about integrations between the DataHub and other key data platforms at Bayer, lessons learned and the future direction for streaming data and stream processing at Bayer.
3. About Me
• Bob Lehmann
• Started life as an Electrical Engineer,
switched to IT 20 years ago
• Have worked with data in many
capacities – sensors and controls,
manufacturing process data, ERP
systems, enterprise data, etc.
• Architect and manage the Enterprise
DataHub at Bayer
• Live in St. Louis, MO
8. The Journey Starts Here
Circa 2014…
• Siloed IT org with different tech stacks
(Bayer IT org > 4000)
• MANY legacy systems and platforms
• Bayer adopted “cloud first” philosophy
• Embraced open source (finally J)
• Cross functional team of architects was established
to define strategies and architectures
DIRECTIVE: Develop a strategy for cloud-based enterprise wide analytics
9. Houston, We Have A Data Problem
• Data sprawl
• Data inconsistency
• Difficult to find data
• Can’t propagate
changes fast enough
Legacy
• Increased data sprawl
• Can’t forklift
applications to cloud
• Cloud apps need on-
prem data and vice-
versa
Cloud
Volume
Variety
Velocity
Veracity
10. Let’s Clean Up This Mess!
Relational
Databases
App App App
Cache
Poll For Changes
Caches &
Derived Stores
Relational
Data
Warehouse
ODS
Data Guard
Hadoop
CSV Dump
Transforms
Transforms
Apps and Services
Splunk
ActiveMQ
Apps
ActiveMQ
Apps Apps
Log Aggregation
HTTP
NFS
NFS
rsync
Transform & Load
Load
Monitoring
Apps and ServicesApps and Services
HTTP
Key-value
Store
Apps
OLTP Queries
Kafka
Log
Search
Monitoring
Real-time
Analytics
Social
Graph
Search Newsfeed OLAP
Samza
Apps
Key-Value
Storage
Oracle
Apps AppsAppsApps
Security &
Fraud
Hadoop Teradata
Apps
Courtesy: Jay Kreps
11. The Enterprise DataHub – Original Concept
- Kafka clusters on prem and in AWS
- Datacenter agnostic
- Establish cross-datacenter connection
- Replicate across datacenters
- Apps only interact with local Kafka cluster
- Use AVRO schemas
VPN
Use Schema
Registry
Mirrormaker?
Maybe GCP in the future?
13. What a Long, Strange Trip It’s Been!
First Phase
The Launch
Second Phase
Reaching
Orbit
Third Phase
Escaping
Gravity
Current Phase
To Infinity
And Beyond…
Kafka
Kafka
Streams
Portal
Portal
KSQL
Kafka
Connect
Portal
Documentation
Site
14. First Phase - The Launch
September, 2016
• Confluent 2.0 / Kafka 0.9
• Security via SSL certs – developed
patch to dynamically load broker
certs
• Replicant - Process to replace
Mirrormaker
• Basic platform monitoring
• Most user interaction via command
line tools
EC2
Container
Service
Replicant
Replicant
Replicant
Replicant
Kafka
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 6
Kafka
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 6
REST
Proxy
Schema
Registry
Schema
Registry
REST
Proxy
VPN
Tunnel
Core Platform
Monitoring
Documentation
Portal
Kafka
Manager
15. Phase 1 Results
First Phase
The Launch
Second Phase
Reaching
Orbit
Third Phase
Escaping
Gravity
Current Phase
To Infinity
And Beyond…
Kafka
Kafka
Streams
Portal
Portal
KSQL
Kafka
Connect
Portal
Documentation
Site
• Java/Scala Developers
• AWS skills
• Linux command line
• Data movement from
on-prem to cloud
• Customer 360 –
bidirectional
movement
21. Phase 2 Results
First Phase
The Launch
Second Phase
Reaching
Orbit
Third Phase
Escaping
Gravity
Current Phase
To Infinity
And Beyond…
Kafka
Kafka
Streams
Portal
Portal
KSQL
Kafka
Connect
Portal
Documentation
Site
• Java/Scala Developers
• AWS skills
• Linux command line
• Data movement from
on-prem to cloud
• Customer 360 –
bidirectional
movement
• Python, Node, etc.
• Some Analytics
• Some BI
• Event sourcing
• Company360
• Exadata CDC
• Incremental migration
to the cloud
22. Stage 3 - Leaving Orbit
• Kubernetes / Kafka
Connect
• Expansion to Google
Cloud
• CDC from SAP using
Informatica Data
Replication
• Integration with Data
Historian
• Detailed training class
Replikant
Cluster
Replikant
Replikant
Replikant
Replikant
Replikant
Cluster
Replikant
Replikant
Replikant
Replikant
Kafka
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 6
Kafka
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 6
REST
Proxy
Schema
Registry
Schema
Registry
REST
Proxy
VPN
Tunnel
Core Platform
Monitoring
Documentation
Portal
Kafka
Manager
User
Self-Service
Portal
Consumer and
Replikant
Monitoring
Slack
Integration
Kubernetes
JDBC
Connector
S3
Connector
JMS
Connector
Elasticsearch
Connector
Kubernetes
JDBC
Connector
JDBC
Connector
JDBC
Connector
JMS
Connector
Data Historian
Ingestion
EMR
Hive SparkPresto
S3/
Parquet
23. • Code-free, simple
• Connector universe is expanding rapidly
• Secure - SSL connection
• AVRO support
Kafka Connect and
Kubernetes
• JDBC (Oracle, Postges, MySQL, SQL Server,
Teradata, Redshift)
• JMS
• S3
• File
• Elasticsearch
Connecters In Use
• Highly scalable
• Cluster in each environment
• Keeps processing local to the
environment
• Efficient use of resources
• Increased security
KubernetesKafka Connect
26. Phase 3 Results
First Phase
The Launch
Second Phase
Reaching
Orbit
Third Phase
Escaping
Gravity
Current Phase
To Infinity
And Beyond…
Kafka
Kafka
Streams
Portal
Portal
KSQL
Kafka
Connect
Portal
Documentation
Site
• Java/Scala Developers
• AWS skills
• Linux command line
• Data movement from
on-prem to cloud
• Customer 360 –
bidirectional
movement
• Data movement
between all datacenters
on central platform
• LIMS Integration
• Serverless apps in AWS
• SAP/Oracle CDC
• Product360 in GCP - Go
• BI / Reporting
• Analytics platform
• Python, Node, etc.
• Some Analytics
• Some BI
• Event sourcing
• Company360
• Exadata CDC
• Incremental migration
to the cloud
27. Current Phase – To Infinity And Beyond
• Bring stream processing to
the masses!
• Data validation across the
pipeline
• SQL interface for Kafka
(using Presto)
• Improve topic
discoverability and reuse
• Expose consumer metrics
to end users
Replikant
Cluster
Replikant
Replikant
Replikant
Replikant
Replikant
Cluster
Replikant
Replikant
Replikant
Replikant
Kafka
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 6
Kafka
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 6
REST
Proxy
Schema
Registry
Schema
Registry
REST
Proxy
VPN
Tunnel
Kubernetes
I/O
File
Connector
JDBC
Connector
JDBC
Connector
JMS
Connector
Stream
Processing
KSQL
Kafka
Streams
Custom
Stream Proc
Kubernetes
I/O
File
Connector
JDBC
Connector
S3
Connector
Elasticsearch
Connector
Stream
Processing
KSQL
Kafka
Streams
Custom
Stream Proc
Data Historian
Ingestion
EMR
Hive SparkPresto
S3/
Parquet
Presto
SQL Engine
Haystack
Metadata
Platform
30. Use Case – CDC From SAP to Data Historian
Schema Splitter converts an input
stream with a “generic” schema
(many different tables flowing
through one topic) to individual table
streams with table specific schemas
Data Historian
Ingestion
EMR
Hive SparkPresto
S3/
Parquet
Kubernetes
Cluster
Schema
Splitter
KSQL
Filter
KSQL
Agg
Kafka
Streams Proc
Replikant
Cluster
Replikant
Kafka
Schema
Registry
Schema
Registry
ZookeeperZookeeper
Oracle
ETL
SAP
Informatica
Data
Replication
Kafka
DB
Schema 2
DB
Schema 4
DB
Schema 1
DB
Schema 3
Teradata
Staging
Teradata
Final
Table 1
Table 3
DB
Schema 1
Table 2
Replikant
Replikant
Replikant
VPN
Tunnel
31. Topic Reuse
• Not as good as we would like. Why?
• Discoverability
• Developers are not altruistic when creating
schemas
32. MetaData Platform
• Haystack is our enterprise metadata platform
• Kafka topic metadata is automatically synced to
Haystack
• Haystack links back to the DataHub portal
• Will be able to search for topics in Haystack and
immediately find the topic in the DataHub portal
33. Apache Presto
• Presto is being implemented as an enterprise data
virtualization solution…not just for the DataHub
• Will also be used to provide data validation across the
pipeline via SQL
Example: Join a topic in Kafka to a table in Postgres to confirm that all
data has transferred correctly
• Will also be used to provide a SQL interface in the
DataHub portal to allow querying for specific messages.
• Developed a patch to the Presto Kafka connector to
connect with SSL and deserialize AVRO
35. Phase 4 – Current Phase
First Phase
The Launch
Second Phase
Reaching
Orbit
Third Phase
Escaping
Gravity
Current Phase
To Infinity
And Beyond…
Kafka
Kafka
Streams
Portal
Portal
KSQL
Kafka
Connect
Portal
Documentation
Site
• Product360 in GCP - Go
• BI / Reporting
• Analytics platform
• Java/Scala Developers
• AWS skills
• Linux command line
• Data movement from
on-prem to cloud
• Customer 360 –
bidirectional
movement
• Data Stewards
• Everyone else!!
• Global streaming
• IOT data
• SAP Hana
• Data movement
between all datacenters
on central platform
• LIMS Integration
• Serverless apps in AWS
• SAP/Oracle CDC
• Python, Node, etc.
• Some Analytics
• Some BI
• Event sourcing
• Company360
• Exadata CDC
• Incremental migration
to the cloud
36. Future
• Move as much ETL as possible to the
streaming layer
• Monitoring and auditing of data flow across
the pipeline
• Consumer monitoring and configurable
alerting
• Improved data governance
• Integrate with enterprise security platform
37. The Enterprise DataHub is a living,
scalable, robust central nervous system
for data that facilitates the seamless
acquisition, transport and processing of
information in real time across multiple
datacenter and cloud environments.