As a data integration professional, it’s almost a guarantee that you’ve heard of real-time stream processing of Big Data. The usual players in the open source world are Apache Kafka, used to move data in real-time, and Spark Streaming, built for in-flight transformations. But what about relational data? Quite often we forget that products incubated in the Apache Foundation can also serve a purpose for “standard” relational databases as well. But how? Well, let’s introduce Oracle GoldenGate and Oracle Data Integrator for Big Data. GoldenGate can extract relational data in real time and produce Kafka messages, ensuring relational data is a part of the enterprise data bus. These messages can then be ingested via ODI through a Spark Streaming process, integrating with additional data sources, such as other relational tables, flat files, etc, as needed. Finally, the output can be sent to multiple locations: on through to a data warehouse for analytical reporting, back to Kafka for additional targets to consume, or any number of targets. Attendees will walk away with a framework on which they can build their data streaming projects, combining relational data with big data and using a common, structured approach via the Oracle Data Integration product stack.
Presented at BIWA Summit 2017.
14. • GoldenGate
• …is non-invasive
• …has checkpoints for recovery
• …moves data quickly
• …is easy to setup
12
Why GoldenGate with Kafka?
15. • Heterogeneous sources and targets
• Built to integrate all data
• Flexibility
• Reusable code templates
(Knowledge Modules)
• Reusable Mappings
• ODI can adapt to your data warehouse - and not the other way around
• Flow based mappings
13
Why Oracle Data Integrator with Spark Streaming?
21. • Create Model using Kafka Logical Schema
• Create Datastore
• Similar to standard “File”
datastore, define file format and
setup columns
• Only support for CSV
• Future formats may include JSON, Avro, etc
• Add Datastore to mapping
18
Kafka and Oracle Data Integrator
22. • Create Spark Data Server, Physical / Logical Schema
• Set Hadoop Data Server
• Add properties, such as checkpointing, asynchronous execution mode, etc
• Additional properties can be added:
http://spark.apache.org/docs/latest/configuration.html
• Spark Server is setup as Staging location
• Source Datastore from Kafka, Oracle DB, etc
• Target Datastore is Cassandra, Oracle DB, etc
• Code generated by KM is pySpark
• pySpark code can be added to filters, joins, other components for transformations
• Additional languages (Scala, Java) may be coming soon
19
Spark Streaming and Oracle Data Integrator