Stream based Data Integration

Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
OOW2018 – Data Integration
Modern Stream-based Data Integration
Product Development
October, 2018
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Oracle OpenWorld 2018

Safe Harbor Statement
The following is intended to outline our general product direction. It is intended for
information purposes only, and may not be incorporated into any contract. It is not a
commitment to deliver any material, code, or functionality, and should not be relied upon
in making purchasing decisions. The development, release, timing and pricing of any
features or functionality described for Oracle’s products may change and remains at the
sole discretion of Oracle Corporation.
Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Oracle OpenWorld 2018 2

Data Democratization
Data Monetization
Self-service IT
Open-Data
Regulatory Governance
Digital Transformation
Market Disruption
Customer 360
WHAT DOES THE
BUSINESS WANT?BUSINESS IMPERATIVES
3

4Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Oracle OpenWorld 2018
DATA INTEGRATION
MAKES IT POSSIBLE
Applications
Polyglot Data
Databases
Data Services
Analytics &
Data Warehouse
Data Lake &
Data Science
DATA
INTEGRATION &
GOVERNANCE

MODERN DATA “PLUMBING”
IS CRITICALLY IMPORTANT FOR MODERN DATA ARCHITECTURE
“Integration tops the list of challenges in the world of data
and analytics today. […] Co-located data is not the same as
integrated data. […] You have to have something to connect
the dots.”
https://www.cio.com/article/3269012/analytics/why-data-analytics-initiatives-still-fail.html
Why Data Analytics Initiatives Still Fail
April, 2018

WHY IS THE DATA “PLUMBING” SO IMPORTANT?
#1 - Feedback Loops
Need to be Quicker
#2 - More Data from
More Sources
#3 - New Regulatory
Pressure for Transparency
Every industry and
geography under more
scrutiny
Data inputs happening
faster and from more
devices
Business demand for
faster, data-driven
decisions

OLD-STYLE DATA PLUMBING NOT GOOD ENOUGH ANYMORE
Batch processing, hub-and-
spoke Kimball-style solution
Data storage is a mix of file
system, database, hadoop
Data governance is ad-hoc
and inconsistent
Data Workloads
are in Motion
Data at rest
is in Object Storage
All Data Inventory
is in a Catalog
Traditional Approaches Modern Approaches

Hadoop
is dead
Hub-and-
Nope!
ETL tools
are toast
EDW is a
dinosaur

Streaming
Pipelines
Data Sushi
(HIPSTER)
Serverless

What is Modern?
Source Data Consumers
#1 – Ingest happens any time
data is available
#2 – Processing can happen
at any latency
#3 – Data is available in the
consumer’s format
Raw Data
Ingest
Stream Processing
Batch Processing
& Long Term Storage
Serving
Layer
#4 – Infrastructure is Serverless
#5 – Data at rest is on Object Storage
Application Data
Polyglot Data
SQL & NoSQL Data
Data Lake & Data Science
Data Services for Applications
EDW & Analytics for Reporting
10

What is Modern Data Ingestion?
#1 – Ingest happens any time
data is available
Raw Data
Ingest
Copying data is bad, but overloading the source application is worse
• Physical co-location is usually first step in a Data Lake
• Optimize for minimal impact on source systems
• Replication of changed data is usually best option for databases
Event driven is mandatory in new world order
• Move data the moment it can, don’t wait for a job scheduler
• Some data is process-bound to batch
Support key styles of data movement and virtualization
• Parallel Data Copy for File-centric Data
• Database Replication for Relational Data
• Bulk Extraction or Storage Replication for full copies
• Data Federation/Virtualization is a “nice to have” but most source systems
can’t take the pain
Initial loads subject to laws of physics and speed of light
• Terabytes take time
• Optimize for database bulk copy programs (BCP for block level unloaders) or
big data copy programs (DCP is highly parallel)
• There is no such thing as magic 
Bulk Copy
Utilities
Source Data
Application Data
Polyglot Data
SQL & NoSQL Data
11

GoldenGate is Modern
https://www.oracle.com/us/products/middleware/data-integration/oracle-goldengate-innovations-wp-5093027.pdf
NON-RELATIONAL DATA
KERNEL INTEGRATIONS
REMOTE CAPTURE
SIMPLIFICATION
MICROSERVICES
CONTAINERS
MONITORING
STREAM ANALYTICS
CLOUD
SUBSCRIPTIONS
GOLDENGATE FOR STREAMING BIG DATA:
12

What is Modern Data Processing?
#2 – Processing can happen
at any latency
Stream Processing
Batch Processing
Pipeline Editor
Object Storage
Streaming Use Cases
• Clickstream Analytics
• Recommendation Engines
• Fraud Detection / Alerting
Pipeline “Logic” Layer
• Specify data pipelines, rules and
embed Machine Learning
• Architecture decoupling:
independent from the engines
• Improve usability for analysts
Streaming Engines
• Oracle Preferred: Flink for true
streaming, Spark for micro-
batching and ML use cases
• Others: Storm, Kafka Streams,
Samza, or Vendor Proprietary
Batch Use Cases
• ETL Offloading
• Data Lake Loading
• Large Scale Analytics
Batch (MPP) Processing
• Engines: Spark for most cases,
Hive or Flink in special cases
• Storage: Object Storage
Interactive Data Access
• SQL for direct query of very large
data sets:
• Hive SQL (basic)
• Spark SQL (basic)
• Sparkline OLAP (advanced)
• Machine Learning & Graph
• MLlib
• GraphX
* In your own data center you can have mixed workloads run from same
physical clusters (ie; Kappa-style) but from the Cloud you should only pay
for what you use and not care how the infrastructure is managed…
13

What is a Modern Serving Layer?
#3 – Data is available in the
consumer’s format
Serving
Layer
Streaming API for Real-Time Data
• Publish and Subscribe (Kafka), with Apps-friendly REST APIs
• REST-based means that API Gateways can be used for Secure
ACLs
• HTTPS transmission and no file system access to data
• Data redaction at API / contract level
SQL-based Access for Interactive Reporting
• Bulk data movement (ETL) out to external data
warehouses/marts
• Direct SQL access to data stored in the Data Lake (Spark OLAP)
• Most widely used data manipulation language
• Numerous LDAP-based client security patterns for data privacy
Direct Access to Raw Data for Specialists
• Native access to data buckets (object storage) or HDFS (file
system)
• Especially useful for Machine Learning / AI programs
• Direct access to large data sets without unnecessary
movement
• Identity for local access is granted at object level (eg; a data file)
Consumers
SQL
14

What is Modern Infrastructure?
#4 – Infrastructure is Serverless
#5 – Data at rest is on Object Storage
OCI Compute
Ingest
Stream
Batch
Serve
Public Cloud Infrastructure is setting the standard for low-cost and high-performance
• Pay only for what you use, serverless style of operation takes operational burdens away from the IT consumers
• Fast compute with flat network runs 5x faster than Amazon (https://blogs.oracle.com/cloud-infrastructure/oracle-tests-better-in-performance-than-amazon-web-services )
• Very low cost storage that is practically infinite, with 99.999999% reliability
OCI Object Store
Source Data
Application Data
Polyglot Data
SQL & NoSQL Data
Consumers
15

What is Modern?
Source Data ConsumersRaw Data
Ingest
Stream Processing
Batch Processing
& Long Term Storage
Application Data
Polyglot Data
SQL & NoSQL Data
Bulk Copy
Utilities
Batch Processing
Pipeline Editor
Object Storage
Serving
Layer
SQL
ANY DATA ANY LATENCY ANY FORMAT ANYWHERE
16

What is Modern? Oracle Cloud
Source Data ConsumersRaw Data
Ingest
Stream Processing
Batch Processing
& Long Term Storage
Application Data
Polyglot Data
SQL & NoSQL Data
Bulk Copy
Utilities
Batch Processing
Pipeline Editor
Object Storage
Serving
Layer
SQL
ANY DATA ANY LATENCY ANY FORMAT ANYWHERE
Oracle Data Pipelines
Oracle Data Integration
Oracle Big Data
Oracle
Events
Oracle
Data
Integration
Oracle
Database
Oracle
Events
Oracle Big
Data
Oracle Cloud
17

Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Oracle OpenWorld 2018 18
Original Architecture Stream Data Platform
https://www.confluent.io/blog/stream-data-platform-1/

Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 19
Microservices-based
GoldenGate
data service
19 different
enterprise
applications
Enterprise Data Lake

✓ Fully managed, replicates >30TB’s per day, low latency
✓ Real-time streaming data platform built with
Oracle GoldenGate, Kafka, Flink and Kubernetes
✓ Provide shared, curated, “private” streams and stream
processing computation running on eBay cloud
✓ Dynamic stream endpoint discovery
✓ Standardized data format & stream catalog
✓ Secure stream access control
✓ Data movement across security zones over secure connections
✓ Comprehensive monitoring, alerting and remediation
https://www.ebayinc.com/stories/blogs/tech/rheos/

https://medium.com/netflix-techblog/netflix-billing-migration-to-aws-part-iii-7d94ab9d1f59
On Premise
Billing
Application
Cloud Data Platform
“GoldenGate stood out in terms of features it offered
which aligned very well with our use case.”

Enterprise Data Lake

Financial Data
Warehouse
Oracle Data
Integration
Platform
ETL

ORACLE UNIQUE TECHNOLOGY
COLLABORATIVE, COMMON PANE OF GLASS
PUSHDOWN OPTIMIZER FOR ETL IN STREAMS, BATCH OR DB
HYBRID AGENT-BASED ARCHITECTURE
GOLDENGATE-BASED STREAMING DATA INGESTION
STREAMING ANALYTICS BEST-IN-CLASS USER EXPERIENCE

GoldenGate for Kappa / Streaming
Raw
Data
Layer
Apps Layer
Speed Layer
Batch Layer
Application
Serving
Layer
REST
APIs
Analytics
Tools
Data
Science
Data Marts
GG GG
User
Updates
DBMS
Updates
Capture
Trail
Route
Deliver
Pump
SSL/HTTPS
JSON
ORC
CSV
Parquet
XML
DDL
Events
Prepared
Data
Prepared
Data
EBay runs 200 billion transactions per
day; more than 25 TB of changed data
per day via GoldenGate and less than 2
seconds of end-to-end latency (Flink)
LinkedIn operates GoldenGate on >200
databases across 5 global data centers
(Samza for processing)
Quickbooks.com runs GG on Oracle,
SQL Server and DB2 hosted on AWS
(GG+Kafka is enterprise data fabric that
feeds their data science/ML platform)
Apple iTunes and uses GG+Kafka to
ingest transactions into 5,000+ node
Data Lake (for data science)
General Motors uses GG+Kafka and
GG+S3 to move transactions from 600+
databases into their Data Lake
Maersk uses GG+Kafka for realtime IoT
tracking of global shipments and port of
entries (customs tracking)
MGM shifting entire IT data
architecture to GG+Kafka for streaming
Validation by forward-leaning customers: Kappa ETL architecture
• We see GG customers using various Stream Processing engines: Spark
Streaming, Flink, Kafka Streams, Apex/DataTorrent, Samza, Kinesis
Firehose, Storm, etc.

GoldenGate now Includes Stream Analytics
ETL
Services
Dimensional
Data
Cubes
Ingest Database Events Select Processing Patterns Build Event Pipelines Serve Data Downstream
Any GoldenGate event is included
free, Kafka native events require
full-use license
Rich set of pre-built patterns can
dramatically improve developer
efficiency and time-to-value
Tool can easily leverage geo-fencing,
machine-learning, and other lookup
data within the data stream
Data can be delivered out to kafka,
databases, or easily staged for
downstream ETL jobs
connect

Oracle Data Integration Platform
Common Framework for Data Integration Use Cases
Database
Migrations
Database
Replication
Data Warehouse
Automation
Data Lake
Automation
Data Governance
Oracle Cloud
Non-Oracle Data Centers
Application Data
Polyglot Data
SQL & NoSQL Data
27

Oracle Databases
Data Integration Platform - Built for Collaboration
DBA Data Engineer ETL Developer Data Steward Data Analyst
Builds replication/ingest pipelines
Works mostly with databases
(data models, sql, plsql)
Builds pipelines for Data Lake
Works mostly with code
(java, scala, python, sql)
Builds pipelines for DW/Marts
Works mostly with tools
(data models, sql, mappings)
Manages policies & cleansing rules
(for pipelines and data at rest)
Domain expert, prepares data
Many Personas Need to Work Together for Data Integration Solutions
Oracle Data Integration Platform
Non-Oracle SQL and Polyglot Data Apps & SaaS
Logs
28

Data Integration Platform - Data Lake Solutions
Data Ingest
Best-in-class streaming or batch
data ingestion/loading
Data Preparation
Deduplicate, Enhance, Link and
Consolidate Enterprise Data
Data Catalog
Scan and inventory data from
across all locations
Stream Analytics
Apply governed business rules
across many sources
Data Lake Builder
Quickly create and load a policy-
driven data lake
Data Pipelines
Organize, cleanse and process
any data in the Lake

Agent Execution Runs from Anywhere
Oracle Data Integration
Platform Cloud:
1. Design the Solution
2. Administer the Deployments
3. Meter the Subscriptions
Control
Plane
Firewall
Data
Plane
Support on-prem use cases! Customer data can stay in the “Data Plane” only.
Corporate data center AWS
Amazon
EC2
Amazon
Redshift
Azure
Azure
VM
Azure
HD Insight
<https>
30

Machine-assisted ETL Optimizer for DW Loading
Files
Staging
(Batch)
Lookup
Table
Reporting
Tables
Staging
(streaming)
Device
Object
Store
Streaming
Sources
Batch
Sources
2. SQL
Filter
6. Flink
ETL
5. Spark
ETL / ML
10. SQL
ETL
3. Bulk
Unload
4. Parallel
Copy
9. Block
Load
1. Log
Capture
7. Direct
Replication
8. Direct
Copy
Unified Editor (DAG) to Model End-to-End PipelineData
Engineer
Data
Analyst
1
3
4
7
8
9
Oracle Cloud Infrastructure
Intelligent Optimizer to Execute ETL in Most Optimal Engine
2
5
6
10
31

Pattern for Logical Data Zones & Topic Types
Raw Data (LCR)
Schema Events
(DDL)
Prepared Data Topics
Master Data Topics
ETL ETL
1 Topic : 1 Table
Data Consumers
Applications
Analytics
ODS (Data Store)
Data Marts
Data Warehouses
Stream Data Producers
Apps & DBs:
Staging Trusted Master
ETL
Bulk Data Producers
ETL
Data Science
Events are
Pushed
Batching
Interactive
Queries
OLAP SQL
Bucket 1 Bucket 2 Bucket 3
32

Presen-
tations:
Data Integration Programming – FOCUS ON DOC LINK
Demo
Kiosks:
Hands-
on Labs:
Oracle
Enterprise
Data Quality
Oracle
GoldenGate
Oracle
Data Integrator
Oracle
Data Integration
Platform Cloud
Oracle
Stream
Analytics
Introduction to
Data Integration
Platform Cloud
HOL6277
Operational Data Stores,
Enterprise Data Warehouses,
and Data Marts in the Cloud
HOL6278
Faster Oracle GoldenGate
Deployments in the Cloud
Using Microservices
HOL6282
Analyzing
Oracle GoldenGate Streams
with Oracle Data Integration
Platform Cloud
HOL6286
The Exchange -
Integration Area
- Moscone South
The Exchange -
Analytics & Big Data Area
- Moscone West
The Exchange -
Data Management Area
- Moscone South
Oracle
Data Catalog
Cloud

Data Integration Programming – FOCUS ON DOC LINK
Monday, October 22
• PRM4229 - Oracle’s Data Platform Roadmap: Oracle GoldenGate,
Oracle Data Integrator, Governance
• PRO4230 - Oracle’s Data Platform in the Cloud: The Foundation for
Your Data
• PRO4231 - Oracle’s Data Platform in the Cloud: Powered by Oracle
GoldenGate/Oracle Stream Analytics
Tuesday, October 23
• PRO4232 - Oracle’s Data Platform in the Cloud Deep Dive
• PRM4061 - Oracle's Data Platform: Strategy and Roadmap for
Oracle Data Integrator
• PRO4233 - Actionable Business Insights with Oracle Stream
Analytics
• HOL6277 - Introduction to Data Integration Platform Cloud
• HOL6278 - Operational Data Stores, Enterprise Data Warehouses,
and Data Marts in the Cloud
• PRO4234 - Stream Processing Enterprise Data with Oracle
GoldenGate and Oracle Stream Analytics
• PRM4235 - Oracle’s Data Platform in the Cloud: Roadmap for
Oracle Enterprise Data Quality
Wednesday, October 24
• CAS4060 - Oracle's Data Platform: Customer Panel
• HOL6282 - Faster Oracle GoldenGate Deployments in the Cloud
Using Microservices
• HOL6286 - Analyzing Oracle GoldenGate Streams with Oracle Data
Integration Platform Cloud
• PRM4239 - Oracle Data Platform: Strategy and Vision for Data
Catalog
Thursday, October 25
• PRO4238 - Oracle’s Data Platform: Oracle GoldenGate for Big Data
• PRM4236 - Oracle’s Data Platform in the Cloud: Strategy and
Roadmap for Oracle GoldenGate
• PRO4557 - Loading Application Data in a Data Warehouse and a
Data Lake in Batch and Real Time
• TIP4240 - Oracle’s Data Platform: Easily Load, Manage, Govern,
and Secure a Data Lake
Presenters: Highlight your current
session in bold red, and gray out any
session that has already happened

Connect with Oracle Integration
@OracleDI
Blogs.oracle.com/DataIntegration/
@OracleIntegrate
Blogs.oracle.com/Integration/

Safe Harbor Statement
The following is intended to outline our general product direction. It is intended for
information purposes only, and may not be incorporated into any contract. It is not a
commitment to deliver any material, code, or functionality, and should not be relied
upon in making purchasing decisions. The development, release, timing, and pricing of
any features or functionality described for Oracle’s products may change and remains at
the sole discretion of Oracle Corporation.
37

Stream based Data Integration

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Stream based Data Integration

Similar to Stream based Data Integration (20)

More from Jeffrey T. Pollock

More from Jeffrey T. Pollock (11)

Recently uploaded

Recently uploaded (20)

Stream based Data Integration