Dealing With Drift - Building an Enterprise Data Lake

Dealing with Drift
Building an Enterprise Data Lake

Speakers
Nathan Swetye
Sr. Manager of Platform Engineering
Cox Automotive
Michael Gay
Lead Technical Architect
Cox Automotive
Pat Patterson
Community Champion
StreamSets

3
25 (and growing) companies
dealing with the automotive
space
Spans the full vehicle ownership
lifecycle
Data perceived as the integration
point for all companies
Cox Automotive

Enterprise Data DNA
Commercial Customers Across Verticals
150,000 downloads
40 of the Fortune 100
Doubling each quarter
Strong Partner Ecosystem Open Source Success
Mission: empower enterprises to harness their data in motion.
StreamSets Overview

StreamSets
Data Collector™
StreamSets
Dataflow Performance
Manager (DPM™)
Instrumented, open source
UI and engine to build any-to-any
dataflows.
Cloud Service to map,
measure and master dataflow
operations.
DATAFLOW LIFECYCLE
Developers
Scientists
Architects
StreamSets Enterprise
EVOLVE (Proactive)
REMEDIATE (Reactive)
DEVELOP OPERATE
Operators
Stewards
Architects

EFFICIENCY
Intent Driven Flows
Batch & Streaming Ingest
In-stream Sanitization
CONTROL
Fine-grained Stage & Flow Metrics
Drift Handling
Lineage and Impact Analysis Capture
AGILITY
Flexible deployment
Exception Handling
Seamless Evolution
StreamSets Data Collector is a complete
IDE for building and executing any-to-any
ingest pipelines.
StreamSets Data Collector

StreamSets DPM provides a
single pane of glass to map,
measure and master your
dataflow operations.
MASTER
Availability & Accuracy
Proactive Remediation
MEASURE
Any Path
Any Time
MAP
Dataflow Lineage
Live Data Architecture
StreamSets
Dataflow Performance Manager (DPM)

Data Drift
Change is the New Normal
The unpredictable, unannounced and unending mutation of data
characteristics caused by the operation, maintenance and
modernization of the systems that produce the data
Structure
Drift
Semantic
Drift
Infrastructure
Drift

SQL on Hadoop (Hive) Y/Y Click Through Rate
80% of analyst time is spent preparing and validating data,
while the remaining 20% is actual data analysis
Example: Data Loss and Corrosion

Data Drift and Scale
At the micro level, data drift leads to
breakage and errors
At the macro level, data drift brings your
system to a grinding halt!

11
The Problem of Data Exchange at Scale
Everyone wants each others’
data, but often difficult to acquire
A tangled mess of data flow
A source of anguish and sorrow

12
The Problem of Data Exchange at Scale
Enter the Data Lake
The central store for valuable
data
Mission: Data Lake, not Data
Swamp
Data$Lake

13
Great. A Data Lake. But how do you Populate it?
Problem: $$ Cost – a Question of Scale
• 25 Companies
• 9+ Source Types, mostly DBs
• 1-Many Schemas per Database
• Many Tables per Schema
Example:
• AutoTrader -> Oracle -> ATM1:
~1600 Tables

14
Problem: $$ Cost – a Question of Scale
• 25 Companies
• 9+ Source Types, mostly DBs
• 1-Many Schemas per Database
• Many Tables per Schema
Example:
• AutoTrader -> Oracle -> ATM1:
~1600 Tables
We’ve
ingested
about that
much

15

18
Cox Automotive’s StreamSets Architecture
Databases
Amazon S3
Files
FTP
Sources
StreamSets
Acquisition
StreamSets
StreamSets
StreamSets
Hadoop Filesystem
Big Data SQL
Amazon S3
Targets
StreamSets
Ingestion
StreamSets
StreamSets
StreamSets
Data Pipelines
Separates Acquisition from Ingestion
Dynamic Error Handling
Encrypted Data in Transit
Data standards applied automatically:
• Compression
• File Formats
• Partitioning Schemes
• Row-level Watermarks
• Time-stamping
Ingestion farm scales with demand
Auto-creates schemas en route
Data comes from a
variety of sources
Pipelines are established
for each source
Ingestion Back
Pressure
Scaling, Secure,
load-balanced
Actual ingestion
activities
On-premises and
Cloud Big Data
Systems
StreamSets
RPC
StreamSets
StreamSets
StreamSets
LoadBalancer

19
Acquisition Deployment Model
Ingest
Form
StreamSets
Pipeline
Deployment
Virtual Host
Deployment
Ingestion
Team Member
StreamSets
Acquisition
Pipeline
Enterprise Data Lake
start workflow
submit form
start workflow
build virtual host
deploy data pipeline
Enterprise Data Sources
DevOps
Team Member

20
Throughput!
0
100
200
300
400
Jan Feb Mar Apr May Jun Jul Aug Sept
Monthly Ingestion Requests
StreamSets
7x

25
Where do we go from Here?
• Amazon Web Services
• StreamSets Dataflow Performance Manager
• Acquire/Ingest decision point: Centralized, Federated, or Democratized?
• Quality
• Streamline access to sources
• Change data capture
• Integration with enterprise data catalogs
• Ingestion post-processing

Dealing With Drift - Building an Enterprise Data Lake

More Related Content

What's hot

Similar to Dealing With Drift - Building an Enterprise Data Lake

More from Pat Patterson

Recently uploaded

Dealing With Drift - Building an Enterprise Data Lake

Editor's Notes