Dealing with Drift
Building an Enterprise Data Lake
Speakers
Nathan Swetye
Sr. Manager of Platform Engineering
Cox Automotive
Michael Gay
Lead Technical Architect
Cox Automotive
Pat Patterson
Community Champion
StreamSets
3
25 (and growing) companies
dealing with the automotive
space
Spans the full vehicle ownership
lifecycle
Data perceived as the integration
point for all companies
Cox Automotive
Enterprise Data DNA
Commercial Customers Across Verticals
150,000 downloads
40 of the Fortune 100
Doubling each quarter
Strong Partner Ecosystem Open Source Success
Mission: empower enterprises to harness their data in motion.
StreamSets Overview
StreamSets
Data Collector™
StreamSets
Dataflow Performance
Manager (DPM™)
Instrumented, open source
UI and engine to build any-to-any
dataflows.
Cloud Service to map,
measure and master dataflow
operations.
DATAFLOW LIFECYCLE
Developers
Scientists
Architects
StreamSets Enterprise
EVOLVE (Proactive)
REMEDIATE (Reactive)
DEVELOP OPERATE
Operators
Stewards
Architects
EFFICIENCY
Intent Driven Flows
Batch & Streaming Ingest
In-stream Sanitization
CONTROL
Fine-grained Stage & Flow Metrics
Drift Handling
Lineage and Impact Analysis Capture
AGILITY
Flexible deployment
Exception Handling
Seamless Evolution
StreamSets Data Collector is a complete
IDE for building and executing any-to-any
ingest pipelines.
StreamSets Data Collector
StreamSets DPM provides a
single pane of glass to map,
measure and master your
dataflow operations.
MASTER
Availability & Accuracy
Proactive Remediation
MEASURE
Any Path
Any Time
MAP
Dataflow Lineage
Live Data Architecture
StreamSets
Dataflow Performance Manager (DPM)
Data Drift
Change is the New Normal
The unpredictable, unannounced and unending mutation of data
characteristics caused by the operation, maintenance and
modernization of the systems that produce the data
Structure
Drift
Semantic
Drift
Infrastructure
Drift
SQL on Hadoop (Hive) Y/Y Click Through Rate
80% of analyst time is spent preparing and validating data,
while the remaining 20% is actual data analysis
Example: Data Loss and Corrosion
Data Drift and Scale
At the micro level, data drift leads to
breakage and errors
At the macro level, data drift brings your
system to a grinding halt!
11
The Problem of Data Exchange at Scale
Everyone wants each others’
data, but often difficult to acquire
A tangled mess of data flow
A source of anguish and sorrow
12
The Problem of Data Exchange at Scale
Enter the Data Lake
The central store for valuable
data
Mission: Data Lake, not Data
Swamp
Data$Lake
13
Great. A Data Lake. But how do you Populate it?
Problem: $$ Cost – a Question of Scale
• 25 Companies
• 9+ Source Types, mostly DBs
• 1-Many Schemas per Database
• Many Tables per Schema
Example:
• AutoTrader -> Oracle -> ATM1:
~1600 Tables
14
Great. A Data Lake. But how do you Populate it?
Problem: $$ Cost – a Question of Scale
• 25 Companies
• 9+ Source Types, mostly DBs
• 1-Many Schemas per Database
• Many Tables per Schema
Example:
• AutoTrader -> Oracle -> ATM1:
~1600 Tables
We’ve
ingested
about that
much
15
Great. A Data Lake. But how do you Populate it?
16
Back to Square 0
17
Back to Square 0
18
Cox Automotive’s StreamSets Architecture
Databases
Amazon S3
Files
FTP
Sources
StreamSets
Acquisition
StreamSets
StreamSets
StreamSets
Hadoop Filesystem
Big Data SQL
Amazon S3
Targets
StreamSets
Ingestion
StreamSets
StreamSets
StreamSets
Data Pipelines
Separates Acquisition from Ingestion
Dynamic Error Handling
Encrypted Data in Transit
Data standards applied automatically:
• Compression
• File Formats
• Partitioning Schemes
• Row-level Watermarks
• Time-stamping
Ingestion farm scales with demand
Auto-creates schemas en route
Data comes from a
variety of sources
Pipelines are established
for each source
Ingestion Back
Pressure
Scaling, Secure,
load-balanced
Actual ingestion
activities
On-premises and
Cloud Big Data
Systems
StreamSets
RPC
StreamSets
StreamSets
StreamSets
LoadBalancer
19
Acquisition Deployment Model
Ingest
Form
StreamSets
Pipeline
Deployment
Virtual Host
Deployment
Ingestion
Team Member
StreamSets
Acquisition
Pipeline
Enterprise Data Lake
start workflow
submit form
start workflow
build virtual host
deploy data pipeline
Enterprise Data Sources
DevOps
Team Member
20
Throughput!
0
100
200
300
400
Jan Feb Mar Apr May Jun Jul Aug Sept
Monthly Ingestion Requests
StreamSets
7x
Live Environment
25
Where do we go from Here?
• Amazon Web Services
• StreamSets Dataflow Performance Manager
• Acquire/Ingest decision point: Centralized, Federated, or Democratized?
• Quality
• Streamline access to sources
• Change data capture
• Integration with enterprise data catalogs
• Ingestion post-processing
Questions

Dealing With Drift - Building an Enterprise Data Lake

  • 1.
    Dealing with Drift Buildingan Enterprise Data Lake
  • 2.
    Speakers Nathan Swetye Sr. Managerof Platform Engineering Cox Automotive Michael Gay Lead Technical Architect Cox Automotive Pat Patterson Community Champion StreamSets
  • 3.
    3 25 (and growing)companies dealing with the automotive space Spans the full vehicle ownership lifecycle Data perceived as the integration point for all companies Cox Automotive
  • 4.
    Enterprise Data DNA CommercialCustomers Across Verticals 150,000 downloads 40 of the Fortune 100 Doubling each quarter Strong Partner Ecosystem Open Source Success Mission: empower enterprises to harness their data in motion. StreamSets Overview
  • 5.
    StreamSets Data Collector™ StreamSets Dataflow Performance Manager(DPM™) Instrumented, open source UI and engine to build any-to-any dataflows. Cloud Service to map, measure and master dataflow operations. DATAFLOW LIFECYCLE Developers Scientists Architects StreamSets Enterprise EVOLVE (Proactive) REMEDIATE (Reactive) DEVELOP OPERATE Operators Stewards Architects
  • 6.
    EFFICIENCY Intent Driven Flows Batch& Streaming Ingest In-stream Sanitization CONTROL Fine-grained Stage & Flow Metrics Drift Handling Lineage and Impact Analysis Capture AGILITY Flexible deployment Exception Handling Seamless Evolution StreamSets Data Collector is a complete IDE for building and executing any-to-any ingest pipelines. StreamSets Data Collector
  • 7.
    StreamSets DPM providesa single pane of glass to map, measure and master your dataflow operations. MASTER Availability & Accuracy Proactive Remediation MEASURE Any Path Any Time MAP Dataflow Lineage Live Data Architecture StreamSets Dataflow Performance Manager (DPM)
  • 8.
    Data Drift Change isthe New Normal The unpredictable, unannounced and unending mutation of data characteristics caused by the operation, maintenance and modernization of the systems that produce the data Structure Drift Semantic Drift Infrastructure Drift
  • 9.
    SQL on Hadoop(Hive) Y/Y Click Through Rate 80% of analyst time is spent preparing and validating data, while the remaining 20% is actual data analysis Example: Data Loss and Corrosion
  • 10.
    Data Drift andScale At the micro level, data drift leads to breakage and errors At the macro level, data drift brings your system to a grinding halt!
  • 11.
    11 The Problem ofData Exchange at Scale Everyone wants each others’ data, but often difficult to acquire A tangled mess of data flow A source of anguish and sorrow
  • 12.
    12 The Problem ofData Exchange at Scale Enter the Data Lake The central store for valuable data Mission: Data Lake, not Data Swamp Data$Lake
  • 13.
    13 Great. A DataLake. But how do you Populate it? Problem: $$ Cost – a Question of Scale • 25 Companies • 9+ Source Types, mostly DBs • 1-Many Schemas per Database • Many Tables per Schema Example: • AutoTrader -> Oracle -> ATM1: ~1600 Tables
  • 14.
    14 Great. A DataLake. But how do you Populate it? Problem: $$ Cost – a Question of Scale • 25 Companies • 9+ Source Types, mostly DBs • 1-Many Schemas per Database • Many Tables per Schema Example: • AutoTrader -> Oracle -> ATM1: ~1600 Tables We’ve ingested about that much
  • 15.
    15 Great. A DataLake. But how do you Populate it?
  • 16.
  • 17.
  • 18.
    18 Cox Automotive’s StreamSetsArchitecture Databases Amazon S3 Files FTP Sources StreamSets Acquisition StreamSets StreamSets StreamSets Hadoop Filesystem Big Data SQL Amazon S3 Targets StreamSets Ingestion StreamSets StreamSets StreamSets Data Pipelines Separates Acquisition from Ingestion Dynamic Error Handling Encrypted Data in Transit Data standards applied automatically: • Compression • File Formats • Partitioning Schemes • Row-level Watermarks • Time-stamping Ingestion farm scales with demand Auto-creates schemas en route Data comes from a variety of sources Pipelines are established for each source Ingestion Back Pressure Scaling, Secure, load-balanced Actual ingestion activities On-premises and Cloud Big Data Systems StreamSets RPC StreamSets StreamSets StreamSets LoadBalancer
  • 19.
    19 Acquisition Deployment Model Ingest Form StreamSets Pipeline Deployment VirtualHost Deployment Ingestion Team Member StreamSets Acquisition Pipeline Enterprise Data Lake start workflow submit form start workflow build virtual host deploy data pipeline Enterprise Data Sources DevOps Team Member
  • 20.
    20 Throughput! 0 100 200 300 400 Jan Feb MarApr May Jun Jul Aug Sept Monthly Ingestion Requests StreamSets 7x
  • 21.
  • 25.
    25 Where do wego from Here? • Amazon Web Services • StreamSets Dataflow Performance Manager • Acquire/Ingest decision point: Centralized, Federated, or Democratized? • Quality • Streamline access to sources • Change data capture • Integration with enterprise data catalogs • Ingestion post-processing
  • 26.

Editor's Notes

  • #5  StreamSets was founded in 2015 to address the pain of building and operating data movement. We were founded by former leaders at Informatica and Cloudera and have key talent with experience at big data open source vendors as well as leading edge practitioners (Square, FCBK). They recognized that big data fundamentally broke traditional data integration systems, and that the low-level frameworks that were being used instead, like Sqoop and Flume, were brittle and opaque. We has seen tremendous open source adoption with over 150,000 downloads in our 15 months since launch. Our solution is general purpose, and we have commercial customers across many industries using us for a broad range of projects from data warehouse replatforming to specific applications in the area of IoT, cybersecurity and website personalization.
  • #6 We look at dataflows as having a life-cycle. First you develop your logic and place it into operation. Over time you encounter problems that need to be remediated and you evolve your data flows to take advantage of new functionality - say Spark machine learning or support new business needs. Our StreamSets Enterprise service spans the full life-cycle through two products. First there is StreamSets Data Collector, which is open source data ingestion software. You use it to develop, test and run individual pipelines. Then for managing complex ingestion projects, we have StreamSets Dataflow Performance Manager. It acts as a single point of management across dozens or hundreds of data collectors.
  • #7 StreamSets Data Collector – open source software for building and running data pipelines that accelerates time to analysis Efficiency –a visual IDE to easily connect sources to destinations--light on schema specification, batch & streaming, on edge and in cluster Agility – data exception handling, data flow evolution, data infrastructure modernization with minimal down time Control – built-in data cleansing plus ability to adapt to data drift improves downstream data quality; monitor and alert on data KPI’s in real-time.
  • #8 Think of DPM as your comprehensive control panel - an operational console for all of your data movement. Your unit of measure here is not a pipeline but a dataflow topology that includes all of the interconnected pipelines that feed back to an application or support a business process. Of course you can also drill down to the individual pipelines with a topology. We talk about DPM in terms of 3 Ms, map, measure and master your dataflow operations. Map - you have a live and self-updating map of your dataflows topologies, you can manage releases and track changes in topologies over time. Measure - Measure and establish baselines for end-to-end and point-in-flow KPIs for data availability and accuracy. Master - you can master your dataflow operations by creating Data SLAs to detect and remediate violations.
  • #10 Storing unconstrained and drifting data in Big Data Stores leads to a whole slew of undetected data consistency and correctness problems. This is the ticking time bomb that most enterprises are facing. The Googles, Facebooks and LinkedIns of the world can put armies of people on this. Most enterprises cannot. Here’s a simple example of a real world customer that Arvind saw at Cloudera... Log data is put into Hadoop in order to be analyzed with SQL. Data is coming from a number of different data centers, a few of which upgraded from IPv4 to IPv6 Manual data ingest process did not take into account the unforeseen IPv6 format for IP addresses. End result is that the business metric (service request rate) is overstated (false positive) causing harm to the business.