Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Dealing with Drift: Building an Enterprise Data Lake


Published on

Presented at San Diego Predictive Analytics Meetup, Dec 5, 2017

Cox Automotive comprises more than 25 companies dealing with different aspects of the car ownership lifecycle, with data as the common language they all share. The challenge for Cox was to create an efficient engine for the timely and trustworthy ingest of data capability for an unknown but large number of data assets from practically any source. Discover how their big data engineering team overcame data drift and are now populating a data lake, allowing analysts easy access to data from their subsidiary companies and producing new data assets unique to the industry.

Published in: Software
  • Be the first to comment

Dealing with Drift: Building an Enterprise Data Lake

  1. 1. Dealing with Drift Building an Enterprise Data Lake Pat Patterson Community Champion @metadaddy
  2. 2. Speaker Pat Patterson Community Champion StreamSets @metadaddy
  3. 3. Credit is Due… Nathan Swetye Sr. Manager of Platform Engineering Cox Automotive Michael Gay Lead Technical Architect Cox Automotive
  4. 4. Agenda Cox Automotive and StreamSets Data Drift Building the Cox Data Lake
  5. 5. Cox Automotive 25 (and growing) companies dealing with the automotive space Spans the full vehicle ownership lifecycle Data perceived as the integration point for all companies
  6. 6. StreamSets Overview Enterprise Data DNA Commercial Customers Across Verticals Strong Partner Ecosystem Open Source Success Mission: empower enterprises to harness their data in motion. 500,000+ downloads 50+ of the Fortune 100 Doubling each quarter
  7. 7. StreamSets Enterprise StreamSets Data Collector™ StreamSets Dataflow Performance Manager (DPM™) Instrumented, open source UI and engine to build any- to-any dataflows. Cloud Service to map, measure and master dataflow operations. DATAFLOW LIFECYCLE Developers Scientists Architects EVOLVE (Proactive) REMEDIATE (Reactive) DEVELOP OPERATE Operators Stewards Architects
  8. 8. StreamSets Data Collector EFFICIENCY Intent Driven Flows Batch & Streaming Ingest In-stream Sanitization CONTROL Fine-grained Stage & Flow Metrics Drift Handling Lineage and Impact Analysis Capture AGILITY Flexible deployment Exception Handling Seamless Evolution StreamSets Data Collector (SDC) is a complete IDE for building and executing any-to-any ingest pipelines.
  9. 9. StreamSets Dataflow Performance Manager StreamSets DPM provides a single pane of glass to map, measure and master your dataflow operations. MASTER Availability & Accuracy Proactive Remediation MEASURE Any Path Any Time MAP Dataflow Lineage Live Data Architecture
  10. 10. Data Drift Change is the New Normal The unpredictable, unannounced and unending mutation of data characteristics caused by the operation, maintenance and modernization of the systems that produce the data Structure Drift Semantic Drift Infrastructure Drift
  11. 11. Example: Data Loss and Corrosion SQL on Hadoop (Hive) Y/Y Click Through Rate 80% of analyst time is spent preparing and validating data, while the remaining 20% is actual data analysis
  12. 12. Data Drift and Scale At the micro level, data drift leads to breakage and errors At the macro level, data drift brings your system to a grinding halt!
  13. 13. The Problem of Data Exchange at Scale Everyone wants each others’ data, but often difficult to acquire A tangled mess of data flow “A source of anguish and sorrow”
  14. 14. The Solution to Data Exchange at Scale Enter the Data Lake The central store for valuable data Mission: Data Lake, not Data Swamp Data$Lake
  15. 15. Great. A Data Lake. But how do you Populate it? Problem: $$ Cost – a Question of Scale • 25 Companies • 9+ Source Types, mostly DBs • 1-Many Schemas per Database • Many Tables per Schema Example: • AutoTrader -> Oracle -> ATM1: ~1600 Tables
  16. 16. Problem: $$ Cost – a Question of Scale • 25 Companies • 9+ Source Types, mostly DBs • 1-Many Schemas per Database • Many Tables per Schema Example: • AutoTrader -> Oracle -> ATM1: ~1600 Tables They’d ingested about that much Great. A Data Lake. But how do you Populate it?
  17. 17. The Initial Solution
  18. 18. Back to Square 0
  19. 19. Back to Square 0
  20. 20. Cox Automotive’s StreamSets Architecture Databases Amazon S3 Files FTP StreamSets Acquisition StreamSets StreamSets StreamSets Hadoop Filesystem Big Data SQL Amazon S3 Targets StreamSets Ingestion StreamSets StreamSets StreamSets Data Pipelines Separates Acquisition from Ingestion Dynamic Error Handling Encrypted Data in Transit Data standards applied automatically: • Compression • File Formats • Partitioning Schemes • Row-level Watermarks • Time-stamping Ingestion farm scales with demand Auto-creates schemas en route Data comes from a variety of sources Pipelines are established for each source Ingestion Back Pressure Scaling, Secure, load-balanced Actual ingestion activities On-premises and Cloud Big Data Systems StreamSets RPC StreamSets StreamSets StreamSets LoadBalancer
  21. 21. Acquisition Deployment Model Ingest Form StreamSets Pipeline Deployment Virtual Host Deployment Ingestion Team Member StreamSets Acquisition Pipeline Enterprise Data Lake start workflow submit form start workflow build virtual host deploy data pipeline Enterprise Data Sources DevOps Team Member
  22. 22. Throughput! 0 100 200 300 400 Jan Feb Mar Apr May Jun Jul Aug Sept Monthly Ingestion Requests StreamSets 7x
  23. 23. Live Environment
  24. 24. Where did they go from Here? • Amazon Web Services • StreamSets Dataflow Performance Manager • Acquire/Ingest decision point • Centralized, Federated, or Democratized? • Quality • Streamline access to sources • Change Data Capture • Integration with enterprise data catalogs • Ingestion post-processing
  25. 25. Questions
  26. 26. Thank You! Pat Patterson Community Champion @metadaddy