Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Free Data Pipelines, Jon Gray, CEO, Cask Data


Published on

This talk will present how to build data pipelines with no code using the open-source, Apache 2.0, Cask Hydrator. The talk will continue with a live demonstration of creating data pipelines for two use cases.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Free Data Pipelines, Jon Gray, CEO, Cask Data

  1. 1. Building Data Pipelines with Cask Hydrator Jon Gray CEO, Cask July 9th, 2016
  2. 2. PROPRIETARY & CONFIDENTIAL Web Analytics and Reporting Use Case ✦Hadoop ETL pipeline stitched together using hard-to-maintain, brittle scripts
 ✦Not enough personnel with expertise in all the Hadoop components (HDFS, MapReduce, Spark, YARN, HBase, Kafka) or lack of expertise
 ✦Hard to debug and validate, resulting in frequent failures in production environment
 Transform web log data from S3 every hour to Hadoop cluster for backup, as well as, perform analytics and enable realtime reporting of metrics such as number of successful/failure responses, most popular webpage etc. The Challenge —
  3. 3. PROPRIETARY & CONFIDENTIAL Demo Example Load Log Files from S3 to HDFS and perform aggregations/analysis •Start with web access logs stored in Amazon S3 •Store the raw logs into HDFS Avro Files •Parse the access log lines into individual fields •Calculate the total number of requests by IP and status code •Find out IPs which received maximum successful status code and error codes - - [08/Feb/2015:04:36:40 +0000] "GET /ajax/planStatusHistory HTTP/1.1" 200 508 "" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit Chrome/38.0.2125.122 Safari/537.36" Fields: IP Address, Timestamp, Http Method, URI, Http Status, Response Size, URI, Client Info Sample Web access log (Combined Log Format):
  4. 4. PROPRIETARY & CONFIDENTIAL INGEST any data from any source in real-time and batch BUILD drag-and-drop ETL/ELT pipelines that run on Hadoop EGRESS any data to any destination in real-time and batch Data Pipeline provides the ability to automate complex workflows that involves fetching data, possibly from multiple data sources, combining, performing non-trivial transformations on the data, writing it to one more data sinks and deriving/
  5. 5. PROPRIETARY & CONFIDENTIAL Stack of Data Enablers
  6. 6. PROPRIETARY & CONFIDENTIAL Hydrator Studio ✦Drag-and-drop GUI for visual Data Pipeline creation
 ✦Rich library of pre-built sources, transforms, sinks for data ingestion and ETL use cases
 ✦Separation of pipeline creation from execution framework - MapReduce, Spark, Spark Streaming etc.
 ✦Hadoop-native and Hadoop Distro agnostic
  7. 7. PROPRIETARY & CONFIDENTIAL Hydrator Data Pipeline ✦ Captures Metadata, Audit, Lineage info and visualized using Cask Tracker
 ✦ Notification, centralized metrics and log collection for ease of operability
 ✦ Simple Java API to build your own source, transforms, sinks with complete class loading isolation
 ✦ SparkML based plugins, Python transforms for data scientists
  8. 8. PROPRIETARY & CONFIDENTIAL ✦ ElasticSearch, SFTP, Cassandra, Kafka, JMS and many more sources and sinks
 Out of the box Integrations
  9. 9. PROPRIETARY & CONFIDENTIAL ✦ Implement your own batch (or realtime) source, transform, sink plugins using simple Java API Custom Plugins
  10. 10. PROPRIETARY & CONFIDENTIAL Pipeline Implementation Logical Physical MR/Spark Executions Planner CDAP ✦ Planner converts logical pipeline to a physical execution plan
 ✦ Optimizes and bundles functions into one or more MR/Spark jobs
 ✦ CDAP is the runtime environment where all the components of the data pipeline are executed
 ✦ CDAP provides centralized log and metrics collection, transaction, lineage and audit information

  11. 11. PROPRIETARY & CONFIDENTIAL Pipeline Implementation
  12. 12. PROPRIETARY & CONFIDENTIAL CASK DATA APPLICATION PLATFORM Integrated Framework for Building and Running Data Applications on Hadoop Integrates the Latest Big Data Technologies Supports All Major Hadoop Distributions Fully Open Source and Highly Extensible
  13. 13. PROPRIETARY & CONFIDENTIAL Hadoop ecosystem, 50 different projects Top 6 Hadoop distributions
  14. 14. PROPRIETARY & CONFIDENTIAL Abstraction and Integration Layer Data Lake Fraud Detection Recommendation Engine Sensor Data Analytics Customer 360 Hydrator Tracker Hadoop ecosystem, 50 different projects Top 6 Hadoop distributions
  15. 15. PROPRIETARY & CONFIDENTIAL Data Lake Fraud Detection Recommendation Engine Sensor Data Analytics Customer 360 Hydrator Tracker CASK DATA APP PLATFORM Hadoop ecosystem, 50 different projects Top 6 Hadoop distributions
  16. 16. PROPRIETARY & CONFIDENTIAL Self-Service Data Ingestion and ETL for Data Lakes Built for Production on CDAP Rich Drag-and-Drop User Interface Open Source & Highly Extensible
  17. 17. PROPRIETARY & CONFIDENTIAL ✦ Join across multiple data sources (CDAP-5588)
 ✦ Macro substitutions
 ✦ Pre-Actions in pipelines similar to post run notifications
 ✦ Spark streaming support for Realtime pipelines Hydrator Roadmap
  18. 18. Thank You! Twitter @CaskData Questions?
  19. 19. PROPRIETARY & CONFIDENTIAL Data Lake Enterprise-wide data management platforms for analyzing disparate sources of data in its native format - Gartner Data Lake 1 0 1 0 0 01 1 0 1 Hydrating your Data Lake Hydrator Self-service, hadoop-native, drag-and- drop open source framework to develop, run and operate data
  20. 20. PROPRIETARY & CONFIDENTIAL Manual processes requiring hand-coding and reliance on
 command-line tools Hard to find data and
 it’s lineage for data
 discovery and exploration Coupling of ingestion and processing drives
 architecture decisions Operationalizing processes
 for production and to
 maintain SLAs Ensuring data is in canonical forms with a shared schema usable by others Coding or filing tickets often required to perform new
 ingestion and processing tasks Multiple architectures and technologies used by different teams on different clusters Guaranteeing compliance in a system that is designed for schema-on-read and raw data Sharing infrastructure in a
 multi-tenant environment
 without low-level QoS support Data Reservoir 1 0 1 0 0 0 1 Data Pond 1 0 1 0 1 0 Data Lake 1 0 1 0 1 0 Data Lake Challenges
  21. 21. PROPRIETARY & CONFIDENTIAL Hydrator framework with templates and plugins enables production workflows in minutes Never lose data by ensuring all ingested data is tracked with
 metadata and lineage Separation of ingestion
 and processing to support
 any type, format and rate Operationalize workflows using
 scheduling and SLA monitoring
 with time / partition awareness Using common transformations and a shared system for
 defining and exposing schema Reference architecture ensures a common platform across teams, orgs, ops and security Multi-tenant namespacing provides data and app isolation, tying together infrastructure Ensure compliance by
 requiring the use of specific transformations and validation Self-service access through Cask Hydrator for the discovery, ingest and exploration of data Data Reservoir 1 0 1 0 0 0 1 Data Pond 1 0 1 0 1 0 Data Lake 1 0 1 0 1 0 Data Lakes on CDAP