Building Data Pipelines with Cask Hydrator
Jon Gray
CEO, Cask
July 9th, 2016
PROPRIETARY & CONFIDENTIAL
Web Analytics and Reporting Use Case
✦Hadoop ETL pipeline stitched together using hard-to-maintain, brittle scripts

✦Not enough personnel with expertise in all the Hadoop components (HDFS,
MapReduce, Spark, YARN, HBase, Kafka) or lack of expertise

✦Hard to debug and validate, resulting in frequent failures in production environment



Transform web log data from S3 every hour to Hadoop cluster for backup, as well as,
perform analytics and enable realtime reporting of metrics such as number of
successful/failure responses, most popular webpage etc.
The Challenge —
PROPRIETARY & CONFIDENTIAL
Demo Example
Load Log Files from S3 to
HDFS and perform
aggregations/analysis
•Start with web access logs stored in
Amazon S3
•Store the raw logs into HDFS Avro Files
•Parse the access log lines into individual
fields
•Calculate the total number of requests by
IP and status code
•Find out IPs which received maximum
successful status code and error codes
69.181.160.120 - - [08/Feb/2015:04:36:40 +0000] "GET /ajax/planStatusHistory HTTP/1.1" 200 508 "http://builds.cask.co/log" "Mozilla/5.0
(Macintosh; Intel Mac OS X 10_10_1) AppleWebKit Chrome/38.0.2125.122 Safari/537.36"
Fields: IP Address, Timestamp, Http Method, URI, Http Status, Response Size, URI, Client Info
Sample Web access log (Combined Log Format):
PROPRIETARY & CONFIDENTIAL
INGEST
any data from any source
in real-time and batch
BUILD
drag-and-drop ETL/ELT
pipelines that run on Hadoop
EGRESS
any data to any destination
in real-time and batch
Data Pipeline
provides the ability to automate complex workflows that involves fetching
data, possibly from multiple data sources, combining, performing non-trivial
transformations on the data, writing it to one more data sinks and deriving/
PROPRIETARY & CONFIDENTIAL
Stack of Data Enablers
PROPRIETARY & CONFIDENTIAL
Hydrator Studio
✦Drag-and-drop GUI for visual Data
Pipeline creation

✦Rich library of pre-built sources,
transforms, sinks for data ingestion
and ETL use cases

✦Separation of pipeline creation from
execution framework - MapReduce,
Spark, Spark Streaming etc.

✦Hadoop-native and Hadoop Distro
agnostic
PROPRIETARY & CONFIDENTIAL
Hydrator Data Pipeline
✦ Captures Metadata, Audit,
Lineage info and visualized using
Cask Tracker

✦ Notification, centralized metrics
and log collection for ease of
operability

✦ Simple Java API to build your
own source, transforms, sinks
with complete class loading
isolation

✦ SparkML based plugins, Python
transforms for data scientists
PROPRIETARY & CONFIDENTIAL
✦ ElasticSearch, SFTP, Cassandra, Kafka, JMS and many more sources and
sinks

Out of the box Integrations
PROPRIETARY & CONFIDENTIAL
✦ Implement your own batch (or realtime) source, transform, sink plugins using simple
Java API
Custom Plugins
PROPRIETARY & CONFIDENTIAL
Pipeline Implementation
Logical
Physical
MR/Spark Executions
Planner
CDAP
✦ Planner converts logical pipeline to a physical
execution plan

✦ Optimizes and bundles functions into one or
more MR/Spark jobs

✦ CDAP is the runtime environment where all the
components of the data pipeline are executed

✦ CDAP provides centralized log and metrics
collection, transaction, lineage and audit
information

PROPRIETARY & CONFIDENTIAL
Pipeline Implementation
PROPRIETARY & CONFIDENTIAL
CASK DATA APPLICATION PLATFORM
Integrated Framework for Building and
Running Data Applications on Hadoop
Integrates the Latest
Big Data Technologies
Supports All Major
Hadoop Distributions
Fully Open Source
and Highly Extensible
PROPRIETARY & CONFIDENTIAL
Hadoop ecosystem, 50 different projects
Top 6 Hadoop distributions
PROPRIETARY & CONFIDENTIAL
Abstraction and Integration Layer
Data Lake
Fraud
Detection
Recommendation
Engine
Sensor Data
Analytics
Customer
360
Hydrator Tracker
Hadoop ecosystem, 50 different projects
Top 6 Hadoop distributions
PROPRIETARY & CONFIDENTIAL
Data Lake
Fraud
Detection
Recommendation
Engine
Sensor Data
Analytics
Customer
360
Hydrator Tracker
CASK DATA APP PLATFORM
Hadoop ecosystem, 50 different projects
Top 6 Hadoop distributions
PROPRIETARY & CONFIDENTIAL
Self-Service Data Ingestion
and ETL for Data Lakes
Built for Production
on CDAP
Rich Drag-and-Drop
User Interface
Open Source &
Highly Extensible
PROPRIETARY & CONFIDENTIAL
✦ Join across multiple data sources (CDAP-5588)

✦ Macro substitutions

✦ Pre-Actions in pipelines similar to post run
notifications

✦ Spark streaming support for Realtime pipelines
Hydrator Roadmap
Thank You!
cdap-user@googlegroups.com
Twitter @CaskData
Questions?
PROPRIETARY & CONFIDENTIAL
Data Lake
Enterprise-wide data management platforms
for analyzing disparate sources of data in its
native format - Gartner
Data
Lake
1
0
1
0
0
01
1
0
1
Hydrating your Data Lake
Hydrator
Self-service, hadoop-native, drag-and-
drop open source framework to
develop, run and operate data
PROPRIETARY & CONFIDENTIAL
Manual processes requiring
hand-coding and reliance on

command-line tools
Hard to find data and

it’s lineage for data

discovery and exploration
Coupling of ingestion and
processing drives

architecture decisions
Operationalizing processes

for production and to

maintain SLAs
Ensuring data is in canonical
forms with a shared schema
usable by others
Coding or filing tickets often
required to perform new

ingestion and processing tasks
Multiple architectures and
technologies used by different
teams on different clusters
Guaranteeing compliance in a
system that is designed for
schema-on-read and raw data
Sharing infrastructure in a

multi-tenant environment

without low-level QoS support
Data
Reservoir
1
0
1
0
0
0
1
Data
Pond
1
0
1
0
1 0
Data
Lake
1
0
1
0
1
0
Data Lake Challenges
PROPRIETARY & CONFIDENTIAL
Hydrator framework with
templates and plugins enables
production workflows in minutes
Never lose data by ensuring all
ingested data is tracked with

metadata and lineage
Separation of ingestion

and processing to support

any type, format and rate
Operationalize workflows using

scheduling and SLA monitoring

with time / partition awareness
Using common transformations
and a shared system for

defining and exposing schema
Reference architecture ensures
a common platform across
teams, orgs, ops and security
Multi-tenant namespacing
provides data and app isolation,
tying together infrastructure
Ensure compliance by

requiring the use of specific
transformations and validation
Self-service access through
Cask Hydrator for the discovery,
ingest and exploration of data
Data
Reservoir
1
0
1
0
0
0
1
Data
Pond
1
0
1
0
1 0
Data
Lake
1
0
1
0
1
0
Data Lakes on CDAP

Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Free Data Pipelines, Jon Gray, CEO, Cask Data

  • 1.
    Building Data Pipelineswith Cask Hydrator Jon Gray CEO, Cask July 9th, 2016
  • 2.
    PROPRIETARY & CONFIDENTIAL WebAnalytics and Reporting Use Case ✦Hadoop ETL pipeline stitched together using hard-to-maintain, brittle scripts
 ✦Not enough personnel with expertise in all the Hadoop components (HDFS, MapReduce, Spark, YARN, HBase, Kafka) or lack of expertise
 ✦Hard to debug and validate, resulting in frequent failures in production environment
 
 Transform web log data from S3 every hour to Hadoop cluster for backup, as well as, perform analytics and enable realtime reporting of metrics such as number of successful/failure responses, most popular webpage etc. The Challenge —
  • 3.
    PROPRIETARY & CONFIDENTIAL DemoExample Load Log Files from S3 to HDFS and perform aggregations/analysis •Start with web access logs stored in Amazon S3 •Store the raw logs into HDFS Avro Files •Parse the access log lines into individual fields •Calculate the total number of requests by IP and status code •Find out IPs which received maximum successful status code and error codes 69.181.160.120 - - [08/Feb/2015:04:36:40 +0000] "GET /ajax/planStatusHistory HTTP/1.1" 200 508 "http://builds.cask.co/log" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit Chrome/38.0.2125.122 Safari/537.36" Fields: IP Address, Timestamp, Http Method, URI, Http Status, Response Size, URI, Client Info Sample Web access log (Combined Log Format):
  • 4.
    PROPRIETARY & CONFIDENTIAL INGEST anydata from any source in real-time and batch BUILD drag-and-drop ETL/ELT pipelines that run on Hadoop EGRESS any data to any destination in real-time and batch Data Pipeline provides the ability to automate complex workflows that involves fetching data, possibly from multiple data sources, combining, performing non-trivial transformations on the data, writing it to one more data sinks and deriving/
  • 5.
  • 6.
    PROPRIETARY & CONFIDENTIAL HydratorStudio ✦Drag-and-drop GUI for visual Data Pipeline creation
 ✦Rich library of pre-built sources, transforms, sinks for data ingestion and ETL use cases
 ✦Separation of pipeline creation from execution framework - MapReduce, Spark, Spark Streaming etc.
 ✦Hadoop-native and Hadoop Distro agnostic
  • 7.
    PROPRIETARY & CONFIDENTIAL HydratorData Pipeline ✦ Captures Metadata, Audit, Lineage info and visualized using Cask Tracker
 ✦ Notification, centralized metrics and log collection for ease of operability
 ✦ Simple Java API to build your own source, transforms, sinks with complete class loading isolation
 ✦ SparkML based plugins, Python transforms for data scientists
  • 8.
    PROPRIETARY & CONFIDENTIAL ✦ElasticSearch, SFTP, Cassandra, Kafka, JMS and many more sources and sinks
 Out of the box Integrations
  • 9.
    PROPRIETARY & CONFIDENTIAL ✦Implement your own batch (or realtime) source, transform, sink plugins using simple Java API Custom Plugins
  • 10.
    PROPRIETARY & CONFIDENTIAL PipelineImplementation Logical Physical MR/Spark Executions Planner CDAP ✦ Planner converts logical pipeline to a physical execution plan
 ✦ Optimizes and bundles functions into one or more MR/Spark jobs
 ✦ CDAP is the runtime environment where all the components of the data pipeline are executed
 ✦ CDAP provides centralized log and metrics collection, transaction, lineage and audit information

  • 11.
  • 12.
    PROPRIETARY & CONFIDENTIAL CASKDATA APPLICATION PLATFORM Integrated Framework for Building and Running Data Applications on Hadoop Integrates the Latest Big Data Technologies Supports All Major Hadoop Distributions Fully Open Source and Highly Extensible
  • 13.
    PROPRIETARY & CONFIDENTIAL Hadoopecosystem, 50 different projects Top 6 Hadoop distributions
  • 14.
    PROPRIETARY & CONFIDENTIAL Abstractionand Integration Layer Data Lake Fraud Detection Recommendation Engine Sensor Data Analytics Customer 360 Hydrator Tracker Hadoop ecosystem, 50 different projects Top 6 Hadoop distributions
  • 15.
    PROPRIETARY & CONFIDENTIAL DataLake Fraud Detection Recommendation Engine Sensor Data Analytics Customer 360 Hydrator Tracker CASK DATA APP PLATFORM Hadoop ecosystem, 50 different projects Top 6 Hadoop distributions
  • 16.
    PROPRIETARY & CONFIDENTIAL Self-ServiceData Ingestion and ETL for Data Lakes Built for Production on CDAP Rich Drag-and-Drop User Interface Open Source & Highly Extensible
  • 17.
    PROPRIETARY & CONFIDENTIAL ✦Join across multiple data sources (CDAP-5588)
 ✦ Macro substitutions
 ✦ Pre-Actions in pipelines similar to post run notifications
 ✦ Spark streaming support for Realtime pipelines Hydrator Roadmap
  • 18.
  • 19.
    PROPRIETARY & CONFIDENTIAL DataLake Enterprise-wide data management platforms for analyzing disparate sources of data in its native format - Gartner Data Lake 1 0 1 0 0 01 1 0 1 Hydrating your Data Lake Hydrator Self-service, hadoop-native, drag-and- drop open source framework to develop, run and operate data
  • 20.
    PROPRIETARY & CONFIDENTIAL Manualprocesses requiring hand-coding and reliance on
 command-line tools Hard to find data and
 it’s lineage for data
 discovery and exploration Coupling of ingestion and processing drives
 architecture decisions Operationalizing processes
 for production and to
 maintain SLAs Ensuring data is in canonical forms with a shared schema usable by others Coding or filing tickets often required to perform new
 ingestion and processing tasks Multiple architectures and technologies used by different teams on different clusters Guaranteeing compliance in a system that is designed for schema-on-read and raw data Sharing infrastructure in a
 multi-tenant environment
 without low-level QoS support Data Reservoir 1 0 1 0 0 0 1 Data Pond 1 0 1 0 1 0 Data Lake 1 0 1 0 1 0 Data Lake Challenges
  • 21.
    PROPRIETARY & CONFIDENTIAL Hydratorframework with templates and plugins enables production workflows in minutes Never lose data by ensuring all ingested data is tracked with
 metadata and lineage Separation of ingestion
 and processing to support
 any type, format and rate Operationalize workflows using
 scheduling and SLA monitoring
 with time / partition awareness Using common transformations and a shared system for
 defining and exposing schema Reference architecture ensures a common platform across teams, orgs, ops and security Multi-tenant namespacing provides data and app isolation, tying together infrastructure Ensure compliance by
 requiring the use of specific transformations and validation Self-service access through Cask Hydrator for the discovery, ingest and exploration of data Data Reservoir 1 0 1 0 0 0 1 Data Pond 1 0 1 0 1 0 Data Lake 1 0 1 0 1 0 Data Lakes on CDAP