SlideShare a Scribd company logo
1 of 14
Download to read offline
Building Data Pipelines with Cask Hydrator
Gokul Gunasekaran
Software Engineer, Cask Data
June 15, 2016
Cask, CDAP, Cask Hydrator and Cask Tracker are trademarks or registered trademarks of Cask Data. Apache Spark, Spark, the Spark logo, Apache Hadoop, Hadoop and the Hadoop logo are trademarks or registered trademarks of the Apache Software Foundation. All other trademarks and registered trademarks are the property of their respective owners.
cask.co
INGEST
any data from any source
in real-time and batch
BUILD
drag-and-drop ETL/ELT
pipelines that run on Hadoop
EGRESS
any data to any destination
in real-time and batch
Data Pipeline
provides the ability to automate complex workflows that involves fetching data,
performing non-trivial transformations, deriving and serving insights from the data
2
cask.co
Web Analytics and Reporting Use Case
✦ Hadoop ETL pipeline(s) stitched together using hard-to-maintain, brittle scripts

✦ Not many developers with expertise in Hadoop components (HDFS, MapReduce, Spark, YARN,
HBase, Kafka)

✦ Hard to debug and validate, resulting in frequent failures in production environment



Fetch web access logs from S3 every hour, load it into Hadoop cluster for backup and perform
analytics and enable realtime reporting of no. of successful/failure responses and client browser info
Challenge —
3
cask.co
Demo
Load Log Files from S3 into
HDFS and perform
aggregations/analysis
• Start with web access logs stored in Amazon S3
• Store the raw logs into HDFS Avro Files
• Parse the access log lines into individual fields
• Find out distribution of status codes
• Find out the most commonly used client browser
4
cask.co
S3 Input
69.181.160.120 - - [08/Feb/2015:04:36:40 +0000] "GET /ajax/planStatusHistory HTTP/1.1" 200 508
"http://builds.cask.co/log" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit Chrome/
38.0.2125.122 Safari/537.36"
Fields: IP Address, Timestamp, Http Method, URI, Http Status, Response Size, URI, Client Browser
5
cask.co
Hydrator Studio
✦ Drag-and-drop GUI for visual Data
Pipeline creation

✦ Rich library of pre-built sources,
transforms, sinks for data ingestion and
ETL use cases

✦ Separation of pipeline creation from
execution framework - MapReduce,
Spark, Spark Streaming etc.

✦ Hadoop-native and Hadoop Distro
agnostic
6
cask.co
Hydrator Data Pipeline
✦ Captures Metadata, Audit, Lineage
info and visualized using Cask
Tracker

✦ Post-run notification, centralized
metrics and log collection for ease of
operability

✦ Simple Java API to build your own
source, transforms, sinks with class
loader isolation

✦ SparkML based plugins, Python
transforms for data scientists
7
cask.co
✦ ElasticSearch, Cassandra, Kafka, SFTP, JMS and many more sources and sinks

✦ De-duplicate, Group By Aggregation, Row Denormalizer and other transforms
Out of the box Integrations
8
cask.co
✦ Implement your own batch (or realtime) source, transform, sink plugins using simple Java API
Custom Plugins
9
cask.co
Data Lake
Fraud
Detection
Recommendation
Engine
Sensor Data
Analytics
Customer
360
Hydrator Tracker
CASK DATA APP PLATFORM
Hadoop ecosystem, 50 different projects
Top 6 Hadoop distributions
10
cask.co
Pipeline Implementation
Logical Pipeline
Physical Workflow
MR/Spark Executions
Planner
CDAP
✦ Planner converts logical pipeline to a physical
execution plan

✦ Optimizes and bundles functions into one or more
MR/Spark jobs

✦ CDAP is the runtime environment where all the
components of the data pipeline are executed

✦ CDAP provides centralized log and metrics collection,
transaction, lineage and audit information

11
cask.co
✦ Join across multiple data sources (CDAP-5588)

✦ Pipeline preview

✦ Macro substitutions

✦ Pre-Actions in pipelines similar to post run notifications

✦ Spark streaming support for Realtime pipelines
Upcoming capabilities
12
Thank You!
cdap-user@googlegroups.com

@CaskData



github.com/caskdata/cdap
github.com/caskdata/hydrator-plugins

Questions?
13
cask.co
Self-Service Data Ingestion
and ETL for Data Lakes
Built for Production
on CDAP
Rich Drag-and-Drop
User Interface
Open Source &
Highly Extensible
14

More Related Content

Viewers also liked

Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Cloudera, Inc.
 
Transactions Over Apache HBase
Transactions Over Apache HBaseTransactions Over Apache HBase
Transactions Over Apache HBaseCask Data
 
FAN - An Architecture for Data Management
FAN - An Architecture for Data ManagementFAN - An Architecture for Data Management
FAN - An Architecture for Data Managementdigitallibrary
 
DMP Data Management Platform
DMP Data Management PlatformDMP Data Management Platform
DMP Data Management PlatformAvinash Tiwary
 
The Data Management Platform: The Digital Brain You Wish You Had by Audrey R...
The Data Management Platform: The Digital Brain You Wish You Had by  Audrey R...The Data Management Platform: The Digital Brain You Wish You Had by  Audrey R...
The Data Management Platform: The Digital Brain You Wish You Had by Audrey R...FOUNDConference
 
Hedvig & Docker Datacenter
Hedvig & Docker DatacenterHedvig & Docker Datacenter
Hedvig & Docker DatacenterEric Carter
 
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & TrifactaExtend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & TrifactaDataWorks Summit/Hadoop Summit
 
Best practices for using flash in hyperscale software storage architectures
Best practices for using flash in hyperscale software storage architecturesBest practices for using flash in hyperscale software storage architectures
Best practices for using flash in hyperscale software storage architecturesEric Carter
 
Bluekai: Data Management Platforms (dmp) for Publishers
Bluekai: Data Management Platforms (dmp) for PublishersBluekai: Data Management Platforms (dmp) for Publishers
Bluekai: Data Management Platforms (dmp) for PublishersBrian Crotty
 
DMP Data Management Platform
DMP Data Management PlatformDMP Data Management Platform
DMP Data Management PlatformAvinash Tiwary
 
Modern storage for modern business: get to know Hedvig
Modern storage for modern business: get to know HedvigModern storage for modern business: get to know Hedvig
Modern storage for modern business: get to know HedvigEric Carter
 
Hedvig slides from VMworld 2016
Hedvig slides from VMworld 2016Hedvig slides from VMworld 2016
Hedvig slides from VMworld 2016Eric Carter
 

Viewers also liked (12)

Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
 
Transactions Over Apache HBase
Transactions Over Apache HBaseTransactions Over Apache HBase
Transactions Over Apache HBase
 
FAN - An Architecture for Data Management
FAN - An Architecture for Data ManagementFAN - An Architecture for Data Management
FAN - An Architecture for Data Management
 
DMP Data Management Platform
DMP Data Management PlatformDMP Data Management Platform
DMP Data Management Platform
 
The Data Management Platform: The Digital Brain You Wish You Had by Audrey R...
The Data Management Platform: The Digital Brain You Wish You Had by  Audrey R...The Data Management Platform: The Digital Brain You Wish You Had by  Audrey R...
The Data Management Platform: The Digital Brain You Wish You Had by Audrey R...
 
Hedvig & Docker Datacenter
Hedvig & Docker DatacenterHedvig & Docker Datacenter
Hedvig & Docker Datacenter
 
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & TrifactaExtend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
 
Best practices for using flash in hyperscale software storage architectures
Best practices for using flash in hyperscale software storage architecturesBest practices for using flash in hyperscale software storage architectures
Best practices for using flash in hyperscale software storage architectures
 
Bluekai: Data Management Platforms (dmp) for Publishers
Bluekai: Data Management Platforms (dmp) for PublishersBluekai: Data Management Platforms (dmp) for Publishers
Bluekai: Data Management Platforms (dmp) for Publishers
 
DMP Data Management Platform
DMP Data Management PlatformDMP Data Management Platform
DMP Data Management Platform
 
Modern storage for modern business: get to know Hedvig
Modern storage for modern business: get to know HedvigModern storage for modern business: get to know Hedvig
Modern storage for modern business: get to know Hedvig
 
Hedvig slides from VMworld 2016
Hedvig slides from VMworld 2016Hedvig slides from VMworld 2016
Hedvig slides from VMworld 2016
 

More from Cask Data

Introducing a horizontally scalable, inference-based business Rules Engine fo...
Introducing a horizontally scalable, inference-based business Rules Engine fo...Introducing a horizontally scalable, inference-based business Rules Engine fo...
Introducing a horizontally scalable, inference-based business Rules Engine fo...Cask Data
 
Transaction in HBase, by Andreas Neumann, Cask
Transaction in HBase, by Andreas Neumann, CaskTransaction in HBase, by Andreas Neumann, Cask
Transaction in HBase, by Andreas Neumann, CaskCask Data
 
#BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask
#BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask #BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask
#BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask Cask Data
 
Building Enterprise Grade Applications in Yarn with Apache Twill
Building Enterprise Grade Applications in Yarn with Apache TwillBuilding Enterprise Grade Applications in Yarn with Apache Twill
Building Enterprise Grade Applications in Yarn with Apache TwillCask Data
 
Logging infrastructure for Microservices using StreamSets Data Collector
Logging infrastructure for Microservices using StreamSets Data CollectorLogging infrastructure for Microservices using StreamSets Data Collector
Logging infrastructure for Microservices using StreamSets Data CollectorCask Data
 
NRT Event Processing with Guaranteed Delivery of HTTP Callbacks, HBaseCon 2015
NRT Event Processing with Guaranteed Delivery of HTTP Callbacks, HBaseCon 2015NRT Event Processing with Guaranteed Delivery of HTTP Callbacks, HBaseCon 2015
NRT Event Processing with Guaranteed Delivery of HTTP Callbacks, HBaseCon 2015Cask Data
 
Brown Bag : CDAP (f.k.a Reactor) Streams Deep DiveStream on file brown bag
Brown Bag : CDAP (f.k.a Reactor) Streams Deep DiveStream on file brown bagBrown Bag : CDAP (f.k.a Reactor) Streams Deep DiveStream on file brown bag
Brown Bag : CDAP (f.k.a Reactor) Streams Deep DiveStream on file brown bagCask Data
 
HBase Meetup @ Cask HQ 09/25
HBase Meetup @ Cask HQ 09/25HBase Meetup @ Cask HQ 09/25
HBase Meetup @ Cask HQ 09/25Cask Data
 

More from Cask Data (9)

Introducing a horizontally scalable, inference-based business Rules Engine fo...
Introducing a horizontally scalable, inference-based business Rules Engine fo...Introducing a horizontally scalable, inference-based business Rules Engine fo...
Introducing a horizontally scalable, inference-based business Rules Engine fo...
 
About CDAP
About CDAPAbout CDAP
About CDAP
 
Transaction in HBase, by Andreas Neumann, Cask
Transaction in HBase, by Andreas Neumann, CaskTransaction in HBase, by Andreas Neumann, Cask
Transaction in HBase, by Andreas Neumann, Cask
 
#BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask
#BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask #BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask
#BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask
 
Building Enterprise Grade Applications in Yarn with Apache Twill
Building Enterprise Grade Applications in Yarn with Apache TwillBuilding Enterprise Grade Applications in Yarn with Apache Twill
Building Enterprise Grade Applications in Yarn with Apache Twill
 
Logging infrastructure for Microservices using StreamSets Data Collector
Logging infrastructure for Microservices using StreamSets Data CollectorLogging infrastructure for Microservices using StreamSets Data Collector
Logging infrastructure for Microservices using StreamSets Data Collector
 
NRT Event Processing with Guaranteed Delivery of HTTP Callbacks, HBaseCon 2015
NRT Event Processing with Guaranteed Delivery of HTTP Callbacks, HBaseCon 2015NRT Event Processing with Guaranteed Delivery of HTTP Callbacks, HBaseCon 2015
NRT Event Processing with Guaranteed Delivery of HTTP Callbacks, HBaseCon 2015
 
Brown Bag : CDAP (f.k.a Reactor) Streams Deep DiveStream on file brown bag
Brown Bag : CDAP (f.k.a Reactor) Streams Deep DiveStream on file brown bagBrown Bag : CDAP (f.k.a Reactor) Streams Deep DiveStream on file brown bag
Brown Bag : CDAP (f.k.a Reactor) Streams Deep DiveStream on file brown bag
 
HBase Meetup @ Cask HQ 09/25
HBase Meetup @ Cask HQ 09/25HBase Meetup @ Cask HQ 09/25
HBase Meetup @ Cask HQ 09/25
 

Recently uploaded

Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfFIDO Alliance
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfFIDO Alliance
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...CzechDreamin
 
Connecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKConnecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKUXDXConf
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftshyamraj55
 
THE BEST IPTV in GERMANY for 2024: IPTVreel
THE BEST IPTV in  GERMANY for 2024: IPTVreelTHE BEST IPTV in  GERMANY for 2024: IPTVreel
THE BEST IPTV in GERMANY for 2024: IPTVreelreely ones
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCzechDreamin
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeCzechDreamin
 
Top 10 Symfony Development Companies 2024
Top 10 Symfony Development Companies 2024Top 10 Symfony Development Companies 2024
Top 10 Symfony Development Companies 2024TopCSSGallery
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...CzechDreamin
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsStefano
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessUXDXConf
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfFIDO Alliance
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceSamy Fodil
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfFIDO Alliance
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaCzechDreamin
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024Stephanie Beckett
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...FIDO Alliance
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyUXDXConf
 

Recently uploaded (20)

Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Connecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKConnecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAK
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
THE BEST IPTV in GERMANY for 2024: IPTVreel
THE BEST IPTV in  GERMANY for 2024: IPTVreelTHE BEST IPTV in  GERMANY for 2024: IPTVreel
THE BEST IPTV in GERMANY for 2024: IPTVreel
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
Top 10 Symfony Development Companies 2024
Top 10 Symfony Development Companies 2024Top 10 Symfony Development Companies 2024
Top 10 Symfony Development Companies 2024
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 

Building Data pipelines with Cask Hydrator, by Gokul Gunasekaran from Cask

  • 1. Building Data Pipelines with Cask Hydrator Gokul Gunasekaran Software Engineer, Cask Data June 15, 2016 Cask, CDAP, Cask Hydrator and Cask Tracker are trademarks or registered trademarks of Cask Data. Apache Spark, Spark, the Spark logo, Apache Hadoop, Hadoop and the Hadoop logo are trademarks or registered trademarks of the Apache Software Foundation. All other trademarks and registered trademarks are the property of their respective owners.
  • 2. cask.co INGEST any data from any source in real-time and batch BUILD drag-and-drop ETL/ELT pipelines that run on Hadoop EGRESS any data to any destination in real-time and batch Data Pipeline provides the ability to automate complex workflows that involves fetching data, performing non-trivial transformations, deriving and serving insights from the data 2
  • 3. cask.co Web Analytics and Reporting Use Case ✦ Hadoop ETL pipeline(s) stitched together using hard-to-maintain, brittle scripts
 ✦ Not many developers with expertise in Hadoop components (HDFS, MapReduce, Spark, YARN, HBase, Kafka)
 ✦ Hard to debug and validate, resulting in frequent failures in production environment
 
 Fetch web access logs from S3 every hour, load it into Hadoop cluster for backup and perform analytics and enable realtime reporting of no. of successful/failure responses and client browser info Challenge — 3
  • 4. cask.co Demo Load Log Files from S3 into HDFS and perform aggregations/analysis • Start with web access logs stored in Amazon S3 • Store the raw logs into HDFS Avro Files • Parse the access log lines into individual fields • Find out distribution of status codes • Find out the most commonly used client browser 4
  • 5. cask.co S3 Input 69.181.160.120 - - [08/Feb/2015:04:36:40 +0000] "GET /ajax/planStatusHistory HTTP/1.1" 200 508 "http://builds.cask.co/log" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit Chrome/ 38.0.2125.122 Safari/537.36" Fields: IP Address, Timestamp, Http Method, URI, Http Status, Response Size, URI, Client Browser 5
  • 6. cask.co Hydrator Studio ✦ Drag-and-drop GUI for visual Data Pipeline creation
 ✦ Rich library of pre-built sources, transforms, sinks for data ingestion and ETL use cases
 ✦ Separation of pipeline creation from execution framework - MapReduce, Spark, Spark Streaming etc.
 ✦ Hadoop-native and Hadoop Distro agnostic 6
  • 7. cask.co Hydrator Data Pipeline ✦ Captures Metadata, Audit, Lineage info and visualized using Cask Tracker
 ✦ Post-run notification, centralized metrics and log collection for ease of operability
 ✦ Simple Java API to build your own source, transforms, sinks with class loader isolation
 ✦ SparkML based plugins, Python transforms for data scientists 7
  • 8. cask.co ✦ ElasticSearch, Cassandra, Kafka, SFTP, JMS and many more sources and sinks
 ✦ De-duplicate, Group By Aggregation, Row Denormalizer and other transforms Out of the box Integrations 8
  • 9. cask.co ✦ Implement your own batch (or realtime) source, transform, sink plugins using simple Java API Custom Plugins 9
  • 10. cask.co Data Lake Fraud Detection Recommendation Engine Sensor Data Analytics Customer 360 Hydrator Tracker CASK DATA APP PLATFORM Hadoop ecosystem, 50 different projects Top 6 Hadoop distributions 10
  • 11. cask.co Pipeline Implementation Logical Pipeline Physical Workflow MR/Spark Executions Planner CDAP ✦ Planner converts logical pipeline to a physical execution plan
 ✦ Optimizes and bundles functions into one or more MR/Spark jobs
 ✦ CDAP is the runtime environment where all the components of the data pipeline are executed
 ✦ CDAP provides centralized log and metrics collection, transaction, lineage and audit information
 11
  • 12. cask.co ✦ Join across multiple data sources (CDAP-5588)
 ✦ Pipeline preview
 ✦ Macro substitutions
 ✦ Pre-Actions in pipelines similar to post run notifications
 ✦ Spark streaming support for Realtime pipelines Upcoming capabilities 12
  • 14. cask.co Self-Service Data Ingestion and ETL for Data Lakes Built for Production on CDAP Rich Drag-and-Drop User Interface Open Source & Highly Extensible 14