Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Hydrator

Code-free Data Pipelines

for Hadoop, Spark, and HBase
Jonathan Gray, CEO @ Cask
Big Data Day LA - July 9th, 201...
cask.co
About Me
2
cask.co
Hadoop Enables New Apps and Patterns
3
ENTERPRISE DATA LAKES BIG DATA ANALYTICS PRODUCTION DATA APPS
Batch and Rea...
PROPRIETARY & CONFIDENTIAL
Web Analytics and Reporting Use Case
✦ Hadoop ETL pipeline stitched together using hard-to-main...
cask.co
The Many Faces of Hadoop
5
Developer
Advanced Programming
Focused on App Logic
Data Scientist
Basic Dev & Complex ...
cask.co6
Enter Cask
Key Customers
and Partners
Named a Gartner Cool Vendor 2016
Founded in 2011 by early Hadoop engineers ...
cask.co
Introducing the Data Application Platform
7
Deployment Models
On-premises Hybrid Cloud
Governance Operations
Pre-p...
cask.co
Introducing the Cask Data App Platform
8
Open Source, Integrated Framework for
Building and Running Data Applicati...
9
What’s in CDAP ?
A self-service, re-configurable, code-free framework to build, run
and operate real-time or batch data p...
cask.co10
A self-service, code-free framework to
build, run and operate data pipelines
on Apache Hadoop and Spark
Built fo...
PROPRIETARY & CONFIDENTIAL
INGEST
any data from any source
in real-time and batch
BUILD
drag-and-drop ETL/ELT
pipelines th...
PROPRIETARY & CONFIDENTIAL
Stack of Data Enablers
PROPRIETARY & CONFIDENTIAL
Hydrator Studio
✦ Drag-and-drop GUI for visual Data
Pipeline creation

✦ Rich library of pre-bu...
PROPRIETARY & CONFIDENTIAL
Hydrator Data Pipeline
✦ Captures Metadata, Audit, Lineage
info, discovered and visualized usin...
PROPRIETARY & CONFIDENTIAL
✦ Elastic, SFTP, Cassandra, Kafka, RDBMS, EDW and many more sources and sinks
✦ Parse/Encode/Ha...
PROPRIETARY & CONFIDENTIAL
✦ Implement your own batch (or realtime) source, transform, sink plugins using simple Java API
...
PROPRIETARY & CONFIDENTIAL
Pipeline Implementation
Logical Pipeline
Physical Workflow
MR/Spark Executions
Planner
CDAP
✦ Pl...
PROPRIETARY & CONFIDENTIAL
Pipeline Implementation
19
Support for fine-grain role-based
authorizing of entities in CDAP

Integration with Sentry and Ranger
Security — Authen...
PROPRIETARY & CONFIDENTIAL
✦ Join across multiple data sources (CDAP-5588)

✦ Live Debug/Preview of pipelines in build mod...
21
Use case mapping
• Build operational analytics
applications
• Micro-service Enablement
• Self-Service Data Analytics / ...
PROPRIETARY & CONFIDENTIAL
Demo Example
Load Log Files from S3 to
HDFS and perform
aggregations/analysis
• Start with web ...
cask.co23
Thanks!
Jonathan Gray
@jgrayla
Download CDAP w/ Hydrator: http://cask.co/downloads/
Upcoming SlideShare
Loading in …5
×

Cask Hydrator: Code Free Data Pipelines - Big Data Day LA 2016

356 views

Published on

Big Data Day 2016
July 9, 2016
Los Angeles

Link to video recording: https://www.youtube.com/watch?v=13qi5UIObCs

Abstract:

To efficiently create and manage an enterprise Data Lake typically requires substantial effort to ingest, process, store, secure, and manage data from a variety of sources. Hydrator is an open source framework and self-service user interface for creating data lakes that simplifies the building and managing of production data pipelines on Spark, MapReduce, Spark Streaming and Tigon.

The goal of this talk is to demonstrate broad, self-service access to Hadoop while maintaining the controls and monitors necessary within the enterprise. Hydrator provides these abilities to the enterprise and to all of the end-users the program, access, and manage enterprise data.
Some of the features that will be demonstrated:
* Supports Ingestion, ETL, Aggregations and Machine Learning. Real-time and Batch. Supports majors distros and cloud providers. Built to allow enterprises to enable self-service while maintaining enterprise requirements for security and governance.

The Hydrator open source ecosystem contains an extensive library of plugins to enable batch and real-time ingestion from traditional and modern databases, cloud services and other common data sources. There are dozens of community plugins for machine learning and analytics as well as pre-built pipelines for common end-to-end use cases.

* Drag-and-drop user interface where you build data ingestion and data processing pipelines from included, community and custom-built plugins as well as custom MapReduce and Spark jobs. Pipelines and plugins support versioning and are configured with JSON.

* Operate pipelines with management interface. Schedule and monitor pipelines through UI or REST APIs. Powerful metadata capabilities. Automatically captures complete audit and lineage information. Integrates with Security and MDM systems.

* Customize and limit access to data sources, sinks and any other plugins to provide simplified and controlled usage by non-technical users.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Cask Hydrator: Code Free Data Pipelines - Big Data Day LA 2016

  1. 1. Hydrator
 Code-free Data Pipelines for Hadoop, Spark, and HBase Jonathan Gray, CEO @ Cask Big Data Day LA - July 9th, 2016 cask.co Cask, CDAP, Cask Hydrator and Cask Tracker are trademarks or registered trademarks of Cask Data. Apache Spark, Spark, the Spark logo, Apache Hadoop, Hadoop and the Hadoop logo are trademarks or registered trademarks of the Apache Software Foundation. All other trademarks and registered trademarks are the property of their respective owners.
  2. 2. cask.co About Me 2
  3. 3. cask.co Hadoop Enables New Apps and Patterns 3 ENTERPRISE DATA LAKES BIG DATA ANALYTICS PRODUCTION DATA APPS Batch and Realtime Data Ingestion Any type of data from any type of source in any volume Batch and Streaming ETL Code-free self-service creation and management of pipelines SQL Exploration and Data Science All data is automatically accessible via SQL and client SDKs Data as a Service Easily expose generic or custom REST APIs on any data 360o Customer View Integrate data from any source and expose through queries and APIs Realtime Dashboards Perform realtime OLAP aggregations and serve them through REST APIs Time Series Analysis Store, process and serve massive volumes of time-series data Realtime Log Analytics Ingestion and processing of high-throughput streaming log events Recommendation Engines Build models in batch using historical data and serve them in realtime Anomaly Detection Systems Process streaming events and predictably compare them in realtime to historical data NRT Event Monitoring Reliably monitor large streams of data and perform defined actions within a specified time Internet of Things Ingestion, storage and processing of events that is highly-available, scalable and consistent ENTERPRISE DATA LAKES BIG DATA ANALYTICS PRODUCTION DATA APPS Batch and Realtime Data Ingestion Any type of data from any type of source in any volume Batch and Streaming ETL Code-free self-service creation and management of pipelines SQL Exploration and Data Science All data is automatically accessible via SQL and client SDKs Data as a Service Easily expose generic or custom REST APIs on any data 360o Customer View Integrate data from any source and expose through queries and APIs Realtime Dashboards Perform realtime OLAP aggregations and serve them through REST APIs Time Series Analysis Store, process and serve massive volumes of time-series data Realtime Log Analytics Ingestion and processing of high-throughput streaming log events Recommendation Engines Build models in batch using historical data and serve them in realtime Anomaly Detection Systems Process streaming events and predictably compare them in realtime to historical data NRT Event Monitoring Reliably monitor large streams of data and perform defined actions within a specified time Internet of Things Ingestion, storage and processing of events that is highly-available, scalable and consistent ENTERPRISE DATA LAKES BIG DATA ANALYTICS PRODUCTION DATA APPS Batch and Realtime Data Ingestion Any type of data from any type of source in any volume Batch and Streaming ETL Code-free self-service creation and management of pipelines SQL Exploration and Data Science All data is automatically accessible via SQL and client SDKs Data as a Service Easily expose generic or custom REST APIs on any data 360o Customer View Integrate data from any source and expose through queries and APIs Realtime Dashboards Perform realtime OLAP aggregations and serve them through REST APIs Time Series Analysis Store, process and serve massive volumes of time-series data Realtime Log Analytics Ingestion and processing of high-throughput streaming log events Recommendation Engines Build models in batch using historical data and serve them in realtime Anomaly Detection Systems Process streaming events and predictably compare them in realtime to historical data NRT Event Monitoring Reliably monitor large streams of data and perform defined actions within a specified time Internet of Things Ingestion, storage and processing of events that is highly-available, scalable and consistent
  4. 4. PROPRIETARY & CONFIDENTIAL Web Analytics and Reporting Use Case ✦ Hadoop ETL pipeline stitched together using hard-to-maintain, brittle scripts
 ✦ Not enough people with expertise in all the Hadoop components (HDFS, MapReduce, Spark, YARN, HBase, Kafka) or a general lack of expertise
 ✦ Hard to debug and validate, resulting in frequent failures in production environment
 ✦ Difficult to integrate into SQL / BI reporting solutions for business users
 ✦ As use cases advance into Data Science, Machine Learning, and Predictive Analytics you need to include scientists and advanced ML programmers Transform web log data from S3 every hour to Hadoop cluster for backup, as well as, perform analytics and enable realtime reporting of metrics such as number of successful/failure responses, most popular pages, etc. The Challenges —
  5. 5. cask.co The Many Faces of Hadoop 5 Developer Advanced Programming Focused on App Logic Data Scientist Basic Dev & Complex Analytics Focused on Data & Algorithms IT Pro / Ops Configuring & Monitoring Focused on Infrastructure & SLA’s LOB / Product Decision Making & Driving Revenue Focused on Apps & Insights Challenge: The tools are missing to connect these users and take apps from prototype to production
  6. 6. cask.co6 Enter Cask Key Customers and Partners Named a Gartner Cool Vendor 2016 Founded in 2011 by early Hadoop engineers from Facebook and Yahoo!
  7. 7. cask.co Introducing the Data Application Platform 7 Deployment Models On-premises Hybrid Cloud Governance Operations Pre-packaged Integrations Orchestration/Automation/Workflows Core Application and Data Integration Role-based User Experience Developer Data Scientist IT /Ops
  8. 8. cask.co Introducing the Cask Data App Platform 8 Open Source, Integrated Framework for Building and Running Data Applications on Hadoop and Spark • Supports all major Hadoop distros • Integrates the latest Big Data technologies • 100% open source and highly extensible
  9. 9. 9 What’s in CDAP ? A self-service, re-configurable, code-free framework to build, run and operate real-time or batch data pipelines in cloud or on- premise. A self-service tool for tracking the flow of data in and out of Data Lake. Track, Index and Search technical, business and operational metadata of applications and pipelines An integration platform that integrates and abstracts underlying Hadoop technologies. Build data analytics solutions in cloud or on-premise. The platform is powerful and versatile for you to build, publish and manage operational self-service analytics applications Your Apps
  10. 10. cask.co10 A self-service, code-free framework to build, run and operate data pipelines on Apache Hadoop and Spark Built for Production on CDAP Rich Drag-and-Drop User Interface Open Source & Highly Extensible
  11. 11. PROPRIETARY & CONFIDENTIAL INGEST any data from any source in real-time and batch BUILD drag-and-drop ETL/ELT pipelines that run on Hadoop EGRESS any data to any destination in real-time and batch Hydrator Data Pipelines provide the ability to automate complex workflows that involves fetching data, possibly from multiple data sources, combining, performing non-trivial transformations and aggregations on the data, writing it to one more data sinks and making it available for applications and analytics
  12. 12. PROPRIETARY & CONFIDENTIAL Stack of Data Enablers
  13. 13. PROPRIETARY & CONFIDENTIAL Hydrator Studio ✦ Drag-and-drop GUI for visual Data Pipeline creation
 ✦ Rich library of pre-built sources, transforms, sinks for data ingestion and ETL use cases
 ✦ Separation of pipeline creation from execution framework - MapReduce, Spark, Spark Streaming etc.
 ✦ Hadoop-native and Hadoop Distro agnostic
  14. 14. PROPRIETARY & CONFIDENTIAL Hydrator Data Pipeline ✦ Captures Metadata, Audit, Lineage info, discovered and visualized using Cask Tracker
 ✦ Notifications, scheduling, and monitoring with centralized metrics and log collection for ease of operability ✦ Simple Java API to build your own source, transforms, sinks with class loading isolation
 ✦ Javascript and Python transforms
 ✦ Include arbitrary Spark jobs
  15. 15. PROPRIETARY & CONFIDENTIAL ✦ Elastic, SFTP, Cassandra, Kafka, RDBMS, EDW and many more sources and sinks ✦ Parse/Encode/Hash, Distinct/Group By, Custom JavaScript/Python Transforms Out of the box Integrations
  16. 16. PROPRIETARY & CONFIDENTIAL ✦ Implement your own batch (or realtime) source, transform, sink plugins using simple Java API Custom Plugins
  17. 17. PROPRIETARY & CONFIDENTIAL Pipeline Implementation Logical Pipeline Physical Workflow MR/Spark Executions Planner CDAP ✦ Planner converts logical pipeline to a physical execution plan
 ✦ Optimizes and bundles functions into one or more MR/Spark jobs
 ✦ CDAP is the runtime environment where all the components of the data pipeline are executed
 ✦ CDAP provides centralized log and metrics collection, transaction, lineage and audit information
  18. 18. PROPRIETARY & CONFIDENTIAL Pipeline Implementation
  19. 19. 19 Support for fine-grain role-based authorizing of entities in CDAP
 Integration with Sentry and Ranger Security — Authentication and Authorization Ability to preview pipelines with real or injected data before deploying (Standalone) Security — Impersonation
 and Encryption Learn about how datasets are being used and the top applications accessing it Tracker — Data Usage Analytics Support for annotating business metadata based on business specified taxonomy Metadata Taxonomy Build and run Hydrator real-time pipelines using Spark Streaming Hydrator — Spark Streaming Ability to run CDAP and CDAP Apps as specified users and ability to encrypt/decrypt sensitive configuration Hydrator — Preview Mode Capability to join multiple streams (inner & outer) and ability to configure actions allowing one to run binaries on designated nodes Hydrator — Join & Action Support for XML, Mainframe (COBOL Copybook), Value Mapper, Normalizer, Denormalizer, JsonToXml, SSH Action, Excel Reader, Solr & Spark ML Hydrator — Plugins 3.5 - Latest Features
  20. 20. PROPRIETARY & CONFIDENTIAL ✦ Join across multiple data sources (CDAP-5588)
 ✦ Live Debug/Preview of pipelines in build mode
 ✦ Macro substitutions for configuration/properties
 ✦ Custom Actions anywhere in pipeline
 ✦ Spark streaming support for real-time pipelines Hydrator Roadmap
  21. 21. 21 Use case mapping • Build operational analytics applications • Micro-service Enablement • Self-Service Data Analytics / Data Science • Data-As-A-Service • Empower developers to easily build solution on Hadoop • Abstract technologies, future proof • Ingestion, Transformation, Blending (complex joins) and Lookup. • Machine Learning, Aggregation and Reporting • Realtime and Batch data pipelines • DW Offloading (Netezza, Teradata, etc) • Painless and Fast Ingest into Impala operationalized • Data Ingestion from varied sources • Easy way to catalog application and pipeline level metadata • Search across technical, business and operational metadata • Track Lineage and Provenance, • Track across non-Hadoop integrations • Usage Analytics of cluster data • Data Quality Measure • Integration with other MDM systems including Navigator
  22. 22. PROPRIETARY & CONFIDENTIAL Demo Example Load Log Files from S3 to HDFS and perform aggregations/analysis • Start with web access logs stored in Amazon S3 • Store the raw logs into HDFS Avro Files • Parse the access log lines into individual fields • Calculate the total number of requests by IP and status code • Find out IPs which received maximum successful status code and error codes 69.181.160.120 - - [08/Feb/2015:04:36:40 +0000] "GET /ajax/planStatusHistory HTTP/1.1" 200 508 "http://builds.cask.co/log" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit Chrome/38.0.2125.122 Safari/537.36" Fields: IP Address, Timestamp, Http Method, URI, Http Status, Response Size, URI, Client Info Sample Web access log (Combined Log Format):
  23. 23. cask.co23 Thanks! Jonathan Gray @jgrayla Download CDAP w/ Hydrator: http://cask.co/downloads/

×