Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Strata+Hadoop NY 2015: Hydrate a data lake in days with CDAP


Published on

Strata+Hadoop World | New York, NY | Sept 29-Oct 1, 2015

About the talk:

Data lakes represent a powerful new data architecture, providing enterprises with the scale and flexibility required for big data: unbounded storage for unbounded questions. Hadoop is the de facto standard for implementing data lakes, but significant expertise, time, and effort are still required for organizations to deliver one. Today, enterprises building their own data lakes on Hadoop are effectively implementing their own internal platforms from a collection of individual open source technologies.

The many projects provided by open source and commercial Hadoop distributions must be integrated with each other, integrated with the existing environment, and operationalized into new and existing processes. With no established best practices or standards, each organization is left to find their own way and rely on expensive, external experts. Data lake proof of concepts can take months.

This talk introduces Cask Hydrator, a new open source data lake framework included in the latest release of the Cask Data App Platform (CDAP). Hydrator is a self-service data ingestion and ETL framework with a drag-and-drop user interface and JSON-based pipeline configurations. Enforcing best practices and providing out-of-the-box functionality, Hydrator enables enterprises to build data lakes in a matter of days. Integrations are included with open source and traditional data sources, from Kafka and Flume to Oracle and Teradata. Completely open source, Cask Hydrator is highly extensible and can be easily integrated with new data sources and sinks, and extended with custom transformations and validations.

Attendees will learn about data lakes, the different approaches and architectures enterprises are utilizing, the benefits and challenges associated with them, and how Cask Hydrator can enable the rapid creation of data lakes and dramatically decrease the complexity in operationalizing them.

This session is sponsored by Cask

Speaker Bio:

Jonathan Gray, founder and CEO of Cask, is an entrepreneur and software engineer with a background in startups, open source, and all things data. Prior to founding Cask, Jonathan was a software engineer at Facebook where he helped drive HBase engineering efforts, including Facebook Messages and several other large-scale projects, from inception to production.

An open source evangelist, Jonathan was responsible for helping build the Facebook engineering brand through developer outreach and refocusing the open source strategy of the company. Prior to Facebook, Jonathan founded, where he became an early adopter of Hadoop and HBase and is now a core contributor and active committer in the community.

Jonathan holds a bachelor’s degree in electrical and computer engineering from Carnegie Mellon University.

Published in: Software
  • Be the first to comment

Strata+Hadoop NY 2015: Hydrate a data lake in days with CDAP

  1. 1. PROPRIETARY & CONFIDENTIAL Why Cask? 2 @jgrayla
  2. 2. SIMPLE ACCESS TO POWERFUL TECHNOLOGY Cask’s goal is to enable every developer and enterprise to
 quickly and easily build and run modern data applications
 using open source big data technologies like Hadoop
  3. 3. PROPRIETARY & CONFIDENTIAL James Dixon (Pentaho) Data Lake Data streams in from sources to fill the lake, and various users of the lake can come to examine, dive in, or sample
 (Hadoop World NYC 2010) Introduction to Data Lakes Gartner Data Lake Enterprise-wide data management platforms for analyzing disparate sources of data in its native format Hortonworks Data Lake Collect everything, dive in anywhere, give flexible access.
 Maximum scale and insight with the lowest possible friction and cost. Cloudera Data Hub A centralized, unified data source that can quickly provide diverse business users with the information they need
 to do their jobs. Data Lake 1 0 1 0 0 01 1 0 1
  4. 4. PROPRIETARY & CONFIDENTIAL The Journey to Data Lakes The Journey to Data Lakes is not Easy Our customers are some of the
 most advanced users of Hadoop
 and have years invested into their journeys. The goal of CDAP is to provide a framework
 and set of abstractions to avoid the pitfalls and
 long timelines that plague Hadoop projects. CDAP drastically accelerates your
 adoption and utilization of big data. 1 0 1 0 0 0 1 0 1
  5. 5. PROPRIETARY & CONFIDENTIAL Raw a.k.a. Level 0
 Data that has been left in it’s native form without any transformation. Types of Water… er, Data Defined a.k.a. Level 1
 Data that has a defined schema and has been wrangled and cleansed. Refined a.k.a. Level 2
 Data that has been aggregated from the source records, like counts or models.
  6. 6. PROPRIETARY & CONFIDENTIAL Analysts Vertical Expertise Utilizes BI Tools No Programming Needs UI for Access Types of Data Users Scientists Mixed Expertise Utilizes Py/R/SQL/etc. Basic Programming
 Needs Tools for Access Developers Horizontal Expertise Utilizes Java/Scripting Advanced Programming Needs Code for Access
  7. 7. PROPRIETARY & CONFIDENTIAL Data Lake Architectures Data Reservoir Raw + Defined Data
 which is governed and audited to ensure compliance and security Data Reservoir 1 0 1 0 0 0 1 Data Pond Raw Data copied from existing internal data stores and pulled from external data sources Data Pond 1 0 1 0 1 0 Data Lake Raw + Defined Data
 pushed from other systems into a centralized, shared storage cluster Data Lake 1 0 1 0 1 0
  8. 8. PROPRIETARY & CONFIDENTIAL Data Pond Raw Data copied from existing internal data stores and pulled from external data sources SME / Enterprise Line of Business Customer 360° View Bring together silo’d datasets Combine with external data sources Ask new questions, find unknown unknowns Data Pond 1 0 1 0 1 0
  9. 9. PROPRIETARY & CONFIDENTIAL Data Lake Web Startup Company Raw + Defined Data pushed from other systems into a centralized, shared storage cluster Log Storage and Analytics Ingestion of data from multiple sources Transforming and processing of data Centralized storage and analytics of log data Data Lake 1 0 1 0 1 0 0
  10. 10. PROPRIETARY & CONFIDENTIAL Data Reservoir Fortune 500 Enterprise Raw + Defined Data which is governed and audited to ensure compliance and security enforcement Enterprise Data Hub Storage and processing for all enterprise data Provide centralized auditing and enforcement Any data available while ensuring compliance Data Reservoir 10 1 0 0 0 1 0
  11. 11. PROPRIETARY & CONFIDENTIAL Data Lake Challenges Manual processes requiring hand-coding and reliance on
 command-line tools Hard to find data and
 it’s lineage for data
 discovery and exploration Operationalizing processes
 for production and to
 maintain SLAs Coupling of ingestion and processing drives
 architecture decisions Ensuring data is in canonical forms with a shared schema usable by others Sharing infrastructure in a
 multi-tenant environment
 without low-level QoS support Multiple architectures and technologies used by different teams on different clusters Guaranteeing compliance in a system that is designed for schema-on-read and raw data Coding or filing tickets often required to perform manual
 ingestion and processing tasks Data Reservoir 1 0 1 0 0 0 1 Data Pond 1 0 1 0 1 0 Data Lake 1 0 1 0 1 0
  12. 12. PROPRIETARY & CONFIDENTIAL CASK DATA APPLICATION PLATFORM Integrated Framework for Building and Running Data Applications on Hadoop Integrates the Latest Big Data Technologies Supports All Major Hadoop Distributions Fully Open Source and Highly Extensible
  13. 13. PROPRIETARY & CONFIDENTIAL14 Key Features CASK DATA APPLICATION PLATFORM Infrastructure INTEGRATION Provide an integrated product experience
 with out-of-the-box capabilities Architecture STANDARDS Define a reference architecture to standardize support for mixed infrastructure Programming ABSTRACTIONS Utilize abstraction layers to encapsulate complex patterns and insulate developers Production SERVICES Provides development tools and runtime services to enable production
 apps and data
  15. 15. PROPRIETARY & CONFIDENTIAL Self-Service Ingestion and ETL
 for Hadoop Data Lakes Built for Production on CDAP Rich Drag-and-Drop User Interface Open Source & Highly Extensible
  16. 16. PROPRIETARY & CONFIDENTIAL DISCOVER data using user and machine generated metadata INGEST any data from any source in real-time and batch BUILD drag-and-drop ETL/ELT pipelines that run on Hadoop EGRESS any data to any destination in real-time and batch
  17. 17. PROPRIETARY & CONFIDENTIAL Data Lakes on CDAP Hydrator framework with templates and plugins enables production workflows in minutes Never lose data by ensuring all ingested data is tracked with
 metadata and lineage Operationalize workflows using
 scheduling and SLA monitoring
 with time / partition awareness Separation of ingestion
 and processing to support
 any type, format and rate Using common transformations and a shared system for
 defining and exposing schema Multi-tenant namespacing provides data and app isolation, tying together infrastructure Reference architecture ensures a common platform across teams, orgs, ops and security Ensure compliance by
 requiring the use of specific transformations and validation Self-service access through Cask Hydrator for the discovery, ingest and exploration of data Data Reservoir 1 0 1 0 0 0 1 Data Pond 1 0 1 0 1 0 Data Lake 1 0 1 0 1 0
  18. 18. Demo
  19. 19. CDAP Community 100% Open Source (ASL2) Website: Mailing List: IRC: #cdap on CDAP Enterprise 100% Commercially Supported Website: Contact Sales: Contact Me: or @jgrayla
 Accelerate Your Data Lake Journey Tap In @
  20. 20. Thank You! Jonathan Gray @jgrayla Questions?