Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Strata+Hadoop NY 2015: Use case examples of building applications on Hadoop with CDAP

795 views

Published on

Strata+Hadoop World | New York, NY | Sept 29-Oct 1, 2015

About the talk:

Hadoop is often used as a cost-effective augmentation of an enterprise data warehouse, or a new business intelligence tool. A greater opportunity exists if a broader range of developers could harness the power of Hadoop analytics not just for BI, but also for operational applications. The developer challenges to implement solutions on Hadoop are significant.

In this presentation, we will provide a brief background on the Cask Data Application Platform (CDAP), and then delve into three customer case studies. The case studies will show how organizations can accelerate Hadoop solution development and deployment across a range of use cases and vertical markets. In all of the examples, we will show how a data application platform can help organizations more effectively use existing talent, minimizing the cost of specialized Hadoop developers or consultants.

Each of these customers dealt with a common set of challenges driven by the evolution of Hadoop. Tremendous innovation around Hadoop and NoSQL technologies has resulted in a proliferation of data storage and processing options that require unique skills. The tight coupling of data, logic, and infrastructure makes reuse difficult. Finally, any operational systems require life cycle management, which is largely absent from the current Hadoop ecosystem.

Much like application servers such as Weblogic opened the door for application innovation on relational databases, the Hadoop ecosystem will progress faster with a data application platform. It has been about a year since the CDAP was contributed to open source to address the needs of developers. Since then, thousands of developers have downloaded the platform or cloned the GitHub repo, highlighting the project’s role as the leading platform for developers on Hadoop.

Speaker Bio:

Jonathan Gray, founder and CEO of Cask, is an entrepreneur and software engineer with a background in startups, open source, and all things data. Prior to founding Cask, Jonathan was a software engineer at Facebook where he helped drive HBase engineering efforts, including Facebook Messages and several other large-scale projects, from inception to production.

An open source evangelist, Jonathan was responsible for helping build the Facebook engineering brand through developer outreach and refocusing the open source strategy of the company. Prior to Facebook, Jonathan founded Streamy.com, where he became an early adopter of Hadoop and HBase and is now a core contributor and active committer in the community.

Jonathan holds a bachelor’s degree in electrical and computer engineering from Carnegie Mellon University.

http://cask.co/

Published in: Software
  • Be the first to comment

Strata+Hadoop NY 2015: Use case examples of building applications on Hadoop with CDAP

  1. 1. PROPRIETARY & CONFIDENTIAL Why Cask? 2 @jgrayla
  2. 2. SIMPLE ACCESS TO POWERFUL TECHNOLOGY Cask’s goal is to enable every developer and enterprise to
 quickly and easily build and run modern data applications
 using open source big data technologies like Hadoop
  3. 3. PROPRIETARY & CONFIDENTIAL Introduction to Data Applications Data Applications, also known as Operational Analytics,
 integrate analytics into applications in order to drive
 actionable intelligence, turning insights into action. Apps that utilize data insights to enhance the customer experience, achieve a business objective, improve a business process or
 enable new products or lines of business.
  4. 4. PROPRIETARY & CONFIDENTIAL ENTERPRISE DATA LAKES BIG DATA ANALYTICS PRODUCTION DATA APPS Batch and Realtime Data Ingestion Any type of data from any type of source in any volume Batch and Streaming ETL Code-free self-service creation and management of pipelines SQL Exploration and Data Science All data is automatically accessible via SQL and client SDKs Data as a Service Easily expose generic or custom REST APIs on any data 360o Customer View Integrate data from any source and expose through queries and APIs Realtime Dashboards Perform realtime OLAP aggregations and serve them through REST APIs Time Series Analysis Store, process and serve massive volumes of time-series data Realtime Log Analytics Ingestion and processing of high-throughput streaming log events Recommendation Engines Build models in batch using historical data and serve them in realtime Anomaly Detection Systems Process streaming events and predictably compare them in realtime to historical data NRT Event Monitoring Reliably monitor large streams of data and perform defined actions within a specified time Internet of Things Ingestion, storage and processing of events that is highly-available, scalable and consistent ENTERPRISE DATA LAKES BIG DATA ANALYTICS PRODUCTION DATA APPS Batch and Realtime Data Ingestion Any type of data from any type of source in any volume Batch and Streaming ETL Code-free self-service creation and management of pipelines SQL Exploration and Data Science All data is automatically accessible via SQL and client SDKs Data as a Service Easily expose generic or custom REST APIs on any data 360o Customer View Integrate data from any source and expose through queries and APIs Realtime Dashboards Perform realtime OLAP aggregations and serve them through REST APIs Time Series Analysis Store, process and serve massive volumes of time-series data Realtime Log Analytics Ingestion and processing of high-throughput streaming log events Recommendation Engines Build models in batch using historical data and serve them in realtime Anomaly Detection Systems Process streaming events and predictably compare them in realtime to historical data NRT Event Monitoring Reliably monitor large streams of data and perform defined actions within a specified time Internet of Things Ingestion, storage and processing of events that is highly-available, scalable and consistent The Path to Data AppsENTERPRISE DATA LAKES BIG DATA ANALYTICS PRODUCTION DATA APPS Batch and Realtime Data Ingestion Any type of data from any type of source in any volume Batch and Streaming ETL Code-free self-service creation and management of pipelines SQL Exploration and Data Science All data is automatically accessible via SQL and client SDKs Data as a Service Easily expose generic or custom REST APIs on any data 360o Customer View Integrate data from any source and expose through queries and APIs Realtime Dashboards Perform realtime OLAP aggregations and serve them through REST APIs Time Series Analysis Store, process and serve massive volumes of time-series data Realtime Log Analytics Ingestion and processing of high-throughput streaming log events Recommendation Engines Build models in batch using historical data and serve them in realtime Anomaly Detection Systems Process streaming events and predictably compare them in realtime to historical data NRT Event Monitoring Reliably monitor large streams of data and perform defined actions within a specified time Internet of Things Ingestion, storage and processing of events that is highly-available, scalable and consistent
  5. 5. PROPRIETARY & CONFIDENTIAL Data App Customer Examples E-Commerce Mature company with existing dependence on SaaS services and legacy apps, moving to a Data Lake and custom Apps Enterprise Large global enterprise with multiple Lakes and Apps attempting to centralize platform and extend to LoB SaaS Technically advanced company with product teams focused on minimizing the 
 time-to-market for Apps
  6. 6. PROPRIETARY & CONFIDENTIAL Data App in SaaS Deep Technical Talent with Product Focus Reduce time-to-market NRT Event Monitoring High-throughput, real-time event ingestion Consistent processing of events with persistence Scalable and guaranteed NRT event processing
  7. 7. PROPRIETARY & CONFIDENTIAL Data App in E-Commerce Unfamiliar Technical Talent with Shift to DIY Address skill gaps Consumer Intelligence Ingestion of multiple customer data sources Real-time analytics and machine learning Targeting and optimization for consumer web
  8. 8. PROPRIETARY & CONFIDENTIAL Data App in the Enterprise Highly variable talent with organizational gaps and executive pressure Provide reference architecture and abstractions Enterprise Data Hub Centralized data storage & management in Data Lakes Tools and APIs that extend data into lines of business Multi-tenant PaaS for governance and accessibility
  9. 9. Big Data Application Example
  10. 10. Big Data Technology Explosion Core Hadoop HDFS, MR 2006 HBase ZooKeeper Core Hadoop 2008 Hive Pig Mahout HBase ZooKeeper Core Hadoop 2009 Sqoop Whirr Avro Hive Pig Mahout HBase Zookeeper Core Hadoop 2010 Flume Bigtop Oozie MRUnit HCatalog Sqoop Whirr Avro Hive Pig Mahout HBase Zookeeper Core Hadoop 2011 Spark Impala Solr Kafka Flume Bigtop Oozie MRUnit HCatalog Sqoop Whirr Avro Hive Pig Mahout HBase Zookeeper Core Hadoop 2012 Sentry Tez Parquet Sentry Ranger Knox Spark YARN Impala Solr Kafka Flume Bigtop Oozie MRUnit HCatalog Sqoop Whirr Avro Hive Pig Mahout HBase Zookeeper Core Hadoop Present Hadoop Alone = 47+ projects across 6 distros
 — Merv Adrian, Gartner Analyst
  11. 11. PROPRIETARY & CONFIDENTIAL Data App Challenges E-Commerce Emphasis and time spent drastically skewed towards
 ops and integration logic Difficult for existing Java devs to use APIs like HBase and build apps without dev lifecycle tools Significant time and effort to bring applications from development into production Enterprise Lack of standards and best practices leads to divergent architectures across org Gaps between IT and LoB
 makes it difficult to cooperate and separate areas of concern Multiple OSS projects and Hadoop distributions make ops and governance a challenge SaaS Real-time, scale-out data ingestion into HDFS and HBase means building infrastructure Difficult to provide guarantees such as high-availability
 and data consistency Tight coupling of ingestion pipeline and processing pipeline creates operational complexity
  12. 12. PROPRIETARY & CONFIDENTIAL The Power of Abstraction Hadoop is a diverse collection of
 open source infrastructure projects, the universe of open source infrastructure
 continues to expand and accelerate. 
 This stuff is already too hard 
 
 There is a dire need for abstraction, to provide standardization and encapsulation, to enable accessibility, flexibility and reusability.
  13. 13. PROPRIETARY & CONFIDENTIAL Analogy of Hadoop to RDBMS MarketSize 1970 1980 1990 2000 2010 2020 2030 BI Market RDBMS Created $30-$40bn MIDDLEWARE & MANAGEMENT INFRASTRUCTURE APPLICATIONS $$$ $ VALUE ACCRUAL OVER TIMEMIDDLEWARE & MANAGEMENT INFRASTRUCTURE APPLICATIONS $$$ $ VALUE ACCRUAL OVER TIME
  14. 14. PROPRIETARY & CONFIDENTIAL Analogy of Hadoop to RDBMSMarketMaturity 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 Hadoop
 Created Pioneers Adopt 
 and extend Hadoop Early Adopters Adopt Fast Followers Still early days…
  15. 15. PROPRIETARY & CONFIDENTIAL CASK DATA APPLICATION PLATFORM Integrated Framework for Building and Running Data Applications on Hadoop Integrates the Latest Big Data Technologies Supports All Major Hadoop Distributions Fully Open Source and Highly Extensible
  16. 16. PROPRIETARY & CONFIDENTIAL17 Key Features CASK DATA APPLICATION PLATFORM Infrastructure INTEGRATION Provide an integrated product experience
 with out-of-the-box capabilities Architecture STANDARDS Define a reference architecture to standardize support for mixed infrastructure Programming ABSTRACTIONS Utilize abstraction layers to encapsulate complex patterns and insulate developers Production SERVICES Provides development tools and runtime services to enable production
 apps and data
  17. 17. PROPRIETARY & CONFIDENTIAL • Application Container Architecture • Reusable Programming Abstractions
 • Global User and Machine Metadata
 Applications Programs MapReduce Spark Tigon Workflow Service Worker Metadata Datasets Table Avro Parquet Timeseries OLAP Cube Geospatial ObjectStore Metadata Metadata
  18. 18. PROPRIETARY & CONFIDENTIAL
  19. 19. PROPRIETARY & CONFIDENTIAL Self-Service Ingestion and ETL
 for Hadoop Data Lakes Built for Production on CDAP Rich Drag-and-Drop User Interface Open Source & Highly Extensible
  20. 20. PROPRIETARY & CONFIDENTIAL DISCOVER data using user and machine generated metadata INGEST any data from any source in real-time and batch BUILD drag-and-drop ETL/ELT pipelines that run on Hadoop EGRESS any data to any destination in real-time and batch
  21. 21. PROPRIETARY & CONFIDENTIAL Data Apps on CDAP E-Commerce CDAP is an integrated platform that accelerates onboarding and keeps focus on business logic Dataset Patterns and App Templates provide higher-level, domain-specific APIs Tools and runtime services enable a true dev lifecycle and faster time to production Enterprise CDAP defines a set of standards and embeds best practices to drive a common architecture Tools, reusable abstractions and self-service interfaces enable non-IT users to access Hadoop CDAP provides portability across many versions of major distros and is going beyond Hadoop SaaS CDAP Streams capability provides scale-out, highly available real-time data ingest Tephra transaction engine integrates with Datasets and Programs for data consistency Streams separate ingestion from processing and combine support for real-time, batch and SQL
  22. 22. Demo
  23. 23. CDAP Community 100% Open Source (ASL2) Website: http://cdap.io Mailing List: cdap-user@googlegroups.com cdap-dev@googlegroups.com IRC: #cdap on freenode.net CDAP Enterprise 100% Commercially Supported Website: http://cask.co Contact Sales: sales@cask.co Contact Me: jon@cask.co or @jgrayla
 Accelerate Your Big Data Journey Tap In @ cask.co Download Now or Learn More at cask.co
  24. 24. Thank You! Jonathan Gray jon@cask.co @jgrayla Questions?

×