Big Data on Tap
August 10th 2016
Nitin Motgi
CTO, Cask
WEBINAR
What’s New in CDAP 3.5
nmotgi
2
Release Date
3.5
August 19th, 2016
* Possibility of coming early depending on customer testing
3
What’s in CDAP ?
A self-service, re-configurable, code-free framework to build, run
and operate real-time or batch data pipelines in cloud or on-
premise.
A self-service tool for tracking the flow of data in and out of Data
Lake. Track, Index and Search technical, business and operational
metadata of applications and pipelines
An integration platform that integrates and abstracts underlying Hadoop
technologies. Build data analytics solutions in cloud or on-premise.
The platform is powerful and versatile for you to build, publish and
manage operational self-service analytics applications
Your Apps
4
Infrastructure
Hadoop Distribution
Integration & Middleware
Applications
Simplification
Java Developers
Hadoop Engineer
Data Scientists /
Analysts
On-Premise Hadoop Distribution or Cloud-
based Hadoop as a Service
Integrations
CDAP native application solving data ingestion
and others like Fraud Detection, Data Quality, etc
Abstraction providing simple APIs for developers
to build data and science solutions.
Collection of 47+ OSS or proprietary
components. Well tested, packaged
Where does CDAP fit ?
5
Use case mapping
• Build operational analytics
applications
• Micro-service Enablement
• Self-Service Data Analytics / Data
Science
• Data-As-A-Service
• Empower developers to easily
build solution on Hadoop
• Abstract technologies, future proof
• Ingestion, Transformation,
Blending (complex joins) and
Lookup.
• Machine Learning, Aggregation
and Reporting
• Realtime and Batch data pipelines
• DW Offloading (Netezza,
Teradata, etc)
• Painless and Fast Ingest into
Impala operationalized
• Data Ingestion from varied
sources
• Easy way to catalog application and
pipeline level metadata
• Search across technical, business
and operational metadata
• Track Lineage and Provenance,
• Track across non-Hadoop
integrations
• Usage Analytics of cluster data
• Data Quality Measure
• Integration with other MDM systems
including Navigator
6
CDAP Timeline
Reactor 1.0
• Reactor Core
• Real-time Engine - Flow
• Dataset 1.0
• Operations UI
Reactor 2.0
• Ad-hoc SQL support (HIVE)
• Perimeter Security
• Stream support
• Dataset 2.0
• Workflow
• Resource View
• Metrics Explorer
• Tested on Apache Hadoop,
CDH & HDP
CDAP 3.0.0
• Spark
• New UI
• Operational Dashboard
• Namespace
• Kafka Integration
Cask Hydrator 1.0 Released
CDAP 2.5.0 (f.k.a Reactor)
• OSS CDAP
• Application Template (l.k.a
Cask Hydrator) released
• ETL Batch & ETL Realtime
• Queryable Datasets
• Permitter Level Security
• Tigon Release
• History
Reactor 1.5,
• Cloud Sandbox
• YARN Integration via Weave
• Dataset 1.5
• Transaction Support
• Service Support
• New Application API
• Tested on Apache Hadoop, HDP
& CDH
2011
2012 2014 2016
2015
CDAP 3.5
• Metadata, Lineage &
Properties
• Artifacts
Cask Tracker 1.0 Released
Tephra-Phoenix Integration
Apache Tephra Incubation
Apache Twill TLP
Apache Beam Collaboration
2013
7
Support for fine-grain role-based
authorizing of entities in CDAP

Integration with Sentry and Ranger
Security — Authentication
and Authorization
Ability to preview pipelines with real or
injected data before deploying (Standalone)
Security — Impersonation

and Encryption
Learn about how datasets are
being used and the top
applications accessing it
Tracker — Data Usage Analytics
Support for annotating business
metadata based on business
specified taxonomy
Metadata Taxonomy
Build and run Hydrator real-time
pipelines using Spark Streaming
Hydrator — Spark Streaming
Ability to run CDAP and CDAP Apps
as specified users and ability to
encrypt/decrypt sensitive configuration
Hydrator — Preview Mode
Capability to join multiple streams
(inner & outer) and ability to
configure actions allowing one to
run binaries on designated nodes
Hydrator — Join & Action
Support for XML, Mainframe (COBOL
Copybook), Value Mapper, Normalizer,
Denormalizer, JsonToXml, SSH Action,
Excel Reader, Solr & Spark ML
Hydrator — Plugins
3.5 - Release Highlights
8
9
Authentication and Authorization
• Isolation of data and operations from users unless access has been explicitly granted
• Granular support for role-based authorization on Namespace, Datasets and Applications
• Ability to control and enforce precise levels of access for Data and Compute
• Manage Roles and Permissions : CDAP REST API / CLI, CM-HUE (CDH) and Ranger UI
(HDP)*
• Pluggable extension for integrating with Apache Sentry, Apache Ranger* and LDAP*
• Authorization is enforced for both data access, as well as administration and
management of entities
• Ability to test authorization in CI environment using in-memory implementation
* Not included in packaged release, but available upon request
10
Encryption
• Secure store for users to safely store sensitive data like passwords and access keys
• Accessed by CDAP Programs / Hydrator via run-time arguments, and data can be
referenced by key names
• Data is access-controlled and uses authorization to enforce privileges.
• Only authorized users can access the data in secure store
• Configurable at the key level, controllable through Apache Sentry 3.5
• InMemory and Standalone modes use JCEKS storage provider (file based)
• Distributed mode uses Hadoop KMS as storage provider
getContext().getAdmin().putSecureData(namespace, KEY, new String(value), "", new HashMap<String, String>());
String value = getContext().getSecureData(namespace, KEY).get()
getContext().getAdmin().deleteSecureData(namespace, KEY)
getContext().listSecureData(namespace)
11
Access across Namespaces
• Namespace is a logical grouping of application and data, partitions a CDAP instance
• Previously, Applications in a namespace can access only resources in the same
namespace
• Now, Allows user to access a Dataset in other Namespace from a Program - user should
have been authorized to access the dataset by the owner
context.addInput(Input.ofDataset(dataset).fromNamespace(otherNamespace));
JavaPairRDD<Long, String> backlinkURLs = sc.fromStream(otherNamespace, stream, String.class);
12
Impersonation and Namespace Mapping
• No more global ‘cdap’ user
• Secured Impersonation allows superuser to perform operations on behalf of other user
• Programs (Compute) can be submitted to cluster to be run as any configured user
• Dataset operations are performed as the user who started the Programs
• Support to map Namespace, Application, Program or Schedule to a Kerberos principal
• Access controls pushed down to lower layers to circumvent controls by external means
• Namespace creation can map to existing resources on cluster (e.g. HDFS, HBase, etc)
• Once mapped, cannot be modified
• REST APIs for creating mapped Namespace
13
D
em
o
14
15
Join
• Data is usually normalized across multiple sources in order to minimize data redundancy
• Normalization divides the data into multiple tables (E.g. Customer Order and Customer
Info)
• Join capability allows users to join data from multiple datasets (external or internal)
• Support for Inner and Outer Joins
• Support for executing Join in MapReduce and Spark
• Supports only Equality and AND comparison on composite keys
• Join available only in Batch
• Automatic generation of output schema based on Join configuration
16
Action
• Support for combining Data Flow and Control Flow in the same pipeline
• Actions are part of Control Flow
• Actions can be added at the Start or End of Data Flow
• Action support running any arbitrary code on any desired machine as part of pipeline
• Control Flow can be Fork and Join
• SSH Action, DB Query and HDFS Action are currently supported
• Pipeline can be only control flow — Only Actions in pipeline
• Actions are able to pass information to Data Flow using Macro variables
17
Realtime Pipelines - Spark Streaming
• Drag and Drop UI for creating pipelines that run Spark Streaming - Real-time
• Capability to transform each record in the pipeline
• Expose easily configurable windowing capability
• Support for computing aggregates across the keys
• Support for loading Machine Learning Models and predict, label and classify
• Easily extendable APIs for building your own plugins
• Support for exactly once semantic
18
Macros
• Shorthand notation for retrieval precedence to workflow-tokens and runtime arguments
• Allows re-usability of pipelines, previously required re-deploying of pipelines
• Macro specified at configure are substituted at runtime on a per-run basis
• Macro support ‘Property Lookups’ and ‘Macro Functions’
• Supports combining Multiple Macros, Nested Macros and Recursive Macros
• Programatic API allowing Plugin developers to use this capability
${macro-name}
${macro-function(arg1, arg2)}
${host}/${path}:${port}
${hostname${host-suffix}}
jdbc:myql://${hostname}/${database}
${${escaped-macro-literal}}
${logicalStartTime(timeformat, offset)}
${secure(database-password)}
jdbc:microsoft:sqlserver://${host}:${port};DatabaseName=${database}
jdbc:postgresql://${host}:${port}/${database};user=joltie,password=${secure(password)}
19
Preview Mode
• Support functional verification of pipeline during development
• Enables fast iterations during development of pipeline
• Pipeline doesn’t need to be deployed to CDAP to achieve this
• Insight into inputs and outputs of each node in pipeline; see transformations at each
stage in pipeline
• Ability to read data from the actual source
• Supported only in Standalone mode
ALPH
A
20
New Plugins
• Source
• COBOL Copybook
• XML
• Excel
• FTP
• Transform and Science
• Value Mapper
• XML to JSON
• XML Parser
• Row Denormalizer
• Normalizer
• Transform and Science
• Order By*
• Logistic Regression -
Trainer and Scorer*
• Tokenizer*
• Gradient Descent
Boosting Tree*
• Random Forest*
• Sink
• Solr
* Available post-release
21
D
em
o
22
23
Tracker - Data Usage Analytics
• Provides insight into - What data is being accessed, How it’s being used, What’s popular
in your cluster, Top Datasets being accessed, etc
• Supports aggregated analytics based on time intervals
• Aggregation are based on Audit Log events
• Analytics are available at individual dataset level and global at cluster level
• Dataset Tracker-Meter
• Measure of quality of data based on profiling and social metrics
• Available for Dataset in the cluster
• Scale from 1 — 100 to measure the quality
• 1 — Bad Dataset, 100 — Good Dataset
BETA
24
Tracker - Tag Annotation & Metadata Taxonomy
• Support to add Business Tags (previously, could only view tags) to a Dataset
• Support to add User Properties to a Dataset
• Preferred Tags are standardized taxonomy of business tags
• Manage Preferred Tags
• Promote Regular Tags to Preferred Tags
• Upload Preferred Tags
• Annotate Dataset with Preferred Tags
• Auto-completion of Tags while annotating
25
D
em
o
26
Thank you all for your time
nmotgi
CDAP 3.5 RC Available
http://cask.co/downloads
info@cask.co

Webinar: What's new in CDAP 3.5?

  • 1.
    Big Data onTap August 10th 2016 Nitin Motgi CTO, Cask WEBINAR What’s New in CDAP 3.5 nmotgi
  • 2.
    2 Release Date 3.5 August 19th,2016 * Possibility of coming early depending on customer testing
  • 3.
    3 What’s in CDAP? A self-service, re-configurable, code-free framework to build, run and operate real-time or batch data pipelines in cloud or on- premise. A self-service tool for tracking the flow of data in and out of Data Lake. Track, Index and Search technical, business and operational metadata of applications and pipelines An integration platform that integrates and abstracts underlying Hadoop technologies. Build data analytics solutions in cloud or on-premise. The platform is powerful and versatile for you to build, publish and manage operational self-service analytics applications Your Apps
  • 4.
    4 Infrastructure Hadoop Distribution Integration &Middleware Applications Simplification Java Developers Hadoop Engineer Data Scientists / Analysts On-Premise Hadoop Distribution or Cloud- based Hadoop as a Service Integrations CDAP native application solving data ingestion and others like Fraud Detection, Data Quality, etc Abstraction providing simple APIs for developers to build data and science solutions. Collection of 47+ OSS or proprietary components. Well tested, packaged Where does CDAP fit ?
  • 5.
    5 Use case mapping •Build operational analytics applications • Micro-service Enablement • Self-Service Data Analytics / Data Science • Data-As-A-Service • Empower developers to easily build solution on Hadoop • Abstract technologies, future proof • Ingestion, Transformation, Blending (complex joins) and Lookup. • Machine Learning, Aggregation and Reporting • Realtime and Batch data pipelines • DW Offloading (Netezza, Teradata, etc) • Painless and Fast Ingest into Impala operationalized • Data Ingestion from varied sources • Easy way to catalog application and pipeline level metadata • Search across technical, business and operational metadata • Track Lineage and Provenance, • Track across non-Hadoop integrations • Usage Analytics of cluster data • Data Quality Measure • Integration with other MDM systems including Navigator
  • 6.
    6 CDAP Timeline Reactor 1.0 •Reactor Core • Real-time Engine - Flow • Dataset 1.0 • Operations UI Reactor 2.0 • Ad-hoc SQL support (HIVE) • Perimeter Security • Stream support • Dataset 2.0 • Workflow • Resource View • Metrics Explorer • Tested on Apache Hadoop, CDH & HDP CDAP 3.0.0 • Spark • New UI • Operational Dashboard • Namespace • Kafka Integration Cask Hydrator 1.0 Released CDAP 2.5.0 (f.k.a Reactor) • OSS CDAP • Application Template (l.k.a Cask Hydrator) released • ETL Batch & ETL Realtime • Queryable Datasets • Permitter Level Security • Tigon Release • History Reactor 1.5, • Cloud Sandbox • YARN Integration via Weave • Dataset 1.5 • Transaction Support • Service Support • New Application API • Tested on Apache Hadoop, HDP & CDH 2011 2012 2014 2016 2015 CDAP 3.5 • Metadata, Lineage & Properties • Artifacts Cask Tracker 1.0 Released Tephra-Phoenix Integration Apache Tephra Incubation Apache Twill TLP Apache Beam Collaboration 2013
  • 7.
    7 Support for fine-grainrole-based authorizing of entities in CDAP
 Integration with Sentry and Ranger Security — Authentication and Authorization Ability to preview pipelines with real or injected data before deploying (Standalone) Security — Impersonation
 and Encryption Learn about how datasets are being used and the top applications accessing it Tracker — Data Usage Analytics Support for annotating business metadata based on business specified taxonomy Metadata Taxonomy Build and run Hydrator real-time pipelines using Spark Streaming Hydrator — Spark Streaming Ability to run CDAP and CDAP Apps as specified users and ability to encrypt/decrypt sensitive configuration Hydrator — Preview Mode Capability to join multiple streams (inner & outer) and ability to configure actions allowing one to run binaries on designated nodes Hydrator — Join & Action Support for XML, Mainframe (COBOL Copybook), Value Mapper, Normalizer, Denormalizer, JsonToXml, SSH Action, Excel Reader, Solr & Spark ML Hydrator — Plugins 3.5 - Release Highlights
  • 8.
  • 9.
    9 Authentication and Authorization •Isolation of data and operations from users unless access has been explicitly granted • Granular support for role-based authorization on Namespace, Datasets and Applications • Ability to control and enforce precise levels of access for Data and Compute • Manage Roles and Permissions : CDAP REST API / CLI, CM-HUE (CDH) and Ranger UI (HDP)* • Pluggable extension for integrating with Apache Sentry, Apache Ranger* and LDAP* • Authorization is enforced for both data access, as well as administration and management of entities • Ability to test authorization in CI environment using in-memory implementation * Not included in packaged release, but available upon request
  • 10.
    10 Encryption • Secure storefor users to safely store sensitive data like passwords and access keys • Accessed by CDAP Programs / Hydrator via run-time arguments, and data can be referenced by key names • Data is access-controlled and uses authorization to enforce privileges. • Only authorized users can access the data in secure store • Configurable at the key level, controllable through Apache Sentry 3.5 • InMemory and Standalone modes use JCEKS storage provider (file based) • Distributed mode uses Hadoop KMS as storage provider getContext().getAdmin().putSecureData(namespace, KEY, new String(value), "", new HashMap<String, String>()); String value = getContext().getSecureData(namespace, KEY).get() getContext().getAdmin().deleteSecureData(namespace, KEY) getContext().listSecureData(namespace)
  • 11.
    11 Access across Namespaces •Namespace is a logical grouping of application and data, partitions a CDAP instance • Previously, Applications in a namespace can access only resources in the same namespace • Now, Allows user to access a Dataset in other Namespace from a Program - user should have been authorized to access the dataset by the owner context.addInput(Input.ofDataset(dataset).fromNamespace(otherNamespace)); JavaPairRDD<Long, String> backlinkURLs = sc.fromStream(otherNamespace, stream, String.class);
  • 12.
    12 Impersonation and NamespaceMapping • No more global ‘cdap’ user • Secured Impersonation allows superuser to perform operations on behalf of other user • Programs (Compute) can be submitted to cluster to be run as any configured user • Dataset operations are performed as the user who started the Programs • Support to map Namespace, Application, Program or Schedule to a Kerberos principal • Access controls pushed down to lower layers to circumvent controls by external means • Namespace creation can map to existing resources on cluster (e.g. HDFS, HBase, etc) • Once mapped, cannot be modified • REST APIs for creating mapped Namespace
  • 13.
  • 14.
  • 15.
    15 Join • Data isusually normalized across multiple sources in order to minimize data redundancy • Normalization divides the data into multiple tables (E.g. Customer Order and Customer Info) • Join capability allows users to join data from multiple datasets (external or internal) • Support for Inner and Outer Joins • Support for executing Join in MapReduce and Spark • Supports only Equality and AND comparison on composite keys • Join available only in Batch • Automatic generation of output schema based on Join configuration
  • 16.
    16 Action • Support forcombining Data Flow and Control Flow in the same pipeline • Actions are part of Control Flow • Actions can be added at the Start or End of Data Flow • Action support running any arbitrary code on any desired machine as part of pipeline • Control Flow can be Fork and Join • SSH Action, DB Query and HDFS Action are currently supported • Pipeline can be only control flow — Only Actions in pipeline • Actions are able to pass information to Data Flow using Macro variables
  • 17.
    17 Realtime Pipelines -Spark Streaming • Drag and Drop UI for creating pipelines that run Spark Streaming - Real-time • Capability to transform each record in the pipeline • Expose easily configurable windowing capability • Support for computing aggregates across the keys • Support for loading Machine Learning Models and predict, label and classify • Easily extendable APIs for building your own plugins • Support for exactly once semantic
  • 18.
    18 Macros • Shorthand notationfor retrieval precedence to workflow-tokens and runtime arguments • Allows re-usability of pipelines, previously required re-deploying of pipelines • Macro specified at configure are substituted at runtime on a per-run basis • Macro support ‘Property Lookups’ and ‘Macro Functions’ • Supports combining Multiple Macros, Nested Macros and Recursive Macros • Programatic API allowing Plugin developers to use this capability ${macro-name} ${macro-function(arg1, arg2)} ${host}/${path}:${port} ${hostname${host-suffix}} jdbc:myql://${hostname}/${database} ${${escaped-macro-literal}} ${logicalStartTime(timeformat, offset)} ${secure(database-password)} jdbc:microsoft:sqlserver://${host}:${port};DatabaseName=${database} jdbc:postgresql://${host}:${port}/${database};user=joltie,password=${secure(password)}
  • 19.
    19 Preview Mode • Supportfunctional verification of pipeline during development • Enables fast iterations during development of pipeline • Pipeline doesn’t need to be deployed to CDAP to achieve this • Insight into inputs and outputs of each node in pipeline; see transformations at each stage in pipeline • Ability to read data from the actual source • Supported only in Standalone mode ALPH A
  • 20.
    20 New Plugins • Source •COBOL Copybook • XML • Excel • FTP • Transform and Science • Value Mapper • XML to JSON • XML Parser • Row Denormalizer • Normalizer • Transform and Science • Order By* • Logistic Regression - Trainer and Scorer* • Tokenizer* • Gradient Descent Boosting Tree* • Random Forest* • Sink • Solr * Available post-release
  • 21.
  • 22.
  • 23.
    23 Tracker - DataUsage Analytics • Provides insight into - What data is being accessed, How it’s being used, What’s popular in your cluster, Top Datasets being accessed, etc • Supports aggregated analytics based on time intervals • Aggregation are based on Audit Log events • Analytics are available at individual dataset level and global at cluster level • Dataset Tracker-Meter • Measure of quality of data based on profiling and social metrics • Available for Dataset in the cluster • Scale from 1 — 100 to measure the quality • 1 — Bad Dataset, 100 — Good Dataset BETA
  • 24.
    24 Tracker - TagAnnotation & Metadata Taxonomy • Support to add Business Tags (previously, could only view tags) to a Dataset • Support to add User Properties to a Dataset • Preferred Tags are standardized taxonomy of business tags • Manage Preferred Tags • Promote Regular Tags to Preferred Tags • Upload Preferred Tags • Annotate Dataset with Preferred Tags • Auto-completion of Tags while annotating
  • 25.
  • 26.
    26 Thank you allfor your time nmotgi CDAP 3.5 RC Available http://cask.co/downloads info@cask.co