Webinar: What's new in CDAP 3.5?

Big Data on Tap
August 10th 2016
Nitin Motgi
CTO, Cask
WEBINAR
What’s New in CDAP 3.5
nmotgi

2
Release Date
3.5
August 19th, 2016
* Possibility of coming early depending on customer testing

3
What’s in CDAP ?
A self-service, re-conﬁgurable, code-free framework to build, run
and operate real-time or batch data pipelines in cloud or on-
premise.
A self-service tool for tracking the ﬂow of data in and out of Data
Lake. Track, Index and Search technical, business and operational
metadata of applications and pipelines
An integration platform that integrates and abstracts underlying Hadoop
technologies. Build data analytics solutions in cloud or on-premise.
The platform is powerful and versatile for you to build, publish and
manage operational self-service analytics applications
Your Apps

4
Infrastructure
Hadoop Distribution
Integration & Middleware
Applications
Simpliﬁcation
Java Developers
Hadoop Engineer
Data Scientists /
Analysts
On-Premise Hadoop Distribution or Cloud-
based Hadoop as a Service
Integrations
CDAP native application solving data ingestion
and others like Fraud Detection, Data Quality, etc
Abstraction providing simple APIs for developers
to build data and science solutions.
Collection of 47+ OSS or proprietary
components. Well tested, packaged
Where does CDAP ﬁt ?

5
Use case mapping
• Build operational analytics
applications
• Micro-service Enablement
• Self-Service Data Analytics / Data
Science
• Data-As-A-Service
• Empower developers to easily
build solution on Hadoop
• Abstract technologies, future proof
• Ingestion, Transformation,
Blending (complex joins) and
Lookup.
• Machine Learning, Aggregation
and Reporting
• Realtime and Batch data pipelines
• DW Ofﬂoading (Netezza,
Teradata, etc)
• Painless and Fast Ingest into
Impala operationalized
• Data Ingestion from varied
sources
• Easy way to catalog application and
pipeline level metadata
• Search across technical, business
and operational metadata
• Track Lineage and Provenance,
• Track across non-Hadoop
integrations
• Usage Analytics of cluster data
• Data Quality Measure
• Integration with other MDM systems
including Navigator

6
CDAP Timeline
Reactor 1.0
• Reactor Core
• Real-time Engine - Flow
• Dataset 1.0
• Operations UI
Reactor 2.0
• Ad-hoc SQL support (HIVE)
• Perimeter Security
• Stream support
• Dataset 2.0
• Workﬂow
• Resource View
• Metrics Explorer
• Tested on Apache Hadoop,
CDH & HDP
CDAP 3.0.0
• Spark
• New UI
• Operational Dashboard
• Namespace
• Kafka Integration
Cask Hydrator 1.0 Released
CDAP 2.5.0 (f.k.a Reactor)
• OSS CDAP
• Application Template (l.k.a
Cask Hydrator) released
• ETL Batch & ETL Realtime
• Queryable Datasets
• Permitter Level Security
• Tigon Release
• History
Reactor 1.5,
• Cloud Sandbox
• YARN Integration via Weave
• Dataset 1.5
• Transaction Support
• Service Support
• New Application API
• Tested on Apache Hadoop, HDP
& CDH
2011
2012 2014 2016
2015
CDAP 3.5
• Metadata, Lineage &
Properties
• Artifacts
Cask Tracker 1.0 Released
Tephra-Phoenix Integration
Apache Tephra Incubation
Apache Twill TLP
Apache Beam Collaboration
2013

7
Support for fine-grain role-based
authorizing of entities in CDAP 
Integration with Sentry and Ranger
Security — Authentication
and Authorization
Ability to preview pipelines with real or
injected data before deploying (Standalone)
Security — Impersonation 
and Encryption
Learn about how datasets are
being used and the top
applications accessing it
Tracker — Data Usage Analytics
Support for annotating business
metadata based on business
specified taxonomy
Metadata Taxonomy
Build and run Hydrator real-time
pipelines using Spark Streaming
Hydrator — Spark Streaming
Ability to run CDAP and CDAP Apps
as specified users and ability to
encrypt/decrypt sensitive configuration
Hydrator — Preview Mode
Capability to join multiple streams
(inner & outer) and ability to
configure actions allowing one to
run binaries on designated nodes
Hydrator — Join & Action
Support for XML, Mainframe (COBOL
Copybook), Value Mapper, Normalizer,
Denormalizer, JsonToXml, SSH Action,
Excel Reader, Solr & Spark ML
Hydrator — Plugins
3.5 - Release Highlights

9
Authentication and Authorization
• Isolation of data and operations from users unless access has been explicitly granted
• Granular support for role-based authorization on Namespace, Datasets and Applications
• Ability to control and enforce precise levels of access for Data and Compute
• Manage Roles and Permissions : CDAP REST API / CLI, CM-HUE (CDH) and Ranger UI
(HDP)*
• Pluggable extension for integrating with Apache Sentry, Apache Ranger* and LDAP*
• Authorization is enforced for both data access, as well as administration and
management of entities
• Ability to test authorization in CI environment using in-memory implementation
* Not included in packaged release, but available upon request

10
Encryption
• Secure store for users to safely store sensitive data like passwords and access keys
• Accessed by CDAP Programs / Hydrator via run-time arguments, and data can be
referenced by key names
• Data is access-controlled and uses authorization to enforce privileges.
• Only authorized users can access the data in secure store
• Configurable at the key level, controllable through Apache Sentry 3.5
• InMemory and Standalone modes use JCEKS storage provider (file based)
• Distributed mode uses Hadoop KMS as storage provider
getContext().getAdmin().putSecureData(namespace, KEY, new String(value), "", new HashMap<String, String>());
String value = getContext().getSecureData(namespace, KEY).get()
getContext().getAdmin().deleteSecureData(namespace, KEY)
getContext().listSecureData(namespace)

11
Access across Namespaces
• Namespace is a logical grouping of application and data, partitions a CDAP instance
• Previously, Applications in a namespace can access only resources in the same
namespace
• Now, Allows user to access a Dataset in other Namespace from a Program - user should
have been authorized to access the dataset by the owner
context.addInput(Input.ofDataset(dataset).fromNamespace(otherNamespace));
JavaPairRDD<Long, String> backlinkURLs = sc.fromStream(otherNamespace, stream, String.class);

12
Impersonation and Namespace Mapping
• No more global ‘cdap’ user
• Secured Impersonation allows superuser to perform operations on behalf of other user
• Programs (Compute) can be submitted to cluster to be run as any configured user
• Dataset operations are performed as the user who started the Programs
• Support to map Namespace, Application, Program or Schedule to a Kerberos principal
• Access controls pushed down to lower layers to circumvent controls by external means
• Namespace creation can map to existing resources on cluster (e.g. HDFS, HBase, etc)
• Once mapped, cannot be modified
• REST APIs for creating mapped Namespace

15
Join
• Data is usually normalized across multiple sources in order to minimize data redundancy
• Normalization divides the data into multiple tables (E.g. Customer Order and Customer
Info)
• Join capability allows users to join data from multiple datasets (external or internal)
• Support for Inner and Outer Joins
• Support for executing Join in MapReduce and Spark
• Supports only Equality and AND comparison on composite keys
• Join available only in Batch
• Automatic generation of output schema based on Join configuration

16
Action
• Support for combining Data Flow and Control Flow in the same pipeline
• Actions are part of Control Flow
• Actions can be added at the Start or End of Data Flow
• Action support running any arbitrary code on any desired machine as part of pipeline
• Control Flow can be Fork and Join
• SSH Action, DB Query and HDFS Action are currently supported
• Pipeline can be only control flow — Only Actions in pipeline
• Actions are able to pass information to Data Flow using Macro variables

17
Realtime Pipelines - Spark Streaming
• Drag and Drop UI for creating pipelines that run Spark Streaming - Real-time
• Capability to transform each record in the pipeline
• Expose easily configurable windowing capability
• Support for computing aggregates across the keys
• Support for loading Machine Learning Models and predict, label and classify
• Easily extendable APIs for building your own plugins
• Support for exactly once semantic

18
Macros
• Shorthand notation for retrieval precedence to workflow-tokens and runtime arguments
• Allows re-usability of pipelines, previously required re-deploying of pipelines
• Macro specified at configure are substituted at runtime on a per-run basis
• Macro support ‘Property Lookups’ and ‘Macro Functions’
• Supports combining Multiple Macros, Nested Macros and Recursive Macros
• Programatic API allowing Plugin developers to use this capability
${macro-name}
${macro-function(arg1, arg2)}
${host}/${path}:${port}
${hostname${host-suffix}}
jdbc:myql://${hostname}/${database}
${${escaped-macro-literal}}
${logicalStartTime(timeformat, offset)}
${secure(database-password)}
jdbc:microsoft:sqlserver://${host}:${port};DatabaseName=${database}
jdbc:postgresql://${host}:${port}/${database};user=joltie,password=${secure(password)}

19
Preview Mode
• Support functional verification of pipeline during development
• Enables fast iterations during development of pipeline
• Pipeline doesn’t need to be deployed to CDAP to achieve this
• Insight into inputs and outputs of each node in pipeline; see transformations at each
stage in pipeline
• Ability to read data from the actual source
• Supported only in Standalone mode
ALPH
A

20
New Plugins
• Source
• COBOL Copybook
• XML
• Excel
• FTP
• Transform and Science
• Value Mapper
• XML to JSON
• XML Parser
• Row Denormalizer
• Normalizer
• Transform and Science
• Order By*
• Logistic Regression -
Trainer and Scorer*
• Tokenizer*
• Gradient Descent
Boosting Tree*
• Random Forest*
• Sink
• Solr
* Available post-release

23
Tracker - Data Usage Analytics
• Provides insight into - What data is being accessed, How it’s being used, What’s popular
in your cluster, Top Datasets being accessed, etc
• Supports aggregated analytics based on time intervals
• Aggregation are based on Audit Log events
• Analytics are available at individual dataset level and global at cluster level
• Dataset Tracker-Meter
• Measure of quality of data based on profiling and social metrics
• Available for Dataset in the cluster
• Scale from 1 — 100 to measure the quality
• 1 — Bad Dataset, 100 — Good Dataset
BETA

24
Tracker - Tag Annotation & Metadata Taxonomy
• Support to add Business Tags (previously, could only view tags) to a Dataset
• Support to add User Properties to a Dataset
• Preferred Tags are standardized taxonomy of business tags
• Manage Preferred Tags
• Promote Regular Tags to Preferred Tags
• Upload Preferred Tags
• Annotate Dataset with Preferred Tags
• Auto-completion of Tags while annotating

26
Thank you all for your time
nmotgi
CDAP 3.5 RC Available
http://cask.co/downloads
info@cask.co

Webinar: What's new in CDAP 3.5?

More Related Content

What's hot

Viewers also liked

Similar to Webinar: What's new in CDAP 3.5?

More from Cask Data

Recently uploaded

Webinar: What's new in CDAP 3.5?