1© Cloudera, Inc. All rights reserved.
How to build multi-disciplinary analytics
applications on a shared data platform
Mark Donsky | Director, Product Management
Nikki Rouda | Director, Product Marketing
2© Cloudera, Inc. All rights reserved.
3© Cloudera, Inc. All rights reserved.
Challenges in data management
Many data silos, each requiring its own proprietary tools and infrastructure
Different vendors, products, and services on-premises versus in cloud
A fragmented approach is difficult, expensive, and risky
SQL
analytic
databases
NoSQL and
real-time
databases
Data
engineering
and ETL
environments
Data
warehouses
and data
marts
4© Cloudera, Inc. All rights reserved.
Traditional applications
4
• One data type
• One analytic
function
• Hard to
integrate
Data
Exploration
STORAGE
SECURITY
GOVERNANCE
WORKLOAD MGMT
INGEST &
REPLICATION
DATA CATALOG
SQL & BI
Analytics
STORAGE
SECURITY
GOVERNANCE
WORKLOAD MGMT
INGEST & REPLICATION
DATA CATALOG
Operational
Real-Time DB
STORAGE
SECURITY
GOVERNANCE
WORKLOAD MGMT
INGEST & REPLICATION
DATA CATALOG
ETL & Data
Processing
STORAGE
SECURITY
GOVERNANCE
WORKLOAD MGMT
INGEST &
REPLICATION
DATA CATALOG
Custom
Functions
STORAGE
SECURITY
GOVERNANCE
WORKLOAD MGMT
INGEST & REPLICATION
DATA CATALOG
5© Cloudera, Inc. All rights reserved.
Negative consequences for your business
Increased operational costs
many distinct environments
to buy and build
Increased staff overhead
many distinct tools to learn
and support
Increased security risks
many distinct frameworks to enforce
Decreased business insights
narrow data sets and analytics
rigidity
Decreased business agility –
outdated and limiting for
applications
Decreased governance capability –
no common visibility across stores
6© Cloudera, Inc. All rights reserved.
Support
multi-function
analytics
Minimize time to
add workloads
Support elastic
workloads
Enable self-
service
Provide a
scalable model
for sharing data
Reduce cost
Increase tenant
isolation
Secure the
environment
Key design goals for today’s data management teams
7© Cloudera, Inc. All rights reserved.
Shared Storage (HDFS, Kudu)
Traditional on-premises deployments perform reasonably
well
Strong multi-function support
Strong shared data experience
Strong information security model
Moderate cost management
Moderate tenant isolation
Moderate workload elasticity
Weak on self service
Weak on speed of deployment
Shared Data Experience (Metadata, Security, Governance)
One physical cluster provides a shared data experience
to multiple workloads and tenants
… but not good enough going forward
8© Cloudera, Inc. All rights reserved.
Traditional cloud deployments are strong where on-premises
is weak, but at the expense of creating workload silos
Moderate multi-function support
Weak on shared data experience
Weak information security model
Moderate cost management
Strong on tenant isolation
Strong on workload elasticity
Strong on self service
Strong on speed of deployment
This is the experience of cloud house offerings… but not good enough going forward
Shared Storage
Cloud
9© Cloudera, Inc. All rights reserved.
In the beginning…
10© Cloudera, Inc. All rights reserved.
In the beginning…
11© Cloudera, Inc. All rights reserved.
Today: One platform. Multiple workloads
DATA
ENGINEERING
OPERATIONAL
DATABASE
ANALYTIC
DATABASE
DATA
SCIENCE
Store and process
unlimited data fast and
cost-effectively
“Programmatic data
processing and machine
learning”
Explore, analyze,
and understand
all your data
“Fast, flexible,
open source
parallel database”
Build data-driven
applications to deliver
real-time insights
“Online applications,
lambda/kappa
architectures”
12© Cloudera, Inc. All rights reserved.
What is a workload?
Data + Data Context + Compute
Data Context:
• HMS: Schema definitions
• Sentry: Security
authorizations
• Navigator: Audit logs
• Navigator: Business glossary
• Navigator: Business metadata
• Navigator: Lineage
13© Cloudera, Inc. All rights reserved.
What about multiple workloads?
Cluster
Hive/HMS
Sentry
NavigatorSpark
Keys
HDFS, Kudu, S3, Private Cloud Storage
14© Cloudera, Inc. All rights reserved.
Data context with multiple workloads
Traditional Hadoop clusters
contain compute, data, and
data context
Transient Hadoop clusters
contain compute and data context,
but externalize data
HDFS, Kudu, S3, Private Cloud
Storage
Why is data
context stored
in each cluster,
and not
alongside
the data?
?
15© Cloudera, Inc. All rights reserved.
The data context consistency problem
Compute and data are becoming further separated
• Compute is stateless: cloud-based or on-prem, either transient or long-running
• Data is stateful: cloud-based or on-prem in HDFS, Kudu, S3, ADLS, Isilon, etc.
What about data context?
• Schema Definitions (Hive Metastore)
• Permissions (Apache Sentry)
• Encryption Keys (KMS)
• Governance (Cloudera Navigator)
Data context should be stateful, but currently is stateless
• This creates synchronization and usability challenges for admins and end users
alike
16© Cloudera, Inc. All rights reserved.
Solution: Shared Data Experience
Externalize data context services
as a shared service
DATA
ENGINEERIN
G
OPERATIONA
L DATABASE
ANALYTIC
DATABASE
DATA
SCIENCE
Benefits:
• Common schemas, access permissions, classifications,
and governance across all workloads
• Reduced cost of ownership: less hardware and software
to manage
• Increased end-user productivity: data is presented
consistently in every cluster
• Faster expansion: admins don’t have to recreate data
context services with each new cluster
KEYSHMS SENTRY
NAVIGATO
R
KEYSHMS SENTRY
NAVIGATO
R
HDFS, Kudu, S3, Private Cloud StorageHDFS, Kudu, S3, Private Cloud Storage
17© Cloudera, Inc. All rights reserved.
The modern platform for machine learning and analytics optimized for the cloud
EXTENSIBLE
SERVICES
CORE
SERVICES DATA
ENGINEERING
OPERATIONAL
DATABASE
ANALYTIC
DATABASE
DATA
SCIENCE
DATA CATALOG
INGEST &
REPLICATION
SECURITY GOVERNANCE
WORKLOAD
MANAGEMENT
Cloudera Enterprise
S3 ADLS HDFS KUDU
STORAGE
SERVICES
18© Cloudera, Inc. All rights reserved.
Two deployment options
Cloudera SDX
Cloudera SDX: Customer-managed
• RDS-backed Hive Metastore
• RDS-backed Apache Sentry
• Customer-managed Cloudera Navigator
Ideal for:
• Director-launched workloads
• CM-managed workloads
Cloudera Altus SDX: Cloudera-
managed
• Serverless Hive Metastore
• Serverless Apache Sentry
• Serverless Cloudera Navigator
Ideal for:
• Altus SDX workloads
• Hybrid workloads
19© Cloudera, Inc. All rights reserved.
Cloud deployments with SDX optimize for all design goals
Shared Data Experience (Metadata, Security, Governance)
One logical cluster provides a shared data experience to multiple
workloads and tenants
SDX makes it possible to transfer on-premises design wins to cloud
Shared Object Storage
Cloud
Strong multi-function support
Strong shared data experience
Strong information security model
Strong on cost management
Strong on tenant isolation
Strong on workload elasticity
Strong on self service
Strong on speed of deployment
20© Cloudera, Inc. All rights reserved.
Positive business outcomes
Increased business insights
diverse data together with
analytics flexibility
Increased business agility
modern and nimble application
innovation
Increased governance
capability one common
viewpoint and store
Decreased operational costs
– one environment for all
needs
Decreased staff overhead –
one set of controls for
everything
Decreased security risks –
comprehensive controls
everywhere
21© Cloudera, Inc. All rights reserved.
Using Predictive Maintenance to Improve
Performance and Reduce Fleet Downtime
• OnCommand Connection is collecting
telematics and geolocation data across
the fleet
• Reduced maintenance costs to $.03 per
mile from $.12-$.15 per mile
• Centralizing data from 13 systems with
varying frequency and semantic
definitions
• Real-time visibility of 300,000+ trucks in
order to improve uptime and vehicle
performance
22© Cloudera, Inc. All rights reserved.
Thank you
Mark Donsky Nikki Rouda
@markdonsky @nrouda

How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform

  • 1.
    1© Cloudera, Inc.All rights reserved. How to build multi-disciplinary analytics applications on a shared data platform Mark Donsky | Director, Product Management Nikki Rouda | Director, Product Marketing
  • 2.
    2© Cloudera, Inc.All rights reserved.
  • 3.
    3© Cloudera, Inc.All rights reserved. Challenges in data management Many data silos, each requiring its own proprietary tools and infrastructure Different vendors, products, and services on-premises versus in cloud A fragmented approach is difficult, expensive, and risky SQL analytic databases NoSQL and real-time databases Data engineering and ETL environments Data warehouses and data marts
  • 4.
    4© Cloudera, Inc.All rights reserved. Traditional applications 4 • One data type • One analytic function • Hard to integrate Data Exploration STORAGE SECURITY GOVERNANCE WORKLOAD MGMT INGEST & REPLICATION DATA CATALOG SQL & BI Analytics STORAGE SECURITY GOVERNANCE WORKLOAD MGMT INGEST & REPLICATION DATA CATALOG Operational Real-Time DB STORAGE SECURITY GOVERNANCE WORKLOAD MGMT INGEST & REPLICATION DATA CATALOG ETL & Data Processing STORAGE SECURITY GOVERNANCE WORKLOAD MGMT INGEST & REPLICATION DATA CATALOG Custom Functions STORAGE SECURITY GOVERNANCE WORKLOAD MGMT INGEST & REPLICATION DATA CATALOG
  • 5.
    5© Cloudera, Inc.All rights reserved. Negative consequences for your business Increased operational costs many distinct environments to buy and build Increased staff overhead many distinct tools to learn and support Increased security risks many distinct frameworks to enforce Decreased business insights narrow data sets and analytics rigidity Decreased business agility – outdated and limiting for applications Decreased governance capability – no common visibility across stores
  • 6.
    6© Cloudera, Inc.All rights reserved. Support multi-function analytics Minimize time to add workloads Support elastic workloads Enable self- service Provide a scalable model for sharing data Reduce cost Increase tenant isolation Secure the environment Key design goals for today’s data management teams
  • 7.
    7© Cloudera, Inc.All rights reserved. Shared Storage (HDFS, Kudu) Traditional on-premises deployments perform reasonably well Strong multi-function support Strong shared data experience Strong information security model Moderate cost management Moderate tenant isolation Moderate workload elasticity Weak on self service Weak on speed of deployment Shared Data Experience (Metadata, Security, Governance) One physical cluster provides a shared data experience to multiple workloads and tenants … but not good enough going forward
  • 8.
    8© Cloudera, Inc.All rights reserved. Traditional cloud deployments are strong where on-premises is weak, but at the expense of creating workload silos Moderate multi-function support Weak on shared data experience Weak information security model Moderate cost management Strong on tenant isolation Strong on workload elasticity Strong on self service Strong on speed of deployment This is the experience of cloud house offerings… but not good enough going forward Shared Storage Cloud
  • 9.
    9© Cloudera, Inc.All rights reserved. In the beginning…
  • 10.
    10© Cloudera, Inc.All rights reserved. In the beginning…
  • 11.
    11© Cloudera, Inc.All rights reserved. Today: One platform. Multiple workloads DATA ENGINEERING OPERATIONAL DATABASE ANALYTIC DATABASE DATA SCIENCE Store and process unlimited data fast and cost-effectively “Programmatic data processing and machine learning” Explore, analyze, and understand all your data “Fast, flexible, open source parallel database” Build data-driven applications to deliver real-time insights “Online applications, lambda/kappa architectures”
  • 12.
    12© Cloudera, Inc.All rights reserved. What is a workload? Data + Data Context + Compute Data Context: • HMS: Schema definitions • Sentry: Security authorizations • Navigator: Audit logs • Navigator: Business glossary • Navigator: Business metadata • Navigator: Lineage
  • 13.
    13© Cloudera, Inc.All rights reserved. What about multiple workloads? Cluster Hive/HMS Sentry NavigatorSpark Keys HDFS, Kudu, S3, Private Cloud Storage
  • 14.
    14© Cloudera, Inc.All rights reserved. Data context with multiple workloads Traditional Hadoop clusters contain compute, data, and data context Transient Hadoop clusters contain compute and data context, but externalize data HDFS, Kudu, S3, Private Cloud Storage Why is data context stored in each cluster, and not alongside the data? ?
  • 15.
    15© Cloudera, Inc.All rights reserved. The data context consistency problem Compute and data are becoming further separated • Compute is stateless: cloud-based or on-prem, either transient or long-running • Data is stateful: cloud-based or on-prem in HDFS, Kudu, S3, ADLS, Isilon, etc. What about data context? • Schema Definitions (Hive Metastore) • Permissions (Apache Sentry) • Encryption Keys (KMS) • Governance (Cloudera Navigator) Data context should be stateful, but currently is stateless • This creates synchronization and usability challenges for admins and end users alike
  • 16.
    16© Cloudera, Inc.All rights reserved. Solution: Shared Data Experience Externalize data context services as a shared service DATA ENGINEERIN G OPERATIONA L DATABASE ANALYTIC DATABASE DATA SCIENCE Benefits: • Common schemas, access permissions, classifications, and governance across all workloads • Reduced cost of ownership: less hardware and software to manage • Increased end-user productivity: data is presented consistently in every cluster • Faster expansion: admins don’t have to recreate data context services with each new cluster KEYSHMS SENTRY NAVIGATO R KEYSHMS SENTRY NAVIGATO R HDFS, Kudu, S3, Private Cloud StorageHDFS, Kudu, S3, Private Cloud Storage
  • 17.
    17© Cloudera, Inc.All rights reserved. The modern platform for machine learning and analytics optimized for the cloud EXTENSIBLE SERVICES CORE SERVICES DATA ENGINEERING OPERATIONAL DATABASE ANALYTIC DATABASE DATA SCIENCE DATA CATALOG INGEST & REPLICATION SECURITY GOVERNANCE WORKLOAD MANAGEMENT Cloudera Enterprise S3 ADLS HDFS KUDU STORAGE SERVICES
  • 18.
    18© Cloudera, Inc.All rights reserved. Two deployment options Cloudera SDX Cloudera SDX: Customer-managed • RDS-backed Hive Metastore • RDS-backed Apache Sentry • Customer-managed Cloudera Navigator Ideal for: • Director-launched workloads • CM-managed workloads Cloudera Altus SDX: Cloudera- managed • Serverless Hive Metastore • Serverless Apache Sentry • Serverless Cloudera Navigator Ideal for: • Altus SDX workloads • Hybrid workloads
  • 19.
    19© Cloudera, Inc.All rights reserved. Cloud deployments with SDX optimize for all design goals Shared Data Experience (Metadata, Security, Governance) One logical cluster provides a shared data experience to multiple workloads and tenants SDX makes it possible to transfer on-premises design wins to cloud Shared Object Storage Cloud Strong multi-function support Strong shared data experience Strong information security model Strong on cost management Strong on tenant isolation Strong on workload elasticity Strong on self service Strong on speed of deployment
  • 20.
    20© Cloudera, Inc.All rights reserved. Positive business outcomes Increased business insights diverse data together with analytics flexibility Increased business agility modern and nimble application innovation Increased governance capability one common viewpoint and store Decreased operational costs – one environment for all needs Decreased staff overhead – one set of controls for everything Decreased security risks – comprehensive controls everywhere
  • 21.
    21© Cloudera, Inc.All rights reserved. Using Predictive Maintenance to Improve Performance and Reduce Fleet Downtime • OnCommand Connection is collecting telematics and geolocation data across the fleet • Reduced maintenance costs to $.03 per mile from $.12-$.15 per mile • Centralizing data from 13 systems with varying frequency and semantic definitions • Real-time visibility of 300,000+ trucks in order to improve uptime and vehicle performance
  • 22.
    22© Cloudera, Inc.All rights reserved. Thank you Mark Donsky Nikki Rouda @markdonsky @nrouda