Driving Business Innovation and Value with Apache Spark

1© Cloudera, Inc. All rights reserved.
Driving
Business Innovation and Value
with Apache Spark
Wim Stoop
Senior PMM
@TheWimster
Sean Owen
Data Science Director
@sean_r_owen

Our relationship with data is changing

Boardroom thinking
DRIVE CUSTOMER
INSIGHTS
IMPROVE PRODUCT &
SERVICES EFFICIENCY LOWER BUSINESS RISK

Common, key requirements
Data
Engineering
Stream
Processing
Data Science &
Machine
Learning

No ordinary processing
• Speed
• In memory vs disk
• Ease of use
• Develop in YOUR language
• Right tool for right job
• Iterative computations

Apache Spark
Fast and flexible general purpose data processing for Hadoop
Data
Engineering
Stream
Processing
Data Science &
Machine
Learning
Unified API and processing Engine for large scale data

Spark at Cloudera
• More customers running Spark than all other
vendors combined
• Over 280 customers
• Spark clusters upwards of 1200 nodes
• Diverse use cases across multiple industries
• Search personalization
• Genomics research
• Insurance modeling
• Advertising optimization
• Predictive modeling of disease conditions

Cloudera Enterprise
Making Hadoop Fast, Easy, and Secure
A new kind of data
platform:
• One place for unlimited data
• Unified, multi-framework data
access
Cloudera makes it:
• Fast for business
• Easy to manage
• Secure without compromise
OPERATIONS
DATA
MANAGEMENT
STRUCTURED UNSTRUCTURED
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENT SECURITY
NoSQL
STORE
INTEGRATE
BATCH STREAM SQL SEARCH OTHER
OTHERFILESYSTEM RELATIONAL

Why Spark at Cloudera?
The Most Apache Spark Experience
STRUCTURED
Sqoop
UNSTRUCTURED
Kafka, Flume
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENT
YARN
SECURITY
Sentry, RecordService
FILESYSTEM
HDFS
RELATIONAL
Kudu
NoSQL
HBase
STORE
INTEGRATE
BATCH
Spark, Hive, Pig
MapReduce
STREAM
Spark
SQL
Impala
SEARCH
Solr
SDK
Kite
Cloudera is the “stress free” choice for Spark
• Support: Proactive Support for Spark workloads
• Expertise: Most Spark users trained. Robust development
community.
• Experience: First to ship and support. Most customers running
Spark of any commercial Hadoop Distribution.
Cloudera lives where your data lives
• Run Spark On-prem or in the Public Cloud
Out-of-the-box ready for end to end use cases
• Spark with supported seamless integrations with other big-data
tools (Kafka, Hbase, Kudu, etc)
Cloudera makes Spark enterprise hardened
• Comprehensive Management and Alerting
• End to End Security and Governance
• Better Multi-tenancy operation for multiple workloads

The One Platform Initiative
Management
Leverage Hadoop-native
resource management
Security
Full support for Hadoop security
and beyond
Scale
Spark at Petabyte scale
Streaming
Performance, simplification & easy-
management of streaming workloads
Cloud
Elastic transient workloads

Spark from Cloudera
Source: Taneja Spark Survey, July 2016

Spark Use Cases

New in Spark 2.0

New Unified API: RDD -> Dataset + DataFrame
RDDs
• Object Oriented
• Functional Operators
• map, reduceByKey,
cogroup, etc
• Compile-time Type Safety
DataFrames
• Structured
• Compact binary
representation
• Query Optimizer
• Sort/shuffle without
deserialization
Datasets

Machine Learning Persistence
Save and Load Models and Pipelines
Bag of
words
Tokenize TF-IDF LDA
Scale &
Normalize
Features
Train
Classifier

Structured Streaming
Spark Streaming 2.0

Structured Streaming
• Streams modeled as continuous DataFrames
• SQL-like syntax to author streaming processing
• Wide array of in-built aggregation and statistical functions
• Easier end-to-end exactly-once semantics
• Out-Of-Order data handling
• Increased performance
• Growing array of Streaming ML functionality
Spark Streaming 2.0

Get the Spark 2.0 CDH Parcel
• Download beta parcel:
http://www.cloudera.com/downloa
ds/beta/spark2/2-0-0.html
• Read more at
http://blog.cloudera.com/blog/2016/09/a
pache-spark-2-0-0-beta-now-available-for-
cdh

Spark in the Cloud

Data Engineering and Data Science in the Cloud
Across industries, data engineering and
data science are a natural fit for the cloud:
● Data growth: More data being created in the cloud
● Transient workloads: Development/test, exploration;
batch ETL, model training and scoring
● Flexibility: Optimize infrastructure for the job;
self-service for data engineers, data scientists
● Lower TCO: Do more with less

Transience for flexibility,
lower TCO and risk
Unified platform, from
ingest to insight and action
Object Store
Hybrid support for
multiple environments
STORE
COMPUTE
Requirements for Data Engineering and Science
Portability, flexibility, and an end-to-end enterprise platform

Director Provisioning: Cluster Lifecycle Management
Spin up, grow & shrink, terminate CDH clusters that read/write to object store
Easy Administration
• Dynamic cluster lifecycle management
• Single pane of glass: multi-cluster view
Flexible Deployments
• Multi-cloud: AWS, Azure, GCP
• Fast cluster deployments
• Scaling of CDH clusters
• Spot instance support
Enterprise-grade
• Integration across Cloudera Enterprise
• Management of CDH deployments at scale
Cloudera Director

Data Engineering and Data Science
Two Common Workload Patterns
Only pay for what you need,
when you need it
▪ Transient clusters
▪ Single user
▪ Sized to demand
▪ Object storage centric
▪ Cloud-native deployment
Batch Processing / ETL
(also: Testing Environments)
Exploratory
Data Science
(also: Development Environments)
Explore and analyze all data,
wherever it lives, on demand
▪ Transient or persistent
▪ Single or multi-user
▪ Elastic workload
▪ HDFS or object storage
▪ Lift-and-shift or cloud-native deployment

Where Cloudera Director Plays in Cluster Management
Data
Sources
Real-Time
Serving
Kafka/
Flume
Spark
Streaming
HBase or
Impala/Kudu (beta)
Kafka
Application
S3
Hive/Spark/HoS
Impala
Analytics
Batch Data Transformations
Can be transient, managed with
Cloudera Director.
Permanent clusters. Can be deployed by Cloudera
Director and managed by Cloudera Manager.

Transient Use Case: ETL Pipeline Workflow in AWS
Q1 Q2 … Qn-1 Qn
ETL Pipeline
Ingest + query
building
Query execution BI, visualization,
analysis
Hive Spark MR2
HDFS
S3 Impala
Script/
Scheduler
CDH Production Cluster (AWS)
Github
Hive Spark MR2
HDFS
CDH Dev Cluster (on-prem)
Trifacta/
Paxata, etc.
Query Builder Query Store
Query Scheduling
QueryCreation
Raw Data
IoT/Devices/
Crawler, etc.
Data Generation
Hue
Spark
Sense
Hive
Tableau

Customer Use Cases

• Comprehensive view of risk for 80
years of historical data across all 50
US states with EDH
• Faster data preparation and ETL
using Cloudera with Spark
• Reduced speed to create pricing
models by 75x resulting in timely
and customized offers to
customers
Improve
Products &
Services
Efficiency
INSURANCE
» PRODUCT IMPROVEMENT
» CUSTOMIZED OFFERS
» RISK REDUCTION

360° View of Retail Customers / Behavior
• Many different data sources integrated
(click streams, in-store POS, online
ordering, and social media)
• Understanding of abandoned online
shopping cart behavior
• Optimized operational investments by
attributing revenue to the appropriate
channel
• Increased customer insight informs
supply chain plans
• Improved ability to explain and predict
returns

Cloudera Spark EMEA Customers

Spark Adoption

Mind the gap
reported barriers to adoption due to
big data skills and training gaps

We’ve got you covered
Cloudera University’s three-day
Spark course enables
participants to build complete,
unified big data applications.
Spark and Hadoop are
transforming how data scientists
work by allowing interactive and
iterative data analysis at scale.
The course provides an
introduction to Machine Learning,
including coverage of
collaborative filtering, clustering,
classification, algorithms, and
data volume.
Apache Spark Developer Training Data Science at Scale with Spark
and Hadoop
Introduction to Machine
Learning

All Training, All Online, All the Time
http://www.cloudera.com/training/ondemand-training.html

Thank you
Wim Stoop
Senior PMM
@TheWimster
Sean Owen
Data Science Director
@sean_r_owen

Driving Business Innovation and Value with Apache Spark

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (18)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

Driving Business Innovation and Value with Apache Spark