More Related Content More from Cloudera, Inc. (20) Driving Business Innovation and Value with Apache Spark1. 1© Cloudera, Inc. All rights reserved.
Driving
Business Innovation and Value
with Apache Spark
Wim Stoop
Senior PMM
@TheWimster
Sean Owen
Data Science Director
@sean_r_owen
3. 3© Cloudera, Inc. All rights reserved.
Boardroom thinking
DRIVE CUSTOMER
INSIGHTS
IMPROVE PRODUCT &
SERVICES EFFICIENCY LOWER BUSINESS RISK
4. 4© Cloudera, Inc. All rights reserved.
Common, key requirements
Data
Engineering
Stream
Processing
Data Science &
Machine
Learning
5. 5© Cloudera, Inc. All rights reserved.
No ordinary processing
• Speed
• In memory vs disk
• Ease of use
• Develop in YOUR language
• Right tool for right job
• Iterative computations
6. 6© Cloudera, Inc. All rights reserved.
Apache Spark
Fast and flexible general purpose data processing for Hadoop
Data
Engineering
Stream
Processing
Data Science &
Machine
Learning
Unified API and processing Engine for large scale data
7. 7© Cloudera, Inc. All rights reserved.
Spark at Cloudera
• More customers running Spark than all other
vendors combined
• Over 280 customers
• Spark clusters upwards of 1200 nodes
• Diverse use cases across multiple industries
• Search personalization
• Genomics research
• Insurance modeling
• Advertising optimization
• Predictive modeling of disease conditions
8. 8© Cloudera, Inc. All rights reserved.
Cloudera Enterprise
Making Hadoop Fast, Easy, and Secure
A new kind of data
platform:
• One place for unlimited data
• Unified, multi-framework data
access
Cloudera makes it:
• Fast for business
• Easy to manage
• Secure without compromise
OPERATIONS
DATA
MANAGEMENT
STRUCTURED UNSTRUCTURED
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENT SECURITY
NoSQL
STORE
INTEGRATE
BATCH STREAM SQL SEARCH OTHER
OTHERFILESYSTEM RELATIONAL
9. 9© Cloudera, Inc. All rights reserved.
Why Spark at Cloudera?
The Most Apache Spark Experience
STRUCTURED
Sqoop
UNSTRUCTURED
Kafka, Flume
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENT
YARN
SECURITY
Sentry, RecordService
FILESYSTEM
HDFS
RELATIONAL
Kudu
NoSQL
HBase
STORE
INTEGRATE
BATCH
Spark, Hive, Pig
MapReduce
STREAM
Spark
SQL
Impala
SEARCH
Solr
SDK
Kite
Cloudera is the “stress free” choice for Spark
• Support: Proactive Support for Spark workloads
• Expertise: Most Spark users trained. Robust development
community.
• Experience: First to ship and support. Most customers running
Spark of any commercial Hadoop Distribution.
Cloudera lives where your data lives
• Run Spark On-prem or in the Public Cloud
Out-of-the-box ready for end to end use cases
• Spark with supported seamless integrations with other big-data
tools (Kafka, Hbase, Kudu, etc)
Cloudera makes Spark enterprise hardened
• Comprehensive Management and Alerting
• End to End Security and Governance
• Better Multi-tenancy operation for multiple workloads
10. 10© Cloudera, Inc. All rights reserved.
The One Platform Initiative
Management
Leverage Hadoop-native
resource management
Security
Full support for Hadoop security
and beyond
Scale
Spark at Petabyte scale
Streaming
Performance, simplification & easy-
management of streaming workloads
Cloud
Elastic transient workloads
11. 11© Cloudera, Inc. All rights reserved.
Spark from Cloudera
Source: Taneja Spark Survey, July 2016
12. 12© Cloudera, Inc. All rights reserved.
Spark Use Cases
Source: Taneja Spark Survey, July 2016
14. 14© Cloudera, Inc. All rights reserved.
New Unified API: RDD -> Dataset + DataFrame
RDDs
• Object Oriented
• Functional Operators
• map, reduceByKey,
cogroup, etc
• Compile-time Type Safety
DataFrames
• Structured
• Compact binary
representation
• Query Optimizer
• Sort/shuffle without
deserialization
Datasets
15. 15© Cloudera, Inc. All rights reserved.
Machine Learning Persistence
Save and Load Models and Pipelines
Bag of
words
Tokenize TF-IDF LDA
Scale &
Normalize
Features
Train
Classifier
17. 17© Cloudera, Inc. All rights reserved.
Structured Streaming
• Streams modeled as continuous DataFrames
• SQL-like syntax to author streaming processing
• Wide array of in-built aggregation and statistical functions
• Easier end-to-end exactly-once semantics
• Out-Of-Order data handling
• Increased performance
• Growing array of Streaming ML functionality
Spark Streaming 2.0
18. 18© Cloudera, Inc. All rights reserved.
Get the Spark 2.0 CDH Parcel
• Download beta parcel:
http://www.cloudera.com/downloa
ds/beta/spark2/2-0-0.html
• Read more at
http://blog.cloudera.com/blog/2016/09/a
pache-spark-2-0-0-beta-now-available-for-
cdh
20. 20© Cloudera, Inc. All rights reserved.
Data Engineering and Data Science in the Cloud
Across industries, data engineering and
data science are a natural fit for the cloud:
● Data growth: More data being created in the cloud
● Transient workloads: Development/test, exploration;
batch ETL, model training and scoring
● Flexibility: Optimize infrastructure for the job;
self-service for data engineers, data scientists
● Lower TCO: Do more with less
21. 21© Cloudera, Inc. All rights reserved.
Transience for flexibility,
lower TCO and risk
Unified platform, from
ingest to insight and action
Object Store
Hybrid support for
multiple environments
STORE
COMPUTE
Requirements for Data Engineering and Science
Portability, flexibility, and an end-to-end enterprise platform
22. 22© Cloudera, Inc. All rights reserved.
Director Provisioning: Cluster Lifecycle Management
Spin up, grow & shrink, terminate CDH clusters that read/write to object store
Easy Administration
• Dynamic cluster lifecycle management
• Single pane of glass: multi-cluster view
Flexible Deployments
• Multi-cloud: AWS, Azure, GCP
• Fast cluster deployments
• Scaling of CDH clusters
• Spot instance support
Enterprise-grade
• Integration across Cloudera Enterprise
• Management of CDH deployments at scale
Cloudera Director
23. 23© Cloudera, Inc. All rights reserved.
Data Engineering and Data Science
Two Common Workload Patterns
Only pay for what you need,
when you need it
▪ Transient clusters
▪ Single user
▪ Sized to demand
▪ Object storage centric
▪ Cloud-native deployment
Batch Processing / ETL
(also: Testing Environments)
Exploratory
Data Science
(also: Development Environments)
Explore and analyze all data,
wherever it lives, on demand
▪ Transient or persistent
▪ Single or multi-user
▪ Elastic workload
▪ HDFS or object storage
▪ Lift-and-shift or cloud-native deployment
24. 24© Cloudera, Inc. All rights reserved.
Where Cloudera Director Plays in Cluster Management
Data
Sources
Real-Time
Serving
Kafka/
Flume
Spark
Streaming
HBase or
Impala/Kudu (beta)
Kafka
Application
S3
Hive/Spark/HoS
Impala
Analytics
Batch Data Transformations
Can be transient, managed with
Cloudera Director.
Permanent clusters. Can be deployed by Cloudera
Director and managed by Cloudera Manager.
25. 25© Cloudera, Inc. All rights reserved.
Transient Use Case: ETL Pipeline Workflow in AWS
Q1 Q2 … Qn-1 Qn
ETL Pipeline
Ingest + query
building
Query execution BI, visualization,
analysis
Hive Spark MR2
HDFS
S3 Impala
Script/
Scheduler
CDH Production Cluster (AWS)
Github
Hive Spark MR2
HDFS
CDH Dev Cluster (on-prem)
Trifacta/
Paxata, etc.
Query Builder Query Store
Query Scheduling
QueryCreation
Raw Data
IoT/Devices/
Crawler, etc.
Data Generation
Hue
Spark
Sense
Hive
Tableau
27. 27© Cloudera, Inc. All rights reserved.
• Comprehensive view of risk for 80
years of historical data across all 50
US states with EDH
• Faster data preparation and ETL
using Cloudera with Spark
• Reduced speed to create pricing
models by 75x resulting in timely
and customized offers to
customers
Improve
Products &
Services
Efficiency
INSURANCE
» PRODUCT IMPROVEMENT
» CUSTOMIZED OFFERS
» RISK REDUCTION
28. 28© Cloudera, Inc. All rights reserved.
360° View of Retail Customers / Behavior
• Many different data sources integrated
(click streams, in-store POS, online
ordering, and social media)
• Understanding of abandoned online
shopping cart behavior
• Optimized operational investments by
attributing revenue to the appropriate
channel
• Increased customer insight informs
supply chain plans
• Improved ability to explain and predict
returns
30. 30© Cloudera, Inc. All rights reserved.
Spark Adoption
Source: Taneja Spark Survey, July 2016
31. 31© Cloudera, Inc. All rights reserved.
Mind the gap
reported barriers to adoption due to
big data skills and training gaps
Source: Taneja Spark Survey, July 2016
32. 32© Cloudera, Inc. All rights reserved.
We’ve got you covered
Cloudera University’s three-day
Spark course enables
participants to build complete,
unified big data applications.
Spark and Hadoop are
transforming how data scientists
work by allowing interactive and
iterative data analysis at scale.
The course provides an
introduction to Machine Learning,
including coverage of
collaborative filtering, clustering,
classification, algorithms, and
data volume.
Apache Spark Developer Training Data Science at Scale with Spark
and Hadoop
Introduction to Machine
Learning
33. 33© Cloudera, Inc. All rights reserved.
All Training, All Online, All the Time
http://www.cloudera.com/training/ondemand-training.html
34. 34© Cloudera, Inc. All rights reserved.
Thank you
Wim Stoop
Senior PMM
@TheWimster
Sean Owen
Data Science Director
@sean_r_owen