SlideShare a Scribd company logo
1 of 34
1© Cloudera, Inc. All rights reserved.
Driving
Business Innovation and Value
with Apache Spark
Wim Stoop
Senior PMM
@TheWimster
Sean Owen
Data Science Director
@sean_r_owen
2© Cloudera, Inc. All rights reserved.
Our relationship with data is changing
3© Cloudera, Inc. All rights reserved.
Boardroom thinking
DRIVE CUSTOMER
INSIGHTS
IMPROVE PRODUCT &
SERVICES EFFICIENCY LOWER BUSINESS RISK
4© Cloudera, Inc. All rights reserved.
Common, key requirements
Data
Engineering
Stream
Processing
Data Science &
Machine
Learning
5© Cloudera, Inc. All rights reserved.
No ordinary processing
• Speed
• In memory vs disk
• Ease of use
• Develop in YOUR language
• Right tool for right job
• Iterative computations
6© Cloudera, Inc. All rights reserved.
Apache Spark
Fast and flexible general purpose data processing for Hadoop
Data
Engineering
Stream
Processing
Data Science &
Machine
Learning
Unified API and processing Engine for large scale data
7© Cloudera, Inc. All rights reserved.
Spark at Cloudera
• More customers running Spark than all other
vendors combined
• Over 280 customers
• Spark clusters upwards of 1200 nodes
• Diverse use cases across multiple industries
• Search personalization
• Genomics research
• Insurance modeling
• Advertising optimization
• Predictive modeling of disease conditions
8© Cloudera, Inc. All rights reserved.
Cloudera Enterprise
Making Hadoop Fast, Easy, and Secure
A new kind of data
platform:
• One place for unlimited data
• Unified, multi-framework data
access
Cloudera makes it:
• Fast for business
• Easy to manage
• Secure without compromise
OPERATIONS
DATA
MANAGEMENT
STRUCTURED UNSTRUCTURED
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENT SECURITY
NoSQL
STORE
INTEGRATE
BATCH STREAM SQL SEARCH OTHER
OTHERFILESYSTEM RELATIONAL
9© Cloudera, Inc. All rights reserved.
Why Spark at Cloudera?
The Most Apache Spark Experience
STRUCTURED
Sqoop
UNSTRUCTURED
Kafka, Flume
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENT
YARN
SECURITY
Sentry, RecordService
FILESYSTEM
HDFS
RELATIONAL
Kudu
NoSQL
HBase
STORE
INTEGRATE
BATCH
Spark, Hive, Pig
MapReduce
STREAM
Spark
SQL
Impala
SEARCH
Solr
SDK
Kite
Cloudera is the “stress free” choice for Spark
• Support: Proactive Support for Spark workloads
• Expertise: Most Spark users trained. Robust development
community.
• Experience: First to ship and support. Most customers running
Spark of any commercial Hadoop Distribution.
Cloudera lives where your data lives
• Run Spark On-prem or in the Public Cloud
Out-of-the-box ready for end to end use cases
• Spark with supported seamless integrations with other big-data
tools (Kafka, Hbase, Kudu, etc)
Cloudera makes Spark enterprise hardened
• Comprehensive Management and Alerting
• End to End Security and Governance
• Better Multi-tenancy operation for multiple workloads
10© Cloudera, Inc. All rights reserved.
The One Platform Initiative
Management
Leverage Hadoop-native
resource management
Security
Full support for Hadoop security
and beyond
Scale
Spark at Petabyte scale
Streaming
Performance, simplification & easy-
management of streaming workloads
Cloud
Elastic transient workloads
11© Cloudera, Inc. All rights reserved.
Spark from Cloudera
Source: Taneja Spark Survey, July 2016
12© Cloudera, Inc. All rights reserved.
Spark Use Cases
Source: Taneja Spark Survey, July 2016
13© Cloudera, Inc. All rights reserved.
New in Spark 2.0
14© Cloudera, Inc. All rights reserved.
New Unified API: RDD -> Dataset + DataFrame
RDDs
• Object Oriented
• Functional Operators
• map, reduceByKey,
cogroup, etc
• Compile-time Type Safety
DataFrames
• Structured
• Compact binary
representation
• Query Optimizer
• Sort/shuffle without
deserialization
Datasets
15© Cloudera, Inc. All rights reserved.
Machine Learning Persistence
Save and Load Models and Pipelines
Bag of
words
Tokenize TF-IDF LDA
Scale &
Normalize
Features
Train
Classifier
16© Cloudera, Inc. All rights reserved.
Structured Streaming
Spark Streaming 2.0
17© Cloudera, Inc. All rights reserved.
Structured Streaming
• Streams modeled as continuous DataFrames
• SQL-like syntax to author streaming processing
• Wide array of in-built aggregation and statistical functions
• Easier end-to-end exactly-once semantics
• Out-Of-Order data handling
• Increased performance
• Growing array of Streaming ML functionality
Spark Streaming 2.0
18© Cloudera, Inc. All rights reserved.
Get the Spark 2.0 CDH Parcel
• Download beta parcel:
http://www.cloudera.com/downloa
ds/beta/spark2/2-0-0.html
• Read more at
http://blog.cloudera.com/blog/2016/09/a
pache-spark-2-0-0-beta-now-available-for-
cdh
19© Cloudera, Inc. All rights reserved.
Spark in the Cloud
20© Cloudera, Inc. All rights reserved.
Data Engineering and Data Science in the Cloud
Across industries, data engineering and
data science are a natural fit for the cloud:
● Data growth: More data being created in the cloud
● Transient workloads: Development/test, exploration;
batch ETL, model training and scoring
● Flexibility: Optimize infrastructure for the job;
self-service for data engineers, data scientists
● Lower TCO: Do more with less
21© Cloudera, Inc. All rights reserved.
Transience for flexibility,
lower TCO and risk
Unified platform, from
ingest to insight and action
Object Store
Hybrid support for
multiple environments
STORE
COMPUTE
Requirements for Data Engineering and Science
Portability, flexibility, and an end-to-end enterprise platform
22© Cloudera, Inc. All rights reserved.
Director Provisioning: Cluster Lifecycle Management
Spin up, grow & shrink, terminate CDH clusters that read/write to object store
Easy Administration
• Dynamic cluster lifecycle management
• Single pane of glass: multi-cluster view
Flexible Deployments
• Multi-cloud: AWS, Azure, GCP
• Fast cluster deployments
• Scaling of CDH clusters
• Spot instance support
Enterprise-grade
• Integration across Cloudera Enterprise
• Management of CDH deployments at scale
Cloudera Director
23© Cloudera, Inc. All rights reserved.
Data Engineering and Data Science
Two Common Workload Patterns
Only pay for what you need,
when you need it
▪ Transient clusters
▪ Single user
▪ Sized to demand
▪ Object storage centric
▪ Cloud-native deployment
Batch Processing / ETL
(also: Testing Environments)
Exploratory
Data Science
(also: Development Environments)
Explore and analyze all data,
wherever it lives, on demand
▪ Transient or persistent
▪ Single or multi-user
▪ Elastic workload
▪ HDFS or object storage
▪ Lift-and-shift or cloud-native deployment
24© Cloudera, Inc. All rights reserved.
Where Cloudera Director Plays in Cluster Management
Data
Sources
Real-Time
Serving
Kafka/
Flume
Spark
Streaming
HBase or
Impala/Kudu (beta)
Kafka
Application
S3
Hive/Spark/HoS
Impala
Analytics
Batch Data Transformations
Can be transient, managed with
Cloudera Director.
Permanent clusters. Can be deployed by Cloudera
Director and managed by Cloudera Manager.
25© Cloudera, Inc. All rights reserved.
Transient Use Case: ETL Pipeline Workflow in AWS
Q1 Q2 … Qn-1 Qn
ETL Pipeline
Ingest + query
building
Query execution BI, visualization,
analysis
Hive Spark MR2
HDFS
S3 Impala
Script/
Scheduler
CDH Production Cluster (AWS)
Github
Hive Spark MR2
HDFS
CDH Dev Cluster (on-prem)
Trifacta/
Paxata, etc.
Query Builder Query Store
Query Scheduling
QueryCreation
Raw Data
IoT/Devices/
Crawler, etc.
Data Generation
Hue
Spark
Sense
Hive
Tableau
26© Cloudera, Inc. All rights reserved.
Customer Use Cases
27© Cloudera, Inc. All rights reserved.
• Comprehensive view of risk for 80
years of historical data across all 50
US states with EDH
• Faster data preparation and ETL
using Cloudera with Spark
• Reduced speed to create pricing
models by 75x resulting in timely
and customized offers to
customers
Improve
Products &
Services
Efficiency
INSURANCE
» PRODUCT IMPROVEMENT
» CUSTOMIZED OFFERS
» RISK REDUCTION
28© Cloudera, Inc. All rights reserved.
360° View of Retail Customers / Behavior
• Many different data sources integrated
(click streams, in-store POS, online
ordering, and social media)
• Understanding of abandoned online
shopping cart behavior
• Optimized operational investments by
attributing revenue to the appropriate
channel
• Increased customer insight informs
supply chain plans
• Improved ability to explain and predict
returns
29© Cloudera, Inc. All rights reserved.
Cloudera Spark EMEA Customers
30© Cloudera, Inc. All rights reserved.
Spark Adoption
Source: Taneja Spark Survey, July 2016
31© Cloudera, Inc. All rights reserved.
Mind the gap
reported barriers to adoption due to
big data skills and training gaps
Source: Taneja Spark Survey, July 2016
32© Cloudera, Inc. All rights reserved.
We’ve got you covered
Cloudera University’s three-day
Spark course enables
participants to build complete,
unified big data applications.
Spark and Hadoop are
transforming how data scientists
work by allowing interactive and
iterative data analysis at scale.
The course provides an
introduction to Machine Learning,
including coverage of
collaborative filtering, clustering,
classification, algorithms, and
data volume.
Apache Spark Developer Training Data Science at Scale with Spark
and Hadoop
Introduction to Machine
Learning
33© Cloudera, Inc. All rights reserved.
All Training, All Online, All the Time
http://www.cloudera.com/training/ondemand-training.html
34© Cloudera, Inc. All rights reserved.
Thank you
Wim Stoop
Senior PMM
@TheWimster
Sean Owen
Data Science Director
@sean_r_owen

More Related Content

Viewers also liked

Viewers also liked (18)

How Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsHow Data Drives Business at Choice Hotels
How Data Drives Business at Choice Hotels
 
How Big Data Can Enable Analytics from the Cloud (Technical Workshop)
How Big Data Can Enable Analytics from the Cloud (Technical Workshop)How Big Data Can Enable Analytics from the Cloud (Technical Workshop)
How Big Data Can Enable Analytics from the Cloud (Technical Workshop)
 
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
 
Using Big Data to Transform Your Customer’s Experience - Part 1

Using Big Data to Transform Your Customer’s Experience - Part 1
Using Big Data to Transform Your Customer’s Experience - Part 1

Using Big Data to Transform Your Customer’s Experience - Part 1

 
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud WorldPart 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
 
Part 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache KuduPart 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache Kudu
 
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the CloudPart 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
 
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr

 
Hadoop ppt
Hadoop pptHadoop ppt
Hadoop ppt
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your Niche
 
Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data
 
Application scenarios and real-world deployments for IoT and Smart Cities
Application scenarios and real-world deployments for IoT and Smart CitiesApplication scenarios and real-world deployments for IoT and Smart Cities
Application scenarios and real-world deployments for IoT and Smart Cities
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Hadoop – big deal
Hadoop – big dealHadoop – big deal
Hadoop – big deal
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
Deep dive hadoop
Deep dive hadoopDeep dive hadoop
Deep dive hadoop
 

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 

Recently uploaded (20)

Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 

Driving Business Innovation and Value with Apache Spark

  • 1. 1© Cloudera, Inc. All rights reserved. Driving Business Innovation and Value with Apache Spark Wim Stoop Senior PMM @TheWimster Sean Owen Data Science Director @sean_r_owen
  • 2. 2© Cloudera, Inc. All rights reserved. Our relationship with data is changing
  • 3. 3© Cloudera, Inc. All rights reserved. Boardroom thinking DRIVE CUSTOMER INSIGHTS IMPROVE PRODUCT & SERVICES EFFICIENCY LOWER BUSINESS RISK
  • 4. 4© Cloudera, Inc. All rights reserved. Common, key requirements Data Engineering Stream Processing Data Science & Machine Learning
  • 5. 5© Cloudera, Inc. All rights reserved. No ordinary processing • Speed • In memory vs disk • Ease of use • Develop in YOUR language • Right tool for right job • Iterative computations
  • 6. 6© Cloudera, Inc. All rights reserved. Apache Spark Fast and flexible general purpose data processing for Hadoop Data Engineering Stream Processing Data Science & Machine Learning Unified API and processing Engine for large scale data
  • 7. 7© Cloudera, Inc. All rights reserved. Spark at Cloudera • More customers running Spark than all other vendors combined • Over 280 customers • Spark clusters upwards of 1200 nodes • Diverse use cases across multiple industries • Search personalization • Genomics research • Insurance modeling • Advertising optimization • Predictive modeling of disease conditions
  • 8. 8© Cloudera, Inc. All rights reserved. Cloudera Enterprise Making Hadoop Fast, Easy, and Secure A new kind of data platform: • One place for unlimited data • Unified, multi-framework data access Cloudera makes it: • Fast for business • Easy to manage • Secure without compromise OPERATIONS DATA MANAGEMENT STRUCTURED UNSTRUCTURED PROCESS, ANALYZE, SERVE UNIFIED SERVICES RESOURCE MANAGEMENT SECURITY NoSQL STORE INTEGRATE BATCH STREAM SQL SEARCH OTHER OTHERFILESYSTEM RELATIONAL
  • 9. 9© Cloudera, Inc. All rights reserved. Why Spark at Cloudera? The Most Apache Spark Experience STRUCTURED Sqoop UNSTRUCTURED Kafka, Flume PROCESS, ANALYZE, SERVE UNIFIED SERVICES RESOURCE MANAGEMENT YARN SECURITY Sentry, RecordService FILESYSTEM HDFS RELATIONAL Kudu NoSQL HBase STORE INTEGRATE BATCH Spark, Hive, Pig MapReduce STREAM Spark SQL Impala SEARCH Solr SDK Kite Cloudera is the “stress free” choice for Spark • Support: Proactive Support for Spark workloads • Expertise: Most Spark users trained. Robust development community. • Experience: First to ship and support. Most customers running Spark of any commercial Hadoop Distribution. Cloudera lives where your data lives • Run Spark On-prem or in the Public Cloud Out-of-the-box ready for end to end use cases • Spark with supported seamless integrations with other big-data tools (Kafka, Hbase, Kudu, etc) Cloudera makes Spark enterprise hardened • Comprehensive Management and Alerting • End to End Security and Governance • Better Multi-tenancy operation for multiple workloads
  • 10. 10© Cloudera, Inc. All rights reserved. The One Platform Initiative Management Leverage Hadoop-native resource management Security Full support for Hadoop security and beyond Scale Spark at Petabyte scale Streaming Performance, simplification & easy- management of streaming workloads Cloud Elastic transient workloads
  • 11. 11© Cloudera, Inc. All rights reserved. Spark from Cloudera Source: Taneja Spark Survey, July 2016
  • 12. 12© Cloudera, Inc. All rights reserved. Spark Use Cases Source: Taneja Spark Survey, July 2016
  • 13. 13© Cloudera, Inc. All rights reserved. New in Spark 2.0
  • 14. 14© Cloudera, Inc. All rights reserved. New Unified API: RDD -> Dataset + DataFrame RDDs • Object Oriented • Functional Operators • map, reduceByKey, cogroup, etc • Compile-time Type Safety DataFrames • Structured • Compact binary representation • Query Optimizer • Sort/shuffle without deserialization Datasets
  • 15. 15© Cloudera, Inc. All rights reserved. Machine Learning Persistence Save and Load Models and Pipelines Bag of words Tokenize TF-IDF LDA Scale & Normalize Features Train Classifier
  • 16. 16© Cloudera, Inc. All rights reserved. Structured Streaming Spark Streaming 2.0
  • 17. 17© Cloudera, Inc. All rights reserved. Structured Streaming • Streams modeled as continuous DataFrames • SQL-like syntax to author streaming processing • Wide array of in-built aggregation and statistical functions • Easier end-to-end exactly-once semantics • Out-Of-Order data handling • Increased performance • Growing array of Streaming ML functionality Spark Streaming 2.0
  • 18. 18© Cloudera, Inc. All rights reserved. Get the Spark 2.0 CDH Parcel • Download beta parcel: http://www.cloudera.com/downloa ds/beta/spark2/2-0-0.html • Read more at http://blog.cloudera.com/blog/2016/09/a pache-spark-2-0-0-beta-now-available-for- cdh
  • 19. 19© Cloudera, Inc. All rights reserved. Spark in the Cloud
  • 20. 20© Cloudera, Inc. All rights reserved. Data Engineering and Data Science in the Cloud Across industries, data engineering and data science are a natural fit for the cloud: ● Data growth: More data being created in the cloud ● Transient workloads: Development/test, exploration; batch ETL, model training and scoring ● Flexibility: Optimize infrastructure for the job; self-service for data engineers, data scientists ● Lower TCO: Do more with less
  • 21. 21© Cloudera, Inc. All rights reserved. Transience for flexibility, lower TCO and risk Unified platform, from ingest to insight and action Object Store Hybrid support for multiple environments STORE COMPUTE Requirements for Data Engineering and Science Portability, flexibility, and an end-to-end enterprise platform
  • 22. 22© Cloudera, Inc. All rights reserved. Director Provisioning: Cluster Lifecycle Management Spin up, grow & shrink, terminate CDH clusters that read/write to object store Easy Administration • Dynamic cluster lifecycle management • Single pane of glass: multi-cluster view Flexible Deployments • Multi-cloud: AWS, Azure, GCP • Fast cluster deployments • Scaling of CDH clusters • Spot instance support Enterprise-grade • Integration across Cloudera Enterprise • Management of CDH deployments at scale Cloudera Director
  • 23. 23© Cloudera, Inc. All rights reserved. Data Engineering and Data Science Two Common Workload Patterns Only pay for what you need, when you need it ▪ Transient clusters ▪ Single user ▪ Sized to demand ▪ Object storage centric ▪ Cloud-native deployment Batch Processing / ETL (also: Testing Environments) Exploratory Data Science (also: Development Environments) Explore and analyze all data, wherever it lives, on demand ▪ Transient or persistent ▪ Single or multi-user ▪ Elastic workload ▪ HDFS or object storage ▪ Lift-and-shift or cloud-native deployment
  • 24. 24© Cloudera, Inc. All rights reserved. Where Cloudera Director Plays in Cluster Management Data Sources Real-Time Serving Kafka/ Flume Spark Streaming HBase or Impala/Kudu (beta) Kafka Application S3 Hive/Spark/HoS Impala Analytics Batch Data Transformations Can be transient, managed with Cloudera Director. Permanent clusters. Can be deployed by Cloudera Director and managed by Cloudera Manager.
  • 25. 25© Cloudera, Inc. All rights reserved. Transient Use Case: ETL Pipeline Workflow in AWS Q1 Q2 … Qn-1 Qn ETL Pipeline Ingest + query building Query execution BI, visualization, analysis Hive Spark MR2 HDFS S3 Impala Script/ Scheduler CDH Production Cluster (AWS) Github Hive Spark MR2 HDFS CDH Dev Cluster (on-prem) Trifacta/ Paxata, etc. Query Builder Query Store Query Scheduling QueryCreation Raw Data IoT/Devices/ Crawler, etc. Data Generation Hue Spark Sense Hive Tableau
  • 26. 26© Cloudera, Inc. All rights reserved. Customer Use Cases
  • 27. 27© Cloudera, Inc. All rights reserved. • Comprehensive view of risk for 80 years of historical data across all 50 US states with EDH • Faster data preparation and ETL using Cloudera with Spark • Reduced speed to create pricing models by 75x resulting in timely and customized offers to customers Improve Products & Services Efficiency INSURANCE » PRODUCT IMPROVEMENT » CUSTOMIZED OFFERS » RISK REDUCTION
  • 28. 28© Cloudera, Inc. All rights reserved. 360° View of Retail Customers / Behavior • Many different data sources integrated (click streams, in-store POS, online ordering, and social media) • Understanding of abandoned online shopping cart behavior • Optimized operational investments by attributing revenue to the appropriate channel • Increased customer insight informs supply chain plans • Improved ability to explain and predict returns
  • 29. 29© Cloudera, Inc. All rights reserved. Cloudera Spark EMEA Customers
  • 30. 30© Cloudera, Inc. All rights reserved. Spark Adoption Source: Taneja Spark Survey, July 2016
  • 31. 31© Cloudera, Inc. All rights reserved. Mind the gap reported barriers to adoption due to big data skills and training gaps Source: Taneja Spark Survey, July 2016
  • 32. 32© Cloudera, Inc. All rights reserved. We’ve got you covered Cloudera University’s three-day Spark course enables participants to build complete, unified big data applications. Spark and Hadoop are transforming how data scientists work by allowing interactive and iterative data analysis at scale. The course provides an introduction to Machine Learning, including coverage of collaborative filtering, clustering, classification, algorithms, and data volume. Apache Spark Developer Training Data Science at Scale with Spark and Hadoop Introduction to Machine Learning
  • 33. 33© Cloudera, Inc. All rights reserved. All Training, All Online, All the Time http://www.cloudera.com/training/ondemand-training.html
  • 34. 34© Cloudera, Inc. All rights reserved. Thank you Wim Stoop Senior PMM @TheWimster Sean Owen Data Science Director @sean_r_owen