1© Cloudera, Inc. All rights reserved.
Uniting Spark & Hadoop
The One Platform Initiative
Doug Cutting | Chief Architect | Cloudera
Anand Iyer | Senior Product Manager | Cloudera
2© Cloudera, Inc. All rights reserved.
Agenda
• Emergence of Spark
• Advantages of Spark
• Spark replacing MapReduce
• The “One Platform Initiative”
• Future of Data Processing in Hadoop
3© Cloudera, Inc. All rights reserved.
MapReduce: A great tool for its day
The original scalable, general, processing engine of Hadoop ecosystem
- Useful across diverse problem domains
- Fueled initial ecosystem explosion
MapReduce
Execution Engine
Hive Pig Mahout SolrCrunch
4© Cloudera, Inc. All rights reserved.
Enter Apache Spark
General purpose computational framework that
substantially improves on MapReduce
Key Properties:
• Leverages distributed memory
• Full Directed Graph expressions for data
parallel computations
• Simpler developer experience
Yet Retains:
• Linear scalability
• Fault-tolerance
• Data locality-based computations
5© Cloudera, Inc. All rights reserved.
Apache Spark
Flexible, in-memory data processing for Hadoop
Easier More Powerful Faster
• Rich APIs for Scala,
Java, and Python
• Interactive shell
• APIs for different
types of workloads:
• Batch
• Streaming
• Machine Learning
• Graph
• In-Memory
processing and
caching
6© Cloudera, Inc. All rights reserved.
Easy Development
High Productivity Language Support
• Native support for multiple
languages with identical APIs
• Scala, Java, Python
• Use of closures, iterations, and
other modern language
constructs to minimize code
• 2-5x less code
Python
lines = sc.textFile(...)
lines.filter(lambda s: “ERROR” in s).count()
Scala
val lines = sc.textFile(...)
lines.filter(s => s.contains(“ERROR”)).count()
Java
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {
return s.contains(“error”);
}
}).count();
7© Cloudera, Inc. All rights reserved.
Easy Development
Use Interactively
• Interactive exploration of data
for data scientists
• No need to develop
“applications”
• Developers can prototype
application on live system
$ ./bin/spark-shell --master local[*]
...
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 1.5.0-SNAPSHOT
/_/
Using Scala version 2.10.4
(Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_51)
Type in expressions to have them evaluated.
Type :help for more information.
...
scala> val words = sc.textFile("file:/usr/share/dict/words")
...
words: org.apache.spark.rdd.RDD[String] =
MapPartitionsRDD[1] at textFile at <console>:21
scala> words.count
...
res0: Long = 235886
scala>
8© Cloudera, Inc. All rights reserved.
Spark Takes Advantage of Memory
Resilient Distributed Datasets (RDD)
• Memory caching layer that stores data in a distributed, fault-tolerant
cache
• Can fall back to disk when data-set does not fit in memory
• Created by parallel transformations on data in stable storage
• Provides fault-tolerance through concept of lineage
9© Cloudera, Inc. All rights reserved.
The Spark Ecosystem & Hadoop
Spark
Streaming
MLlib SparkSQL GraphX
Data-
frames
SparkR
STORAGE
HDFS, HBase
RESOURCE MANAGEMENT
YARN
Spark Impala MR OthersSearch
10© Cloudera, Inc. All rights reserved.
Cloudera is Driving the Spark Movement
2013 2014 2015 2016
Identified Spark’s
early potential
Ships and
Supports
Spark with
CDH 4.4
Added Spark on
YARN integration
Announces initiative to
make Spark the standard
execution engine
Launches first
Spark training
Added security
integration
Cloudera engineers
publish O’Reilly Spark
book
Driving effort to
further performance,
usability, and
enterprise-readiness
11© Cloudera, Inc. All rights reserved.
Spark at Cloudera
• Cloudera was the first Hadoop vendor to ship and support Spark
• Spark is a fully integrated part of Cloudera’s platform
• Shared data, metadata, resource management, administration, security, and governance
• Complements specialized analytic tools for comprehensive big data platform
• Cloudera is the first Hadoop vendor to offer Spark training
• Trained more customers than any other vendor
• Most popular training course
• Cloudera has 5x the engineering resources of the next competitor
• Most committers on staff and most changes contributed
• Well-trained staff across the globe with expertise implementing a broad range of Spark use cases
12© Cloudera, Inc. All rights reserved.
Cloudera’s Engineering Commitment to Spark
Cloudera
57%Intel
29%
Hortonworks
14%
Spark Committers by Hadoop Distribution*
* IBM and
MapR have 0
committers
Spark Patches by Hadoop Distribution
Cloudera, 451
Hortonworks, 10
IBM, 29
MapR, 1
Intel, 466
13© Cloudera, Inc. All rights reserved.
Cloudera Customers
• More customers running Spark than all other vendors combined
• Over 150 customers
• Spark clusters as large as 800 nodes
• Diverse range of use cases across multiple industries
• Search personalization
• Genomics research
• Insurance modeling
• Advertising optimization
• Predictive modeling of disease conditions
14© Cloudera, Inc. All rights reserved.
Cloudera Customer Use Cases
Core Spark Spark Streaming
• Portfolio Risk Analysis
• ETL Pipeline Speed-Up
• 20+ years of stock dataFinancial
Services
Health
• Identify disease-causing genes
in the full human genome
• Calculate Jaccard scores on
health care data sets
ERP
• Optical Character Recognition and
Bill Classification
• Trend analysis
• Document classification (LDA)
• Fraud analyticsData
Services
1010
• Online Fraud Detection
Financial
Services
Health
• Incident Prediction for Sepsis
Retail
• Online Recommendation Systems
• Real-Time Inventory Management
Ad Tech
• Real-Time Ad Performance Analysis
15© Cloudera, Inc. All rights reserved.
Spark will replace MapReduce
as the standard execution engine for Hadoop
16© Cloudera, Inc. All rights reserved.
Community Initiative: Spark Supersedes MapReduce
Stage 1
• Crunch on Spark
• Search on Spark
Stage 2
• Hive on Spark (beta)
• Spark on HBase (beta)
Stage 3
• Pig on Spark (alpha)
• Sqoop on Spark
Cloudera is driving community development to port components to Spark:
17© Cloudera, Inc. All rights reserved.
Uniting Spark and Hadoop
The One Platform Initiative Investment Areas
Management
Leverage Hadoop-native
resource management.
Security
Full support for Hadoop security
and beyond.
Scale
Enable 10k-node clusters.
Streaming
Support for 80% of common stream
processing workloads.
18© Cloudera, Inc. All rights reserved.
Management
Leverage Hadoop-native resource management
Spark-on-YARN
• Drove Spark-On-YARN Integration
• Improve Spark-On-YARN Integration for
better multi-tenancy, performance and
ease of use
Metrics
• Improved metrics for debugging and
monitoring
• Improve metrics for visibility into resource
utilization
• Revamp WebUI for better debugging and
monitoring (especially at high concurrency)
Automation
• Dynamic Resource Allocation based on
needs of job
• Smart auto-selection and tuning of job
parameters (when data volumes change)
Accessibility
• SparkSQL & Hive integration
improvements
• Easy Python dependency management
for PySpark
19© Cloudera, Inc. All rights reserved.
Security
Full support for Hadoop security and beyond
Perimeter
• Kerberos Integration
Visibility
• Audit/Lineage via Cloudera Navigator
• Full Spark PCI compliance
Access
• HDFS Sync (Apache Sentry)
• Enable column- and view-level security
Data Protection
• Integration with Intel’s Advanced
Encryption libraries
20© Cloudera, Inc. All rights reserved.
Scale
Enable 10K-Node Clusters
Fault-Tolerance
• Revamp Scheduler handling of node failure
• Dynamic resource utilization &
prioritization
Performance
• Task scheduling based on HDFS data
locality & HDFS caching (reduce data
movement and enable more jobs)
• Integrate with HDFS Discardable
Distributed Memory (reduce memory
pressure)
• Scheduler improvements for performance
at scale
Stability
• Sort Based Shuffle Improvements for
improved stability at scale
• Stress test at scale with mixed multi-tenant
workloads
• Scale Spark History Server for 1000s of jobs
21© Cloudera, Inc. All rights reserved.
Streaming
Support for 80% of common stream processing workloads
Zero Data Loss
• Delivered full data resiliency
Management
• Streaming application management via
Cloudera Manager for zero downtime
Ingest
• Delivered Flume integration and drove
Kafka integration
Accessibility
• SQL interfaces and API extensions for
Streaming jobs
Performance
• Improved State Management to enable
maintaining a high volume of state
information
22© Cloudera, Inc. All rights reserved.
The Future of Data Processing on Hadoop
Spark complemented by specialized fit-for-purpose engines
General Data Processing
w/Spark
Fast Batch Processing, Machine Learning,
and Stream Processing
Analytic
Database
w/Impala
Low-Latency
Massively Concurrent
Queries
Full-Text Search w/Solr
Querying textual data
On-Disk Processing
w/MapReduce
Jobs at extreme scale and
extremely disk IO intensive
Shared:
• Data Storage
• Metadata
• Resource
Management
• Administration
• Security
• Governance
23© Cloudera, Inc. All rights reserved.
Cloudera is Built for Production Success
Hadoop delivers:
• One place for unlimited data
• Unified, multi-framework data access
Cloudera delivers:
• Leading Performance
• Enterprise Security
• Data Management
• Simple Administration
Security and Administration
Unlimited Storage
Process Discover Model Serve
Deployment
Flexibility
On-Premises
Appliances
Engineered Systems
Public Cloud
Private Cloud
Hybrid Cloud
A modern data platform plus what the enterprise requires.
24© Cloudera, Inc. All rights reserved.
Spark Resources
• Learn Spark
• O’Reilly Advanced Analytics with Spark eBook (written by Clouderans)
• Cloudera Developer Blog
• cloudera.com/spark
• Get Trained
• Cloudera Spark Training
• Try it Out
• Cloudera Live Spark Tutorial
‹#›© 2015 Cloudera, Inc. All rights reserved.
The conference for and by Data Scientists, from startup to enterprise
wrangleconf.com
Public registration is now open!
Who: Featuring data scientists from Salesforce,
Uber, Pinterest, and more
When: Thursday, October 22, 2015
Where: Broadway Studios, San Francisco
27© Cloudera, Inc. All rights reserved.
Thank You!

Spark One Platform Webinar

  • 1.
    1© Cloudera, Inc.All rights reserved. Uniting Spark & Hadoop The One Platform Initiative Doug Cutting | Chief Architect | Cloudera Anand Iyer | Senior Product Manager | Cloudera
  • 2.
    2© Cloudera, Inc.All rights reserved. Agenda • Emergence of Spark • Advantages of Spark • Spark replacing MapReduce • The “One Platform Initiative” • Future of Data Processing in Hadoop
  • 3.
    3© Cloudera, Inc.All rights reserved. MapReduce: A great tool for its day The original scalable, general, processing engine of Hadoop ecosystem - Useful across diverse problem domains - Fueled initial ecosystem explosion MapReduce Execution Engine Hive Pig Mahout SolrCrunch
  • 4.
    4© Cloudera, Inc.All rights reserved. Enter Apache Spark General purpose computational framework that substantially improves on MapReduce Key Properties: • Leverages distributed memory • Full Directed Graph expressions for data parallel computations • Simpler developer experience Yet Retains: • Linear scalability • Fault-tolerance • Data locality-based computations
  • 5.
    5© Cloudera, Inc.All rights reserved. Apache Spark Flexible, in-memory data processing for Hadoop Easier More Powerful Faster • Rich APIs for Scala, Java, and Python • Interactive shell • APIs for different types of workloads: • Batch • Streaming • Machine Learning • Graph • In-Memory processing and caching
  • 6.
    6© Cloudera, Inc.All rights reserved. Easy Development High Productivity Language Support • Native support for multiple languages with identical APIs • Scala, Java, Python • Use of closures, iterations, and other modern language constructs to minimize code • 2-5x less code Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count() Scala val lines = sc.textFile(...) lines.filter(s => s.contains(“ERROR”)).count() Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();
  • 7.
    7© Cloudera, Inc.All rights reserved. Easy Development Use Interactively • Interactive exploration of data for data scientists • No need to develop “applications” • Developers can prototype application on live system $ ./bin/spark-shell --master local[*] ... Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 1.5.0-SNAPSHOT /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_51) Type in expressions to have them evaluated. Type :help for more information. ... scala> val words = sc.textFile("file:/usr/share/dict/words") ... words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:21 scala> words.count ... res0: Long = 235886 scala>
  • 8.
    8© Cloudera, Inc.All rights reserved. Spark Takes Advantage of Memory Resilient Distributed Datasets (RDD) • Memory caching layer that stores data in a distributed, fault-tolerant cache • Can fall back to disk when data-set does not fit in memory • Created by parallel transformations on data in stable storage • Provides fault-tolerance through concept of lineage
  • 9.
    9© Cloudera, Inc.All rights reserved. The Spark Ecosystem & Hadoop Spark Streaming MLlib SparkSQL GraphX Data- frames SparkR STORAGE HDFS, HBase RESOURCE MANAGEMENT YARN Spark Impala MR OthersSearch
  • 10.
    10© Cloudera, Inc.All rights reserved. Cloudera is Driving the Spark Movement 2013 2014 2015 2016 Identified Spark’s early potential Ships and Supports Spark with CDH 4.4 Added Spark on YARN integration Announces initiative to make Spark the standard execution engine Launches first Spark training Added security integration Cloudera engineers publish O’Reilly Spark book Driving effort to further performance, usability, and enterprise-readiness
  • 11.
    11© Cloudera, Inc.All rights reserved. Spark at Cloudera • Cloudera was the first Hadoop vendor to ship and support Spark • Spark is a fully integrated part of Cloudera’s platform • Shared data, metadata, resource management, administration, security, and governance • Complements specialized analytic tools for comprehensive big data platform • Cloudera is the first Hadoop vendor to offer Spark training • Trained more customers than any other vendor • Most popular training course • Cloudera has 5x the engineering resources of the next competitor • Most committers on staff and most changes contributed • Well-trained staff across the globe with expertise implementing a broad range of Spark use cases
  • 12.
    12© Cloudera, Inc.All rights reserved. Cloudera’s Engineering Commitment to Spark Cloudera 57%Intel 29% Hortonworks 14% Spark Committers by Hadoop Distribution* * IBM and MapR have 0 committers Spark Patches by Hadoop Distribution Cloudera, 451 Hortonworks, 10 IBM, 29 MapR, 1 Intel, 466
  • 13.
    13© Cloudera, Inc.All rights reserved. Cloudera Customers • More customers running Spark than all other vendors combined • Over 150 customers • Spark clusters as large as 800 nodes • Diverse range of use cases across multiple industries • Search personalization • Genomics research • Insurance modeling • Advertising optimization • Predictive modeling of disease conditions
  • 14.
    14© Cloudera, Inc.All rights reserved. Cloudera Customer Use Cases Core Spark Spark Streaming • Portfolio Risk Analysis • ETL Pipeline Speed-Up • 20+ years of stock dataFinancial Services Health • Identify disease-causing genes in the full human genome • Calculate Jaccard scores on health care data sets ERP • Optical Character Recognition and Bill Classification • Trend analysis • Document classification (LDA) • Fraud analyticsData Services 1010 • Online Fraud Detection Financial Services Health • Incident Prediction for Sepsis Retail • Online Recommendation Systems • Real-Time Inventory Management Ad Tech • Real-Time Ad Performance Analysis
  • 15.
    15© Cloudera, Inc.All rights reserved. Spark will replace MapReduce as the standard execution engine for Hadoop
  • 16.
    16© Cloudera, Inc.All rights reserved. Community Initiative: Spark Supersedes MapReduce Stage 1 • Crunch on Spark • Search on Spark Stage 2 • Hive on Spark (beta) • Spark on HBase (beta) Stage 3 • Pig on Spark (alpha) • Sqoop on Spark Cloudera is driving community development to port components to Spark:
  • 17.
    17© Cloudera, Inc.All rights reserved. Uniting Spark and Hadoop The One Platform Initiative Investment Areas Management Leverage Hadoop-native resource management. Security Full support for Hadoop security and beyond. Scale Enable 10k-node clusters. Streaming Support for 80% of common stream processing workloads.
  • 18.
    18© Cloudera, Inc.All rights reserved. Management Leverage Hadoop-native resource management Spark-on-YARN • Drove Spark-On-YARN Integration • Improve Spark-On-YARN Integration for better multi-tenancy, performance and ease of use Metrics • Improved metrics for debugging and monitoring • Improve metrics for visibility into resource utilization • Revamp WebUI for better debugging and monitoring (especially at high concurrency) Automation • Dynamic Resource Allocation based on needs of job • Smart auto-selection and tuning of job parameters (when data volumes change) Accessibility • SparkSQL & Hive integration improvements • Easy Python dependency management for PySpark
  • 19.
    19© Cloudera, Inc.All rights reserved. Security Full support for Hadoop security and beyond Perimeter • Kerberos Integration Visibility • Audit/Lineage via Cloudera Navigator • Full Spark PCI compliance Access • HDFS Sync (Apache Sentry) • Enable column- and view-level security Data Protection • Integration with Intel’s Advanced Encryption libraries
  • 20.
    20© Cloudera, Inc.All rights reserved. Scale Enable 10K-Node Clusters Fault-Tolerance • Revamp Scheduler handling of node failure • Dynamic resource utilization & prioritization Performance • Task scheduling based on HDFS data locality & HDFS caching (reduce data movement and enable more jobs) • Integrate with HDFS Discardable Distributed Memory (reduce memory pressure) • Scheduler improvements for performance at scale Stability • Sort Based Shuffle Improvements for improved stability at scale • Stress test at scale with mixed multi-tenant workloads • Scale Spark History Server for 1000s of jobs
  • 21.
    21© Cloudera, Inc.All rights reserved. Streaming Support for 80% of common stream processing workloads Zero Data Loss • Delivered full data resiliency Management • Streaming application management via Cloudera Manager for zero downtime Ingest • Delivered Flume integration and drove Kafka integration Accessibility • SQL interfaces and API extensions for Streaming jobs Performance • Improved State Management to enable maintaining a high volume of state information
  • 22.
    22© Cloudera, Inc.All rights reserved. The Future of Data Processing on Hadoop Spark complemented by specialized fit-for-purpose engines General Data Processing w/Spark Fast Batch Processing, Machine Learning, and Stream Processing Analytic Database w/Impala Low-Latency Massively Concurrent Queries Full-Text Search w/Solr Querying textual data On-Disk Processing w/MapReduce Jobs at extreme scale and extremely disk IO intensive Shared: • Data Storage • Metadata • Resource Management • Administration • Security • Governance
  • 23.
    23© Cloudera, Inc.All rights reserved. Cloudera is Built for Production Success Hadoop delivers: • One place for unlimited data • Unified, multi-framework data access Cloudera delivers: • Leading Performance • Enterprise Security • Data Management • Simple Administration Security and Administration Unlimited Storage Process Discover Model Serve Deployment Flexibility On-Premises Appliances Engineered Systems Public Cloud Private Cloud Hybrid Cloud A modern data platform plus what the enterprise requires.
  • 24.
    24© Cloudera, Inc.All rights reserved. Spark Resources • Learn Spark • O’Reilly Advanced Analytics with Spark eBook (written by Clouderans) • Cloudera Developer Blog • cloudera.com/spark • Get Trained • Cloudera Spark Training • Try it Out • Cloudera Live Spark Tutorial
  • 25.
    ‹#›© 2015 Cloudera,Inc. All rights reserved. The conference for and by Data Scientists, from startup to enterprise wrangleconf.com Public registration is now open! Who: Featuring data scientists from Salesforce, Uber, Pinterest, and more When: Thursday, October 22, 2015 Where: Broadway Studios, San Francisco
  • 27.
    27© Cloudera, Inc.All rights reserved. Thank You!