Apache Spark
The Emerging Platform for Distributed Analytics
July 2014
Thomas W. Dinsmore
What is Apache Spark?
• Distributed in-memory analytics engine
• Runs in standalone clusters or Hadoop
• Fully compatible with Hadoop
storage APIs
• Runs under YARN
• Top-level Apache project
• Supported in all major Hadoop distros
• Open source and vendor neutral
Thomas W. Dinsmore
SAP
Support
Spark Timeline
+ + + + +2009 2010 2011 2012 2013 2014 ++
Project begins Open sourced
Spark Summit 2013
Spark Summit 2013
Apache Incubator
Apache Top-Level
Cloudera
Support
MapR
Support
Horton
Support
Thomas W. Dinsmore
News cascade
starting late last year.
What problems does Spark solve?
Problem #1: MapReduce I/O sandbags
runtime for advanced analytics.
Compute Store
Must persist results after each pass through data
Advanced analytics often requires multiple passes through data
Hadoop
Storage
Hadoop
Storage
Thomas W. Dinsmore
Spark Vision: Distributed in-memory platform
Compute
Intermediate results stay in memory.
100X performance improvement for iterative algorithms.
Compute Compute Compute
Hadoop
Storage
Thomas W. Dinsmore
Problem #2: Many “point” solutions for
advanced analytics in Hadoop
Machine !
LearningQueries
Graph !
Analytics
Streaming !
Analytics
Thomas W. Dinsmore
Spark Vision: single integrated platform for
advanced analytics in Hadoop.
• Simplified administration
• Integrated results.
Thomas W. Dinsmore
How important is Spark?
Mike Olson, Cloudera:
“The leading candidate for
‘successor to MapReduce’
today is Apache Spark.”
Thomas W. Dinsmore
M.C. Srivas, MapR:
“We believe Spark on
Hadoop is a game changer
for any business.”
Thomas W. Dinsmore
Ben Lorica, O’Reilly Media:
“The number of companies
that are using Spark in
production has exploded
over the last year.”
Thomas W. Dinsmore
Apache Spark is the most active project in the
Hadoop ecosystem.
Source: Cloudera
Commits, Past 12 Months
22%
Thomas W. Dinsmore
Spark’s Key Capabilities
Spark 1.0 Machine Learning
• Linear Regression
• Logistic Regression
• Linear Support Vector
Machine
• Regularization
• Decision Trees
• Naive Bayes
• Alternating Least
Squares
• K-Means Plus-Plus
• Singular Value
Decomposition
• Principal Components
Analysis
• Stochastic Gradient
Descent
• L-BFGS
Spark project expects to double supported techniques in 1.1 (August 2014).
Thomas W. Dinsmore
Spark SQL
• Currently most active project
• Supports fast interactive queries
• Hive-compatible
• Works with Hive data
• Runs unmodified queries
• Roadmap to support more formats
• Will absorb Shark project
Thomas W. Dinsmore
Spark Streaming
• Supports analysis of data streams in real time
• Unifies streaming and batch data
• Integrates with popular data sources:
• HDFS
• Flume
• Kafka
• Twitter
• Easy to use
• Fault tolerant
Thomas W. Dinsmore
Spark Graph Analytics
• Currently Alpha release
• Unifies graph-parallel and data-
parallel computing under single API
• Performance parity with Giraph
• Replaces Spark Bagel (Pregel on
Spark)
Thomas W. Dinsmore
Spark Performance
Machine Learning
• 100x faster than MapReduce
Queries (Shark) !
• Comparable to Impala
• 100x faster than Hive
!
Streaming
• 2X throughput of Storm
Graph (GraphX) !
• Comparable to Giraph
• 10X faster than MapReduce
Thomas W. Dinsmore
Spark Distributions
Thomas W. Dinsmore
Connector
Every major Hadoop distribution, plus…
Interface to HANABig Data Appliance
Programming Interfaces
Supported APIs “Alpha” Release
Thomas W. Dinsmore
Spark project expects to release production grade R interface early 2015.
“SparkR”
Spark Users
Thomas W. Dinsmore
Certified on Spark
Thomas W. Dinsmore
Who is Databricks?
• Commercial venture, incepted 2013
• Founded by Spark principals
• Services and support business model
• Gatekeepers to Spark
• Just landed $33M in Series B
• Andreeson, Horowitz
• New Enterprise Associates
• Just announced Spark Cloud product
Thomas W. Dinsmore
Thank You

Apache Spark Briefing

  • 1.
    Apache Spark The EmergingPlatform for Distributed Analytics July 2014 Thomas W. Dinsmore
  • 2.
    What is ApacheSpark? • Distributed in-memory analytics engine • Runs in standalone clusters or Hadoop • Fully compatible with Hadoop storage APIs • Runs under YARN • Top-level Apache project • Supported in all major Hadoop distros • Open source and vendor neutral Thomas W. Dinsmore
  • 3.
    SAP Support Spark Timeline + ++ + +2009 2010 2011 2012 2013 2014 ++ Project begins Open sourced Spark Summit 2013 Spark Summit 2013 Apache Incubator Apache Top-Level Cloudera Support MapR Support Horton Support Thomas W. Dinsmore News cascade starting late last year.
  • 4.
    What problems doesSpark solve?
  • 5.
    Problem #1: MapReduceI/O sandbags runtime for advanced analytics. Compute Store Must persist results after each pass through data Advanced analytics often requires multiple passes through data Hadoop Storage Hadoop Storage Thomas W. Dinsmore
  • 6.
    Spark Vision: Distributedin-memory platform Compute Intermediate results stay in memory. 100X performance improvement for iterative algorithms. Compute Compute Compute Hadoop Storage Thomas W. Dinsmore
  • 7.
    Problem #2: Many“point” solutions for advanced analytics in Hadoop Machine ! LearningQueries Graph ! Analytics Streaming ! Analytics Thomas W. Dinsmore
  • 8.
    Spark Vision: singleintegrated platform for advanced analytics in Hadoop. • Simplified administration • Integrated results. Thomas W. Dinsmore
  • 9.
  • 10.
    Mike Olson, Cloudera: “Theleading candidate for ‘successor to MapReduce’ today is Apache Spark.” Thomas W. Dinsmore
  • 11.
    M.C. Srivas, MapR: “Webelieve Spark on Hadoop is a game changer for any business.” Thomas W. Dinsmore
  • 12.
    Ben Lorica, O’ReillyMedia: “The number of companies that are using Spark in production has exploded over the last year.” Thomas W. Dinsmore
  • 13.
    Apache Spark isthe most active project in the Hadoop ecosystem. Source: Cloudera Commits, Past 12 Months 22% Thomas W. Dinsmore
  • 14.
  • 15.
    Spark 1.0 MachineLearning • Linear Regression • Logistic Regression • Linear Support Vector Machine • Regularization • Decision Trees • Naive Bayes • Alternating Least Squares • K-Means Plus-Plus • Singular Value Decomposition • Principal Components Analysis • Stochastic Gradient Descent • L-BFGS Spark project expects to double supported techniques in 1.1 (August 2014). Thomas W. Dinsmore
  • 16.
    Spark SQL • Currentlymost active project • Supports fast interactive queries • Hive-compatible • Works with Hive data • Runs unmodified queries • Roadmap to support more formats • Will absorb Shark project Thomas W. Dinsmore
  • 17.
    Spark Streaming • Supportsanalysis of data streams in real time • Unifies streaming and batch data • Integrates with popular data sources: • HDFS • Flume • Kafka • Twitter • Easy to use • Fault tolerant Thomas W. Dinsmore
  • 18.
    Spark Graph Analytics •Currently Alpha release • Unifies graph-parallel and data- parallel computing under single API • Performance parity with Giraph • Replaces Spark Bagel (Pregel on Spark) Thomas W. Dinsmore
  • 19.
    Spark Performance Machine Learning •100x faster than MapReduce Queries (Shark) ! • Comparable to Impala • 100x faster than Hive ! Streaming • 2X throughput of Storm Graph (GraphX) ! • Comparable to Giraph • 10X faster than MapReduce Thomas W. Dinsmore
  • 20.
    Spark Distributions Thomas W.Dinsmore Connector Every major Hadoop distribution, plus… Interface to HANABig Data Appliance
  • 21.
    Programming Interfaces Supported APIs“Alpha” Release Thomas W. Dinsmore Spark project expects to release production grade R interface early 2015. “SparkR”
  • 22.
  • 23.
  • 24.
    Who is Databricks? •Commercial venture, incepted 2013 • Founded by Spark principals • Services and support business model • Gatekeepers to Spark • Just landed $33M in Series B • Andreeson, Horowitz • New Enterprise Associates • Just announced Spark Cloud product Thomas W. Dinsmore
  • 25.