An introduction To Apache Spark

1
An Introduction to Apache Spark
By Amir Sedighi
Datis Pars Data Technology
Slides adopted from Databricks
(Paco Nathan and Aaron Davidson)
@amirsedighi
http://hexican.com

2
History
● Developed in 2009 at UC Berkeley AMPLab.
● Open sourced in 2010.
● Spark becomes one of the largest big-data
projects with more 400 contributors in 50+
organizations such as:
– Databricks, Yahoo!, Intel, Cloudera, IBM, …

3
What is Spark?
● Fast and general cluster computing system
interoperable with Hadoop datasets.

4
What are Spark improvements?
● Improves efficiency through:
– In-memory computing primitives.
– General computation graphs.
● Improves usability through
– Rich APIs in Scala, Java, Python
– Interactive shell (Scala/Python)

5
MapReduce is a DAG in General

6
MapReduce
● MapReduce is great for single-pass batch jobs
while in many use-cases we need to use
MapReduce in a multi-pass manner...

7
What improvements Spark made on
running MapReduce?
● Improving the performance of MapReduce for
running as a multi-pass analytics, interactive,
real-time, distributed computation model on the
top of Hadoop.
Note:
– Spark is a hadoop successor.

8
How Spark Made it?
A Wise Data Sharing!

9
Data Sharing in Hadoop
MapReduce

11
Data Sharing in Spark
10-100x Faster than network and disk!

12
Spark Programming Model
● At a high level, every Spark application consists
of a driver program that runs the user’s main
function.
● Promotes you to write programs in term of
making transformations on distributed datasets.

13
● The main abstraction Spark provides is a
resilient distributed dataset (RDD).
– Collection of elements partitioned across the cluster
(Memory of Disk)
– Can be accessed and operated in parallel (map,
filter, ...)
– Automatically rebuilt on failure

14
● RDDs Operations
– Transformations: Create a new dataset from an
existing one.
● Example: map()
– Actions: Return a value to the driver program after
running a computation on the dataset.
● Example: reduce()

16
● Another abstraction is Shared Variables
– Broadcast Variables, which can be used to cache a
value in memory on all nodes.
– Accumulator

20
Ease of Use
● Spark offers over 80 high-level operators that
make it easy to build parallel apps.
● Scala and Python shells to use it interactively.

23
Apache Spark Core
● Spark Core is the general engine for the Spark
platform.
– In-memory computing capabilities deliver speed
– General execution model supports wide delivery of
use cases
– Ease of development – native APIs in Java, Scala,
Python (+ SQL, Clojure, R)

30
Spark Streaming
● makes it easy to build scalable fault-tolerant
streaming applications.

35
Spark Streaming
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
DStream: a sequence of distributed datasets (RDDs)
representing a distributed stream of data

36
Spark Streaming
val hashTags = tweets.flatMap (status => getTags(status))
DStream: a sequence of distributed datasets (RDDs)
representing a distributed stream of data
transformation: modify data in one DStream to create
another DStream
new DStream

37
Spark Streaming
val hashTags = tweets.flatMap (status => getTags(status))
val tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue()
sliding window
operation
window length sliding interval

38
Spark Streaming
val tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue()

40
MLLib
● MLLib is Spark's scaleable machine learning
engine.
● MLLib works on any hadoop datasource such
as HDFS, HBase and local files.

41
MLLib
● Algorithms:
– linear SVM and logistic regression
– classification and regression tree
– k-means clustering
– recommendation via alternating least squares
– singular value decomposition
– linear regression with L1- and L2-regularization
– multinomial naive Bayes
– basic statistics
– feature transformations

43
GraphX
● GraphX is Spark's API for graphs and graph-
parallel computation.
● Works with both graphs and collections.

44
GraphX
● Comparable performance to the fastest
specialized graph processing systems

45
GraphX
● Algorithms
– PageRank
– Connected components
– Label propagation
– SVD++
– Strongly connected components
– Triangle count

46
Spark Runs Everywhere
● Spark runs on Hadoop, Mesos, standalone, or
in the cloud.
● Spark accesses diverse data sources including
HDFS, Cassandra, HBase, S3.

47
Resources
● http://spark.apache.org
● Intro to Apache Spark by Paco Nathan
● Building a Unified Data Pipeline in Spark by Aaron
Davidson.
● http://www.slideshare.net/manishgforce/lightening-fast-bi
g-data-analytics-using-apache-spark
● Deep Dive with Spark Streaming - Tathagata Das - Spark
Meetup
● ZYMR

An introduction To Apache Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to An introduction To Apache Spark

Similar to An introduction To Apache Spark (20)

More from Amir Sedighi

More from Amir Sedighi (8)

Recently uploaded

Recently uploaded (20)

An introduction To Apache Spark