SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
1.
Intro to Apache Spark
Marius Soutier
Freelance Software Engineer
@mariussoutier
Clustered In-Memory Computation
2.
Motivation
• Classical data architectures break down
• RDMBS can’t handle large amounts of data well
• Most RDMBS can’t handle multiple input formats
• Most NoSQLs don’t offer analytics
Problem Running computations on BigData®
3.
The 3 Vs of Big Data
Volume
100s of GB, TB, PB
Variety
Structured, Unstructured,
Semi-Structured
Velocity
Sensors, Realtime
“Fast Data”
4.
Hadoop (1)
• De-facto standard for running computations on large amounts of
different data is Hadoop
• Hadoop consists of
• HDFS distributed, fault-tolerant file system
• Map/Reduce parallelizable computations pioneered by Google
• Hadoop is typically run on a (large) cluster of non-virtualized
commodity hardware
5.
Hadoop (2)
• However, Map/Reduce are batch jobs with high latency
• Not suitable for interactive queries, real-time analytics,
or Machine Learning
• Pure Map/Reduce is hard to develop and maintain
6.
Enter Spark
Spark
is a framework for
clustered
in-memory
data processing
7.
• Developed at UC Berkeley, released in
2010
• Apache Top-Level Project Since February
2014, current version is 1.2.1 / 1.3.0
• USP: Uses cluster-wide available memory
to speed up computations
• Very active community
Apache Spark (1)
8.
• Written in Scala (& Akka),
APIs for Java and Python
• Programming model is a collection pipeline*
instead of Map/Reduce
• Supports batch, streaming, interactive,
or all combined using unified API
Apache Spark (2)
* http://martinfowler.com/articles/collection-pipeline/
10.
Spark is a framework for clustered in-memory
data processing
Spark is a platform for data driven products.
11.
• Base abstraction Resilient Distributed Dataset (RDD)
• Essentially a distributed collection of objects
• Can be cached in memory or on disk
RDD
12.
RDD Word Count
val sc = new SparkContext()
val input: RDD[String] = sc.textFile("/tmp/word.txt")
val words: RDD[(String, Long)] = input
.flatMap(line => line.toLowerCase.split("s+"))
.map(word => word -> 1L)
.cache()
val wordCountsRdd: RDD[(String, Long)] = words
.reduceByKey(_ + _)
.sortByKey()
val wordCounts: Array[(String, Long)] = wordCountsRdd.collect()
13.
Cluster
Driver
SparkContext
Master
Worker
Executor
Worker
Executor
Tasks
Tasks
• Spark app (driver) builds DAG from RDD operations
• DAG is split into tasks that are executed by workers
14.
Example Architecture
Input
HDFS
Message Queue
Spark
Streaming
Spark
Batch Jobs
SparkSQL
Real-Time
Dashboard
Interactive
SQL
Analytics,
Reports