Intro to Apache Spark

Intro to Apache Spark
Marius Soutier
Freelance Software Engineer
@mariussoutier
Clustered In-Memory Computation

Motivation
• Classical data architectures break down
• RDMBS can’t handle large amounts of data well
• Most RDMBS can’t handle multiple input formats
• Most NoSQLs don’t offer analytics
Problem Running computations on BigData®

The 3 Vs of Big Data
Volume
100s of GB, TB, PB
Variety
Structured, Unstructured,
Semi-Structured
Velocity
Sensors, Realtime
“Fast Data”

Hadoop (1)
• De-facto standard for running computations on large amounts of
different data is Hadoop
• Hadoop consists of
• HDFS distributed, fault-tolerant ﬁle system
• Map/Reduce parallelizable computations pioneered by Google
• Hadoop is typically run on a (large) cluster of non-virtualized
commodity hardware

Hadoop (2)
• However, Map/Reduce are batch jobs with high latency
• Not suitable for interactive queries, real-time analytics,
or Machine Learning
• Pure Map/Reduce is hard to develop and maintain

Enter Spark
Spark
is a framework for
clustered
in-memory
data processing

• Developed at UC Berkeley, released in
2010
• Apache Top-Level Project Since February
2014, current version is 1.2.1 / 1.3.0
• USP: Uses cluster-wide available memory
to speed up computations
• Very active community
Apache Spark (1)

• Written in Scala (& Akka),  
APIs for Java and Python
• Programming model is a collection pipeline*
instead of Map/Reduce
• Supports batch, streaming, interactive,  
or all combined using uniﬁed API
Apache Spark (2)
* http://martinfowler.com/articles/collection-pipeline/

Spark Ecosystem
Spark Core
Spark SQL
Spark Hive
BlinkDB
Approximate
SQL
Spark
Streaming
MLlib
Machine
Learning
GraphX SparkR
ALPHA
ALPHA
ALPHA
Tachyon

Spark is a framework for clustered in-memory
data processing
Spark is a platform for data driven products.

• Base abstraction Resilient Distributed Dataset (RDD)
• Essentially a distributed collection of objects
• Can be cached in memory or on disk
RDD

RDD Word Count
val sc = new SparkContext() 
val input: RDD[String] = sc.textFile("/tmp/word.txt") 
val words: RDD[(String, Long)] = input 
.flatMap(line => line.toLowerCase.split("s+")) 
.map(word => word -> 1L) 
.cache() 
 
val wordCountsRdd: RDD[(String, Long)] = words 
.reduceByKey(_ + _) 
.sortByKey()
 
val wordCounts: Array[(String, Long)] = wordCountsRdd.collect()

Cluster
Driver
SparkContext
Master
Worker
Executor
Worker
Executor
Tasks
Tasks
• Spark app (driver) builds DAG from RDD operations
• DAG is split into tasks that are executed by workers

Example Architecture
Input
HDFS
Message Queue
Spark
Streaming
Spark
Batch Jobs
SparkSQL
Real-Time
Dashboard
Interactive
SQL
Analytics,
Reports

Intro to Apache Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Intro to Apache Spark

Similar to Intro to Apache Spark (20)

Recently uploaded

Recently uploaded (20)

Intro to Apache Spark