1. Intro to Apache Spark
Marius Soutier
Freelance Software Engineer
@mariussoutier
Clustered In-Memory Computation
2. Motivation
• Classical data architectures break down
• RDMBS can’t handle large amounts of data well
• Most RDMBS can’t handle multiple input formats
• Most NoSQLs don’t offer analytics
Problem Running computations on BigData®
3. The 3 Vs of Big Data
Volume
100s of GB, TB, PB
Variety
Structured, Unstructured,
Semi-Structured
Velocity
Sensors, Realtime
“Fast Data”
4. Hadoop (1)
• De-facto standard for running computations on large amounts of
different data is Hadoop
• Hadoop consists of
• HDFS distributed, fault-tolerant file system
• Map/Reduce parallelizable computations pioneered by Google
• Hadoop is typically run on a (large) cluster of non-virtualized
commodity hardware
5. Hadoop (2)
• However, Map/Reduce are batch jobs with high latency
• Not suitable for interactive queries, real-time analytics,
or Machine Learning
• Pure Map/Reduce is hard to develop and maintain
7. • Developed at UC Berkeley, released in
2010
• Apache Top-Level Project Since February
2014, current version is 1.2.1 / 1.3.0
• USP: Uses cluster-wide available memory
to speed up computations
• Very active community
Apache Spark (1)
8. • Written in Scala (& Akka),
APIs for Java and Python
• Programming model is a collection pipeline*
instead of Map/Reduce
• Supports batch, streaming, interactive,
or all combined using unified API
Apache Spark (2)
* http://martinfowler.com/articles/collection-pipeline/
10. Spark is a framework for clustered in-memory
data processing
Spark is a platform for data driven products.
11. • Base abstraction Resilient Distributed Dataset (RDD)
• Essentially a distributed collection of objects
• Can be cached in memory or on disk
RDD
12. RDD Word Count
val sc = new SparkContext()
val input: RDD[String] = sc.textFile("/tmp/word.txt")
val words: RDD[(String, Long)] = input
.flatMap(line => line.toLowerCase.split("s+"))
.map(word => word -> 1L)
.cache()
val wordCountsRdd: RDD[(String, Long)] = words
.reduceByKey(_ + _)
.sortByKey()
val wordCounts: Array[(String, Long)] = wordCountsRdd.collect()