Talk given at ClojureD conference, Berlin
Apache Spark is an engine for efficiently processing large amounts of data. We show how to apply the elegance of Clojure to Spark - fully exploiting the REPL and dynamic typing. There will be live coding using our gorillalabs/sparkling API.
In the presentation, we will of course introduce the core concepts of Spark, like resilient distributed data sets (RDD). And you will learn how the Spark concepts resembles those well-known from Clojure, like persistent data structures and functional programming.
Finally, we will provide some Do’s and Don’ts for you to kick off your Spark program based upon our experience.
About Paulus Esterhazy and Christian Betz
Being a LISP hacker for several years, and a Java-guy for some more, Chris turned to Clojure for production code in 2011. He’s been Project Lead, Software Architect, and VP Tech in the meantime, interested in AI and data-visualization.
Now, working on the heart of data driven marketing for Performance Media in Hamburg, he turned to Apache Spark for some Big Data jobs. Chris released the API-wrapper ‘chrisbetz/sparkling’ to fully exploit the power of his compute cluster.
Paulus Esterhazy
Paulus is a philosophy PhD turned software engineer with an interest in functional programming and a penchant for hammock-driven development.
He currently works as Senior Web Developer at Red Pineapple Media in Berlin.
Scanning the Internet for External Cloud Exposures via SSL Certs
Big Data Processing using Apache Spark and Clojure
1. Big Data Processing
using Apache Spark and
Clojure
Dr. Paulus Esterhazy and Dr. Christian Betz
January 2015
https://github.com/pesterhazy/, @pesterhazy
https://github.com/chrisbetz/, @chris_betz
10. WTF is Spark?
Patrick Wendell
Databricks
Spark Performance
Common Patterns and Pitfalls for
Implementing Algorithms in Spark
Hossein Falaki
@mhfalaki
hossein@databricks.com
Advanced Spark
Reynold Xin, July 2, 2014 @ Spark Summit Training
Disclaimer: We reuse stuff
11. Apache Spark - an Overview
"Apache Spark™ is a fast and general engine for large-scale data processing."
Value proposition?
Spark keeps stuff in memory where possible, so intermediate results do not need I/O.
Spark allows quicker development cycle with proper unit tests (see later)
Spark allows to define your own data sources (JDBC in our case).
Spark allows you to work with any data structures (so some are better than others).
12. Two Questions
“I like Clojure, why might I be interested in Spark?”
“Granted that Spark is useful, why program it in Clojure?”
13. Two Questions
“I like Clojure, why might I be interested in Spark?”
“Granted that Spark is useful, why program it in Clojure?”
That's you!
14. How Big Data is processed today
large amounts of data to process
Hadoop is the de-facto standard
Hadoop = MapReduce + HDFS
15. However, Hadoop has some limitations
Pain point: performance
Writing to disk after each map-/reduce step
That's esp. bad for chains of map-/reduce steps and iterative algorithms
(machine learning, PageRank)
Identified Bottleneck: HDD I/O
17. Resilient Distributed Datasets (RDDs)
Datasets: Collection of elements
Distributed: Could be an on any node in the cluster.
Resilient: Could get lost (or partially lost), doesn't matter. Spark will
recompute.
18. Different types of RDDs, all the same interface
Scientific Answer: RDD is an Interface!
1. Set of partitions (“splits” in Hadoop)
2. List of dependencies on parent RDDs
3. Function to compute a partition"
(as an Iterator) given its parent(s)
4. (Optional) partitioner (hash, range)
5. (Optional) preferred location(s)"
for each partition
“lineage”
optimized
execution
19. Different types of RDDs, all the same interface
Scientific Answer: RDD is an Interface!
1. Set of partitions (“splits” in Hadoop)
2. List of dependencies on parent RDDs
3. Function to compute a partition"
(as an Iterator) given its parent(s)
4. (Optional) partitioner (hash, range)
5. (Optional) preferred location(s)"
for each partition
“lineage”
optimized
execution
Example: HadoopRDD
partitions = one per HDFS block
dependencies = none
compute(part) = read corresponding block
preferredLocations(part) = HDFS block location
partitioner = none
20. Different types of RDDs, all the same interface
Scientific Answer: RDD is an Interface!
1. Set of partitions (“splits” in Hadoop)
2. List of dependencies on parent RDDs
3. Function to compute a partition"
(as an Iterator) given its parent(s)
4. (Optional) partitioner (hash, range)
5. (Optional) preferred location(s)"
for each partition
“lineage”
optimized
execution
Example: HadoopRDD
partitions = one per HDFS block
dependencies = none
compute(part) = read corresponding block
preferredLocations(part) = HDFS block location
partitioner = none
Example: Filtered RDD
partitions = same as parent RDD
dependencies = “one-to-one” on parent
compute(part) = compute parent and filter it
preferredLocations(part) = none (ask parent)
partitioner = none
21. How are RDDs handled?
You create an RDD from a data source, e.g. an HDFS file, a Cassandra DB
query, or from a JDBC-Query.
You transform RDDs (with map, filter, ...), which gives you new RDDs
You perform an action on one RDD to get the results from that RDD into your
"driver". (like first, take, collect, count, ...)
22. Basic Building Blocks: RDDs
Resilient Distributed Datasets
Spark follows a function approach:
You define collections (RDDs) and functions on collections
Sources for RDDs:
• Local collections parallelized
• HDFS files
• Your own (e.g. JDBC-RDD)
Transformations (only a selection)
• map
• filter
Actions (only a selection)
• reduce (fn)
• count
23. Basic Building Blocks: RDDs
Resilient Distributed Datasets
Spark follows a function approach:
You define collections (RDDs) and functions on collections
Sources for RDDs:
• Local collections parallelized
• HDFS files
• Your own (e.g. JDBC-RDD)
Transformations (only a selection)
• map
• filter
Actions (only a selection)
• reduce (fn)
• count
JdbcRDD
(Query)
HDFS-File
(Path)
Sources define
the basic RDDs
you're working on
24. Basic Building Blocks: RDDs
Resilient Distributed Datasets
Spark follows a function approach:
You define collections (RDDs) and functions on collections
Sources for RDDs:
• Local collections parallelized
• HDFS files
• Your own (e.g. JDBC-RDD)
Transformations (only a selection)
• map
• filter
Actions (only a selection)
• reduce (fn)
• count
JdbcRDD
(Query)
HDFS-File
(Path)
map filter
Sources define
the basic RDDs
you're working on
Transformations create
new RDDs
25. Basic Building Blocks: RDDs
Resilient Distributed Datasets
Spark follows a function approach:
You define collections (RDDs) and functions on collections
Sources for RDDs:
• Local collections parallelized
• HDFS files
• Your own (e.g. JDBC-RDD)
Transformations (only a selection)
• map
• filter
Actions (only a selection)
• reduce (fn)
• count
JdbcRDD
(Query)
HDFS-File
(Path)
map filter
join
Sources define
the basic RDDs
you're working on
Transformations create
new RDDs
26. Basic Building Blocks: RDDs
Resilient Distributed Datasets
Spark follows a function approach:
You define collections (RDDs) and functions on collections
Sources for RDDs:
• Local collections parallelized
• HDFS files
• Your own (e.g. JDBC-RDD)
Transformations (only a selection)
• map
• filter
Actions (only a selection)
• reduce (fn)
• count
JdbcRDD
(Query)
HDFS-File
(Path)
map filter
join
filter
You provide your
own functions in here!
Sources define
the basic RDDs
you're working on
Transformations create
new RDDs
27. Basic Building Blocks: RDDs
Resilient Distributed Datasets
Spark follows a function approach:
You define collections (RDDs) and functions on collections
Sources for RDDs:
• Local collections parallelized
• HDFS files
• Your own (e.g. JDBC-RDD)
Transformations (only a selection)
• map
• filter
Actions (only a selection)
• reduce (fn)
• count
JdbcRDD
(Query)
HDFS-File
(Path)
map filter
join
filter
reduce
You provide your
own functions in here!
Sources define
the basic RDDs
you're working on
Transformations create
new RDDs
Actions spit a
result to the Driver
31. Only go on when your tests are green!
(deftest test-line-count*
(let [conf (test-conf)]
(spark/with-context
sc conf
(testing
"no lines return 0"
(is (= 0 (line-count* (spark/parallelize sc [])))))
(testing
"a single line returns 1"
(is (= 1 (line-count* (spark/parallelize sc ["this is a single line"])))))
(testing
"multiple lines count correctly"
(is (= 10 (line-count* (spark/parallelize sc (repeat 10 "this is a single line"))))))
)))
32. What's an RDD? What's in it?
Take e.g. an JdbcRDD (we all know relational databases...):
33. What's an RDD? What's in it?
Take e.g. an JdbcRDD (we all know relational databases...):
campaign_id from to active
1 123 2014-01-01 2014-01-31 true
2 234 2014-01-06 2014-01-14 true
3 345 2014-02-01 2014-03-31 false
4 456 2014-02-10 2014-03-09 true
34. What's an RDD? What's in it?
Take e.g. an JdbcRDD (we all know relational databases...):
campaign_id from to active
1 123 2014-01-01 2014-01-31 true
2 234 2014-01-06 2014-01-14 true
3 345 2014-02-01 2014-03-31 false
4 456 2014-02-10 2014-03-09 true
That's your table
35. What's an RDD? What's in it?
Take e.g. an JdbcRDD (we all know relational databases...):
campaign_id from to active
1 123 2014-01-01 2014-01-31 true
2 234 2014-01-06 2014-01-14 true
3 345 2014-02-01 2014-03-31 false
4 456 2014-02-10 2014-03-09 true
That's your table
[ {:campaign-id 123 :active true}
{:campaign-id 234 :active true}
{:campaign-id 345 :active false}
{:campaign-id 456 :active true}]
36. What's an RDD? What's in it?
Take e.g. an JdbcRDD (we all know relational databases...):
campaign_id from to active
1 123 2014-01-01 2014-01-31 true
2 234 2014-01-06 2014-01-14 true
3 345 2014-02-01 2014-03-31 false
4 456 2014-02-10 2014-03-09 true
That's your table
[ {:campaign-id 123 :active true}
{:campaign-id 234 :active true}
{:campaign-id 345 :active false}
{:campaign-id 456 :active true}]
RDDs are lists of objects
37. What's an RDD? What's in it?
Take e.g. an JdbcRDD (we all know relational databases...):
campaign_id from to active
1 123 2014-01-01 2014-01-31 true
2 234 2014-01-06 2014-01-14 true
3 345 2014-02-01 2014-03-31 false
4 456 2014-02-10 2014-03-09 true
That's your table
[ {:campaign-id 123 :active true}
{:campaign-id 234 :active true}
{:campaign-id 345 :active false}
{:campaign-id 456 :active true}]
RDDs are lists of objects
[ #t[123 {:campaign-id 123 :active true}]
#t[234 {:campaign-id 234 :active true}]]
[ #t[345 {:campaign-id 345 :active false}]
#t[456 {:campaign-id 456 :active true}]
38. What's an RDD? What's in it?
Take e.g. an JdbcRDD (we all know relational databases...):
campaign_id from to active
1 123 2014-01-01 2014-01-31 true
2 234 2014-01-06 2014-01-14 true
3 345 2014-02-01 2014-03-31 false
4 456 2014-02-10 2014-03-09 true
That's your table
[ {:campaign-id 123 :active true}
{:campaign-id 234 :active true}
{:campaign-id 345 :active false}
{:campaign-id 456 :active true}]
RDDs are lists of objects
[ #t[123 {:campaign-id 123 :active true}]
#t[234 {:campaign-id 234 :active true}]]
[ #t[345 {:campaign-id 345 :active false}]
#t[456 {:campaign-id 456 :active true}]
PairRDDs handle key-value pairs,
may have partitioners assigned,
keys not necessarily unique!
41. In Practice 5: details RDD
• Lazy evaluation is explicitly forced
• Transformation vs actions
• Serialization of Clojure functions
42. In Practice 6: data sources and destinations
• Writing to HDFS
• Reading from HDFS
• HDFS is versatile: text files, S3, Cassandra
• Parallelizing regular Clojure collections
43. In Practice 7: top errors
(defn top-errors [lines]
(->> lines
(map parse-line)
(filter (fn [entry] (not= "200" (:status entry))))
(map (fn [entry] [(:uri entry) 1]))
(reduce (fn [a [k v]]
(update-in a [k] #((fnil + 0) % v))) {})
(sort-by val >)
(take 10)))
47. Spark and Functional Programming
• Spark is inspired by FP
• Not surprising – Scala is a functional programming language
• RDDs are immutable values
• Resilience: caches can be discarded
• DAG of transformations
• Philosophically close to Clojure
48. Processing RDDs
So your application
• defines (source) RDDs,
• transforms them (which creates new RDDs with dependencies on the source RDDs)
• and runs actions on them to get results back to the driver.
This defines a Directed Acyclic Graph (DAG) of operators.
Spark compiles this DAG of operators into a set of stages, where the boundary between two stages is a shuffle
phase.
Each stage contains tasks, working on one partition each.
49. Processing RDDs
So your application
• defines (source) RDDs,
• transforms them (which creates new RDDs with dependencies on the source RDDs)
• and runs actions on them to get results back to the driver.
This defines a Directed Acyclic Graph (DAG) of operators.
Spark compiles this DAG of operators into a set of stages, where the boundary between two stages is a shuffle
phase.
Each stage contains tasks, working on one partition each.
Example
sc.textFile("/some-hdfs-data")
map#map# reduceByKey# collect#textFile#
.map(line => line.split("t"))
.map(parts =>
(parts[0], int(parts[1])))
.reduceByKey(_ + _, 3)
.collect()
RDD[String]
RDD[List[String]]
RDD[(String, Int)]
Array[(String, Int)]
RDD[(String, Int)]
50. Processing RDDs
So your application
• defines (source) RDDs,
• transforms them (which creates new RDDs with dependencies on the source RDDs)
• and runs actions on them to get results back to the driver.
This defines a Directed Acyclic Graph (DAG) of operators.
Spark compiles this DAG of operators into a set of stages, where the boundary between two stages is a shuffle
phase.
Each stage contains tasks, working on one partition each.
Example
sc.textFile("/some-hdfs-data")
map#map# reduceByKey# collect#textFile#
.map(line => line.split("t"))
.map(parts =>
(parts[0], int(parts[1])))
.reduceByKey(_ + _, 3)
.collect()
RDD[String]
RDD[List[String]]
RDD[(String, Int)]
Array[(String, Int)]
RDD[(String, Int)]
Execution Graph
map#map# reduceByKey# collect#textFile#
map#
Stage#2#Stage#1#
map# reduceByKey# collect#textFile#
51. Processing RDDs
So your application
• defines (source) RDDs,
• transforms them (which creates new RDDs with dependencies on the source RDDs)
• and runs actions on them to get results back to the driver.
This defines a Directed Acyclic Graph (DAG) of operators.
Spark compiles this DAG of operators into a set of stages, where the boundary between two stages is a shuffle
phase.
Each stage contains tasks, working on one partition each.
Example
sc.textFile("/some-hdfs-data")
map#map# reduceByKey# collect#textFile#
.map(line => line.split("t"))
.map(parts =>
(parts[0], int(parts[1])))
.reduceByKey(_ + _, 3)
.collect()
RDD[String]
RDD[List[String]]
RDD[(String, Int)]
Array[(String, Int)]
RDD[(String, Int)]
Execution Graph
map#map# reduceByKey# collect#textFile#
map#
Stage#2#Stage#1#
map# reduceByKey# collect#textFile#
Execution Graph
map#
Stage#2#Stage#1#
map# reduceByKey# collect#textFile#
Stage#2#Stage#1#
read HDFS split
apply both maps
partial reduce
write shuffle data
read shuffle data
final reduce
send result to driver
52. Dynamic Types for Data Processing
• Clojure's strength: developer-friendly wrapper for a complex interior
• Static types everywhere
• Imperfect data
• For this use case, static typing can get in the way
• Jobs naturally represented as transformations of Clojure data structures
53. Data Exploration
• Working in real time with big datasets
• Great for data mining
• Clojure's powerful REPL
• Gorilla REPL for live plotting?
54. Summary: Why Spark(ling)
Data sharing: Hadoop is for a single map-reduce pass, it needs to write out
intermediate result to HDFS.
Interactive data exploration: Spark keeps data in memory, opening the
possibility of interactively working with TBs of data
Hadoop (and HIVE and Pig) lacks an (easy) way to implement unit tests. So
writing your own code is also error-prone and development cycle is
slooooow.
56. Running your spark code
Run locally: e.g. inside tests. Use "local" or "local[*]" as SparkMaster.
Run on cluster: either directly addressing Spark or (our case): run on top of YARN
Both open a Web interface on http://host:4040/.
Using the REPL: Open a SparkContext, define RDDs and store them in vars, perform transformations on these. Develop stuff in
the REPL transfer your REPL stuff into tests.
Run inside of tests: Open local SparcContext, feed mock data, run jobs. Therefore: design for testability!
Submit a Spark Job using "spark-submit" with proper arguments (see upload.sh, run.sh).
57. Best Practices / Dos and Don'ts
Shuffling is very expensive, so try to avoid it:
• Never, ever, let go of your Partitioner - this has huuuuuuge performance impact. Use map-values instead of map, keep
partition when re-keying for join, etc.
• This equals: Keep your execution plan slim.
There are some tricks for this, all boiling down to proper design of your data models.
Use broadcasting where necessary.
You need to monitor memory usage, as the inability to store stuff in memory will cause spills to disc (e.g. while shuffling).
This will kill you. Tune total memory and/or cache/shuffle ratios.
59. Example
Matrix Multiplication
• Repeatedly multiply sparse matrix and vector
24
Links
(url, neighbors)
Ranks
(url, rank)
…
iteration 1 iteration 2 iteration 3
Same file read
over and over
60. Example
Matrix Multiplication
• Repeatedly multiply sparse matrix and vector
24
Links
(url, neighbors)
Ranks
(url, rank)
…
iteration 1 iteration 2 iteration 3
Same file read
over and over
Spark can do much better
25
• Using cache(), keep neighbors in memory
• Do not write intermediate results on disk
Links
(url, neighbors)
Ranks
(url, rank)
join join join
…
Grouping same RDD
over and over
61. Example
Matrix Multiplication
• Repeatedly multiply sparse matrix and vector
24
Links
(url, neighbors)
Ranks
(url, rank)
…
iteration 1 iteration 2 iteration 3
Same file read
over and over
Spark can do much better
25
• Using cache(), keep neighbors in memory
• Do not write intermediate results on disk
Links
(url, neighbors)
Ranks
(url, rank)
join join join
…
Grouping same RDD
over and over
Spark can do much better
26
• Do not partition neighbors every time
Links
(url, neighbors)
Ranks
(url, rank)
join join join
…
partitionBy
Same node
63. Why did I start gorillalabs/sparkling?
first, there was clj-spark from The Climate Corporation. Very basic, not maintained anymore.
Then, I found out about flambo from yieldbot. Looked promising at first, fresh release, maybe used in production at yieldbot.
Small jobs were developed fast with Spark.
I ran into sooooo many problems (running on Spark Standalone, moving to YARN, fighting with low memory). Nothing to do with flambo, but with understanding the nuts and
bolts of Spark, YARN and other elements of my infrastructure. Ok, some with serializing my Clojure data structures.
Scaling up the amount of data led me directly into hell. My system was way slower than our existing solution. Was Spark the wrong way? I was completely like this guy: http://
blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html: „Spark should be better than MapReduce (if only it worked)“
After some thinking, I found out what happend: flambo promised to keep me in Clojure-land. Therefore, it uses a map operation to convert Scala Tuple2 to Clojure vector and
back again where necessary. But map looses your Partitioner information. Remember my point? So, flambo broke Einstein’s „as simple as possible but no simpler“
I fixed the library, I incorporated a different take on serializing functions (without reflection). That’s where I released gorillalabs/sparkling.
I needed to tweak the Data Model to have the same partitioner all over the place or use hand-crafted data structures and broadcasts for those not fitting my model. I now
ended up with code generating an index-structure from an RDD, sorted-tree-sets for date-ranged data, and so forth. And everything is fully unit-tested, cause that’s the only way
to go.
Now, my system outperforms a much bigger MySQL-based system on a local master, scales almost linearly wrt cores on a cluster. HURRAY!
64. Having nrepl / GorillaREPL is so nice!
Having an nrepl open on my Cluster is so nice, since I can inspect stuff in my
computation. Ever wondered, what that intermediate RDD contains? Just
(spark/take rdd 10) it.
Using GorillaREPL, it’s like a visual workbench for big data analysis. See for
yourself: http://bit.ly/1C7sSK4
67. References
http://lintool.github.io/SparkTutorial/ (where you can find the slides used in this presentation)
https://speakerdeck.com/ecepoi/apache-spark-at-viadeo
https://speakerdeck.com/ecepoi/viadeos-segmentation-platform-with-spark-on-mesos
https://speakerdeck.com/rxin/advanced-spark-at-spark-summit-2014
68. Sources
Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on
large clusters. Communications of the ACM, 51(1), 107-113.
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., ... & Stoica, I.
(2012, April). Resilient distributed datasets: A fault-tolerant abstraction for in-
memory cluster computing. In Proceedings of the 9th USENIX conference on
Networked Systems Design and Implementation (pp. 2-2). USENIX Association.
(Both available as PDFs)