SlideShare a Scribd company logo
1 of 69
Download to read offline
Big Data Processing
using Apache Spark and
Clojure
Dr. Paulus Esterhazy and Dr. Christian Betz
January 2015
https://github.com/pesterhazy/, @pesterhazy
https://github.com/chrisbetz/, @chris_betz
Who uses Clojure?
Who's getting paid to
use Clojure?
Who uses BigData?
Who uses Hadoop?
Who uses Spark?
About us
Paulus
red pinapple media GmbH
Chris
WTF is Spark?
Patrick Wendell
Databricks
Spark Performance
Common Patterns and Pitfalls for
Implementing Algorithms in Spark
Hossein Falaki
@mhfalaki
hossein@databricks.com
Advanced Spark
Reynold Xin, July 2, 2014 @ Spark Summit Training
Disclaimer: We reuse stuff
Apache Spark - an Overview
"Apache Spark™ is a fast and general engine for large-scale data processing."
Value proposition?
Spark keeps stuff in memory where possible, so intermediate results do not need I/O.
Spark allows quicker development cycle with proper unit tests (see later)
Spark allows to define your own data sources (JDBC in our case).
Spark allows you to work with any data structures (so some are better than others).
Two Questions
“I like Clojure, why might I be interested in Spark?”
“Granted that Spark is useful, why program it in Clojure?”
Two Questions
“I like Clojure, why might I be interested in Spark?”
“Granted that Spark is useful, why program it in Clojure?”
That's you!
How Big Data is processed today
large amounts of data to process
Hadoop is the de-facto standard
Hadoop = MapReduce + HDFS
However, Hadoop has some limitations
Pain point: performance
Writing to disk after each map-/reduce step
That's esp. bad for chains of map-/reduce steps and iterative algorithms
(machine learning, PageRank)
Identified Bottleneck: HDD I/O
Spark's Answer
Major innovation: data sharing between processing steps
In-memory processing
Resilient Distributed Datasets (RDDs)
Datasets: Collection of elements
Distributed: Could be an on any node in the cluster.
Resilient: Could get lost (or partially lost), doesn't matter. Spark will
recompute.
Different types of RDDs, all the same interface
Scientific Answer: RDD is an Interface!
1.  Set of partitions (“splits” in Hadoop)
2.  List of dependencies on parent RDDs
3.  Function to compute a partition"
(as an Iterator) given its parent(s)
4.  (Optional) partitioner (hash, range)
5.  (Optional) preferred location(s)"
for each partition
“lineage”
optimized
execution
Different types of RDDs, all the same interface
Scientific Answer: RDD is an Interface!
1.  Set of partitions (“splits” in Hadoop)
2.  List of dependencies on parent RDDs
3.  Function to compute a partition"
(as an Iterator) given its parent(s)
4.  (Optional) partitioner (hash, range)
5.  (Optional) preferred location(s)"
for each partition
“lineage”
optimized
execution
Example: HadoopRDD
partitions = one per HDFS block

dependencies = none

compute(part) = read corresponding block

preferredLocations(part) = HDFS block location

partitioner = none
Different types of RDDs, all the same interface
Scientific Answer: RDD is an Interface!
1.  Set of partitions (“splits” in Hadoop)
2.  List of dependencies on parent RDDs
3.  Function to compute a partition"
(as an Iterator) given its parent(s)
4.  (Optional) partitioner (hash, range)
5.  (Optional) preferred location(s)"
for each partition
“lineage”
optimized
execution
Example: HadoopRDD
partitions = one per HDFS block

dependencies = none

compute(part) = read corresponding block

preferredLocations(part) = HDFS block location

partitioner = none
Example: Filtered RDD
partitions = same as parent RDD

dependencies = “one-to-one” on parent

compute(part) = compute parent and filter it

preferredLocations(part) = none (ask parent)

partitioner = none
How are RDDs handled?
You create an RDD from a data source, e.g. an HDFS file, a Cassandra DB
query, or from a JDBC-Query.
You transform RDDs (with map, filter, ...), which gives you new RDDs
You perform an action on one RDD to get the results from that RDD into your
"driver". (like first, take, collect, count, ...)
Basic Building Blocks: RDDs
Resilient Distributed Datasets
Spark follows a function approach:
You define collections (RDDs) and functions on collections
Sources for RDDs:
• Local collections parallelized
• HDFS files
• Your own (e.g. JDBC-RDD)
Transformations (only a selection)
• map
• filter
Actions (only a selection)
• reduce (fn)
• count
Basic Building Blocks: RDDs
Resilient Distributed Datasets
Spark follows a function approach:
You define collections (RDDs) and functions on collections
Sources for RDDs:
• Local collections parallelized
• HDFS files
• Your own (e.g. JDBC-RDD)
Transformations (only a selection)
• map
• filter
Actions (only a selection)
• reduce (fn)
• count
JdbcRDD
(Query)
HDFS-File
(Path)
Sources define
the basic RDDs

you're working on
Basic Building Blocks: RDDs
Resilient Distributed Datasets
Spark follows a function approach:
You define collections (RDDs) and functions on collections
Sources for RDDs:
• Local collections parallelized
• HDFS files
• Your own (e.g. JDBC-RDD)
Transformations (only a selection)
• map
• filter
Actions (only a selection)
• reduce (fn)
• count
JdbcRDD
(Query)
HDFS-File
(Path)
map filter
Sources define
the basic RDDs

you're working on
Transformations create
new RDDs
Basic Building Blocks: RDDs
Resilient Distributed Datasets
Spark follows a function approach:
You define collections (RDDs) and functions on collections
Sources for RDDs:
• Local collections parallelized
• HDFS files
• Your own (e.g. JDBC-RDD)
Transformations (only a selection)
• map
• filter
Actions (only a selection)
• reduce (fn)
• count
JdbcRDD
(Query)
HDFS-File
(Path)
map filter
join
Sources define
the basic RDDs

you're working on
Transformations create
new RDDs
Basic Building Blocks: RDDs
Resilient Distributed Datasets
Spark follows a function approach:
You define collections (RDDs) and functions on collections
Sources for RDDs:
• Local collections parallelized
• HDFS files
• Your own (e.g. JDBC-RDD)
Transformations (only a selection)
• map
• filter
Actions (only a selection)
• reduce (fn)
• count
JdbcRDD
(Query)
HDFS-File
(Path)
map filter
join
filter
You provide your
own functions in here!
Sources define
the basic RDDs

you're working on
Transformations create
new RDDs
Basic Building Blocks: RDDs
Resilient Distributed Datasets
Spark follows a function approach:
You define collections (RDDs) and functions on collections
Sources for RDDs:
• Local collections parallelized
• HDFS files
• Your own (e.g. JDBC-RDD)
Transformations (only a selection)
• map
• filter
Actions (only a selection)
• reduce (fn)
• count
JdbcRDD
(Query)
HDFS-File
(Path)
map filter
join
filter
reduce
You provide your
own functions in here!
Sources define
the basic RDDs

you're working on
Transformations create
new RDDs
Actions spit a

result to the Driver
RDDs in Practice
Example code: https://github.com/gorillalabs/ClojureD
In Practice 1: line count
(defn line-count [lines]
(->> lines
count))
(defn process [f]
(with-open [rdr (clojure.java.io/reader "in.log")]
(let [result (f (line-seq rdr))]
(if (seq? result)
(doall result)
result))))
(process line-count)
In Practice 2: line count cont'd
(defn line-count* [lines]
(->> lines
s/count))
(defn new-spark-context []
(let [c (-> (s-conf/spark-conf)
(s-conf/master "local[*]")
(s-conf/app-name "sparkling")
(s-conf/set "spark.akka.timeout" "300")
(s-conf/set conf)
(s-conf/set-executor-env {
"spark.executor.memory" "4G",
"spark.files.overwrite" "true"}))]
(s/spark-context c) ))
(defonce sc (delay (new-spark-context)))
(defn process* [f]
(let [lines-rdd (s/text-file @sc "in.log")]
(f lines-rdd)))
(defn line-count [lines]
(->> lines
count))
(defn process [f]
(with-open [rdr (clojure.java.io/reader "in.log")]
(let [result (f (line-seq rdr))]
(if (seq? result)
(doall result)
result))))
(process line-count)
Only go on when your tests are green!
(deftest test-line-count*

(let [conf (test-conf)]

(spark/with-context

sc conf

(testing

"no lines return 0"

(is (= 0 (line-count* (spark/parallelize sc [])))))



(testing

"a single line returns 1"

(is (= 1 (line-count* (spark/parallelize sc ["this is a single line"])))))



(testing

"multiple lines count correctly"

(is (= 10 (line-count* (spark/parallelize sc (repeat 10 "this is a single line"))))))

)))
What's an RDD? What's in it?
Take e.g. an JdbcRDD (we all know relational databases...):
What's an RDD? What's in it?
Take e.g. an JdbcRDD (we all know relational databases...):
campaign_id from to active
1 123 2014-01-01 2014-01-31 true
2 234 2014-01-06 2014-01-14 true
3 345 2014-02-01 2014-03-31 false
4 456 2014-02-10 2014-03-09 true
What's an RDD? What's in it?
Take e.g. an JdbcRDD (we all know relational databases...):
campaign_id from to active
1 123 2014-01-01 2014-01-31 true
2 234 2014-01-06 2014-01-14 true
3 345 2014-02-01 2014-03-31 false
4 456 2014-02-10 2014-03-09 true
That's your table
What's an RDD? What's in it?
Take e.g. an JdbcRDD (we all know relational databases...):
campaign_id from to active
1 123 2014-01-01 2014-01-31 true
2 234 2014-01-06 2014-01-14 true
3 345 2014-02-01 2014-03-31 false
4 456 2014-02-10 2014-03-09 true
That's your table
[ {:campaign-id 123 :active true}
{:campaign-id 234 :active true}
{:campaign-id 345 :active false}
{:campaign-id 456 :active true}]
What's an RDD? What's in it?
Take e.g. an JdbcRDD (we all know relational databases...):
campaign_id from to active
1 123 2014-01-01 2014-01-31 true
2 234 2014-01-06 2014-01-14 true
3 345 2014-02-01 2014-03-31 false
4 456 2014-02-10 2014-03-09 true
That's your table
[ {:campaign-id 123 :active true}
{:campaign-id 234 :active true}
{:campaign-id 345 :active false}
{:campaign-id 456 :active true}]
RDDs are lists of objects
What's an RDD? What's in it?
Take e.g. an JdbcRDD (we all know relational databases...):
campaign_id from to active
1 123 2014-01-01 2014-01-31 true
2 234 2014-01-06 2014-01-14 true
3 345 2014-02-01 2014-03-31 false
4 456 2014-02-10 2014-03-09 true
That's your table
[ {:campaign-id 123 :active true}
{:campaign-id 234 :active true}
{:campaign-id 345 :active false}
{:campaign-id 456 :active true}]
RDDs are lists of objects
[ #t[123 {:campaign-id 123 :active true}]
#t[234 {:campaign-id 234 :active true}]]
[ #t[345 {:campaign-id 345 :active false}]
#t[456 {:campaign-id 456 :active true}]
What's an RDD? What's in it?
Take e.g. an JdbcRDD (we all know relational databases...):
campaign_id from to active
1 123 2014-01-01 2014-01-31 true
2 234 2014-01-06 2014-01-14 true
3 345 2014-02-01 2014-03-31 false
4 456 2014-02-10 2014-03-09 true
That's your table
[ {:campaign-id 123 :active true}
{:campaign-id 234 :active true}
{:campaign-id 345 :active false}
{:campaign-id 456 :active true}]
RDDs are lists of objects
[ #t[123 {:campaign-id 123 :active true}]
#t[234 {:campaign-id 234 :active true}]]
[ #t[345 {:campaign-id 345 :active false}]
#t[456 {:campaign-id 456 :active true}]
PairRDDs handle key-value pairs,
may have partitioners assigned,
keys not necessarily unique!
In Practice 3: status codes
(defn parse-line [line]
(some->> line
(re-matches common-log-regex)
rest
(zipmap [:ip :timestamp :request :status
:length :referer :ua :duration])
transform-log-entry))
(defn group-by-status-code [lines]
(->> lines
(map parse-line)
(map (fn [entry] [(:status entry) 1]))
(reduce (fn [a [k v]]
(update-in a [k] #((fnil + 0) % v))) {})
(map identity)))
In Practice 4: status codes cont'd
(defn parse-line [line]
(some->> line
(re-matches common-log-regex)
rest
(zipmap [:ip :timestamp :request :status
:length :referer :ua :duration])
transform-log-entry))
(defn group-by-status-code [lines]
(->> lines
(map parse-line)
(map (fn [entry] [(:status entry) 1]))
(reduce (fn [a [k v]]
(update-in a [k] #((fnil + 0) % v))) {})
(map identity)))
(defn group-by-status-code* [lines]
(-> lines
(s/map parse-line)
(s/map-to-pair (fn [entry]
(s/tuple (:status entry) 1)))
(s/reduce-by-key +)
(s/map (sd/key-value-fn vector))
(s/collect)))
In Practice 5: details RDD
• Lazy evaluation is explicitly forced
• Transformation vs actions
• Serialization of Clojure functions
In Practice 6: data sources and destinations
• Writing to HDFS
• Reading from HDFS
• HDFS is versatile: text files, S3, Cassandra
• Parallelizing regular Clojure collections
In Practice 7: top errors
(defn top-errors [lines]
(->> lines
(map parse-line)
(filter (fn [entry] (not= "200" (:status entry))))
(map (fn [entry] [(:uri entry) 1]))
(reduce (fn [a [k v]]
(update-in a [k] #((fnil + 0) % v))) {})
(sort-by val >)
(take 10)))
In Practice 8: top errors cont'd
(defn top-errors* [lines]
(-> lines
(s/map parse-line)
(s/filter (fn [entry] (not= "200" (:status entry))))
s/cache
(s/map-to-pair (fn [entry] (s/tuple (:uri entry) 1)))
(s/reduce-by-key +)
;; flip
(s/map-to-pair (sd/key-value-fn (fn [a b] (s/tuple b a))))
(s/sort-by-key false) ;; descending order
;; flip
(s/map-to-pair (sd/key-value-fn (fn [a b] (s/tuple b a))))
(s/map (sd/key-value-fn vector))
(s/take 10)))
In Practice 9: caching
• enables data sharing
• avoiding data (de)serialization
• performance degrades gracefully
Why Use Clojure to
Write Spark Jobs?
Spark and Functional Programming
• Spark is inspired by FP
• Not surprising – Scala is a functional programming language
• RDDs are immutable values
• Resilience: caches can be discarded
• DAG of transformations
• Philosophically close to Clojure
Processing RDDs
So your application
• defines (source) RDDs,
• transforms them (which creates new RDDs with dependencies on the source RDDs)
• and runs actions on them to get results back to the driver.
This defines a Directed Acyclic Graph (DAG) of operators.
Spark compiles this DAG of operators into a set of stages, where the boundary between two stages is a shuffle
phase.
Each stage contains tasks, working on one partition each.
Processing RDDs
So your application
• defines (source) RDDs,
• transforms them (which creates new RDDs with dependencies on the source RDDs)
• and runs actions on them to get results back to the driver.
This defines a Directed Acyclic Graph (DAG) of operators.
Spark compiles this DAG of operators into a set of stages, where the boundary between two stages is a shuffle
phase.
Each stage contains tasks, working on one partition each.
Example
sc.textFile("/some-hdfs-data")
map#map# reduceByKey# collect#textFile#
.map(line => line.split("t"))
.map(parts =>
(parts[0], int(parts[1])))
.reduceByKey(_ + _, 3)
.collect()
RDD[String]
RDD[List[String]]
RDD[(String, Int)]
Array[(String, Int)]
RDD[(String, Int)]
Processing RDDs
So your application
• defines (source) RDDs,
• transforms them (which creates new RDDs with dependencies on the source RDDs)
• and runs actions on them to get results back to the driver.
This defines a Directed Acyclic Graph (DAG) of operators.
Spark compiles this DAG of operators into a set of stages, where the boundary between two stages is a shuffle
phase.
Each stage contains tasks, working on one partition each.
Example
sc.textFile("/some-hdfs-data")
map#map# reduceByKey# collect#textFile#
.map(line => line.split("t"))
.map(parts =>
(parts[0], int(parts[1])))
.reduceByKey(_ + _, 3)
.collect()
RDD[String]
RDD[List[String]]
RDD[(String, Int)]
Array[(String, Int)]
RDD[(String, Int)]
Execution Graph
map#map# reduceByKey# collect#textFile#
map#
Stage#2#Stage#1#
map# reduceByKey# collect#textFile#
Processing RDDs
So your application
• defines (source) RDDs,
• transforms them (which creates new RDDs with dependencies on the source RDDs)
• and runs actions on them to get results back to the driver.
This defines a Directed Acyclic Graph (DAG) of operators.
Spark compiles this DAG of operators into a set of stages, where the boundary between two stages is a shuffle
phase.
Each stage contains tasks, working on one partition each.
Example
sc.textFile("/some-hdfs-data")
map#map# reduceByKey# collect#textFile#
.map(line => line.split("t"))
.map(parts =>
(parts[0], int(parts[1])))
.reduceByKey(_ + _, 3)
.collect()
RDD[String]
RDD[List[String]]
RDD[(String, Int)]
Array[(String, Int)]
RDD[(String, Int)]
Execution Graph
map#map# reduceByKey# collect#textFile#
map#
Stage#2#Stage#1#
map# reduceByKey# collect#textFile#
Execution Graph
map#
Stage#2#Stage#1#
map# reduceByKey# collect#textFile#
Stage#2#Stage#1#
read HDFS split
apply both maps
partial reduce
write shuffle data
read shuffle data
final reduce
send result to driver
Dynamic Types for Data Processing
• Clojure's strength: developer-friendly wrapper for a complex interior
• Static types everywhere
• Imperfect data
• For this use case, static typing can get in the way
• Jobs naturally represented as transformations of Clojure data structures
Data Exploration
• Working in real time with big datasets
• Great for data mining
• Clojure's powerful REPL
• Gorilla REPL for live plotting?
Summary: Why Spark(ling)
Data sharing: Hadoop is for a single map-reduce pass, it needs to write out
intermediate result to HDFS.
Interactive data exploration: Spark keeps data in memory, opening the
possibility of interactively working with TBs of data
Hadoop (and HIVE and Pig) lacks an (easy) way to implement unit tests. So
writing your own code is also error-prone and development cycle is
slooooow.
Practical tips
Running your spark code
Run locally: e.g. inside tests. Use "local" or "local[*]" as SparkMaster.
Run on cluster: either directly addressing Spark or (our case): run on top of YARN
Both open a Web interface on http://host:4040/.
Using the REPL: Open a SparkContext, define RDDs and store them in vars, perform transformations on these. Develop stuff in
the REPL transfer your REPL stuff into tests.
Run inside of tests: Open local SparcContext, feed mock data, run jobs. Therefore: design for testability!
Submit a Spark Job using "spark-submit" with proper arguments (see upload.sh, run.sh).
Best Practices / Dos and Don'ts
Shuffling is very expensive, so try to avoid it:
• Never, ever, let go of your Partitioner - this has huuuuuuge performance impact. Use map-values instead of map, keep
partition when re-keying for join, etc.
• This equals: Keep your execution plan slim.
There are some tricks for this, all boiling down to proper design of your data models.
Use broadcasting where necessary.
You need to monitor memory usage, as the inability to store stuff in memory will cause spills to disc (e.g. while shuffling).
This will kill you. Tune total memory and/or cache/shuffle ratios.
Example
Example
Matrix Multiplication
• Repeatedly multiply sparse matrix and vector
24
Links
(url, neighbors)
Ranks
(url, rank)
…
iteration 1 iteration 2 iteration 3
Same file read
over and over
Example
Matrix Multiplication
• Repeatedly multiply sparse matrix and vector
24
Links
(url, neighbors)
Ranks
(url, rank)
…
iteration 1 iteration 2 iteration 3
Same file read
over and over
Spark can do much better
25
• Using cache(), keep neighbors in memory
• Do not write intermediate results on disk
Links
(url, neighbors)
Ranks
(url, rank)
join join join
…
Grouping same RDD
over and over
Example
Matrix Multiplication
• Repeatedly multiply sparse matrix and vector
24
Links
(url, neighbors)
Ranks
(url, rank)
…
iteration 1 iteration 2 iteration 3
Same file read
over and over
Spark can do much better
25
• Using cache(), keep neighbors in memory
• Do not write intermediate results on disk
Links
(url, neighbors)
Ranks
(url, rank)
join join join
…
Grouping same RDD
over and over
Spark can do much better
26
• Do not partition neighbors every time
Links
(url, neighbors)
Ranks
(url, rank)
join join join
…
partitionBy
Same node
Some anecdotes
Why did I start gorillalabs/sparkling?
first, there was clj-spark from The Climate Corporation. Very basic, not maintained anymore.
Then, I found out about flambo from yieldbot. Looked promising at first, fresh release, maybe used in production at yieldbot.
Small jobs were developed fast with Spark.
I ran into sooooo many problems (running on Spark Standalone, moving to YARN, fighting with low memory). Nothing to do with flambo, but with understanding the nuts and
bolts of Spark, YARN and other elements of my infrastructure. Ok, some with serializing my Clojure data structures.
Scaling up the amount of data led me directly into hell. My system was way slower than our existing solution. Was Spark the wrong way? I was completely like this guy: http://
blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html: „Spark should be better than MapReduce (if only it worked)“
After some thinking, I found out what happend: flambo promised to keep me in Clojure-land. Therefore, it uses a map operation to convert Scala Tuple2 to Clojure vector and
back again where necessary. But map looses your Partitioner information. Remember my point? So, flambo broke Einstein’s „as simple as possible but no simpler“
I fixed the library, I incorporated a different take on serializing functions (without reflection). That’s where I released gorillalabs/sparkling.
I needed to tweak the Data Model to have the same partitioner all over the place or use hand-crafted data structures and broadcasts for those not fitting my model. I now
ended up with code generating an index-structure from an RDD, sorted-tree-sets for date-ranged data, and so forth. And everything is fully unit-tested, cause that’s the only way
to go.
Now, my system outperforms a much bigger MySQL-based system on a local master, scales almost linearly wrt cores on a cluster. HURRAY!
Having nrepl / GorillaREPL is so nice!
Having an nrepl open on my Cluster is so nice, since I can inspect stuff in my
computation. Ever wondered, what that intermediate RDD contains? Just
(spark/take rdd 10) it.
Using GorillaREPL, it’s like a visual workbench for big data analysis. See for
yourself: http://bit.ly/1C7sSK4
References
Online
Sparkling: https://github.com/gorillalabs/sparkling
Flambo: https://github.com/yieldbot/flambo
flambo-example: https://github.com/pesterhazy/flambo-example
References
http://lintool.github.io/SparkTutorial/ (where you can find the slides used in this presentation)
https://speakerdeck.com/ecepoi/apache-spark-at-viadeo
https://speakerdeck.com/ecepoi/viadeos-segmentation-platform-with-spark-on-mesos
https://speakerdeck.com/rxin/advanced-spark-at-spark-summit-2014
Sources
Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on
large clusters. Communications of the ACM, 51(1), 107-113.
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., ... & Stoica, I.
(2012, April). Resilient distributed datasets: A fault-tolerant abstraction for in-
memory cluster computing. In Proceedings of the 9th USENIX conference on
Networked Systems Design and Implementation (pp. 2-2). USENIX Association.
(Both available as PDFs)
Questions?

More Related Content

What's hot

Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab
 
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Holden Karau
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraRussell Spitzer
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupHolden Karau
 
Fast track to getting started with DSE Max @ ING
Fast track to getting started with DSE Max @ INGFast track to getting started with DSE Max @ ING
Fast track to getting started with DSE Max @ INGDuyhai Doan
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...Duyhai Doan
 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Holden Karau
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Holden Karau
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探台灣資料科學年會
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsCheng Min Chi
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInVitaly Gordon
 
Spark cassandra integration, theory and practice
Spark cassandra integration, theory and practiceSpark cassandra integration, theory and practice
Spark cassandra integration, theory and practiceDuyhai Doan
 
Dive into Catalyst
Dive into CatalystDive into Catalyst
Dive into CatalystCheng Lian
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Matthias Niehoff
 

What's hot (20)

Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
 
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
 
Fast track to getting started with DSE Max @ ING
Fast track to getting started with DSE Max @ INGFast track to getting started with DSE Max @ ING
Fast track to getting started with DSE Max @ ING
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internals
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Apache Spark Workshop
Apache Spark WorkshopApache Spark Workshop
Apache Spark Workshop
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedIn
 
Collections forceawakens
Collections forceawakensCollections forceawakens
Collections forceawakens
 
Spark cassandra integration, theory and practice
Spark cassandra integration, theory and practiceSpark cassandra integration, theory and practice
Spark cassandra integration, theory and practice
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
 
Dive into Catalyst
Dive into CatalystDive into Catalyst
Dive into Catalyst
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
 

Viewers also liked

Compilers Are Databases
Compilers Are DatabasesCompilers Are Databases
Compilers Are DatabasesMartin Odersky
 
Funktionales Programmieren mit Clojure
Funktionales Programmieren mit ClojureFunktionales Programmieren mit Clojure
Funktionales Programmieren mit ClojureDr. Christian Betz
 
UX w trudnych warunkach
UX w trudnych warunkachUX w trudnych warunkach
UX w trudnych warunkachAnna Liszewska
 
Finding and Closing Business from the Social Web
Finding and Closing Business from the Social Web Finding and Closing Business from the Social Web
Finding and Closing Business from the Social Web Heinz Marketing Inc
 
Daily Newsletter: 10th January, 2011
Daily Newsletter: 10th January, 2011Daily Newsletter: 10th January, 2011
Daily Newsletter: 10th January, 2011Fullerton Securities
 
Digital Marketing: Advice & Tips
Digital Marketing: Advice & TipsDigital Marketing: Advice & Tips
Digital Marketing: Advice & TipsPaul Di Gangi
 
XopheLachnitt - Surinformation et maîtrise de l'information 4/4 (2013)
XopheLachnitt - Surinformation et maîtrise de l'information 4/4 (2013)XopheLachnitt - Surinformation et maîtrise de l'information 4/4 (2013)
XopheLachnitt - Surinformation et maîtrise de l'information 4/4 (2013)Christophe Lachnitt
 
London Best Places to Work Roadshow | ARM
London Best Places to Work Roadshow | ARMLondon Best Places to Work Roadshow | ARM
London Best Places to Work Roadshow | ARMGlassdoor
 
Challenges in stereoscopic movie making and cinema
Challenges in stereoscopic movie making and cinemaChallenges in stereoscopic movie making and cinema
Challenges in stereoscopic movie making and cinemadanielbuechele
 
Universidad nacional de cajamarca para combinar
Universidad nacional de cajamarca   para combinarUniversidad nacional de cajamarca   para combinar
Universidad nacional de cajamarca para combinarKelin Mariñas Cabrera
 
#MayoInOz Opening Keynote
#MayoInOz Opening Keynote#MayoInOz Opening Keynote
#MayoInOz Opening KeynoteLee Aase
 
Leveraging Social Media Skills
Leveraging Social Media Skills Leveraging Social Media Skills
Leveraging Social Media Skills GovLoop
 
Grudging monkeys and microservices
Grudging monkeys and microservicesGrudging monkeys and microservices
Grudging monkeys and microservicesCarlo Sciolla
 
Social Media Strategies for Events - Hanzehogeschool Groningen 290312
Social Media Strategies for Events - Hanzehogeschool Groningen 290312Social Media Strategies for Events - Hanzehogeschool Groningen 290312
Social Media Strategies for Events - Hanzehogeschool Groningen 290312EventsAcademy
 

Viewers also liked (16)

Compilers Are Databases
Compilers Are DatabasesCompilers Are Databases
Compilers Are Databases
 
Funktionales Programmieren mit Clojure
Funktionales Programmieren mit ClojureFunktionales Programmieren mit Clojure
Funktionales Programmieren mit Clojure
 
UX w trudnych warunkach
UX w trudnych warunkachUX w trudnych warunkach
UX w trudnych warunkach
 
Finding and Closing Business from the Social Web
Finding and Closing Business from the Social Web Finding and Closing Business from the Social Web
Finding and Closing Business from the Social Web
 
prof. in eng. proj. mngt., const. mngt.
prof. in eng. proj. mngt., const. mngt.prof. in eng. proj. mngt., const. mngt.
prof. in eng. proj. mngt., const. mngt.
 
Daily Newsletter: 10th January, 2011
Daily Newsletter: 10th January, 2011Daily Newsletter: 10th January, 2011
Daily Newsletter: 10th January, 2011
 
Digital Marketing: Advice & Tips
Digital Marketing: Advice & TipsDigital Marketing: Advice & Tips
Digital Marketing: Advice & Tips
 
XopheLachnitt - Surinformation et maîtrise de l'information 4/4 (2013)
XopheLachnitt - Surinformation et maîtrise de l'information 4/4 (2013)XopheLachnitt - Surinformation et maîtrise de l'information 4/4 (2013)
XopheLachnitt - Surinformation et maîtrise de l'information 4/4 (2013)
 
London Best Places to Work Roadshow | ARM
London Best Places to Work Roadshow | ARMLondon Best Places to Work Roadshow | ARM
London Best Places to Work Roadshow | ARM
 
Challenges in stereoscopic movie making and cinema
Challenges in stereoscopic movie making and cinemaChallenges in stereoscopic movie making and cinema
Challenges in stereoscopic movie making and cinema
 
Recorte Web - AAM - MediaIN
Recorte Web - AAM - MediaINRecorte Web - AAM - MediaIN
Recorte Web - AAM - MediaIN
 
Universidad nacional de cajamarca para combinar
Universidad nacional de cajamarca   para combinarUniversidad nacional de cajamarca   para combinar
Universidad nacional de cajamarca para combinar
 
#MayoInOz Opening Keynote
#MayoInOz Opening Keynote#MayoInOz Opening Keynote
#MayoInOz Opening Keynote
 
Leveraging Social Media Skills
Leveraging Social Media Skills Leveraging Social Media Skills
Leveraging Social Media Skills
 
Grudging monkeys and microservices
Grudging monkeys and microservicesGrudging monkeys and microservices
Grudging monkeys and microservices
 
Social Media Strategies for Events - Hanzehogeschool Groningen 290312
Social Media Strategies for Events - Hanzehogeschool Groningen 290312Social Media Strategies for Events - Hanzehogeschool Groningen 290312
Social Media Strategies for Events - Hanzehogeschool Groningen 290312
 

Similar to Big Data Processing using Apache Spark and Clojure

Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part IIArjen de Vries
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...Holden Karau
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentationpunesparkmeetup
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014cdmaxime
 
A Deep Dive Into Spark
A Deep Dive Into SparkA Deep Dive Into Spark
A Deep Dive Into SparkAshish kumar
 
Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache SparkMarcoYuriFujiiMelo
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 

Similar to Big Data Processing using Apache Spark and Clojure (20)

Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Spark learning
Spark learningSpark learning
Spark learning
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentation
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
 
A Deep Dive Into Spark
A Deep Dive Into SparkA Deep Dive Into Spark
A Deep Dive Into Spark
 
Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache Spark
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Spark
SparkSpark
Spark
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 

Recently uploaded

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Recently uploaded (20)

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

Big Data Processing using Apache Spark and Clojure

  • 1. Big Data Processing using Apache Spark and Clojure Dr. Paulus Esterhazy and Dr. Christian Betz January 2015 https://github.com/pesterhazy/, @pesterhazy https://github.com/chrisbetz/, @chris_betz
  • 3. Who's getting paid to use Clojure?
  • 10. WTF is Spark? Patrick Wendell Databricks Spark Performance Common Patterns and Pitfalls for Implementing Algorithms in Spark Hossein Falaki @mhfalaki hossein@databricks.com Advanced Spark Reynold Xin, July 2, 2014 @ Spark Summit Training Disclaimer: We reuse stuff
  • 11. Apache Spark - an Overview "Apache Spark™ is a fast and general engine for large-scale data processing." Value proposition? Spark keeps stuff in memory where possible, so intermediate results do not need I/O. Spark allows quicker development cycle with proper unit tests (see later) Spark allows to define your own data sources (JDBC in our case). Spark allows you to work with any data structures (so some are better than others).
  • 12. Two Questions “I like Clojure, why might I be interested in Spark?” “Granted that Spark is useful, why program it in Clojure?”
  • 13. Two Questions “I like Clojure, why might I be interested in Spark?” “Granted that Spark is useful, why program it in Clojure?” That's you!
  • 14. How Big Data is processed today large amounts of data to process Hadoop is the de-facto standard Hadoop = MapReduce + HDFS
  • 15. However, Hadoop has some limitations Pain point: performance Writing to disk after each map-/reduce step That's esp. bad for chains of map-/reduce steps and iterative algorithms (machine learning, PageRank) Identified Bottleneck: HDD I/O
  • 16. Spark's Answer Major innovation: data sharing between processing steps In-memory processing
  • 17. Resilient Distributed Datasets (RDDs) Datasets: Collection of elements Distributed: Could be an on any node in the cluster. Resilient: Could get lost (or partially lost), doesn't matter. Spark will recompute.
  • 18. Different types of RDDs, all the same interface Scientific Answer: RDD is an Interface! 1.  Set of partitions (“splits” in Hadoop) 2.  List of dependencies on parent RDDs 3.  Function to compute a partition" (as an Iterator) given its parent(s) 4.  (Optional) partitioner (hash, range) 5.  (Optional) preferred location(s)" for each partition “lineage” optimized execution
  • 19. Different types of RDDs, all the same interface Scientific Answer: RDD is an Interface! 1.  Set of partitions (“splits” in Hadoop) 2.  List of dependencies on parent RDDs 3.  Function to compute a partition" (as an Iterator) given its parent(s) 4.  (Optional) partitioner (hash, range) 5.  (Optional) preferred location(s)" for each partition “lineage” optimized execution Example: HadoopRDD partitions = one per HDFS block dependencies = none compute(part) = read corresponding block preferredLocations(part) = HDFS block location partitioner = none
  • 20. Different types of RDDs, all the same interface Scientific Answer: RDD is an Interface! 1.  Set of partitions (“splits” in Hadoop) 2.  List of dependencies on parent RDDs 3.  Function to compute a partition" (as an Iterator) given its parent(s) 4.  (Optional) partitioner (hash, range) 5.  (Optional) preferred location(s)" for each partition “lineage” optimized execution Example: HadoopRDD partitions = one per HDFS block dependencies = none compute(part) = read corresponding block preferredLocations(part) = HDFS block location partitioner = none Example: Filtered RDD partitions = same as parent RDD dependencies = “one-to-one” on parent compute(part) = compute parent and filter it preferredLocations(part) = none (ask parent) partitioner = none
  • 21. How are RDDs handled? You create an RDD from a data source, e.g. an HDFS file, a Cassandra DB query, or from a JDBC-Query. You transform RDDs (with map, filter, ...), which gives you new RDDs You perform an action on one RDD to get the results from that RDD into your "driver". (like first, take, collect, count, ...)
  • 22. Basic Building Blocks: RDDs Resilient Distributed Datasets Spark follows a function approach: You define collections (RDDs) and functions on collections Sources for RDDs: • Local collections parallelized • HDFS files • Your own (e.g. JDBC-RDD) Transformations (only a selection) • map • filter Actions (only a selection) • reduce (fn) • count
  • 23. Basic Building Blocks: RDDs Resilient Distributed Datasets Spark follows a function approach: You define collections (RDDs) and functions on collections Sources for RDDs: • Local collections parallelized • HDFS files • Your own (e.g. JDBC-RDD) Transformations (only a selection) • map • filter Actions (only a selection) • reduce (fn) • count JdbcRDD (Query) HDFS-File (Path) Sources define the basic RDDs
 you're working on
  • 24. Basic Building Blocks: RDDs Resilient Distributed Datasets Spark follows a function approach: You define collections (RDDs) and functions on collections Sources for RDDs: • Local collections parallelized • HDFS files • Your own (e.g. JDBC-RDD) Transformations (only a selection) • map • filter Actions (only a selection) • reduce (fn) • count JdbcRDD (Query) HDFS-File (Path) map filter Sources define the basic RDDs
 you're working on Transformations create new RDDs
  • 25. Basic Building Blocks: RDDs Resilient Distributed Datasets Spark follows a function approach: You define collections (RDDs) and functions on collections Sources for RDDs: • Local collections parallelized • HDFS files • Your own (e.g. JDBC-RDD) Transformations (only a selection) • map • filter Actions (only a selection) • reduce (fn) • count JdbcRDD (Query) HDFS-File (Path) map filter join Sources define the basic RDDs
 you're working on Transformations create new RDDs
  • 26. Basic Building Blocks: RDDs Resilient Distributed Datasets Spark follows a function approach: You define collections (RDDs) and functions on collections Sources for RDDs: • Local collections parallelized • HDFS files • Your own (e.g. JDBC-RDD) Transformations (only a selection) • map • filter Actions (only a selection) • reduce (fn) • count JdbcRDD (Query) HDFS-File (Path) map filter join filter You provide your own functions in here! Sources define the basic RDDs
 you're working on Transformations create new RDDs
  • 27. Basic Building Blocks: RDDs Resilient Distributed Datasets Spark follows a function approach: You define collections (RDDs) and functions on collections Sources for RDDs: • Local collections parallelized • HDFS files • Your own (e.g. JDBC-RDD) Transformations (only a selection) • map • filter Actions (only a selection) • reduce (fn) • count JdbcRDD (Query) HDFS-File (Path) map filter join filter reduce You provide your own functions in here! Sources define the basic RDDs
 you're working on Transformations create new RDDs Actions spit a
 result to the Driver
  • 28. RDDs in Practice Example code: https://github.com/gorillalabs/ClojureD
  • 29. In Practice 1: line count (defn line-count [lines] (->> lines count)) (defn process [f] (with-open [rdr (clojure.java.io/reader "in.log")] (let [result (f (line-seq rdr))] (if (seq? result) (doall result) result)))) (process line-count)
  • 30. In Practice 2: line count cont'd (defn line-count* [lines] (->> lines s/count)) (defn new-spark-context [] (let [c (-> (s-conf/spark-conf) (s-conf/master "local[*]") (s-conf/app-name "sparkling") (s-conf/set "spark.akka.timeout" "300") (s-conf/set conf) (s-conf/set-executor-env { "spark.executor.memory" "4G", "spark.files.overwrite" "true"}))] (s/spark-context c) )) (defonce sc (delay (new-spark-context))) (defn process* [f] (let [lines-rdd (s/text-file @sc "in.log")] (f lines-rdd))) (defn line-count [lines] (->> lines count)) (defn process [f] (with-open [rdr (clojure.java.io/reader "in.log")] (let [result (f (line-seq rdr))] (if (seq? result) (doall result) result)))) (process line-count)
  • 31. Only go on when your tests are green! (deftest test-line-count*
 (let [conf (test-conf)]
 (spark/with-context
 sc conf
 (testing
 "no lines return 0"
 (is (= 0 (line-count* (spark/parallelize sc [])))))
 
 (testing
 "a single line returns 1"
 (is (= 1 (line-count* (spark/parallelize sc ["this is a single line"])))))
 
 (testing
 "multiple lines count correctly"
 (is (= 10 (line-count* (spark/parallelize sc (repeat 10 "this is a single line"))))))
 )))
  • 32. What's an RDD? What's in it? Take e.g. an JdbcRDD (we all know relational databases...):
  • 33. What's an RDD? What's in it? Take e.g. an JdbcRDD (we all know relational databases...): campaign_id from to active 1 123 2014-01-01 2014-01-31 true 2 234 2014-01-06 2014-01-14 true 3 345 2014-02-01 2014-03-31 false 4 456 2014-02-10 2014-03-09 true
  • 34. What's an RDD? What's in it? Take e.g. an JdbcRDD (we all know relational databases...): campaign_id from to active 1 123 2014-01-01 2014-01-31 true 2 234 2014-01-06 2014-01-14 true 3 345 2014-02-01 2014-03-31 false 4 456 2014-02-10 2014-03-09 true That's your table
  • 35. What's an RDD? What's in it? Take e.g. an JdbcRDD (we all know relational databases...): campaign_id from to active 1 123 2014-01-01 2014-01-31 true 2 234 2014-01-06 2014-01-14 true 3 345 2014-02-01 2014-03-31 false 4 456 2014-02-10 2014-03-09 true That's your table [ {:campaign-id 123 :active true} {:campaign-id 234 :active true} {:campaign-id 345 :active false} {:campaign-id 456 :active true}]
  • 36. What's an RDD? What's in it? Take e.g. an JdbcRDD (we all know relational databases...): campaign_id from to active 1 123 2014-01-01 2014-01-31 true 2 234 2014-01-06 2014-01-14 true 3 345 2014-02-01 2014-03-31 false 4 456 2014-02-10 2014-03-09 true That's your table [ {:campaign-id 123 :active true} {:campaign-id 234 :active true} {:campaign-id 345 :active false} {:campaign-id 456 :active true}] RDDs are lists of objects
  • 37. What's an RDD? What's in it? Take e.g. an JdbcRDD (we all know relational databases...): campaign_id from to active 1 123 2014-01-01 2014-01-31 true 2 234 2014-01-06 2014-01-14 true 3 345 2014-02-01 2014-03-31 false 4 456 2014-02-10 2014-03-09 true That's your table [ {:campaign-id 123 :active true} {:campaign-id 234 :active true} {:campaign-id 345 :active false} {:campaign-id 456 :active true}] RDDs are lists of objects [ #t[123 {:campaign-id 123 :active true}] #t[234 {:campaign-id 234 :active true}]] [ #t[345 {:campaign-id 345 :active false}] #t[456 {:campaign-id 456 :active true}]
  • 38. What's an RDD? What's in it? Take e.g. an JdbcRDD (we all know relational databases...): campaign_id from to active 1 123 2014-01-01 2014-01-31 true 2 234 2014-01-06 2014-01-14 true 3 345 2014-02-01 2014-03-31 false 4 456 2014-02-10 2014-03-09 true That's your table [ {:campaign-id 123 :active true} {:campaign-id 234 :active true} {:campaign-id 345 :active false} {:campaign-id 456 :active true}] RDDs are lists of objects [ #t[123 {:campaign-id 123 :active true}] #t[234 {:campaign-id 234 :active true}]] [ #t[345 {:campaign-id 345 :active false}] #t[456 {:campaign-id 456 :active true}] PairRDDs handle key-value pairs, may have partitioners assigned, keys not necessarily unique!
  • 39. In Practice 3: status codes (defn parse-line [line] (some->> line (re-matches common-log-regex) rest (zipmap [:ip :timestamp :request :status :length :referer :ua :duration]) transform-log-entry)) (defn group-by-status-code [lines] (->> lines (map parse-line) (map (fn [entry] [(:status entry) 1])) (reduce (fn [a [k v]] (update-in a [k] #((fnil + 0) % v))) {}) (map identity)))
  • 40. In Practice 4: status codes cont'd (defn parse-line [line] (some->> line (re-matches common-log-regex) rest (zipmap [:ip :timestamp :request :status :length :referer :ua :duration]) transform-log-entry)) (defn group-by-status-code [lines] (->> lines (map parse-line) (map (fn [entry] [(:status entry) 1])) (reduce (fn [a [k v]] (update-in a [k] #((fnil + 0) % v))) {}) (map identity))) (defn group-by-status-code* [lines] (-> lines (s/map parse-line) (s/map-to-pair (fn [entry] (s/tuple (:status entry) 1))) (s/reduce-by-key +) (s/map (sd/key-value-fn vector)) (s/collect)))
  • 41. In Practice 5: details RDD • Lazy evaluation is explicitly forced • Transformation vs actions • Serialization of Clojure functions
  • 42. In Practice 6: data sources and destinations • Writing to HDFS • Reading from HDFS • HDFS is versatile: text files, S3, Cassandra • Parallelizing regular Clojure collections
  • 43. In Practice 7: top errors (defn top-errors [lines] (->> lines (map parse-line) (filter (fn [entry] (not= "200" (:status entry)))) (map (fn [entry] [(:uri entry) 1])) (reduce (fn [a [k v]] (update-in a [k] #((fnil + 0) % v))) {}) (sort-by val >) (take 10)))
  • 44. In Practice 8: top errors cont'd (defn top-errors* [lines] (-> lines (s/map parse-line) (s/filter (fn [entry] (not= "200" (:status entry)))) s/cache (s/map-to-pair (fn [entry] (s/tuple (:uri entry) 1))) (s/reduce-by-key +) ;; flip (s/map-to-pair (sd/key-value-fn (fn [a b] (s/tuple b a)))) (s/sort-by-key false) ;; descending order ;; flip (s/map-to-pair (sd/key-value-fn (fn [a b] (s/tuple b a)))) (s/map (sd/key-value-fn vector)) (s/take 10)))
  • 45. In Practice 9: caching • enables data sharing • avoiding data (de)serialization • performance degrades gracefully
  • 46. Why Use Clojure to Write Spark Jobs?
  • 47. Spark and Functional Programming • Spark is inspired by FP • Not surprising – Scala is a functional programming language • RDDs are immutable values • Resilience: caches can be discarded • DAG of transformations • Philosophically close to Clojure
  • 48. Processing RDDs So your application • defines (source) RDDs, • transforms them (which creates new RDDs with dependencies on the source RDDs) • and runs actions on them to get results back to the driver. This defines a Directed Acyclic Graph (DAG) of operators. Spark compiles this DAG of operators into a set of stages, where the boundary between two stages is a shuffle phase. Each stage contains tasks, working on one partition each.
  • 49. Processing RDDs So your application • defines (source) RDDs, • transforms them (which creates new RDDs with dependencies on the source RDDs) • and runs actions on them to get results back to the driver. This defines a Directed Acyclic Graph (DAG) of operators. Spark compiles this DAG of operators into a set of stages, where the boundary between two stages is a shuffle phase. Each stage contains tasks, working on one partition each. Example sc.textFile("/some-hdfs-data") map#map# reduceByKey# collect#textFile# .map(line => line.split("t")) .map(parts => (parts[0], int(parts[1]))) .reduceByKey(_ + _, 3) .collect() RDD[String] RDD[List[String]] RDD[(String, Int)] Array[(String, Int)] RDD[(String, Int)]
  • 50. Processing RDDs So your application • defines (source) RDDs, • transforms them (which creates new RDDs with dependencies on the source RDDs) • and runs actions on them to get results back to the driver. This defines a Directed Acyclic Graph (DAG) of operators. Spark compiles this DAG of operators into a set of stages, where the boundary between two stages is a shuffle phase. Each stage contains tasks, working on one partition each. Example sc.textFile("/some-hdfs-data") map#map# reduceByKey# collect#textFile# .map(line => line.split("t")) .map(parts => (parts[0], int(parts[1]))) .reduceByKey(_ + _, 3) .collect() RDD[String] RDD[List[String]] RDD[(String, Int)] Array[(String, Int)] RDD[(String, Int)] Execution Graph map#map# reduceByKey# collect#textFile# map# Stage#2#Stage#1# map# reduceByKey# collect#textFile#
  • 51. Processing RDDs So your application • defines (source) RDDs, • transforms them (which creates new RDDs with dependencies on the source RDDs) • and runs actions on them to get results back to the driver. This defines a Directed Acyclic Graph (DAG) of operators. Spark compiles this DAG of operators into a set of stages, where the boundary between two stages is a shuffle phase. Each stage contains tasks, working on one partition each. Example sc.textFile("/some-hdfs-data") map#map# reduceByKey# collect#textFile# .map(line => line.split("t")) .map(parts => (parts[0], int(parts[1]))) .reduceByKey(_ + _, 3) .collect() RDD[String] RDD[List[String]] RDD[(String, Int)] Array[(String, Int)] RDD[(String, Int)] Execution Graph map#map# reduceByKey# collect#textFile# map# Stage#2#Stage#1# map# reduceByKey# collect#textFile# Execution Graph map# Stage#2#Stage#1# map# reduceByKey# collect#textFile# Stage#2#Stage#1# read HDFS split apply both maps partial reduce write shuffle data read shuffle data final reduce send result to driver
  • 52. Dynamic Types for Data Processing • Clojure's strength: developer-friendly wrapper for a complex interior • Static types everywhere • Imperfect data • For this use case, static typing can get in the way • Jobs naturally represented as transformations of Clojure data structures
  • 53. Data Exploration • Working in real time with big datasets • Great for data mining • Clojure's powerful REPL • Gorilla REPL for live plotting?
  • 54. Summary: Why Spark(ling) Data sharing: Hadoop is for a single map-reduce pass, it needs to write out intermediate result to HDFS. Interactive data exploration: Spark keeps data in memory, opening the possibility of interactively working with TBs of data Hadoop (and HIVE and Pig) lacks an (easy) way to implement unit tests. So writing your own code is also error-prone and development cycle is slooooow.
  • 56. Running your spark code Run locally: e.g. inside tests. Use "local" or "local[*]" as SparkMaster. Run on cluster: either directly addressing Spark or (our case): run on top of YARN Both open a Web interface on http://host:4040/. Using the REPL: Open a SparkContext, define RDDs and store them in vars, perform transformations on these. Develop stuff in the REPL transfer your REPL stuff into tests. Run inside of tests: Open local SparcContext, feed mock data, run jobs. Therefore: design for testability! Submit a Spark Job using "spark-submit" with proper arguments (see upload.sh, run.sh).
  • 57. Best Practices / Dos and Don'ts Shuffling is very expensive, so try to avoid it: • Never, ever, let go of your Partitioner - this has huuuuuuge performance impact. Use map-values instead of map, keep partition when re-keying for join, etc. • This equals: Keep your execution plan slim. There are some tricks for this, all boiling down to proper design of your data models. Use broadcasting where necessary. You need to monitor memory usage, as the inability to store stuff in memory will cause spills to disc (e.g. while shuffling). This will kill you. Tune total memory and/or cache/shuffle ratios.
  • 59. Example Matrix Multiplication • Repeatedly multiply sparse matrix and vector 24 Links (url, neighbors) Ranks (url, rank) … iteration 1 iteration 2 iteration 3 Same file read over and over
  • 60. Example Matrix Multiplication • Repeatedly multiply sparse matrix and vector 24 Links (url, neighbors) Ranks (url, rank) … iteration 1 iteration 2 iteration 3 Same file read over and over Spark can do much better 25 • Using cache(), keep neighbors in memory • Do not write intermediate results on disk Links (url, neighbors) Ranks (url, rank) join join join … Grouping same RDD over and over
  • 61. Example Matrix Multiplication • Repeatedly multiply sparse matrix and vector 24 Links (url, neighbors) Ranks (url, rank) … iteration 1 iteration 2 iteration 3 Same file read over and over Spark can do much better 25 • Using cache(), keep neighbors in memory • Do not write intermediate results on disk Links (url, neighbors) Ranks (url, rank) join join join … Grouping same RDD over and over Spark can do much better 26 • Do not partition neighbors every time Links (url, neighbors) Ranks (url, rank) join join join … partitionBy Same node
  • 63. Why did I start gorillalabs/sparkling? first, there was clj-spark from The Climate Corporation. Very basic, not maintained anymore. Then, I found out about flambo from yieldbot. Looked promising at first, fresh release, maybe used in production at yieldbot. Small jobs were developed fast with Spark. I ran into sooooo many problems (running on Spark Standalone, moving to YARN, fighting with low memory). Nothing to do with flambo, but with understanding the nuts and bolts of Spark, YARN and other elements of my infrastructure. Ok, some with serializing my Clojure data structures. Scaling up the amount of data led me directly into hell. My system was way slower than our existing solution. Was Spark the wrong way? I was completely like this guy: http:// blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html: „Spark should be better than MapReduce (if only it worked)“ After some thinking, I found out what happend: flambo promised to keep me in Clojure-land. Therefore, it uses a map operation to convert Scala Tuple2 to Clojure vector and back again where necessary. But map looses your Partitioner information. Remember my point? So, flambo broke Einstein’s „as simple as possible but no simpler“ I fixed the library, I incorporated a different take on serializing functions (without reflection). That’s where I released gorillalabs/sparkling. I needed to tweak the Data Model to have the same partitioner all over the place or use hand-crafted data structures and broadcasts for those not fitting my model. I now ended up with code generating an index-structure from an RDD, sorted-tree-sets for date-ranged data, and so forth. And everything is fully unit-tested, cause that’s the only way to go. Now, my system outperforms a much bigger MySQL-based system on a local master, scales almost linearly wrt cores on a cluster. HURRAY!
  • 64. Having nrepl / GorillaREPL is so nice! Having an nrepl open on my Cluster is so nice, since I can inspect stuff in my computation. Ever wondered, what that intermediate RDD contains? Just (spark/take rdd 10) it. Using GorillaREPL, it’s like a visual workbench for big data analysis. See for yourself: http://bit.ly/1C7sSK4
  • 67. References http://lintool.github.io/SparkTutorial/ (where you can find the slides used in this presentation) https://speakerdeck.com/ecepoi/apache-spark-at-viadeo https://speakerdeck.com/ecepoi/viadeos-segmentation-platform-with-spark-on-mesos https://speakerdeck.com/rxin/advanced-spark-at-spark-summit-2014
  • 68. Sources Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., ... & Stoica, I. (2012, April). Resilient distributed datasets: A fault-tolerant abstraction for in- memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (pp. 2-2). USENIX Association. (Both available as PDFs)