A tour of pyspark streaming in Apache Spark with an example calculating CPU usage using the Docker stats API. Two buzzwordy technologies for the price of one.
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containers 2015 02-24 Washington DC Apache Spark Interactive
1.
2. Rapid Prototyping in
PySpark Streaming
The Thermodynamics of Docker Containers
Rich Seymour @rseymour
Washington DC Area Apache Spark Interactive Meetup
10. Docker
10
Open Sourced by dotCloud March 2013
Switched from LXC to libcontainer March 2014
Written in Go
Allows us to contain dependencies with:
cgroups, namespaces, capabilities, netlink, netfilter, etc.
Currently Linux only, but supported on Amazon, Google,
Microsoft and Redhat cloud offerings
Gives us a registry and a method for pulling binary diffs
15. “a fast and general
engine for large scale
data processing”
16. Spark
16
Apache Project born out of UC Berkeley’s Algorithms,
Machines, People Lab (AMPLab)
Java / Scala / Python APIs for computing on redundant
distributed datasets across a cluster of multicore machines.
17. Resilient Distributed Datasets (RDDs)
17
“Represents an immutable, partitioned collection
of elements that can be operated on in parallel.”
18. Resilient Distributed Datasets (RDDs)
18
“Represents an immutable, partitioned collection
of elements that can be operated on in parallel.”
Immutable: can’t be changed over time.
If you want to preserve a change, create a new
RDD on the left of the equals sign.
19. Resilient Distributed Datasets (RDDs)
19
“Represents an immutable, partitioned collection
of elements that can be operated on in parallel.”
Partitioned: Split up often by key w/ a partitioner,
if your RDD is made up of key value pairs:
my_rdd = [(1,”Apple”), (2,”IBM”)]
25. Pipes circa 1964 – Doug McIlroy
25
Summary--what's most important.
To put my strongest concerns into a nutshell:
1. We should have some ways of coupling programs like
garden hose--screw in another segment when it becomes when
it becomes necessary to massage data in another way.
This is the way of IO also.
2. Our loader should be able to do link-loading and
controlled establishment.
3. Our library filing scheme should allow for rather
general indexing, responsibility, generations, data path
switching.
4. It should be possible to get private system components
(all routines are system components) for buggering around
with.
M. D. McIlroy
October 11, 1964
Interesting side notes: http://www.cs.dartmouth.edu/~doug/sieve/
26. Resilient Distributed Datasets (RDDs)
26
“Represents an immutable, partitioned collection
of elements that can be operated on in parallel.”
27. 27
Wolfsburg – Inside the Volkswagen Plant photo by Roger
https://www.flickr.com/photos/24736216@N07/5869083813/
Such parallel efficiency!
31. PySpark is in some ways
just helpers for Functional
Programming in Python
32. Functional Programming in Python
32
Please check out this article by Mary Rose Cook (@maryrosecook) in
which she writes:
“Functional code is characterized by one thing:
the absence of side effects”
https://codewords.hackerschool.com/issues/one/an-introduction-to-functional-programming
36. cpu shares
36
Each container gets 1024 by default. Unless you specify a
different X everything is equal. As soon as you do, the
scheduler steps in.
37. Thermodynamics...
37
The idea was to run docker containers that try to use all of the
CPU, but then limit them using cgroups and see how well that
works.
"Tolman & Einstein" by Los Angeles Times. Original uploader was Tillman at en.wikipedia - Transferred from en.wikipedia; transfer was stated to be made by User:chazchaz101.(Original text :
Los Angeles Times photographic archive, UCLA Library [1]). Licensed under Public Domain via Wikimedia Commons -
https://commons.wikimedia.org/wiki/File:Tolman_%26_Einstein.jpg#mediaviewer/File:Tolman_%26_Einstein.jpg
45. 45
def calculate_cpu_percent(prev_cpu, prev_sys, stats):
cpu_percent = 0.0
cpu_delta = float(stats['cpu_stats']['cpu_usage']['total_usage']) - prev_cpu
system_delta = float(stats['cpu_stats']['system_cpu_usage']) - prev_sys
if system_delta > 0.0 and cpu_delta > 0.0:
cpu_percent = (cpu_delta / system_delta) *
float(len(stats['cpu_stats']['cpu_usage']['percpu_usage'])) * 100.0
return cpu_percent
Really easy to do in a for loop:
(eg:
https://github.com/docker/docker/blob/ea8cb16af7e8c83a264a1d1c48db3cacd4cc082b/api/client/commands.go#L264
0-L2665 and
https://github.com/docker/docker/blob/ea8cb16af7e8c83a264a1d1c48db3cacd4cc082b/api/client/commands.go#L275
8-L2771 )
But not straightforward over a DStream of RDDs.
54. 54
keyed_up = stats
.map(safe_load)
.filter(lambda x: x != None)
.flatMap(key_up)
.filter(lambda x: x != None)
.groupByKeyAndWindow(20,5)
((‘isosystem_large_1’,total_usage), 54016836033.0)
((‘isosystem_large_1’,total_usage), 54016936033.0)
((‘isosystem_large_1’,total_usage), 54017036033.0)
((‘isosystem_large_1’,system_cpu_usage), 52812200870000000.0)
((‘isosystem_large_1’,system_cpu_usage), 52812200870200000.0)
((‘isosystem_large_1’,system_cpu_usage), 52812200870400000.0)
((‘isosystem_large_1’, tot_cpus), 4)
((‘isosystem_large_1’, tot_cpus), 4)
((‘isosystem_large_1’, tot_cpus), 4)
So now groupByKeyAndWindow groups every 20 seconds of data by
key, it then slides it by 5 seconds to keep a moving delta.