Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containers 2015 02-24 Washington DC Apache Spark Interactive

Rapid Prototyping in
PySpark Streaming
The Thermodynamics of Docker Containers
Rich Seymour @rseymour
Washington DC Area Apache Spark Interactive Meetup

Or trying the
bleeding edge
without
bleeding out

The Buzzwords
5
 Docker
 Spark
 Thermodynamics (not really a buzzword)

The Buzzversions
6
 Docker (1.5.0) we have stats
 Spark (1.2.1) pyspark streaming!
 Thermodynamics (beta)

At a high level
(docker) containers
contain your
dependencies

Fire up multiple
services in seconds on
your laptop

Docker
10
 Open Sourced by dotCloud March 2013
 Switched from LXC to libcontainer March 2014
 Written in Go
 Allows us to contain dependencies with:
 cgroups, namespaces, capabilities, netlink, netfilter, etc.
 Currently Linux only, but supported on Amazon, Google,
Microsoft and Redhat cloud offerings
 Gives us a registry and a method for pulling binary diffs

Control Groups (cgroups)
11
 Started in 2007
 Basis for Docker’s control of resources

“a fast and general
engine for large scale
data processing”

Spark
16
Apache Project born out of UC Berkeley’s Algorithms,
Machines, People Lab (AMPLab)
Java / Scala / Python APIs for computing on redundant
distributed datasets across a cluster of multicore machines.

Resilient Distributed Datasets (RDDs)
17
“Represents an immutable, partitioned collection
of elements that can be operated on in parallel.”

18
Immutable: can’t be changed over time.
If you want to preserve a change, create a new
RDD on the left of the equals sign.

19
Partitioned: Split up often by key w/ a partitioner,
if your RDD is made up of key value pairs:
my_rdd = [(1,”Apple”), (2,”IBM”)]

aggregate
aggregateByKey
cache
cartesian
checkpoint
coalesce
cogroup
collect
collectAsMap
combineByKey
context
count
countApprox
countApproxDistinct
countByKey
countByValue
distinct
filter
first
flatMap
flatMapValues
fold
foldByKey
foreach
foreachPartition
fullOuterJoin
getCheckpointFile
getNumPartitions
getStorageLevel
glom
groupBy
groupByKey
groupWith
histogram
id
intersection
isCheckpointed
join
keyBy
keys
leftOuterJoin
lookup
map
mapPartitions
mapPartitionsWithIndex
mapPartitionsWithSplit
mapValues
max
mean
meanApprox
min
name
partitionBy
persist
pipe
randomSplit
reduce
reduceByKey
reduceByKeyLocally
repartition
repartitionAndSortWithinPartit
ions
rightOuterJoin
sample
sampleByKey
sampleStdev
sampleVariance
saveAsHadoopDataset
saveAsHadoopFile
saveAsNewAPIHadoopDatas
et
saveAsNewAPIHadoopFile
saveAsPickleFile
saveAsSequenceFile
saveAsTextFile
setName
sortBy
sortByKey
stats
stdev
subtract
subtractByKey
sum
sumApprox
take
takeOrdered
takeSample
toDebugString
top
union
unpersist
values
variance
zip
zipWithIndex
zipWithUniqueId
20
94 RDD methods… (we’ll revisit these)

aggregate
aggregateByKey
cache
cartesian
checkpoint
coalesce
cogroup
collect
collectAsMap
combineByKey
context
count
countApprox
countApproxDistinct
countByKey
countByValue
distinct
filter
first
flatMap
flatMapValues
fold
foldByKey
foreach
foreachPartition
fullOuterJoin
getCheckpointFile
getNumPartitions
getStorageLevel
glom
groupBy
groupByKey
groupWith
histogram
id
intersection
isCheckpointed
join
keyBy
keys
leftOuterJoin
lookup
map
mapPartitions
mapPartitionsWithSplit
mapValues
max
mean
meanApprox
min
name
partitionBy
persist
pipe
randomSplit
reduce
reduceByKey
reduceByKeyLocally
repartition
repartitionAndSortWithinPartit
ions
rightOuterJoin
sample
sampleByKey
sampleStdev
sampleVariance
saveAsHadoopDataset
saveAsHadoopFile
saveAsNewAPIHadoopDatas
et
saveAsNewAPIHadoopFile
saveAsPickleFile
saveAsSequenceFile
saveAsTextFile
setName
sortBy
sortByKey
stats
stdev
subtract
subtractByKey
sum
sumApprox
take
takeOrdered
takeSample
toDebugString
top
union
unpersist
values
variance
zip
zipWithIndex
zipWithUniqueId
23
94 RDD methods… and a pipe is one

Pipes circa 1964 – Doug McIlroy
25
Summary--what's most important.
To put my strongest concerns into a nutshell:
1. We should have some ways of coupling programs like
garden hose--screw in another segment when it becomes when
it becomes necessary to massage data in another way.
This is the way of IO also.
2. Our loader should be able to do link-loading and
controlled establishment.
3. Our library filing scheme should allow for rather
general indexing, responsibility, generations, data path
switching.
4. It should be possible to get private system components
(all routines are system components) for buggering around
with.
M. D. McIlroy
October 11, 1964
Interesting side notes: http://www.cs.dartmouth.edu/~doug/sieve/

26

27
Wolfsburg – Inside the Volkswagen Plant photo by Roger
https://www.flickr.com/photos/24736216@N07/5869083813/
Such parallel efficiency!

PySpark Streaming
28
Creates Discretized Streams (DStreams) composed of
RDDs which can be processed in microbatches.
In Python

PySpark Streaming
29
Creates Discretized Streams (DStreams) composed of
RDDs which can be processed in microbatches.
In Python
With a lot of caveats

cache
checkpoint
cogroup
combineByKey
context
count
countByValue
countByValueAndWindow
countByWindow
filter
flatMap
flatMapValues
foreachRDD
fullOuterJoin
glom
groupByKey
groupByKeyAndWindow
join
leftOuterJoin
map
mapPartitions
mapValues
partitionBy
persist
pprint
reduce
reduceByKey
reduceByKeyAndWindow
reduceByWindow
repartition
rightOuterJoin
saveAsTextFiles
slice
transform
transformWith
union
updateStateByKey
window
30
39 DStream methods

PySpark is in some ways
just helpers for Functional
Programming in Python

Functional Programming in Python
32
Please check out this article by Mary Rose Cook (@maryrosecook) in
which she writes:
“Functional code is characterized by one thing:
the absence of side effects”
https://codewords.hackerschool.com/issues/one/an-introduction-to-functional-programming

34
And how they
relate (kinda) to
thermodynamics
GROUPS

cpu shares
36
Each container gets 1024 by default. Unless you specify a
different X everything is equal. As soon as you do, the
scheduler steps in.

Thermodynamics...
37
The idea was to run docker containers that try to use all of the
CPU, but then limit them using cgroups and see how well that
works.
"Tolman & Einstein" by Los Angeles Times. Original uploader was Tillman at en.wikipedia - Transferred from en.wikipedia; transfer was stated to be made by User:chazchaz101.(Original text :
Los Angeles Times photographic archive, UCLA Library [1]). Licensed under Public Domain via Wikimedia Commons -
https://commons.wikimedia.org/wiki/File:Tolman_%26_Einstein.jpg#mediaviewer/File:Tolman_%26_Einstein.jpg

(Py)Spark Streaming is the wrong tool for this job.
39

“You must go on, I can’t go on, I’ll go on,”
40

Or less dramatically
42
Whatever! We can learn how and why PySpark works in this
situation, even though it isn’t ideal.

44
"cpu_stats": {
"cpu_usage": {
"percpu_usage": [
13407699471,
40464379579,
44303391682,
15849983951
],
"total_usage": 114025454683,
"usage_in_kernelmode": 80000000,
"usage_in_usermode": 113940000000
},
"system_cpu_usage": 52519779330000000,
"throttling_data": {}
},
nanoseconds.
nanoseconds.
nanoseconds.

45
def calculate_cpu_percent(prev_cpu, prev_sys, stats):
cpu_percent = 0.0
cpu_delta = float(stats['cpu_stats']['cpu_usage']['total_usage']) - prev_cpu
system_delta = float(stats['cpu_stats']['system_cpu_usage']) - prev_sys
if system_delta > 0.0 and cpu_delta > 0.0:
cpu_percent = (cpu_delta / system_delta) *
float(len(stats['cpu_stats']['cpu_usage']['percpu_usage'])) * 100.0
return cpu_percent
Really easy to do in a for loop:
(eg:
https://github.com/docker/docker/blob/ea8cb16af7e8c83a264a1d1c48db3cacd4cc082b/api/client/commands.go#L264
0-L2665 and
https://github.com/docker/docker/blob/ea8cb16af7e8c83a264a1d1c48db3cacd4cc082b/api/client/commands.go#L275
8-L2771 )
But not straightforward over a DStream of RDDs.

RDDs are generally
not in guaranteed
order

System CPU: 101 nanoseconds
Container X: 12 nanoseconds
Container Y: 16 nanoseconds
47

48
out of order!!! why not just a FIFO
queue!

❯ ./stats.sh | head -200 | ./easy_perc.py
isosystem_medium_2 0.00350599250802
isosystem_large_1 0.00706705183238
50
Not perfect, but once
stable it’s nice and
easy

Let’s see how I did it
in PySpark
Streaming

52
keyed_up = stats.map(safe_load).filter(lambda x: x != None).flatMap(key_up)
.filter(lambda x: x != None).groupByKeyAndWindow(20,5)
mins = keyed_up.mapValues(min)
maxes = keyed_up.mapValues(max)
diffed = maxes.join(mins).mapValues(lambda x: x[0] - x[1])
system_diff = diffed.filter(lambda x: x[0][1] == 'system_cpu_usage')
total_diff = diffed.filter(lambda x: x[0][1] == 'total_usage')
tot_cpus = maxes.filter(lambda x: x[0][1] == 'tot_cpus')
math_me = system_diff.map(rm_subkey).join(total_diff.map(rm_subkey))
percs = math_me.mapValues(lambda x: x[1]/x[0] * 100.0 if x[0] > 0.01 else 0)
.join(tot_cpus.map(rm_subkey))
.mapValues(lambda x: x[0]*x[1])
percs.filter(lambda x: x != None).pprint()

53
keyed_up = stats
.map(safe_load)
.filter(lambda x: x != None)
.flatMap(key_up)
.groupByKeyAndWindow(20,5)
{
"id": ”isosystem_large_1",
"stats": {
"read": "2015-02-24T13:28:03.510603276-05:00",
"network": {...},
"cpu_stats": {
"cpu_usage": {
"total_usage": 54016836033,
"percpu_usage": [
14829724030,
8132644889,
17463950886,
13590516228
],
"usage_in_kernelmode": 50000000,
"usage_in_usermode": 53970000000
},
"system_cpu_usage": 52812200870000000,
"throttling_data": {}
},
"memory_stats": {...},
"blkio_stats": {}
}
}
Turn the JSON into 3 key value (K,V) pairs ie:
K: (‘isosystem_large_1’,total_usage)
V: 54016836033.0
((‘isosystem_large_1’,total_usage), 54016836033.0)
((‘isosystem_large_1’,system_cpu_usage), 52812200870000000.0)
((‘isosystem_large_1’, tot_cpus), 4)

54
keyed_up = stats
.map(safe_load)
.flatMap(key_up)
.groupByKeyAndWindow(20,5)
So now groupByKeyAndWindow groups every 20 seconds of data by
key, it then slides it by 5 seconds to keep a moving delta.

55

56
mins -> ((‘isosystem_large_1’,total_usage), 10000.0)
maxes -> ((‘isosystem_large_1’,total_usage), 40000.0)

57

58
diffed = maxes.join(mins)
.mapValues(lambda x: x[0] - x[1])
)
maxes
((‘isosystem_large_1’,total_usage), 40000.0) mins
diffed
join and map the values to their
difference

59
Give me 3 streams where I filter by the ‘subkey’

60

61
math_me = system_diff
.map(rm_subkey)
.join(total_diff
.map(rm_subkey))
def rm_subkey(x):
return (x[0][0], x[1])
In other words take: ((‘isosystem_large_1’,’total_usage’), 30000.0)
and make: (‘isosystem_large_1’, 30000.0)
and join it with ((‘isosystem_large_1’, ‘system_cpu_usage’), 90000.0) ->
(‘isosystem_large_1’, 90000.0)
-> (‘isosystem_large_1’, 90000.0, 30000.0)
total_usagesystem_cpu_usage

62

63
.join(tot_cpus
.map(rm_subkey))
math_me
(‘isosystem_large_1’, 90000.0, 30000.0)
percs
(‘isosystem_large_1’, 133.3)
join and map the values to their
difference
33.3 <- (30000.0/90000.0) * 100.0
tot_cpus
(‘isosystem_large_1’, 4)

64
Finally an Action!

65
def calculate_cpu_percent(prev_cpu, prev_sys, stats):
cpu_percent = 0.0
cpu_delta = float(stats['cpu_stats']['cpu_usage']['total_usage']) - prev_cpu
system_delta = float(stats['cpu_stats']['system_cpu_usage']) - prev_sys
if system_delta > 0.0 and cpu_delta > 0.0:
cpu_percent = (cpu_delta / system_delta) *
float(len(stats['cpu_stats']['cpu_usage']['percpu_usage'])) * 100.0
return cpu_percent

67
[SPARK-5154] [PySpark] [Streaming]
Kafka streaming support in Python
[SPARK-5704] [SQL] [PySpark]
createDataFrame from RDD with columns
Coming in Spark 1.3.0!

69
Could we abstract out the best parts of PySpark to work
as a pure Python library?
Even just for local use?

Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containers 2015 02-24 Washington DC Apache Spark Interactive

Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containers 2015 02-24 Washington DC Apache Spark Interactive

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containers 2015 02-24 Washington DC Apache Spark Interactive

Similar to Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containers 2015 02-24 Washington DC Apache Spark Interactive (20)

Recently uploaded

Recently uploaded (20)

Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containers 2015 02-24 Washington DC Apache Spark Interactive

Editor's Notes