SlideShare a Scribd company logo
1 of 71
Rapid Prototyping in
PySpark Streaming
The Thermodynamics of Docker Containers
Rich Seymour @rseymour
Washington DC Area Apache Spark Interactive Meetup
Or trying the
bleeding edge
without
bleeding out
Why?
The Buzzwords
5
 Docker
 Spark
 Thermodynamics (not really a buzzword)
The Buzzversions
6
 Docker (1.5.0) we have stats
 Spark (1.2.1) pyspark streaming!
 Thermodynamics (beta)
Docker
At a high level
(docker) containers
contain your
dependencies
Fire up multiple
services in seconds on
your laptop
Docker
10
 Open Sourced by dotCloud March 2013
 Switched from LXC to libcontainer March 2014
 Written in Go
 Allows us to contain dependencies with:
 cgroups, namespaces, capabilities, netlink, netfilter, etc.
 Currently Linux only, but supported on Amazon, Google,
Microsoft and Redhat cloud offerings
 Gives us a registry and a method for pulling binary diffs
Control Groups (cgroups)
11
 Started in 2007
 Basis for Docker’s control of resources
Cute Logo
12
Cute Logo
13
Spark
“a fast and general
engine for large scale
data processing”
Spark
16
Apache Project born out of UC Berkeley’s Algorithms,
Machines, People Lab (AMPLab)
Java / Scala / Python APIs for computing on redundant
distributed datasets across a cluster of multicore machines.
Resilient Distributed Datasets (RDDs)
17
“Represents an immutable, partitioned collection
of elements that can be operated on in parallel.”
Resilient Distributed Datasets (RDDs)
18
“Represents an immutable, partitioned collection
of elements that can be operated on in parallel.”
Immutable: can’t be changed over time.
If you want to preserve a change, create a new
RDD on the left of the equals sign.
Resilient Distributed Datasets (RDDs)
19
“Represents an immutable, partitioned collection
of elements that can be operated on in parallel.”
Partitioned: Split up often by key w/ a partitioner,
if your RDD is made up of key value pairs:
my_rdd = [(1,”Apple”), (2,”IBM”)]
aggregate
aggregateByKey
cache
cartesian
checkpoint
coalesce
cogroup
collect
collectAsMap
combineByKey
context
count
countApprox
countApproxDistinct
countByKey
countByValue
distinct
filter
first
flatMap
flatMapValues
fold
foldByKey
foreach
foreachPartition
fullOuterJoin
getCheckpointFile
getNumPartitions
getStorageLevel
glom
groupBy
groupByKey
groupWith
histogram
id
intersection
isCheckpointed
join
keyBy
keys
leftOuterJoin
lookup
map
mapPartitions
mapPartitionsWithIndex
mapPartitionsWithSplit
mapValues
max
mean
meanApprox
min
name
partitionBy
persist
pipe
randomSplit
reduce
reduceByKey
reduceByKeyLocally
repartition
repartitionAndSortWithinPartit
ions
rightOuterJoin
sample
sampleByKey
sampleStdev
sampleVariance
saveAsHadoopDataset
saveAsHadoopFile
saveAsNewAPIHadoopDatas
et
saveAsNewAPIHadoopFile
saveAsPickleFile
saveAsSequenceFile
saveAsTextFile
setName
sortBy
sortByKey
stats
stdev
subtract
subtractByKey
sum
sumApprox
take
takeOrdered
takeSample
toDebugString
top
union
unpersist
values
variance
zip
zipWithIndex
zipWithUniqueId
20
94 RDD methods… (we’ll revisit these)
94 methods?
Transformations
Actions
aggregate
aggregateByKey
cache
cartesian
checkpoint
coalesce
cogroup
collect
collectAsMap
combineByKey
context
count
countApprox
countApproxDistinct
countByKey
countByValue
distinct
filter
first
flatMap
flatMapValues
fold
foldByKey
foreach
foreachPartition
fullOuterJoin
getCheckpointFile
getNumPartitions
getStorageLevel
glom
groupBy
groupByKey
groupWith
histogram
id
intersection
isCheckpointed
join
keyBy
keys
leftOuterJoin
lookup
map
mapPartitions
mapPartitionsWithIndex
mapPartitionsWithSplit
mapValues
max
mean
meanApprox
min
name
partitionBy
persist
pipe
randomSplit
reduce
reduceByKey
reduceByKeyLocally
repartition
repartitionAndSortWithinPartit
ions
rightOuterJoin
sample
sampleByKey
sampleStdev
sampleVariance
saveAsHadoopDataset
saveAsHadoopFile
saveAsNewAPIHadoopDatas
et
saveAsNewAPIHadoopFile
saveAsPickleFile
saveAsSequenceFile
saveAsTextFile
setName
sortBy
sortByKey
stats
stdev
subtract
subtractByKey
sum
sumApprox
take
takeOrdered
takeSample
toDebugString
top
union
unpersist
values
variance
zip
zipWithIndex
zipWithUniqueId
23
94 RDD methods… and a pipe is one
Pipes circa 1964 – Doug McIlroy
25
Summary--what's most important.
To put my strongest concerns into a nutshell:
1. We should have some ways of coupling programs like
garden hose--screw in another segment when it becomes when
it becomes necessary to massage data in another way.
This is the way of IO also.
2. Our loader should be able to do link-loading and
controlled establishment.
3. Our library filing scheme should allow for rather
general indexing, responsibility, generations, data path
switching.
4. It should be possible to get private system components
(all routines are system components) for buggering around
with.
M. D. McIlroy
October 11, 1964
Interesting side notes: http://www.cs.dartmouth.edu/~doug/sieve/
Resilient Distributed Datasets (RDDs)
26
“Represents an immutable, partitioned collection
of elements that can be operated on in parallel.”
27
Wolfsburg – Inside the Volkswagen Plant photo by Roger
https://www.flickr.com/photos/24736216@N07/5869083813/
Such parallel efficiency!
PySpark Streaming
28
Creates Discretized Streams (DStreams) composed of
RDDs which can be processed in microbatches.
In Python
PySpark Streaming
29
Creates Discretized Streams (DStreams) composed of
RDDs which can be processed in microbatches.
In Python
With a lot of caveats
cache
checkpoint
cogroup
combineByKey
context
count
countByValue
countByValueAndWindow
countByWindow
filter
flatMap
flatMapValues
foreachRDD
fullOuterJoin
glom
groupByKey
groupByKeyAndWindow
join
leftOuterJoin
map
mapPartitions
mapPartitionsWithIndex
mapValues
partitionBy
persist
pprint
reduce
reduceByKey
reduceByKeyAndWindow
reduceByWindow
repartition
rightOuterJoin
saveAsTextFiles
slice
transform
transformWith
union
updateStateByKey
window
30
39 DStream methods
PySpark is in some ways
just helpers for Functional
Programming in Python
Functional Programming in Python
32
Please check out this article by Mary Rose Cook (@maryrosecook) in
which she writes:
“Functional code is characterized by one thing:
the absence of side effects”
https://codewords.hackerschool.com/issues/one/an-introduction-to-functional-programming
Today we’ll look at
34
And how they
relate (kinda) to
thermodynamics
GROUPS
CPU limits in cgroups
cpu shares
36
Each container gets 1024 by default. Unless you specify a
different X everything is equal. As soon as you do, the
scheduler steps in.
Thermodynamics...
37
The idea was to run docker containers that try to use all of the
CPU, but then limit them using cgroups and see how well that
works.
"Tolman & Einstein" by Los Angeles Times. Original uploader was Tillman at en.wikipedia - Transferred from en.wikipedia; transfer was stated to be made by User:chazchaz101.(Original text :
Los Angeles Times photographic archive, UCLA Library [1]). Licensed under Public Domain via Wikimedia Commons -
https://commons.wikimedia.org/wiki/File:Tolman_%26_Einstein.jpg#mediaviewer/File:Tolman_%26_Einstein.jpg
STOP
(Py)Spark Streaming is the wrong tool for this job.
39
“You must go on, I can’t go on, I’ll go on,”
40
Or less dramatically
41
Or less dramatically
42
Whatever! We can learn how and why PySpark works in this
situation, even though it isn’t ideal.
So, why?
44
"cpu_stats": {
"cpu_usage": {
"percpu_usage": [
13407699471,
40464379579,
44303391682,
15849983951
],
"total_usage": 114025454683,
"usage_in_kernelmode": 80000000,
"usage_in_usermode": 113940000000
},
"system_cpu_usage": 52519779330000000,
"throttling_data": {}
},
nanoseconds.
nanoseconds.
nanoseconds.
45
def calculate_cpu_percent(prev_cpu, prev_sys, stats):
cpu_percent = 0.0
cpu_delta = float(stats['cpu_stats']['cpu_usage']['total_usage']) - prev_cpu
system_delta = float(stats['cpu_stats']['system_cpu_usage']) - prev_sys
if system_delta > 0.0 and cpu_delta > 0.0:
cpu_percent = (cpu_delta / system_delta) *
float(len(stats['cpu_stats']['cpu_usage']['percpu_usage'])) * 100.0
return cpu_percent
Really easy to do in a for loop:
(eg:
https://github.com/docker/docker/blob/ea8cb16af7e8c83a264a1d1c48db3cacd4cc082b/api/client/commands.go#L264
0-L2665 and
https://github.com/docker/docker/blob/ea8cb16af7e8c83a264a1d1c48db3cacd4cc082b/api/client/commands.go#L275
8-L2771 )
But not straightforward over a DStream of RDDs.
RDDs are generally
not in guaranteed
order
System CPU: 101 nanoseconds
Container X: 12 nanoseconds
Container Y: 16 nanoseconds
System CPU: 106 nanoseconds
Container X: 11 nanoseconds
Container Y: 16 nanoseconds
System CPU: 110 nanoseconds
Container X: 17 nanoseconds
Container Y: 19 nanoseconds
47
System CPU: 101 nanoseconds
Container X: 12 nanoseconds
Container Y: 16 nanoseconds
System CPU: 106 nanoseconds
Container X: 10 nanoseconds
Container Y: 16 nanoseconds
System CPU: 110 nanoseconds
Container X: 17 nanoseconds
Container Y: 19 nanoseconds
48
out of order!!! why not just a FIFO
queue!
./stats.sh |
./easy_perc.py
❯ ./stats.sh | head -200 | ./easy_perc.py
isosystem_medium_2 0.00350599250802
isosystem_medium_2 74.9828826
isosystem_medium_1 0.00353414664914
isosystem_medium_2 93.8992185
isosystem_medium_1 91.7493855
isosystem_medium_1 99.5302075188
isosystem_medium_2 99.1333246115
isosystem_medium_2 99.6940311
isosystem_medium_1 99.6248363
isosystem_large_1 0.00706705183238
isosystem_medium_2 99.2483985
isosystem_medium_1 99.8950363
isosystem_large_1 199.3501746
isosystem_medium_1 99.2755894264
isosystem_large_1 198.486462344
isosystem_medium_2 99.6565749626
isosystem_large_1 199.3187556
isosystem_medium_1 99.1366565
isosystem_medium_2 100.2461571
isosystem_large_1 199.1434093
50
Not perfect, but once
stable it’s nice and
easy
Let’s see how I did it
in PySpark
Streaming
52
keyed_up = stats.map(safe_load).filter(lambda x: x != None).flatMap(key_up) 
.filter(lambda x: x != None).groupByKeyAndWindow(20,5)
mins = keyed_up.mapValues(min)
maxes = keyed_up.mapValues(max)
diffed = maxes.join(mins).mapValues(lambda x: x[0] - x[1])
system_diff = diffed.filter(lambda x: x[0][1] == 'system_cpu_usage')
total_diff = diffed.filter(lambda x: x[0][1] == 'total_usage')
tot_cpus = maxes.filter(lambda x: x[0][1] == 'tot_cpus')
math_me = system_diff.map(rm_subkey).join(total_diff.map(rm_subkey))
percs = math_me.mapValues(lambda x: x[1]/x[0] * 100.0 if x[0] > 0.01 else 0) 
.join(tot_cpus.map(rm_subkey)) 
.mapValues(lambda x: x[0]*x[1])
percs.filter(lambda x: x != None).pprint()
53
keyed_up = stats 
.map(safe_load) 
.filter(lambda x: x != None) 
.flatMap(key_up) 
.filter(lambda x: x != None) 
.groupByKeyAndWindow(20,5)
{
"id": ”isosystem_large_1",
"stats": {
"read": "2015-02-24T13:28:03.510603276-05:00",
"network": {...},
"cpu_stats": {
"cpu_usage": {
"total_usage": 54016836033,
"percpu_usage": [
14829724030,
8132644889,
17463950886,
13590516228
],
"usage_in_kernelmode": 50000000,
"usage_in_usermode": 53970000000
},
"system_cpu_usage": 52812200870000000,
"throttling_data": {}
},
"memory_stats": {...},
"blkio_stats": {}
}
}
Turn the JSON into 3 key value (K,V) pairs ie:
K: (‘isosystem_large_1’,total_usage)
V: 54016836033.0
((‘isosystem_large_1’,total_usage), 54016836033.0)
((‘isosystem_large_1’,system_cpu_usage), 52812200870000000.0)
((‘isosystem_large_1’, tot_cpus), 4)
54
keyed_up = stats 
.map(safe_load) 
.filter(lambda x: x != None) 
.flatMap(key_up) 
.filter(lambda x: x != None) 
.groupByKeyAndWindow(20,5)
((‘isosystem_large_1’,total_usage), 54016836033.0)
((‘isosystem_large_1’,total_usage), 54016936033.0)
((‘isosystem_large_1’,total_usage), 54017036033.0)
((‘isosystem_large_1’,system_cpu_usage), 52812200870000000.0)
((‘isosystem_large_1’,system_cpu_usage), 52812200870200000.0)
((‘isosystem_large_1’,system_cpu_usage), 52812200870400000.0)
((‘isosystem_large_1’, tot_cpus), 4)
((‘isosystem_large_1’, tot_cpus), 4)
((‘isosystem_large_1’, tot_cpus), 4)
So now groupByKeyAndWindow groups every 20 seconds of data by
key, it then slides it by 5 seconds to keep a moving delta.
55
keyed_up = stats.map(safe_load).filter(lambda x: x != None).flatMap(key_up) 
.filter(lambda x: x != None).groupByKeyAndWindow(20,5)
mins = keyed_up.mapValues(min)
maxes = keyed_up.mapValues(max)
diffed = maxes.join(mins).mapValues(lambda x: x[0] - x[1])
system_diff = diffed.filter(lambda x: x[0][1] == 'system_cpu_usage')
total_diff = diffed.filter(lambda x: x[0][1] == 'total_usage')
tot_cpus = maxes.filter(lambda x: x[0][1] == 'tot_cpus')
math_me = system_diff.map(rm_subkey).join(total_diff.map(rm_subkey))
percs = math_me.mapValues(lambda x: x[1]/x[0] * 100.0 if x[0] > 0.01 else 0) 
.join(tot_cpus.map(rm_subkey)) 
.mapValues(lambda x: x[0]*x[1])
percs.filter(lambda x: x != None).pprint()
56
mins = keyed_up.mapValues(min)
maxes = keyed_up.mapValues(max)
((‘isosystem_large_1’,total_usage), 10000.0)
((‘isosystem_large_1’,total_usage), 20000.0)
((‘isosystem_large_1’,total_usage), 40000.0)
mins -> ((‘isosystem_large_1’,total_usage), 10000.0)
maxes -> ((‘isosystem_large_1’,total_usage), 40000.0)
57
keyed_up = stats.map(safe_load).filter(lambda x: x != None).flatMap(key_up) 
.filter(lambda x: x != None).groupByKeyAndWindow(20,5)
mins = keyed_up.mapValues(min)
maxes = keyed_up.mapValues(max)
diffed = maxes.join(mins).mapValues(lambda x: x[0] - x[1])
system_diff = diffed.filter(lambda x: x[0][1] == 'system_cpu_usage')
total_diff = diffed.filter(lambda x: x[0][1] == 'total_usage')
tot_cpus = maxes.filter(lambda x: x[0][1] == 'tot_cpus')
math_me = system_diff.map(rm_subkey).join(total_diff.map(rm_subkey))
percs = math_me.mapValues(lambda x: x[1]/x[0] * 100.0 if x[0] > 0.01 else 0) 
.join(tot_cpus.map(rm_subkey)) 
.mapValues(lambda x: x[0]*x[1])
percs.filter(lambda x: x != None).pprint()
58
diffed = maxes.join(mins) 
.mapValues(lambda x: x[0] - x[1])
)
maxes
((‘isosystem_large_1’,total_usage), 40000.0) mins
((‘isosystem_large_1’,total_usage), 10000.0)
diffed
((‘isosystem_large_1’,total_usage), 30000.0)
join and map the values to their
difference
59
keyed_up = stats.map(safe_load).filter(lambda x: x != None).flatMap(key_up) 
.filter(lambda x: x != None).groupByKeyAndWindow(20,5)
mins = keyed_up.mapValues(min)
maxes = keyed_up.mapValues(max)
diffed = maxes.join(mins).mapValues(lambda x: x[0] - x[1])
system_diff = diffed.filter(lambda x: x[0][1] == 'system_cpu_usage')
total_diff = diffed.filter(lambda x: x[0][1] == 'total_usage')
tot_cpus = maxes.filter(lambda x: x[0][1] == 'tot_cpus')
math_me = system_diff.map(rm_subkey).join(total_diff.map(rm_subkey))
percs = math_me.mapValues(lambda x: x[1]/x[0] * 100.0 if x[0] > 0.01 else 0) 
.join(tot_cpus.map(rm_subkey)) 
.mapValues(lambda x: x[0]*x[1])
percs.filter(lambda x: x != None).pprint()
Give me 3 streams where I filter by the ‘subkey’
60
keyed_up = stats.map(safe_load).filter(lambda x: x != None).flatMap(key_up) 
.filter(lambda x: x != None).groupByKeyAndWindow(20,5)
mins = keyed_up.mapValues(min)
maxes = keyed_up.mapValues(max)
diffed = maxes.join(mins).mapValues(lambda x: x[0] - x[1])
system_diff = diffed.filter(lambda x: x[0][1] == 'system_cpu_usage')
total_diff = diffed.filter(lambda x: x[0][1] == 'total_usage')
tot_cpus = maxes.filter(lambda x: x[0][1] == 'tot_cpus')
math_me = system_diff.map(rm_subkey).join(total_diff.map(rm_subkey))
percs = math_me.mapValues(lambda x: x[1]/x[0] * 100.0 if x[0] > 0.01 else 0) 
.join(tot_cpus.map(rm_subkey)) 
.mapValues(lambda x: x[0]*x[1])
percs.filter(lambda x: x != None).pprint()
61
math_me = system_diff
.map(rm_subkey)
.join(total_diff
.map(rm_subkey))
def rm_subkey(x):
return (x[0][0], x[1])
In other words take: ((‘isosystem_large_1’,’total_usage’), 30000.0)
and make: (‘isosystem_large_1’, 30000.0)
and join it with ((‘isosystem_large_1’, ‘system_cpu_usage’), 90000.0) ->
(‘isosystem_large_1’, 90000.0)
-> (‘isosystem_large_1’, 90000.0, 30000.0)
total_usagesystem_cpu_usage
62
keyed_up = stats.map(safe_load).filter(lambda x: x != None).flatMap(key_up) 
.filter(lambda x: x != None).groupByKeyAndWindow(20,5)
mins = keyed_up.mapValues(min)
maxes = keyed_up.mapValues(max)
diffed = maxes.join(mins).mapValues(lambda x: x[0] - x[1])
system_diff = diffed.filter(lambda x: x[0][1] == 'system_cpu_usage')
total_diff = diffed.filter(lambda x: x[0][1] == 'total_usage')
tot_cpus = maxes.filter(lambda x: x[0][1] == 'tot_cpus')
math_me = system_diff.map(rm_subkey).join(total_diff.map(rm_subkey))
percs = math_me.mapValues(lambda x: x[1]/x[0] * 100.0 if x[0] > 0.01 else 0) 
.join(tot_cpus.map(rm_subkey)) 
.mapValues(lambda x: x[0]*x[1])
percs.filter(lambda x: x != None).pprint()
63
percs = math_me.mapValues(lambda x: x[1]/x[0] * 100.0 if x[0] > 0.01 else 0) 
.join(tot_cpus 
.map(rm_subkey)) 
.mapValues(lambda x: x[0]*x[1])
math_me
(‘isosystem_large_1’, 90000.0, 30000.0)
percs
(‘isosystem_large_1’, 133.3)
join and map the values to their
difference
33.3 <- (30000.0/90000.0) * 100.0
tot_cpus
(‘isosystem_large_1’, 4)
64
keyed_up = stats.map(safe_load).filter(lambda x: x != None).flatMap(key_up) 
.filter(lambda x: x != None).groupByKeyAndWindow(20,5)
mins = keyed_up.mapValues(min)
maxes = keyed_up.mapValues(max)
diffed = maxes.join(mins).mapValues(lambda x: x[0] - x[1])
system_diff = diffed.filter(lambda x: x[0][1] == 'system_cpu_usage')
total_diff = diffed.filter(lambda x: x[0][1] == 'total_usage')
tot_cpus = maxes.filter(lambda x: x[0][1] == 'tot_cpus')
math_me = system_diff.map(rm_subkey).join(total_diff.map(rm_subkey))
percs = math_me.mapValues(lambda x: x[1]/x[0] * 100.0 if x[0] > 0.01 else 0) 
.join(tot_cpus.map(rm_subkey)) 
.mapValues(lambda x: x[0]*x[1])
percs.filter(lambda x: x != None).pprint()
Finally an Action!
65
def calculate_cpu_percent(prev_cpu, prev_sys, stats):
cpu_percent = 0.0
cpu_delta = float(stats['cpu_stats']['cpu_usage']['total_usage']) - prev_cpu
system_delta = float(stats['cpu_stats']['system_cpu_usage']) - prev_sys
if system_delta > 0.0 and cpu_delta > 0.0:
cpu_percent = (cpu_delta / system_delta) *
float(len(stats['cpu_stats']['cpu_usage']['percpu_usage'])) * 100.0
return cpu_percent
The future!
67
[SPARK-5154] [PySpark] [Streaming]
Kafka streaming support in Python
[SPARK-5704] [SQL] [PySpark]
createDataFrame from RDD with columns
Coming in Spark 1.3.0!
Things I’d like to see
69
Could we abstract out the best parts of PySpark to work
as a pure Python library?
Even just for local use?
Thank You
@rseymour
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containers 2015 02-24 Washington DC Apache Spark Interactive

More Related Content

What's hot

Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesIlya Ganelin
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkTaras Matyashovsky
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezMapR Technologies
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)hiteshnd
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonChristian Perone
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemBojan Babic
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan ZhuBuilding a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan ZhuDatabricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkSamy Dindane
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
 

What's hot (20)

Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFrames
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Spark vs Hadoop
Spark vs HadoopSpark vs Hadoop
Spark vs Hadoop
 
PySaprk
PySaprkPySaprk
PySaprk
 
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan ZhuBuilding a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 

Similar to Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containers 2015 02-24 Washington DC Apache Spark Interactive

Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
VictoriaMetrics for the Atlas Cluster
VictoriaMetrics for the Atlas ClusterVictoriaMetrics for the Atlas Cluster
VictoriaMetrics for the Atlas ClusterVictoriaMetrics
 
Intro to containerization
Intro to containerizationIntro to containerization
Intro to containerizationBalint Pato
 
Containers for sensor web services, applications and research @ Sensor Web Co...
Containers for sensor web services, applications and research @ Sensor Web Co...Containers for sensor web services, applications and research @ Sensor Web Co...
Containers for sensor web services, applications and research @ Sensor Web Co...Daniel Nüst
 
Docker Platform and Ecosystem
Docker Platform and EcosystemDocker Platform and Ecosystem
Docker Platform and EcosystemPatrick Chanezon
 
Practical Chaos Engineering
Practical Chaos EngineeringPractical Chaos Engineering
Practical Chaos EngineeringSIGHUP
 
Anomaly Detection at Scale
Anomaly Detection at ScaleAnomaly Detection at Scale
Anomaly Detection at ScaleJeff Henrikson
 
Containers - Portable, repeatable user-oriented application delivery. Build, ...
Containers - Portable, repeatable user-oriented application delivery. Build, ...Containers - Portable, repeatable user-oriented application delivery. Build, ...
Containers - Portable, repeatable user-oriented application delivery. Build, ...Walid Shaari
 
Practical Container Security by Mrunal Patel and Thomas Cameron, Red Hat
Practical Container Security by Mrunal Patel and Thomas Cameron, Red HatPractical Container Security by Mrunal Patel and Thomas Cameron, Red Hat
Practical Container Security by Mrunal Patel and Thomas Cameron, Red HatDocker, Inc.
 
Situation Normal - UKUUG Mar'10
Situation Normal - UKUUG Mar'10Situation Normal - UKUUG Mar'10
Situation Normal - UKUUG Mar'10Simon Wardley
 
Situation Normal - Presentation at NottTuesday
Situation Normal - Presentation at NottTuesdaySituation Normal - Presentation at NottTuesday
Situation Normal - Presentation at NottTuesdaySimon Wardley
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignMichael Noll
 
Drupalcamp es 2013 drupal with lxc docker and vagrant
Drupalcamp es 2013  drupal with lxc docker and vagrant Drupalcamp es 2013  drupal with lxc docker and vagrant
Drupalcamp es 2013 drupal with lxc docker and vagrant Ricardo Amaro
 
LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1Hajime Tazaki
 

Similar to Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containers 2015 02-24 Washington DC Apache Spark Interactive (20)

Container Mythbusters
Container MythbustersContainer Mythbusters
Container Mythbusters
 
What is this "docker"
What is this  "docker" What is this  "docker"
What is this "docker"
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Containers & Security
Containers & SecurityContainers & Security
Containers & Security
 
VictoriaMetrics for the Atlas Cluster
VictoriaMetrics for the Atlas ClusterVictoriaMetrics for the Atlas Cluster
VictoriaMetrics for the Atlas Cluster
 
Intro to containerization
Intro to containerizationIntro to containerization
Intro to containerization
 
Containers for sensor web services, applications and research @ Sensor Web Co...
Containers for sensor web services, applications and research @ Sensor Web Co...Containers for sensor web services, applications and research @ Sensor Web Co...
Containers for sensor web services, applications and research @ Sensor Web Co...
 
Hack the whale
Hack the whaleHack the whale
Hack the whale
 
Docker Platform and Ecosystem
Docker Platform and EcosystemDocker Platform and Ecosystem
Docker Platform and Ecosystem
 
Practical Chaos Engineering
Practical Chaos EngineeringPractical Chaos Engineering
Practical Chaos Engineering
 
Anomaly Detection at Scale
Anomaly Detection at ScaleAnomaly Detection at Scale
Anomaly Detection at Scale
 
Containers - Portable, repeatable user-oriented application delivery. Build, ...
Containers - Portable, repeatable user-oriented application delivery. Build, ...Containers - Portable, repeatable user-oriented application delivery. Build, ...
Containers - Portable, repeatable user-oriented application delivery. Build, ...
 
Practical Container Security by Mrunal Patel and Thomas Cameron, Red Hat
Practical Container Security by Mrunal Patel and Thomas Cameron, Red HatPractical Container Security by Mrunal Patel and Thomas Cameron, Red Hat
Practical Container Security by Mrunal Patel and Thomas Cameron, Red Hat
 
Situation Normal - UKUUG Mar'10
Situation Normal - UKUUG Mar'10Situation Normal - UKUUG Mar'10
Situation Normal - UKUUG Mar'10
 
Situation Normal - Presentation at NottTuesday
Situation Normal - Presentation at NottTuesdaySituation Normal - Presentation at NottTuesday
Situation Normal - Presentation at NottTuesday
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 
Drupalcamp es 2013 drupal with lxc docker and vagrant
Drupalcamp es 2013  drupal with lxc docker and vagrant Drupalcamp es 2013  drupal with lxc docker and vagrant
Drupalcamp es 2013 drupal with lxc docker and vagrant
 
Simon Wardley
Simon WardleySimon Wardley
Simon Wardley
 
LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1
 
Computer cluster
Computer clusterComputer cluster
Computer cluster
 

Recently uploaded

April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 

Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containers 2015 02-24 Washington DC Apache Spark Interactive

Editor's Notes

  1. YARN and Mesos also support cgroups
  2. YARN and Mesos also support cgroups
  3. YARN and Mesos also support cgroups
  4. Range or hash partitioning.
  5. There are lot’s of cool ways to run things in parallel from the command line GNU parallel is one
  6. YARN and Mesos also support cgroups
  7. Tolman & Einstein
  8. http://localhost:8888/notebooks/DStream%20CPU%20Percentage.ipynb