Stream Processing
w/Python
How to avoid summoning Cuthuluhu*
w/your friend**
@holdenkarau
*Not a gaurantee. If you summon the dark lord you may void your support contracts.
** I hope
Who am I?
● My name is Holden Karau
● Prefered pronouns are she/her
● I’m a Principal Software Engineer at IBM’s Spark Technology Center
● Apache Spark committer
● previously Alpine, Databricks, Google, Foursquare & Amazon
● co-author of High Performance Spark & Learning Spark (+ more)
● Twitter: @holdenkarau
● Slideshare http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Github https://github.com/holdenk
● YouTube (mostly Spark Videos): http://youtube.com/holdenkarau
Spark Technology
Center
4
IBM
Spark
Technology
Center
Founded in 2015.
Location:
Physical: 505 Howard St., San Francisco CA
Web: http://spark.tc Twitter: @apachespark_tc
Mission:
Contribute intellectual and technical capital to the Apache Spark
community.
Make the core technology enterprise- and cloud-ready.
Build data science skills to drive intelligence into business
applications — http://bigdatauniversity.com
Key statistics:
About 50 developers, co-located with 25 IBM designers.
Major contributions to Apache Spark http://jiras.spark.tc
Apache SystemML is now an Apache Incubator project.
Founding member of UC Berkeley AMPLab and RISE Lab
Member of R Consortium and Scala Center
Spark Technology
Center
What is going to be covered:
● Why you should care about stream processing in Python
● Mistake we made trying to support Python in Spark
● How we can hack stream processing in Python today (hint its not great)
● Why we should avoid summoning Cuthulu
● What Apache Spark is, because I spend waaay to much time working on it
(and it does stream processing in Python).
● Options to avoid summoning Cuthulu
Torsten Reuschling
Who I think you wonderful humans are?
● Nice enough people
● Don’t mind pictures of cats
● Likely know some Python _or_ need to work with Python
● Likely know some Apache Kafka
● May have heard about Apache Spark before
Lori Erickson
What is Spark?
● General purpose distributed system
○ With a really nice API including Python* :)
● Apache project (one of the most
active)
● Must faster than Hadoop
Map/Reduce
● Has a streaming engine
● Built on top of two abstractions for
distributed data: RDDs & Datasets
Why should you care?
● Lots of great libraries in Python
● Lots of amazing Data Scientists in Python
● “Productionizing” software (aka rewriting it) tends to not
be a lot of fun
● Possible solution: support Python in production better
Why you should listen to me about this...
● I’ve spent a lot of time trying to get Apache Spark (a
JVM system) to work better for Python users
● Lots of progress (some of which we will talk about
today)
● Lots of mistakes too, some of which would be really
easy to repeat in Kafka (infact I have branch that does
just that!**)
● Cuthuluhu is watching you and will eat your soul
A detour into PySpark’s internals
Photo by Bill Ward
Spark in Scala, how does PySpark work?
● Py4J + pickling + magic
○ This can be kind of slow sometimes
● RDDs are generally RDDs of pickled objects
● Spark SQL (and DataFrames) avoid some of this
kristin klein
So what does that look like?
Driver
py4j
Worker 1
Worker K
pipe
pipe
So how does that impact PySpark?
● Data from Spark worker serialized and piped to Python
worker
○ Multiple iterator-to-iterator transformations are still pipelined :)
● Double serialization cost makes everything more
expensive
● Python worker startup takes a bit of extra time
○ If we reuse them though that isn’t the end of the world*
● Python memory isn’t controlled by the JVM - easy to go
over container limits if deploying on YARN or similar
● Error messages make ~0 sense
● etc.
When we say 0 sense:
[Stage 0:> (0 + 0) / 4]17/02/01 09:52:07 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 180, in main
process()
File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 175, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 345, in func
return f(iterator)
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <lambda>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
Continued for ~400 lines
File "high_performance_pyspark/bad_pyspark.py", line 32, in <lambda>
Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
Photo from Cocoa Dream
We can do that w/Kafka streams..
● Why bother learning from our mistakes?
● Or more seriously, the mistakes weren’t that bad...
Our “special” business logic
def transform(input):
"""
Transforms the supplied input.
"""
return str(len(input))
Pargon
Let’s pretend all the world is a string:
override def transform(value: String): String = {
// WARNING: This may summon cuthuluhu
dataOut.writeInt(value.getBytes.size)
dataOut.write(value.getBytes)
dataOut.flush()
val resultSize = dataIn.readInt()
val result = new Array[Byte](resultSize)
dataIn.readFully(result)
// Assume UTF8, what could go wrong? :p
new String(result)
}
Let’s pretend all the world is a string:
def main(socket):
while (True):
input_length = _read_int(socket)
data = socket.read(input_length)
result = transform(data)
resultBytes = result.encode()
_write_int(len(resultBytes), socket)
socket.write(resultBytes)
socket.flush()
What about things other than strings?
Use another system
● Like Spark! (oh wait)
Write it in a format Python can understand:
● Pickling (from Java)
● JSON
● XML
Purely Python solutions
● +s: Save use from double serialization, -s: code
duplication, bug duplication, now you have two systems!
Ooor we could use Spark!
kafkaStream = KafkaUtils.createDirectStream(
ssc, "ex", {"bootstrap.servers": bootstrap_servers})
processed = kafkaStream.map(transform)
def write_to_kafka(partition):
producer =
KafkaProducer(bootstrap_servers=bootstrap_servers)
for x in partition:
producer.send("output", x)
processed.foreachRDD(lambda rdd:
rdd.foreachPartition(write_to_kafka))
# But this summons Cuthulhu anyways :(
What about Spark Structured Streaming?
● Please don’t use it yet
● Really
● It’s got an interesting API and could be really great in
the future
● But for today (2.2) it’s not ready for prime time -- or even
super saver shipping time
What can we do instead?
● Pure Python solutions
○ +s: Save use from double serialization, -s: code duplication, bug
duplication, now you have two systems!
● Magical Better Interchange
● Jython (if you find people willing to do this let me know
:p)
How much more awesome could this be?
● Serialization improvements: Possibly up to 2x faster
○ Early Jython attempt lead to 1.5X to 2X improvement for random
UDFs
○ Alternative, more reasonable work, is being done to accelerate
interchange (with Arrow) -- not yet used for UDF acceleration but initial
proposals are progressing :D
● Likely still somewhat painful to debug (sorry!)
How much faster can avoiding copy/ser be?
Andrew Skudder
The “future*”: Awesome UDFs
● Work going on in Scala land to translate simple Scala
into SQL expressions - need the Dataset API
○ Maybe we can try similar approaches with Python?
● Very early work going on to use Jython for simple UDFs
(e.g. 2.7 compat & no native libraries) - SPARK-15369
○ Early benchmarking w/word count 5% slower than native Scala UDF,
close to 2x faster than regular Python
● Willing to share your Python UDFs for benchmarking? -
http://bit.ly/pySparkUDF
*The future may or may not have better performance than today. But bun-bun the bunny has some lettuce so its
ok!
What if I just need something working?
Today:
● Spark Streaming DStreams supports Python today
● Write your own client and re-publish (not that hard)
The “Future”*:
● Spark + Structured Streaming + Python acceleration
● Bring Python acceleration to
kafka-streams-python-cthulhu ?
*Far enough in the future you forgot I promised it to you.
Resources:
Proof of Concept w/Kafka Streams (str only for now):
● kafka-streams-python-cthulhu
Apache Spark
● DStreams & Structured Streaming*
Improved serialization options:
● Apache Arrow
“Driver Side” bridges:
● Py4J
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance SparkLearning PySpark
High Performance Spark!
Available today! I brought one copy for one lucky person.
The rest of you can buy it from that scrapy Seattle
bookstore :p
http://bit.ly/hkHighPerfSpark
* Early Release means extra mistakes, but also a chance to help us make a more awesome
book.
And some upcoming talks:
● Data Day Seattle (SEA, Sept)
● Strata New York (NYC, Sept)
● Strange Loop (Sept/Oct)
● Spark Summit EU (Dublin, October)
● November: Possibly Big Data Spain + Bee Scala (TBD)
● Strata Singapore (Singapore, December)
● ScalaX (London, December)
● Know of interesting conferences/webinar things that
should be on my radar? Let me know!
k thnx bye :)
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
Will tweet results
“eventually” @holdenkarau
Any PySpark Users: Have some
simple UDFs you wish ran faster
you are willing to share?:
http://bit.ly/pySparkUDF
Pssst: Have feedback on the presentation? Give me a
shout (holden@pigscanfly.ca) if you feel comfortable doing
so :)

Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summoning Cuthulu

  • 1.
    Stream Processing w/Python How toavoid summoning Cuthuluhu* w/your friend** @holdenkarau *Not a gaurantee. If you summon the dark lord you may void your support contracts. ** I hope
  • 2.
    Who am I? ●My name is Holden Karau ● Prefered pronouns are she/her ● I’m a Principal Software Engineer at IBM’s Spark Technology Center ● Apache Spark committer ● previously Alpine, Databricks, Google, Foursquare & Amazon ● co-author of High Performance Spark & Learning Spark (+ more) ● Twitter: @holdenkarau ● Slideshare http://www.slideshare.net/hkarau ● Linkedin https://www.linkedin.com/in/holdenkarau ● Github https://github.com/holdenk ● YouTube (mostly Spark Videos): http://youtube.com/holdenkarau
  • 4.
    Spark Technology Center 4 IBM Spark Technology Center Founded in2015. Location: Physical: 505 Howard St., San Francisco CA Web: http://spark.tc Twitter: @apachespark_tc Mission: Contribute intellectual and technical capital to the Apache Spark community. Make the core technology enterprise- and cloud-ready. Build data science skills to drive intelligence into business applications — http://bigdatauniversity.com Key statistics: About 50 developers, co-located with 25 IBM designers. Major contributions to Apache Spark http://jiras.spark.tc Apache SystemML is now an Apache Incubator project. Founding member of UC Berkeley AMPLab and RISE Lab Member of R Consortium and Scala Center Spark Technology Center
  • 5.
    What is goingto be covered: ● Why you should care about stream processing in Python ● Mistake we made trying to support Python in Spark ● How we can hack stream processing in Python today (hint its not great) ● Why we should avoid summoning Cuthulu ● What Apache Spark is, because I spend waaay to much time working on it (and it does stream processing in Python). ● Options to avoid summoning Cuthulu Torsten Reuschling
  • 6.
    Who I thinkyou wonderful humans are? ● Nice enough people ● Don’t mind pictures of cats ● Likely know some Python _or_ need to work with Python ● Likely know some Apache Kafka ● May have heard about Apache Spark before Lori Erickson
  • 7.
    What is Spark? ●General purpose distributed system ○ With a really nice API including Python* :) ● Apache project (one of the most active) ● Must faster than Hadoop Map/Reduce ● Has a streaming engine ● Built on top of two abstractions for distributed data: RDDs & Datasets
  • 8.
    Why should youcare? ● Lots of great libraries in Python ● Lots of amazing Data Scientists in Python ● “Productionizing” software (aka rewriting it) tends to not be a lot of fun ● Possible solution: support Python in production better
  • 9.
    Why you shouldlisten to me about this... ● I’ve spent a lot of time trying to get Apache Spark (a JVM system) to work better for Python users ● Lots of progress (some of which we will talk about today) ● Lots of mistakes too, some of which would be really easy to repeat in Kafka (infact I have branch that does just that!**) ● Cuthuluhu is watching you and will eat your soul
  • 10.
    A detour intoPySpark’s internals Photo by Bill Ward
  • 11.
    Spark in Scala,how does PySpark work? ● Py4J + pickling + magic ○ This can be kind of slow sometimes ● RDDs are generally RDDs of pickled objects ● Spark SQL (and DataFrames) avoid some of this kristin klein
  • 12.
    So what doesthat look like? Driver py4j Worker 1 Worker K pipe pipe
  • 13.
    So how doesthat impact PySpark? ● Data from Spark worker serialized and piped to Python worker ○ Multiple iterator-to-iterator transformations are still pipelined :) ● Double serialization cost makes everything more expensive ● Python worker startup takes a bit of extra time ○ If we reuse them though that isn’t the end of the world* ● Python memory isn’t controlled by the JVM - easy to go over container limits if deploying on YARN or similar ● Error messages make ~0 sense ● etc.
  • 14.
    When we say0 sense: [Stage 0:> (0 + 0) / 4]17/02/01 09:52:07 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 180, in main process() File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 175, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func return func(split, prev_func(split, iterator)) File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func return func(split, prev_func(split, iterator)) File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func return func(split, prev_func(split, iterator)) File "/home/holden/repos/spark/python/pyspark/rdd.py", line 345, in func return f(iterator) File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <lambda> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() Continued for ~400 lines File "high_performance_pyspark/bad_pyspark.py", line 32, in <lambda>
  • 15.
    Cat photo fromhttp://galato901.deviantart.com/art/Cat-on-Work-Break-173043455 Photo from Cocoa Dream
  • 16.
    We can dothat w/Kafka streams.. ● Why bother learning from our mistakes? ● Or more seriously, the mistakes weren’t that bad...
  • 17.
    Our “special” businesslogic def transform(input): """ Transforms the supplied input. """ return str(len(input)) Pargon
  • 18.
    Let’s pretend allthe world is a string: override def transform(value: String): String = { // WARNING: This may summon cuthuluhu dataOut.writeInt(value.getBytes.size) dataOut.write(value.getBytes) dataOut.flush() val resultSize = dataIn.readInt() val result = new Array[Byte](resultSize) dataIn.readFully(result) // Assume UTF8, what could go wrong? :p new String(result) }
  • 19.
    Let’s pretend allthe world is a string: def main(socket): while (True): input_length = _read_int(socket) data = socket.read(input_length) result = transform(data) resultBytes = result.encode() _write_int(len(resultBytes), socket) socket.write(resultBytes) socket.flush()
  • 20.
    What about thingsother than strings? Use another system ● Like Spark! (oh wait) Write it in a format Python can understand: ● Pickling (from Java) ● JSON ● XML Purely Python solutions ● +s: Save use from double serialization, -s: code duplication, bug duplication, now you have two systems!
  • 21.
    Ooor we coulduse Spark! kafkaStream = KafkaUtils.createDirectStream( ssc, "ex", {"bootstrap.servers": bootstrap_servers}) processed = kafkaStream.map(transform) def write_to_kafka(partition): producer = KafkaProducer(bootstrap_servers=bootstrap_servers) for x in partition: producer.send("output", x) processed.foreachRDD(lambda rdd: rdd.foreachPartition(write_to_kafka)) # But this summons Cuthulhu anyways :(
  • 22.
    What about SparkStructured Streaming? ● Please don’t use it yet ● Really ● It’s got an interesting API and could be really great in the future ● But for today (2.2) it’s not ready for prime time -- or even super saver shipping time
  • 23.
    What can wedo instead? ● Pure Python solutions ○ +s: Save use from double serialization, -s: code duplication, bug duplication, now you have two systems! ● Magical Better Interchange ● Jython (if you find people willing to do this let me know :p)
  • 24.
    How much moreawesome could this be? ● Serialization improvements: Possibly up to 2x faster ○ Early Jython attempt lead to 1.5X to 2X improvement for random UDFs ○ Alternative, more reasonable work, is being done to accelerate interchange (with Arrow) -- not yet used for UDF acceleration but initial proposals are progressing :D ● Likely still somewhat painful to debug (sorry!)
  • 25.
    How much fastercan avoiding copy/ser be? Andrew Skudder
  • 26.
    The “future*”: AwesomeUDFs ● Work going on in Scala land to translate simple Scala into SQL expressions - need the Dataset API ○ Maybe we can try similar approaches with Python? ● Very early work going on to use Jython for simple UDFs (e.g. 2.7 compat & no native libraries) - SPARK-15369 ○ Early benchmarking w/word count 5% slower than native Scala UDF, close to 2x faster than regular Python ● Willing to share your Python UDFs for benchmarking? - http://bit.ly/pySparkUDF *The future may or may not have better performance than today. But bun-bun the bunny has some lettuce so its ok!
  • 28.
    What if Ijust need something working? Today: ● Spark Streaming DStreams supports Python today ● Write your own client and re-publish (not that hard) The “Future”*: ● Spark + Structured Streaming + Python acceleration ● Bring Python acceleration to kafka-streams-python-cthulhu ? *Far enough in the future you forgot I promised it to you.
  • 29.
    Resources: Proof of Conceptw/Kafka Streams (str only for now): ● kafka-streams-python-cthulhu Apache Spark ● DStreams & Structured Streaming* Improved serialization options: ● Apache Arrow “Driver Side” bridges: ● Py4J
  • 30.
    Learning Spark Fast Data Processingwith Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Spark in Action High Performance SparkLearning PySpark
  • 31.
    High Performance Spark! Availabletoday! I brought one copy for one lucky person. The rest of you can buy it from that scrapy Seattle bookstore :p http://bit.ly/hkHighPerfSpark * Early Release means extra mistakes, but also a chance to help us make a more awesome book.
  • 32.
    And some upcomingtalks: ● Data Day Seattle (SEA, Sept) ● Strata New York (NYC, Sept) ● Strange Loop (Sept/Oct) ● Spark Summit EU (Dublin, October) ● November: Possibly Big Data Spain + Bee Scala (TBD) ● Strata Singapore (Singapore, December) ● ScalaX (London, December) ● Know of interesting conferences/webinar things that should be on my radar? Let me know!
  • 33.
    k thnx bye:) If you care about Spark testing and don’t hate surveys: http://bit.ly/holdenTestingSpark Will tweet results “eventually” @holdenkarau Any PySpark Users: Have some simple UDFs you wish ran faster you are willing to share?: http://bit.ly/pySparkUDF Pssst: Have feedback on the presentation? Give me a shout (holden@pigscanfly.ca) if you feel comfortable doing so :)