Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summoning Cuthulu

Stream Processing
w/Python
How to avoid summoning Cuthuluhu*
w/your friend**
@holdenkarau
*Not a gaurantee. If you summon the dark lord you may void your support contracts.
** I hope

Who am I?
● My name is Holden Karau
● Prefered pronouns are she/her
● I’m a Principal Software Engineer at IBM’s Spark Technology Center
● Apache Spark committer
● previously Alpine, Databricks, Google, Foursquare & Amazon
● co-author of High Performance Spark & Learning Spark (+ more)
● Twitter: @holdenkarau
● Slideshare http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Github https://github.com/holdenk
● YouTube (mostly Spark Videos): http://youtube.com/holdenkarau

Spark Technology
Center
4
IBM
Spark
Technology
Center
Founded in 2015.
Location:
Physical: 505 Howard St., San Francisco CA
Web: http://spark.tc Twitter: @apachespark_tc
Mission:
Contribute intellectual and technical capital to the Apache Spark
community.
Make the core technology enterprise- and cloud-ready.
Build data science skills to drive intelligence into business
applications — http://bigdatauniversity.com
Key statistics:
About 50 developers, co-located with 25 IBM designers.
Major contributions to Apache Spark http://jiras.spark.tc
Apache SystemML is now an Apache Incubator project.
Founding member of UC Berkeley AMPLab and RISE Lab
Member of R Consortium and Scala Center
Spark Technology
Center

What is going to be covered:
● Why you should care about stream processing in Python
● Mistake we made trying to support Python in Spark
● How we can hack stream processing in Python today (hint its not great)
● Why we should avoid summoning Cuthulu
● What Apache Spark is, because I spend waaay to much time working on it
(and it does stream processing in Python).
● Options to avoid summoning Cuthulu
Torsten Reuschling

Who I think you wonderful humans are?
● Nice enough people
● Don’t mind pictures of cats
● Likely know some Python _or_ need to work with Python
● Likely know some Apache Kafka
● May have heard about Apache Spark before
Lori Erickson

What is Spark?
● General purpose distributed system
○ With a really nice API including Python* :)
● Apache project (one of the most
active)
● Must faster than Hadoop
Map/Reduce
● Has a streaming engine
● Built on top of two abstractions for
distributed data: RDDs & Datasets

Why should you care?
● Lots of great libraries in Python
● Lots of amazing Data Scientists in Python
● “Productionizing” software (aka rewriting it) tends to not
be a lot of fun
● Possible solution: support Python in production better

Why you should listen to me about this...
● I’ve spent a lot of time trying to get Apache Spark (a
JVM system) to work better for Python users
● Lots of progress (some of which we will talk about
today)
● Lots of mistakes too, some of which would be really
easy to repeat in Kafka (infact I have branch that does
just that!**)
● Cuthuluhu is watching you and will eat your soul

A detour into PySpark’s internals
Photo by Bill Ward

Spark in Scala, how does PySpark work?
● Py4J + pickling + magic
○ This can be kind of slow sometimes
● RDDs are generally RDDs of pickled objects
● Spark SQL (and DataFrames) avoid some of this
kristin klein

So what does that look like?
Driver
py4j
Worker 1
Worker K
pipe
pipe

So how does that impact PySpark?
● Data from Spark worker serialized and piped to Python
worker
○ Multiple iterator-to-iterator transformations are still pipelined :)
● Double serialization cost makes everything more
expensive
● Python worker startup takes a bit of extra time
○ If we reuse them though that isn’t the end of the world*
● Python memory isn’t controlled by the JVM - easy to go
over container limits if deploying on YARN or similar
● Error messages make ~0 sense
● etc.

When we say 0 sense:
[Stage 0:> (0 + 0) / 4]17/02/01 09:52:07 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 180, in main
process()
File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 175, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 345, in func
return f(iterator)
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <lambda>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
Continued for ~400 lines
File "high_performance_pyspark/bad_pyspark.py", line 32, in <lambda>

Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
Photo from Cocoa Dream

We can do that w/Kafka streams..
● Why bother learning from our mistakes?
● Or more seriously, the mistakes weren’t that bad...

Our “special” business logic
def transform(input):
"""
Transforms the supplied input.
"""
return str(len(input))
Pargon

Let’s pretend all the world is a string:
override def transform(value: String): String = {
// WARNING: This may summon cuthuluhu
dataOut.writeInt(value.getBytes.size)
dataOut.write(value.getBytes)
dataOut.flush()
val resultSize = dataIn.readInt()
val result = new Array[Byte](resultSize)
dataIn.readFully(result)
// Assume UTF8, what could go wrong? :p
new String(result)
}

Let’s pretend all the world is a string:
def main(socket):
while (True):
input_length = _read_int(socket)
data = socket.read(input_length)
result = transform(data)
resultBytes = result.encode()
_write_int(len(resultBytes), socket)
socket.write(resultBytes)
socket.flush()

What about things other than strings?
Use another system
● Like Spark! (oh wait)
Write it in a format Python can understand:
● Pickling (from Java)
● JSON
● XML
Purely Python solutions
● +s: Save use from double serialization, -s: code
duplication, bug duplication, now you have two systems!

Ooor we could use Spark!
kafkaStream = KafkaUtils.createDirectStream(
ssc, "ex", {"bootstrap.servers": bootstrap_servers})
processed = kafkaStream.map(transform)
def write_to_kafka(partition):
producer =
KafkaProducer(bootstrap_servers=bootstrap_servers)
for x in partition:
producer.send("output", x)
processed.foreachRDD(lambda rdd:
rdd.foreachPartition(write_to_kafka))
# But this summons Cuthulhu anyways :(

What about Spark Structured Streaming?
● Please don’t use it yet
● Really
● It’s got an interesting API and could be really great in
the future
● But for today (2.2) it’s not ready for prime time -- or even
super saver shipping time

What can we do instead?
● Pure Python solutions
○ +s: Save use from double serialization, -s: code duplication, bug
duplication, now you have two systems!
● Magical Better Interchange
● Jython (if you find people willing to do this let me know
:p)

How much more awesome could this be?
● Serialization improvements: Possibly up to 2x faster
○ Early Jython attempt lead to 1.5X to 2X improvement for random
UDFs
○ Alternative, more reasonable work, is being done to accelerate
interchange (with Arrow) -- not yet used for UDF acceleration but initial
proposals are progressing :D
● Likely still somewhat painful to debug (sorry!)

How much faster can avoiding copy/ser be?
Andrew Skudder

The “future*”: Awesome UDFs
● Work going on in Scala land to translate simple Scala
into SQL expressions - need the Dataset API
○ Maybe we can try similar approaches with Python?
● Very early work going on to use Jython for simple UDFs
(e.g. 2.7 compat & no native libraries) - SPARK-15369
○ Early benchmarking w/word count 5% slower than native Scala UDF,
close to 2x faster than regular Python
● Willing to share your Python UDFs for benchmarking? -
http://bit.ly/pySparkUDF
*The future may or may not have better performance than today. But bun-bun the bunny has some lettuce so its
ok!

What if I just need something working?
Today:
● Spark Streaming DStreams supports Python today
● Write your own client and re-publish (not that hard)
The “Future”*:
● Spark + Structured Streaming + Python acceleration
● Bring Python acceleration to
kafka-streams-python-cthulhu ?
*Far enough in the future you forgot I promised it to you.

Resources:
Proof of Concept w/Kafka Streams (str only for now):
● kafka-streams-python-cthulhu
Apache Spark
● DStreams & Structured Streaming*
Improved serialization options:
● Apache Arrow
“Driver Side” bridges:
● Py4J

Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance SparkLearning PySpark

High Performance Spark!
Available today! I brought one copy for one lucky person.
The rest of you can buy it from that scrapy Seattle
bookstore :p
http://bit.ly/hkHighPerfSpark
* Early Release means extra mistakes, but also a chance to help us make a more awesome
book.

And some upcoming talks:
● Data Day Seattle (SEA, Sept)
● Strata New York (NYC, Sept)
● Strange Loop (Sept/Oct)
● Spark Summit EU (Dublin, October)
● November: Possibly Big Data Spain + Bee Scala (TBD)
● Strata Singapore (Singapore, December)
● ScalaX (London, December)
● Know of interesting conferences/webinar things that
should be on my radar? Let me know!

k thnx bye :)
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
Will tweet results
“eventually” @holdenkarau
Any PySpark Users: Have some
simple UDFs you wish ran faster
you are willing to share?:
http://bit.ly/pySparkUDF
Pssst: Have feedback on the presentation? Give me a
shout (holden@pigscanfly.ca) if you feel comfortable doing
so :)

Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summoning Cuthulu

More Related Content

What's hot

Similar to Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summoning Cuthulu

More from confluent

Recently uploaded

Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summoning Cuthulu