Apache Spark
Super Happy Funtimes
Who am I?
● Prefered pronouns are she/her
● I’m a Principal Software Engineer at IBM’s Spark Technology Center
● previously Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & Fast Data processing with Spark
○ co-author of a new book focused on Spark performance coming out this year*
● @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Github https://github.com/holdenk
● Spark Videos http://bit.ly/holdenSparkVideos
Whats going to be covered?
● A quick intro into what Spark is & the kinds of problems its good for solving
● A brief look at some simple API
● For people that already know Spark a detour into how the API can bight us
● Pictures of cats and other cute animals
● Some resources if you want to learn more about Spark
● Excessive use of the wordcount example
Photo by: Ana Sofia
Guerreirinho
Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
Photo from Cocoa Dream
What is Spark?
How do I make MapReduce tasks faster?
What is Spark?
● General purpose distributed system
○ With a really nice API
● Apache project (one of the most
active)
● Must faster than Hadoop
Map/Reduce
● Good when too big for a single
machine
● Built on top of two abstractions for
distributed data: RDDs & Datasets
The different pieces of Spark: 2.0+
Apache Spark
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark
ML
bagel &
Graph X
MLLib
Community
Packages
Structured
Streaming
RDDs: Spark’s Primary abstraction
RDD (Resilient Distributed Dataset)
● Recomputed on node failure
● Distributed across the cluster
● Lazily evaluated (transformations & actions)
Helen Olney
Word count (in python)
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
word_count.saveAsTextFile(“output”)
Photo By: Will
Keightley
Word count (in python)
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
word_count.saveAsTextFile("output")
No data is read or
processed until after
this line
This is an “action”
which forces spark to
evaluate the RDD
daniilr
The same example in Scala:
val words = rdd.flatMap(line => line.split(" "))
val wordPairs = words.map(word => (word, 1))
val wordCounts = wordPairs.reduceByKey(c1, c2 => c1 + c2)
wordCounts.saveAsTextFile("wc.out")
Huang
Yun
Chung
Let’s revisit wordcount with groupByKey
val words = rdd.flatMap(_.split(" "))
val wordPairs = words.map((_, 1))
val grouped = wordPairs.groupByKey()
grouped.mapValues(_.sum)
Tomomi
GroupByKey
reduceByKey
groupByKey - just how evil is it?
● Pretty evil
● Groups all of the records with the same key into a single record
○ Even if we immediately reduce it (e.g. sum it or similar)
○ This can be too big to fit in memory, then our job fails
● Unless we are in SQL then happy pandas
PROgeckoam
So what does that look like?
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
(67843, T, R)(10003, A, R)
(94110, [(A, B), (A, C), (E, F), (A, R), (D, R), (E, R), (E, R), (T, R) (T, R)]
Tomomi
So what did we do instead?
● reduceByKey
○ Works when the types are the same (e.g. in our summing version)
● aggregateByKey
○ Doesn’t require the types to be the same (e.g. computing stats model or similar)
Allows Spark to pipeline the reduction & skip making the list
We also got a map-side reduction (note the difference in shuffled read)
So what was the point of that?
● Spark has a really powerful API
● But doesn’t go a very good job of keeping you from shooting yourself in the
foot
● Just because something works on a small dataset doesn’t mean it will work
on a large dataset
Photo by: A.Davey
Psst: Do you <3 functional programming?
● Spark can be the trojan horse for functional
programming
● If you </3 functional programming its ok - Python, Java,
R, and SQL APIs exist too
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Coming soon:
Spark in Action
Coming soon:
High Performance Spark
And the next book…..
First four chapters are available in “Early Release”*:
● Buy from O’Reilly - http://bit.ly/highPerfSpark
● Give us your (or your employers) money!
Get notified when updated & finished:
● http://www.highperformancespark.com
● https://twitter.com/highperfspark
* Early Release means extra mistakes, but also a chance to help us make a more awesome
book.
Spark Videos
● Apache Spark Youtube Channel
● My Spark videos on YouTube -
○ http://bit.ly/holdenSparkVideos
● Spark Summit 2014 training
● Paco’s Introduction to Apache Spark
Paul Anderson
And some upcoming talks:
● September
○ This meetup (yay)! Also I’ll have office half-hour at like 10:00am if
anyone wants.* (Starbucks @ 11302 Euclid Ave, Cleveland)
○ Toronto - Beyond the Code (Contributing to Spark) & PyLadies
(Getting Started with Spark)
○ New York City Strata Conf (Structured Streaming & Machine Learning)
● October
○ PyData DC - Making Spark go fast in Python (vroom vroom)
○ Salt Lake City Spark Meetup - TBD
○ London - OSCON - Getting Started Contributing to Spark
● December
○ Strata Singapore (Introduction to Datasets)
k thnx bye!
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
Will tweet results
“eventually” @holdenkarau
Any PySpark Users: Have some
simple UDFs you wish ran faster
you are willing to share?:
http://bit.ly/pySparkUDF
Pssst: Have feedback on the presentation? Give me a
shout (holden@pigscanfly.ca) if you feel comfortable doing
so :)

Apache Spark Super Happy Funtimes - CHUG 2016

  • 1.
  • 2.
    Who am I? ●Prefered pronouns are she/her ● I’m a Principal Software Engineer at IBM’s Spark Technology Center ● previously Alpine, Databricks, Google, Foursquare & Amazon ● co-author of Learning Spark & Fast Data processing with Spark ○ co-author of a new book focused on Spark performance coming out this year* ● @holdenkarau ● Slide share http://www.slideshare.net/hkarau ● Linkedin https://www.linkedin.com/in/holdenkarau ● Github https://github.com/holdenk ● Spark Videos http://bit.ly/holdenSparkVideos
  • 3.
    Whats going tobe covered? ● A quick intro into what Spark is & the kinds of problems its good for solving ● A brief look at some simple API ● For people that already know Spark a detour into how the API can bight us ● Pictures of cats and other cute animals ● Some resources if you want to learn more about Spark ● Excessive use of the wordcount example Photo by: Ana Sofia Guerreirinho
  • 4.
    Cat photo fromhttp://galato901.deviantart.com/art/Cat-on-Work-Break-173043455 Photo from Cocoa Dream
  • 5.
    What is Spark? Howdo I make MapReduce tasks faster?
  • 6.
    What is Spark? ●General purpose distributed system ○ With a really nice API ● Apache project (one of the most active) ● Must faster than Hadoop Map/Reduce ● Good when too big for a single machine ● Built on top of two abstractions for distributed data: RDDs & Datasets
  • 7.
    The different piecesof Spark: 2.0+ Apache Spark SQL & DataFrames Streaming Language APIs Scala, Java, Python, & R Graph Tools Spark ML bagel & Graph X MLLib Community Packages Structured Streaming
  • 8.
    RDDs: Spark’s Primaryabstraction RDD (Resilient Distributed Dataset) ● Recomputed on node failure ● Distributed across the cluster ● Lazily evaluated (transformations & actions) Helen Olney
  • 9.
    Word count (inpython) lines = sc.textFile(src) words = lines.flatMap(lambda x: x.split(" ")) word_count = (words.map(lambda x: (x, 1)) .reduceByKey(lambda x, y: x+y)) word_count.saveAsTextFile(“output”) Photo By: Will Keightley
  • 10.
    Word count (inpython) lines = sc.textFile(src) words = lines.flatMap(lambda x: x.split(" ")) word_count = (words.map(lambda x: (x, 1)) .reduceByKey(lambda x, y: x+y)) word_count.saveAsTextFile("output") No data is read or processed until after this line This is an “action” which forces spark to evaluate the RDD daniilr
  • 11.
    The same examplein Scala: val words = rdd.flatMap(line => line.split(" ")) val wordPairs = words.map(word => (word, 1)) val wordCounts = wordPairs.reduceByKey(c1, c2 => c1 + c2) wordCounts.saveAsTextFile("wc.out")
  • 12.
  • 13.
    Let’s revisit wordcountwith groupByKey val words = rdd.flatMap(_.split(" ")) val wordPairs = words.map((_, 1)) val grouped = wordPairs.groupByKey() grouped.mapValues(_.sum) Tomomi
  • 14.
  • 15.
  • 16.
    groupByKey - justhow evil is it? ● Pretty evil ● Groups all of the records with the same key into a single record ○ Even if we immediately reduce it (e.g. sum it or similar) ○ This can be too big to fit in memory, then our job fails ● Unless we are in SQL then happy pandas PROgeckoam
  • 17.
    So what doesthat look like? (94110, A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) (67843, T, R)(10003, A, R) (94110, [(A, B), (A, C), (E, F), (A, R), (D, R), (E, R), (E, R), (T, R) (T, R)] Tomomi
  • 18.
    So what didwe do instead? ● reduceByKey ○ Works when the types are the same (e.g. in our summing version) ● aggregateByKey ○ Doesn’t require the types to be the same (e.g. computing stats model or similar) Allows Spark to pipeline the reduction & skip making the list We also got a map-side reduction (note the difference in shuffled read)
  • 19.
    So what wasthe point of that? ● Spark has a really powerful API ● But doesn’t go a very good job of keeping you from shooting yourself in the foot ● Just because something works on a small dataset doesn’t mean it will work on a large dataset Photo by: A.Davey
  • 20.
    Psst: Do you<3 functional programming? ● Spark can be the trojan horse for functional programming ● If you </3 functional programming its ok - Python, Java, R, and SQL APIs exist too
  • 21.
    Learning Spark Fast Data Processingwith Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark
  • 22.
    Learning Spark Fast Data Processingwith Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Coming soon: Spark in Action Coming soon: High Performance Spark
  • 23.
    And the nextbook….. First four chapters are available in “Early Release”*: ● Buy from O’Reilly - http://bit.ly/highPerfSpark ● Give us your (or your employers) money! Get notified when updated & finished: ● http://www.highperformancespark.com ● https://twitter.com/highperfspark * Early Release means extra mistakes, but also a chance to help us make a more awesome book.
  • 24.
    Spark Videos ● ApacheSpark Youtube Channel ● My Spark videos on YouTube - ○ http://bit.ly/holdenSparkVideos ● Spark Summit 2014 training ● Paco’s Introduction to Apache Spark Paul Anderson
  • 25.
    And some upcomingtalks: ● September ○ This meetup (yay)! Also I’ll have office half-hour at like 10:00am if anyone wants.* (Starbucks @ 11302 Euclid Ave, Cleveland) ○ Toronto - Beyond the Code (Contributing to Spark) & PyLadies (Getting Started with Spark) ○ New York City Strata Conf (Structured Streaming & Machine Learning) ● October ○ PyData DC - Making Spark go fast in Python (vroom vroom) ○ Salt Lake City Spark Meetup - TBD ○ London - OSCON - Getting Started Contributing to Spark ● December ○ Strata Singapore (Introduction to Datasets)
  • 26.
    k thnx bye! Ifyou care about Spark testing and don’t hate surveys: http://bit.ly/holdenTestingSpark Will tweet results “eventually” @holdenkarau Any PySpark Users: Have some simple UDFs you wish ran faster you are willing to share?: http://bit.ly/pySparkUDF Pssst: Have feedback on the presentation? Give me a shout (holden@pigscanfly.ca) if you feel comfortable doing so :)