SlideShare a Scribd company logo
1 of 64
Download to read offline
Beyond Wordcount:
With Datasets & Scaling
With a Wordcount example as well because guild
@holdenkarau
Presented at Nike PDX Jan 2018
Who am I?
โ— My name is Holden Karau
โ— Prefered pronouns are she/her
โ— Developer Advocate at Google focused on OSS Big Data
โ— Apache Spark PMC
โ— Contributor to a lot of other projects (including BEAM)
โ— previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
โ— co-author of High Performance Spark & Learning Spark (+ more)
โ— Twitter: @holdenkarau
โ— Slideshare http://www.slideshare.net/hkarau
โ— Linkedin https://www.linkedin.com/in/holdenkarau
โ— Github https://github.com/holdenk
โ— Related Spark Videos http://bit.ly/holdenSparkVideos
Who do I think you all are?
โ— Nice people*
โ— Possibly some knowledge of Apache Spark?
โ— Interested in using Spark Datasets
โ— Familiar-ish with Scala or Java or Python
Amanda
What is Spark?
โ— General purpose distributed system
โ—‹ With a really nice API including Python :)
โ— Apache project (one of the most
active)
โ— Must faster than Hadoop
Map/Reduce
โ— Good when too big for a single
machine
โ— Built on top of two abstractions for
distributed data: RDDs & Datasets
What we are going to explore together!
โ— What are Spark Datasets
โ— Where it fits into the Spark ecosystem
โ— How DataFrames & Datasets are different from RDDs
โ— How Datasets are different than DataFrames
โ— The limitations for Datasets :( and then working beyond
them
Ryan McGilchrist
The different pieces of Spark
Apache Spark
SQL, DataFrames & Datasets
Structured
Streaming
Scala,
Java,
Python, &
R
Spark ML
bagel &
Graph X
MLLib
Scala,
Java,
PythonStreaming
Graph
Frames
Paul Hudson
Why should we consider Datasets?
โ— Performance
โ—‹ Smart optimizer
โ—‹ More efficient storage
โ—‹ Faster serialization
โ— Simplicity
โ—‹ Windowed operations
โ—‹ Multi-column & multi-type aggregates
Rikki's Refuge
Why are Datasets so awesome?
โ— Easier to mix functional style and relational style
โ—‹ No more hive UDFs!
โ— Nice performance of Spark SQL flexibility of RDDs
โ—‹ Tungsten (better serialization)
โ—‹ Equivalent of Sortable trait
โ— Strongly typed
โ— The future (ML, Graph, etc.)
โ— Potential for better language interop
โ—‹ Something like Arrow has a much better chance with Datasets
โ—‹ Cross-platform libraries are easier to make & use
Will Folsom
What is the performance like?
Andrew Skudder
How is it so fast?
โ— Optimizer has more information (schema & operations)
โ— More efficient storage formats
โ— Faster serialization
โ— Some operations directly on serialized data formats
โ— non-JVM languages: does more computation in the JVM
Andrew Skudder
How much more space efficient?
And in Python itโ€™s even more important!
Andrew Skudder
*Note: do not compare absolute #s with previous graph -
different dataset sizes because I forgot to write it down when I
made the first one.
Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
Why people come to Spark:
Well this MapReduce
job is going to take
16 hours - how long
could it take to learn
Spark?
dougwoods
Why people come to Spark:
My DataFrame wonโ€™t fit
in memory on my cluster
anymore, let alone my
MacBook Pro :( Maybe
this Spark business will
solve that...
brownpau
Plus a little magic :)
Steven Saus
Magic has itโ€™s limits: key-skew + black boxes
โ— There is a worse way to do WordCount
โ— We can use the seemingly safe thing called groupByKey
โ— Then compute the sum...
_torne
Bad word count RDD :(
words = rdd.flatMap(lambda x: x.split(" "))
wordPairs = words.map(lambda w: (w, 1))
grouped = wordPairs.groupByKey()
counted_words = grouped.mapValues(lambda counts: sum(counts))
counted_words.saveAsTextFile("boop")
Tomomi
f ford Pinto by Morven
Mini โ€œfixโ€: Datasets (aka DataFrames)
โ— Still super powerful
โ— Still allow arbitrary lambdas
โ— But give you more options to โ€œhelpโ€ the optimizer
โ— groupBy returns a GroupedDataStructure and offers
special aggregates
โ— Selects can push filters down for us*
โ— Etc.
Getting started:
Our window to the world:
โ— Core Spark has the SparkContext
โ— Spark Streaming has the StreamingContext
โ— SQL has:
โ—‹ SQLContext and HiveContext (pre-2.0)
โ—‹ Unified in SparkSession post 2.0
Petful
Spark Dataset Data Types
โ— Requires element types have Spark SQL encoder
โ—‹ Many common basic types already have encoders
โ—‹ case classes of supported types donโ€™t require their own encoder
โ—‹ RDDs support any serializable object
โ— Many common โ€œnativeโ€ data types are directly
supported
โ— Can add encoders for others
loiez Deniel
How to start?
โ— Load Data in DataFrames & Datasets - use
SparkSession
โ—‹ Using the new DataSource API, raw SQL queries, etc.
โ— Register tables
โ—‹ Run SQL queries against them
โ— Write programmatic queries against DataFrames
โ— Apply functional transformations to Datasets
U-nagi
Loading data Spark SQL (DataSets)
sqlContext.read returns a DataFrameReader
We can specify general properties & data specific options
โ— option(โ€œkeyโ€, โ€œvalueโ€)
โ—‹ spark-csv ones we will use are header & inferSchema
โ— format(โ€œformatNameโ€)
โ—‹ built in formats include parquet, jdbc, etc.
โ— load(โ€œpathโ€)
Jess Johnson
Loading some simple JSON data
df = sqlContext.read
.format("json")
.load("sample.json")
Jess Johnson
What about other data formats?
โ— Built in
โ—‹ Parquet
โ—‹ JDBC
โ—‹ Json (which is amazing!)
โ—‹ Orc
โ—‹ csv*
โ—‹ Hive
โ— Available as packages
โ—‹ Avro, Redshift, Mongo, Cassandra, Cloudant, Couchbase, etc.
โ—‹ +34 at
http://spark-packages.org/?q=tags%3A%22Data%20Sources%22
Michael Coghlan
*built in starting in 2.0, prior to that use a package
Sample json record
{"name":"mission",
"pandas":[{"id":1,"zip":"94110","pt":"giant",
"happy":true, "attributes":[0.4,0.5]}]}
Xiahong Chen
Getting the schema
โ— printSchema() for human readable
โ— schema for machine readable
Resulting schema:
root
|-- name: string (nullable = true)
|-- pandas: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = false)
| | |-- zip: string (nullable = true)
| | |-- pt: string (nullable = true)
| | |-- happy: boolean (nullable = false)
| | |-- attributes: array (nullable = true)
| | | |-- element: double (containsNull = false)
Simon Gรถtz
Sample case class for schema:
case class RawPanda(id: Long, zip: String, pt:
String, happy: Boolean, attributes:
Array[Double])
case class PandaPlace(name: String, pandas:
Array[RawPanda])
Orangeaurochs
Then apply some type magic
// Convert our runtime typed DataFrame into a Dataset
// This makes functional operations much easier to perform
// and fails fast if the schema does not match the type
val pandas: Dataset[RawPanda] = df.as[RawPanda]
We can also convert RDDs
def fromRDD(rdd: RDD[RawPanda]): Dataset[RawPanda] = {
rdd.toDS
}
Nesster
What do relational transforms look like?
Many familiar faces are back with a twist:
โ— filter
โ— join
โ— groupBy - Now safe!
And some new ones:
โ— select
โ— window
โ— sql (register as a table and run โ€œarbitraryโ€ SQL)
โ— etc.
Writing a relational transformation:
relevant = df.select(df("place"),
df("happyPandas"))
nice_places = relevant.filter(
df("happyPandas") >= minHappyPandas)
Jody McIntyre
Ok word count RDD (in python)
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
word_count.saveAsTextFile(โ€œoutputโ€)
Photo By: Will
Keightley
Word count w/Dataframes
df = spark.read.load(src)
# Returns an RDD
words = df.select("text").flatMap(lambda x: x.text.split(" "))
words_df = words.map(
lambda x: Row(word=x, cnt=1)).toDF()
word_count = words_df.groupBy("word").sum()
word_count.write.format("parquet").save("wc.parquet")
Still have the double
serialization here :(
Word count w/Datasets
val df = spark.read.load(src).select("text")
val ds = df.as[String]
# Returns an Dataset!
val words = ds.flatMap(x => x.split(" "))
val grouped = words.groupBy("value")
val word_count = grouped.agg(count("*") as
"count")
word_count.write.format("parquet").save("wc")
Canโ€™t push down
filters from here
If itโ€™s a simple type we donโ€™t
have to define a case class
What can the optimizer do now?
โ— Sort on the serialized data
โ— Understand the aggregate (โ€œpartial aggregatesโ€)
โ—‹ Could sort of do this before but not as awesomely, and only if we used
reduceByKey - not groupByKey
โ— Pack them bits nice and tight
So whats this new groupBy?
โ— No longer causes explosions like RDD groupBy
โ—‹ Able to introspect and pipeline the aggregation
โ— Returns a GroupedData (or GroupedDataset)
โ— Makes it easy to perform multiple aggregations
โ— Built in shortcuts for aggregates like avg, min, max
โ— Longer list at
http://spark.apache.org/docs/latest/api/scala/index.html#
org.apache.spark.sql.functions$
โ— Allows the optimizer to see what aggregates are being
performed
Sherrie Thai
Computing some aggregates by age code:
df.groupBy("age").min("hours-per-week")
OR
import org.apache.spark.sql.catalyst.expressions.aggregate._
df.groupBy("age").agg(min("hours-per-week"))
Easily compute multiple aggregates:
df.groupBy("age").agg(min("hours-per-week"),
avg("hours-per-week"),
max("capital-gain"))
PhotoAtelier
Using Datasets to mix functional & relational style:
val ds: Dataset[RawPanda] = ...
val happiness = ds.toDF().filter($"happy" === true).as[RawPanda].
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
So what was that?
ds.toDF().filter($"happy" === true).as[RawPanda].
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
convert a Dataset to a
DataFrame to access
more DataFrame
functions (pre-2.0)
Convert DataFrame
back to a Dataset
A typed query (specifies the
return type).Traditional functional
reduction:
arbitrary scala code :)
And functional style maps:
/**
* Functional map + Dataset, sums the positive attributes for the
pandas
*/
def funMap(ds: Dataset[RawPanda]): Dataset[Double] = {
ds.map{rp => rp.attributes.filter(_ > 0).sum}
}
Chris Isherwood
But where DataFrames explode?
โ— Iterative algorithms - large plans
โ—‹ Use your escape hatch to RDDs!
โ— Some push downs are sad pandas :(
โ— Default shuffle size is sometimes too small for big data
(200 partitions)
โ— Default partition size when reading in is also sad
DF & RDD re-use - sadly not magic
โ— If we know we are going to re-use the RDD what should we do?
โ—‹ If it fits nicely in memory caching in memory
โ—‹ persisting at another level
โ–  MEMORY, MEMORY_ONLY_SER, MEMORY_AND_DISK,
MEMORY_AND_DISK_SER
โ—‹ checkpointing
โ— Noisey clusters
โ—‹ _2 & checkpointing can help
โ— persist first for checkpointing
Richard Gillin
RDDs: The have issues to (especially with skew)
โ— Keys arenโ€™t evenly distributed
โ—‹ Sales by zip code, or records by city, etc.
โ— groupByKey will explode (but it's pretty easy to break)
โ— We can have really unbalanced partitions
โ—‹ If we have enough key skew sortByKey could even fail
โ—‹ Stragglers (uneven sharding can make some tasks take much longer)
Mitchell
Joyce
So what does groupByKey look like?
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
(67843, T, R)(10003, A, R)
(94110, [(A, B), (A, C), (E, F), (A, R), (D, R), (E, R), (E, R), (T, R) (T, R)]
Tomomi
So what did we do instead?
โ— reduceByKey
โ—‹ Works when the types are the same (e.g. in our summing version)
โ— aggregateByKey
โ—‹ Doesnโ€™t require the types to be the same (e.g. computing stats model or similar)
Allows Spark to pipeline the reduction & skip making the list
We also got a map-side reduction (note the difference in shuffled read)
What is DS functional perf like?
โ— Often better than RDD
โ— Generally not as good relational - less understanding
โ— SPARK-14083 is working on doing bytecode analysis
โ— Can still be faster than RDD transformations because of
serialization improvements
The โ€œfuture*โ€: Awesome UDFs
โ— Work going on in Scala land to translate simple Scala
into SQL expressions
โ— POC w/Jython for simple UDFs (e.g. 2.7 compat & no
native libraries) - SPARK-15369
โ—‹ Early benchmarking w/word count 5% slower than native Scala UDF,
close to 65% faster than regular Python
โ— Arrow - faster interchange between languages
โ— Willing to share your Python UDFs for benchmarking? -
http://bit.ly/pySparkUDF
*The future may or may not have better performance than today. But bun-bun the bunny has some lettuce so its
ok!
Andrew Skudder
*Arrow: possibly the future. I really hope so. Spark 2.3 and beyond!
* *
Summary: Why to use Datasets
โ— We can solve problems tricky to solve with RDDs
โ—‹ Window operations
โ—‹ Multiple aggregations
โ— Fast
โ—‹ Awesome optimizer
โ—‹ Super space efficient data format
โ— We can solve problems tricky/frustrating to solve with
Dataframes
โ—‹ Writing UDFs and UDAFs can really break your flow
Lindsay Attaway
And some upcoming talks:
โ— Jan
โ—‹ If interest tomorrow: Office Hours? @holdenkarau
โ—‹ Wellington Spark + BEAM meetup
โ—‹ LinuxConf AU
โ—‹ Sydney Spark meetup
โ—‹ Data Day Texas
โ— Feb
โ—‹ FOSDEM - One on testing one on scaling
โ—‹ JFokus in Stockholm - Adding deep learning to Spark
Cat wave photo by Quinn Dombrowski
k thnx bye!
If you <3 testing & want to fill out
survey: http://bit.ly/holdenTestingSpark
Want to tell me (and or my boss) how
Iโ€™m doing?
http://bit.ly/holdenTalkFeedback
Bonus/support slides
โ— โ€œBonusโ€ like the coupons you get from bed bath and beyond :p
โ— Not as many cat pictures (Iโ€™m sorry)
โ— The talk really should be finished at this pointโ€ฆ but maybe someone asked a
question?
Doing the (comparatively) impossible
Hey Paul
Windowed operations
โ— Can compute over the past K and next J
โ— Really hard to do in regular Spark, super easy in SQL
Lucie Provencher
1
3
2
1
Windowed operations
โ— Can compute over the past K and next J
โ— Really hard to do in regular Spark, super easy in SQL
Lucie Provencher
1
3
2
1
Window specs
import org.apache.spark.sql.expressions.Window
val spec =
Window.partitionBy("age").orderBy("capital-gain"
).rowsBetween(-10, 10)
val rez =
df.select(avg("capital-gain").over(spec))
Ryo Chijiiwa
UDFS: Adding custom code
sqlContext.udf.register("strLen", (s: String) =>
s.length())
sqlCtx.registerFunction("strLen", lambda x:
len(x), IntegerType())
YaฤŸmur Adam
Using UDF on a table:
First Register the table:
df.registerTempTable("myTable")
sqlContext.sql("SELECT firstCol,
strLen(stringCol) from myTable")
Using UDFs Programmatically
def dateTimeFunction(format : String ): UserDefinedFunction = {
import org.apache.spark.sql.functions.udf
udf((time : Long) => new Timestamp(time * 1000))
}
val format = "dd-mm-yyyy"
df.select(df(firstCol),
dateTimeFunction(format)(df(unixTimeStamp).cast(TimestampType))

More Related Content

What's hot

Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018
Holden Karau
ย 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
ย 

What's hot (20)

Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
ย 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop
ย 
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
ย 
Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...
ย 
Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016
ย 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016
ย 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
ย 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
ย 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
ย 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
ย 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
ย 
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016
ย 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
ย 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
ย 
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018
ย 
Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016
ย 
Getting started contributing to Apache Spark
Getting started contributing to Apache SparkGetting started contributing to Apache Spark
Getting started contributing to Apache Spark
ย 
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
ย 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
ย 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
ย 

Similar to Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018

Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
Holden Karau
ย 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
Holden Karau
ย 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Holden Karau
ย 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...
Holden Karau
ย 

Similar to Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018 (20)

Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
ย 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
ย 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
ย 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
ย 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?
ย 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
ย 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
ย 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
ย 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...
ย 
Dart the Better JavaScript
Dart the Better JavaScriptDart the Better JavaScript
Dart the Better JavaScript
ย 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...
ย 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
ย 
Introducing Datawave
Introducing DatawaveIntroducing Datawave
Introducing Datawave
ย 
AWS DevOps - Terraform, Docker, HashiCorp Vault
AWS DevOps - Terraform, Docker, HashiCorp VaultAWS DevOps - Terraform, Docker, HashiCorp Vault
AWS DevOps - Terraform, Docker, HashiCorp Vault
ย 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
ย 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
ย 
Yaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdfYaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdf
ย 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)
ย 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
ย 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
ย 

Recently uploaded

Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
SUHANI PANDEY
ย 
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
ย 
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
singhpriety023
ย 
( Pune ) VIP Baner Call Girls ๐ŸŽ—๏ธ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls ๐ŸŽ—๏ธ 9352988975 Sizzling | Escorts | Girls Are Re...( Pune ) VIP Baner Call Girls ๐ŸŽ—๏ธ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls ๐ŸŽ—๏ธ 9352988975 Sizzling | Escorts | Girls Are Re...
nilamkumrai
ย 
๐Ÿ’š๐Ÿ˜‹ Salem Escort Service Call Girls, 9352852248 โ‚น5000 To 25K With AC๐Ÿ’š๐Ÿ˜‹
๐Ÿ’š๐Ÿ˜‹ Salem Escort Service Call Girls, 9352852248 โ‚น5000 To 25K With AC๐Ÿ’š๐Ÿ˜‹๐Ÿ’š๐Ÿ˜‹ Salem Escort Service Call Girls, 9352852248 โ‚น5000 To 25K With AC๐Ÿ’š๐Ÿ˜‹
๐Ÿ’š๐Ÿ˜‹ Salem Escort Service Call Girls, 9352852248 โ‚น5000 To 25K With AC๐Ÿ’š๐Ÿ˜‹
nirzagarg
ย 
Call Girls in Prashant Vihar, Delhi ๐Ÿ’ฏ Call Us ๐Ÿ”9953056974 ๐Ÿ” Escort Service
Call Girls in Prashant Vihar, Delhi ๐Ÿ’ฏ Call Us ๐Ÿ”9953056974 ๐Ÿ” Escort ServiceCall Girls in Prashant Vihar, Delhi ๐Ÿ’ฏ Call Us ๐Ÿ”9953056974 ๐Ÿ” Escort Service
Call Girls in Prashant Vihar, Delhi ๐Ÿ’ฏ Call Us ๐Ÿ”9953056974 ๐Ÿ” Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
ย 
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
SUHANI PANDEY
ย 
valsad Escorts Service โ˜Ž๏ธ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service โ˜Ž๏ธ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...valsad Escorts Service โ˜Ž๏ธ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service โ˜Ž๏ธ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
ย 
Lucknow โคCALL GIRL 88759*99948 โคCALL GIRLS IN Lucknow ESCORT SERVICEโคCALL GIRL
Lucknow โคCALL GIRL 88759*99948 โคCALL GIRLS IN Lucknow ESCORT SERVICEโคCALL GIRLLucknow โคCALL GIRL 88759*99948 โคCALL GIRLS IN Lucknow ESCORT SERVICEโคCALL GIRL
Lucknow โคCALL GIRL 88759*99948 โคCALL GIRLS IN Lucknow ESCORT SERVICEโคCALL GIRL
imonikaupta
ย 

Recently uploaded (20)

Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
ย 
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
ย 
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
ย 
Hireโ† Young Call Girls in Tilak nagar (Delhi) โ˜Ž๏ธ 9205541914 โ˜Ž๏ธ Independent Esc...
Hireโ† Young Call Girls in Tilak nagar (Delhi) โ˜Ž๏ธ 9205541914 โ˜Ž๏ธ Independent Esc...Hireโ† Young Call Girls in Tilak nagar (Delhi) โ˜Ž๏ธ 9205541914 โ˜Ž๏ธ Independent Esc...
Hireโ† Young Call Girls in Tilak nagar (Delhi) โ˜Ž๏ธ 9205541914 โ˜Ž๏ธ Independent Esc...
ย 
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
ย 
Busty DesiโšกCall Girls in Vasundhara Ghaziabad >เผ’8448380779 Escort Service
Busty DesiโšกCall Girls in Vasundhara Ghaziabad >เผ’8448380779 Escort ServiceBusty DesiโšกCall Girls in Vasundhara Ghaziabad >เผ’8448380779 Escort Service
Busty DesiโšกCall Girls in Vasundhara Ghaziabad >เผ’8448380779 Escort Service
ย 
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
ย 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
ย 
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
ย 
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
ย 
( Pune ) VIP Baner Call Girls ๐ŸŽ—๏ธ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls ๐ŸŽ—๏ธ 9352988975 Sizzling | Escorts | Girls Are Re...( Pune ) VIP Baner Call Girls ๐ŸŽ—๏ธ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls ๐ŸŽ—๏ธ 9352988975 Sizzling | Escorts | Girls Are Re...
ย 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
ย 
๐Ÿ’š๐Ÿ˜‹ Salem Escort Service Call Girls, 9352852248 โ‚น5000 To 25K With AC๐Ÿ’š๐Ÿ˜‹
๐Ÿ’š๐Ÿ˜‹ Salem Escort Service Call Girls, 9352852248 โ‚น5000 To 25K With AC๐Ÿ’š๐Ÿ˜‹๐Ÿ’š๐Ÿ˜‹ Salem Escort Service Call Girls, 9352852248 โ‚น5000 To 25K With AC๐Ÿ’š๐Ÿ˜‹
๐Ÿ’š๐Ÿ˜‹ Salem Escort Service Call Girls, 9352852248 โ‚น5000 To 25K With AC๐Ÿ’š๐Ÿ˜‹
ย 
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
ย 
Call Girls in Prashant Vihar, Delhi ๐Ÿ’ฏ Call Us ๐Ÿ”9953056974 ๐Ÿ” Escort Service
Call Girls in Prashant Vihar, Delhi ๐Ÿ’ฏ Call Us ๐Ÿ”9953056974 ๐Ÿ” Escort ServiceCall Girls in Prashant Vihar, Delhi ๐Ÿ’ฏ Call Us ๐Ÿ”9953056974 ๐Ÿ” Escort Service
Call Girls in Prashant Vihar, Delhi ๐Ÿ’ฏ Call Us ๐Ÿ”9953056974 ๐Ÿ” Escort Service
ย 
All Time Service Available Call Girls Mg Road ๐Ÿ‘Œ โญ๏ธ 6378878445
All Time Service Available Call Girls Mg Road ๐Ÿ‘Œ โญ๏ธ 6378878445All Time Service Available Call Girls Mg Road ๐Ÿ‘Œ โญ๏ธ 6378878445
All Time Service Available Call Girls Mg Road ๐Ÿ‘Œ โญ๏ธ 6378878445
ย 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
ย 
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
ย 
valsad Escorts Service โ˜Ž๏ธ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service โ˜Ž๏ธ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...valsad Escorts Service โ˜Ž๏ธ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service โ˜Ž๏ธ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
ย 
Lucknow โคCALL GIRL 88759*99948 โคCALL GIRLS IN Lucknow ESCORT SERVICEโคCALL GIRL
Lucknow โคCALL GIRL 88759*99948 โคCALL GIRLS IN Lucknow ESCORT SERVICEโคCALL GIRLLucknow โคCALL GIRL 88759*99948 โคCALL GIRLS IN Lucknow ESCORT SERVICEโคCALL GIRL
Lucknow โคCALL GIRL 88759*99948 โคCALL GIRLS IN Lucknow ESCORT SERVICEโคCALL GIRL
ย 

Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018

  • 1. Beyond Wordcount: With Datasets & Scaling With a Wordcount example as well because guild @holdenkarau Presented at Nike PDX Jan 2018
  • 2. Who am I? โ— My name is Holden Karau โ— Prefered pronouns are she/her โ— Developer Advocate at Google focused on OSS Big Data โ— Apache Spark PMC โ— Contributor to a lot of other projects (including BEAM) โ— previously IBM, Alpine, Databricks, Google, Foursquare & Amazon โ— co-author of High Performance Spark & Learning Spark (+ more) โ— Twitter: @holdenkarau โ— Slideshare http://www.slideshare.net/hkarau โ— Linkedin https://www.linkedin.com/in/holdenkarau โ— Github https://github.com/holdenk โ— Related Spark Videos http://bit.ly/holdenSparkVideos
  • 3.
  • 4. Who do I think you all are? โ— Nice people* โ— Possibly some knowledge of Apache Spark? โ— Interested in using Spark Datasets โ— Familiar-ish with Scala or Java or Python Amanda
  • 5. What is Spark? โ— General purpose distributed system โ—‹ With a really nice API including Python :) โ— Apache project (one of the most active) โ— Must faster than Hadoop Map/Reduce โ— Good when too big for a single machine โ— Built on top of two abstractions for distributed data: RDDs & Datasets
  • 6. What we are going to explore together! โ— What are Spark Datasets โ— Where it fits into the Spark ecosystem โ— How DataFrames & Datasets are different from RDDs โ— How Datasets are different than DataFrames โ— The limitations for Datasets :( and then working beyond them Ryan McGilchrist
  • 7. The different pieces of Spark Apache Spark SQL, DataFrames & Datasets Structured Streaming Scala, Java, Python, & R Spark ML bagel & Graph X MLLib Scala, Java, PythonStreaming Graph Frames Paul Hudson
  • 8. Why should we consider Datasets? โ— Performance โ—‹ Smart optimizer โ—‹ More efficient storage โ—‹ Faster serialization โ— Simplicity โ—‹ Windowed operations โ—‹ Multi-column & multi-type aggregates Rikki's Refuge
  • 9. Why are Datasets so awesome? โ— Easier to mix functional style and relational style โ—‹ No more hive UDFs! โ— Nice performance of Spark SQL flexibility of RDDs โ—‹ Tungsten (better serialization) โ—‹ Equivalent of Sortable trait โ— Strongly typed โ— The future (ML, Graph, etc.) โ— Potential for better language interop โ—‹ Something like Arrow has a much better chance with Datasets โ—‹ Cross-platform libraries are easier to make & use Will Folsom
  • 10. What is the performance like? Andrew Skudder
  • 11. How is it so fast? โ— Optimizer has more information (schema & operations) โ— More efficient storage formats โ— Faster serialization โ— Some operations directly on serialized data formats โ— non-JVM languages: does more computation in the JVM Andrew Skudder
  • 12. How much more space efficient?
  • 13. And in Python itโ€™s even more important! Andrew Skudder *Note: do not compare absolute #s with previous graph - different dataset sizes because I forgot to write it down when I made the first one.
  • 14. Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
  • 15. Why people come to Spark: Well this MapReduce job is going to take 16 hours - how long could it take to learn Spark? dougwoods
  • 16. Why people come to Spark: My DataFrame wonโ€™t fit in memory on my cluster anymore, let alone my MacBook Pro :( Maybe this Spark business will solve that... brownpau
  • 17. Plus a little magic :) Steven Saus
  • 18. Magic has itโ€™s limits: key-skew + black boxes โ— There is a worse way to do WordCount โ— We can use the seemingly safe thing called groupByKey โ— Then compute the sum... _torne
  • 19. Bad word count RDD :( words = rdd.flatMap(lambda x: x.split(" ")) wordPairs = words.map(lambda w: (w, 1)) grouped = wordPairs.groupByKey() counted_words = grouped.mapValues(lambda counts: sum(counts)) counted_words.saveAsTextFile("boop") Tomomi
  • 20. f ford Pinto by Morven
  • 21. Mini โ€œfixโ€: Datasets (aka DataFrames) โ— Still super powerful โ— Still allow arbitrary lambdas โ— But give you more options to โ€œhelpโ€ the optimizer โ— groupBy returns a GroupedDataStructure and offers special aggregates โ— Selects can push filters down for us* โ— Etc.
  • 22. Getting started: Our window to the world: โ— Core Spark has the SparkContext โ— Spark Streaming has the StreamingContext โ— SQL has: โ—‹ SQLContext and HiveContext (pre-2.0) โ—‹ Unified in SparkSession post 2.0 Petful
  • 23. Spark Dataset Data Types โ— Requires element types have Spark SQL encoder โ—‹ Many common basic types already have encoders โ—‹ case classes of supported types donโ€™t require their own encoder โ—‹ RDDs support any serializable object โ— Many common โ€œnativeโ€ data types are directly supported โ— Can add encoders for others loiez Deniel
  • 24. How to start? โ— Load Data in DataFrames & Datasets - use SparkSession โ—‹ Using the new DataSource API, raw SQL queries, etc. โ— Register tables โ—‹ Run SQL queries against them โ— Write programmatic queries against DataFrames โ— Apply functional transformations to Datasets U-nagi
  • 25. Loading data Spark SQL (DataSets) sqlContext.read returns a DataFrameReader We can specify general properties & data specific options โ— option(โ€œkeyโ€, โ€œvalueโ€) โ—‹ spark-csv ones we will use are header & inferSchema โ— format(โ€œformatNameโ€) โ—‹ built in formats include parquet, jdbc, etc. โ— load(โ€œpathโ€) Jess Johnson
  • 26. Loading some simple JSON data df = sqlContext.read .format("json") .load("sample.json") Jess Johnson
  • 27. What about other data formats? โ— Built in โ—‹ Parquet โ—‹ JDBC โ—‹ Json (which is amazing!) โ—‹ Orc โ—‹ csv* โ—‹ Hive โ— Available as packages โ—‹ Avro, Redshift, Mongo, Cassandra, Cloudant, Couchbase, etc. โ—‹ +34 at http://spark-packages.org/?q=tags%3A%22Data%20Sources%22 Michael Coghlan *built in starting in 2.0, prior to that use a package
  • 29. Getting the schema โ— printSchema() for human readable โ— schema for machine readable
  • 30. Resulting schema: root |-- name: string (nullable = true) |-- pandas: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- id: long (nullable = false) | | |-- zip: string (nullable = true) | | |-- pt: string (nullable = true) | | |-- happy: boolean (nullable = false) | | |-- attributes: array (nullable = true) | | | |-- element: double (containsNull = false) Simon Gรถtz
  • 31. Sample case class for schema: case class RawPanda(id: Long, zip: String, pt: String, happy: Boolean, attributes: Array[Double]) case class PandaPlace(name: String, pandas: Array[RawPanda]) Orangeaurochs
  • 32. Then apply some type magic // Convert our runtime typed DataFrame into a Dataset // This makes functional operations much easier to perform // and fails fast if the schema does not match the type val pandas: Dataset[RawPanda] = df.as[RawPanda]
  • 33. We can also convert RDDs def fromRDD(rdd: RDD[RawPanda]): Dataset[RawPanda] = { rdd.toDS } Nesster
  • 34. What do relational transforms look like? Many familiar faces are back with a twist: โ— filter โ— join โ— groupBy - Now safe! And some new ones: โ— select โ— window โ— sql (register as a table and run โ€œarbitraryโ€ SQL) โ— etc.
  • 35. Writing a relational transformation: relevant = df.select(df("place"), df("happyPandas")) nice_places = relevant.filter( df("happyPandas") >= minHappyPandas) Jody McIntyre
  • 36. Ok word count RDD (in python) lines = sc.textFile(src) words = lines.flatMap(lambda x: x.split(" ")) word_count = (words.map(lambda x: (x, 1)) .reduceByKey(lambda x, y: x+y)) word_count.saveAsTextFile(โ€œoutputโ€) Photo By: Will Keightley
  • 37. Word count w/Dataframes df = spark.read.load(src) # Returns an RDD words = df.select("text").flatMap(lambda x: x.text.split(" ")) words_df = words.map( lambda x: Row(word=x, cnt=1)).toDF() word_count = words_df.groupBy("word").sum() word_count.write.format("parquet").save("wc.parquet") Still have the double serialization here :(
  • 38. Word count w/Datasets val df = spark.read.load(src).select("text") val ds = df.as[String] # Returns an Dataset! val words = ds.flatMap(x => x.split(" ")) val grouped = words.groupBy("value") val word_count = grouped.agg(count("*") as "count") word_count.write.format("parquet").save("wc") Canโ€™t push down filters from here If itโ€™s a simple type we donโ€™t have to define a case class
  • 39. What can the optimizer do now? โ— Sort on the serialized data โ— Understand the aggregate (โ€œpartial aggregatesโ€) โ—‹ Could sort of do this before but not as awesomely, and only if we used reduceByKey - not groupByKey โ— Pack them bits nice and tight
  • 40. So whats this new groupBy? โ— No longer causes explosions like RDD groupBy โ—‹ Able to introspect and pipeline the aggregation โ— Returns a GroupedData (or GroupedDataset) โ— Makes it easy to perform multiple aggregations โ— Built in shortcuts for aggregates like avg, min, max โ— Longer list at http://spark.apache.org/docs/latest/api/scala/index.html# org.apache.spark.sql.functions$ โ— Allows the optimizer to see what aggregates are being performed Sherrie Thai
  • 41. Computing some aggregates by age code: df.groupBy("age").min("hours-per-week") OR import org.apache.spark.sql.catalyst.expressions.aggregate._ df.groupBy("age").agg(min("hours-per-week"))
  • 42. Easily compute multiple aggregates: df.groupBy("age").agg(min("hours-per-week"), avg("hours-per-week"), max("capital-gain")) PhotoAtelier
  • 43. Using Datasets to mix functional & relational style: val ds: Dataset[RawPanda] = ... val happiness = ds.toDF().filter($"happy" === true).as[RawPanda]. select($"attributes"(0).as[Double]). reduce((x, y) => x + y)
  • 44. So what was that? ds.toDF().filter($"happy" === true).as[RawPanda]. select($"attributes"(0).as[Double]). reduce((x, y) => x + y) convert a Dataset to a DataFrame to access more DataFrame functions (pre-2.0) Convert DataFrame back to a Dataset A typed query (specifies the return type).Traditional functional reduction: arbitrary scala code :)
  • 45. And functional style maps: /** * Functional map + Dataset, sums the positive attributes for the pandas */ def funMap(ds: Dataset[RawPanda]): Dataset[Double] = { ds.map{rp => rp.attributes.filter(_ > 0).sum} } Chris Isherwood
  • 46. But where DataFrames explode? โ— Iterative algorithms - large plans โ—‹ Use your escape hatch to RDDs! โ— Some push downs are sad pandas :( โ— Default shuffle size is sometimes too small for big data (200 partitions) โ— Default partition size when reading in is also sad
  • 47. DF & RDD re-use - sadly not magic โ— If we know we are going to re-use the RDD what should we do? โ—‹ If it fits nicely in memory caching in memory โ—‹ persisting at another level โ–  MEMORY, MEMORY_ONLY_SER, MEMORY_AND_DISK, MEMORY_AND_DISK_SER โ—‹ checkpointing โ— Noisey clusters โ—‹ _2 & checkpointing can help โ— persist first for checkpointing Richard Gillin
  • 48. RDDs: The have issues to (especially with skew) โ— Keys arenโ€™t evenly distributed โ—‹ Sales by zip code, or records by city, etc. โ— groupByKey will explode (but it's pretty easy to break) โ— We can have really unbalanced partitions โ—‹ If we have enough key skew sortByKey could even fail โ—‹ Stragglers (uneven sharding can make some tasks take much longer) Mitchell Joyce
  • 49. So what does groupByKey look like? (94110, A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) (67843, T, R)(10003, A, R) (94110, [(A, B), (A, C), (E, F), (A, R), (D, R), (E, R), (E, R), (T, R) (T, R)] Tomomi
  • 50. So what did we do instead? โ— reduceByKey โ—‹ Works when the types are the same (e.g. in our summing version) โ— aggregateByKey โ—‹ Doesnโ€™t require the types to be the same (e.g. computing stats model or similar) Allows Spark to pipeline the reduction & skip making the list We also got a map-side reduction (note the difference in shuffled read)
  • 51. What is DS functional perf like? โ— Often better than RDD โ— Generally not as good relational - less understanding โ— SPARK-14083 is working on doing bytecode analysis โ— Can still be faster than RDD transformations because of serialization improvements
  • 52. The โ€œfuture*โ€: Awesome UDFs โ— Work going on in Scala land to translate simple Scala into SQL expressions โ— POC w/Jython for simple UDFs (e.g. 2.7 compat & no native libraries) - SPARK-15369 โ—‹ Early benchmarking w/word count 5% slower than native Scala UDF, close to 65% faster than regular Python โ— Arrow - faster interchange between languages โ— Willing to share your Python UDFs for benchmarking? - http://bit.ly/pySparkUDF *The future may or may not have better performance than today. But bun-bun the bunny has some lettuce so its ok!
  • 53. Andrew Skudder *Arrow: possibly the future. I really hope so. Spark 2.3 and beyond! * *
  • 54. Summary: Why to use Datasets โ— We can solve problems tricky to solve with RDDs โ—‹ Window operations โ—‹ Multiple aggregations โ— Fast โ—‹ Awesome optimizer โ—‹ Super space efficient data format โ— We can solve problems tricky/frustrating to solve with Dataframes โ—‹ Writing UDFs and UDAFs can really break your flow Lindsay Attaway
  • 55. And some upcoming talks: โ— Jan โ—‹ If interest tomorrow: Office Hours? @holdenkarau โ—‹ Wellington Spark + BEAM meetup โ—‹ LinuxConf AU โ—‹ Sydney Spark meetup โ—‹ Data Day Texas โ— Feb โ—‹ FOSDEM - One on testing one on scaling โ—‹ JFokus in Stockholm - Adding deep learning to Spark
  • 56. Cat wave photo by Quinn Dombrowski k thnx bye! If you <3 testing & want to fill out survey: http://bit.ly/holdenTestingSpark Want to tell me (and or my boss) how Iโ€™m doing? http://bit.ly/holdenTalkFeedback
  • 57. Bonus/support slides โ— โ€œBonusโ€ like the coupons you get from bed bath and beyond :p โ— Not as many cat pictures (Iโ€™m sorry) โ— The talk really should be finished at this pointโ€ฆ but maybe someone asked a question?
  • 58. Doing the (comparatively) impossible Hey Paul
  • 59. Windowed operations โ— Can compute over the past K and next J โ— Really hard to do in regular Spark, super easy in SQL Lucie Provencher 1 3 2 1
  • 60. Windowed operations โ— Can compute over the past K and next J โ— Really hard to do in regular Spark, super easy in SQL Lucie Provencher 1 3 2 1
  • 61. Window specs import org.apache.spark.sql.expressions.Window val spec = Window.partitionBy("age").orderBy("capital-gain" ).rowsBetween(-10, 10) val rez = df.select(avg("capital-gain").over(spec)) Ryo Chijiiwa
  • 62. UDFS: Adding custom code sqlContext.udf.register("strLen", (s: String) => s.length()) sqlCtx.registerFunction("strLen", lambda x: len(x), IntegerType()) YaฤŸmur Adam
  • 63. Using UDF on a table: First Register the table: df.registerTempTable("myTable") sqlContext.sql("SELECT firstCol, strLen(stringCol) from myTable")
  • 64. Using UDFs Programmatically def dateTimeFunction(format : String ): UserDefinedFunction = { import org.apache.spark.sql.functions.udf udf((time : Long) => new Timestamp(time * 1000)) } val format = "dd-mm-yyyy" df.select(df(firstCol), dateTimeFunction(format)(df(unixTimeStamp).cast(TimestampType))