A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji

A Tale of Three Apache
Spark APIs: RDDs,
DataFrames & Datasets
Jules S. Damji
August 5, 2017, Big Data Day Los Angeles
@2twitme

I have used Apache Spark Before…

I know the difference between
DataFrame and RDDs…

Spark Community Evangelist &Developer
Advocate @ Databricks
Developer Advocate @ Hortonworks
Software engineering @: Sun Microsystems,
Netscape, @Home, VeriSign, Scalix, Centrify,
LoudCloud/Opsware, ProQuest

Agenda…
Why are we here today, what’s the problem?
• Resilient Distributed Datasets (RDDs)
• Structure in Spark
• DataFrames and Datasets
• Demo
• Q & A

Resilient Distributed Dataset
(RDD)

What are RDDs?
• … Distributed data abstraction
• … Resilient & Immutable
• … Lazy
• … Compile Type-safe
• … Semi-structuredor unstructured

Unstructured Data: Text (logs, tweets, articles, social)

Why Use RDDs?
• … Offer Control & flexibility
• ... Low-level API
• ... Type-safe
• ... Encourage how-to

Some code to read Wikipedia
val rdd = sc.textFile("/mnt/wikipediapagecounts.gz")
val parsedRDD = rdd.flatMap {
line => line.split("""s+""") match {
case Array(project, page, numRequests, _) => Some((project, page, numRequests))
case _ => None
}
}
// filter only English pages
parsedRDD.filter { case (project, page, numRequests) => project == "en" }.
map { case (_, page, numRequests) => (page, numRequests) }.
reduceByKey(_ + _).
take(100). foreach { case (page, requests) => println(s"$page: $requests") }

When to Use RDDs?
• ... Low-level API & control of dataset
• ... Dealing with unstrucrured data (media streams or texts)
• ... Manipulate data with lambda functions than DSL
• ... Don’t care schema or structure of data
• ... Sacrifice optimization, performance & inefficiecies

Why When
Use RDDs
• Unstructured Data & No schema
• No code optimization &
performance
• Low-level APIs, not DSL or high-level
• Control and flexibility
• Low-level APIs and Lambda
functions
• Type-safety
• How-do

What’s the problem?
• ... Express how-to solution, not what-to
• ... Not optimized by Spark
• ... Slow for non-JVM languages like Python
• ... Inadverdent inefficiecies

Inadvertent inefficiencies in RDDs
parsedRDD.filter { case (project, page, numRequests) => project == "en" }.
map { case (_, page, numRequests) => (page, numRequests) }.
reduceByKey(_ + _).
filter { case (page, _) => ! isSpecialPage(page) }.
take(100). foreach { case (project, requests) => println (s"project: $requests") }

Structured in Spark
DataFrames & Datasets APIs

Background: What is in an RDD?
•Dependencies
• Partitions (with optional localityinfo)
• Compute function: Partition =>Iterator[T]
Opaque Computation
& Opaque Data

Structured APIs In Spark
19
SQL DataFrames Datasets
Syntax
Errors
Analysis
Errors
Runtime Compile
Time
Runtime
Compile
Time
Compile
Time
Runtime
Analysis errors are reported before a distributed job starts

Unification of APIs in Spark 2.0

DataFrame API code.
// convert RDD -> Df with column names
val df = parsedRDD.toDF("project", "page", "numRequests")
//filter, groupBy, sum, and then agg()
df.filter($"project" === "en").
groupBy($"page").
agg(sum($"numRequests").as("count")).
limit(100).
show(100)

Take DataFrame à SQL Table à Query
df. createOrReplaceTempView(("edits")
val results = spark.sql("""SELECT page, sum(numRequests)
AS count FROM edits WHERE project = 'en' GROUP BY page
LIMIT 100""")
results.show(100)

23
Using Catalyst in Spark SQL
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
Analysis: analyzinga logicalplan to resolve references
Logical Optimization: logicalplan optimization
Physical Planning: Physical planning
Code Generation:Compileparts of the query to Java bytecode
SQL AST
DataFrame
Datasets

PhysicalPlan
with Predicate Pushdown
and Column Pruning
join
optimized
scan
(events)
optimized
scan
(users)
LogicalPlan
filter
join
PhysicalPlan
join
scan
(users)events file userstable
24
scan
(events)
filter
users.join(events, users("id") === events("uid")) .
filter(events("date") > "2015-01-01")
DataFrame Optimization

Type-safe:operate
on domain objects
with compiled
lambda functions
8
Dataset API in Spark 2.x
val df = spark.read.j son("people.json")
/ / Convert data to domain obj ects.
case cl ass Person(name: Stri ng, age: I n t )
val ds: Dataset[Person] = df.as[Person]
val = fi l terD S = d s . f i l t e r (p = > p. ag e > 30

DataFrames are Faster than RDDs

Why When
DataFrames & Datasets
• StructuredData schema
• Code optimization & performance
• Space efficiency with Tungsten
• High-level APIs and DSL
• StrongType-safety
• Ease-of-use & Readability
• What-to-do

43
Spark Core (RDD)
Catalyst
DataFrame/DatasetSQL
MLPipelines
Structured
Streaming
{ JSON }
JDBC
andmore…
FoundationalSpark2.x Components
Spark SQL
GraphFrames

Source: michaelmalak
Putting all Together: Conclusion

https://spark-summit.org/eu-2017/

Resources
• Getting Started Guide with Apache Spark on Databricks
• docs.databricks.com
• Spark Programming Guide
• https://databricks.com/blog/2016/01/04/introducing-apache-spark-
datasets.html
• https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-
apis-rdds-dataframes-and-datasets.html
• https://github.com/bmc/rdds-dataframes-datasets-presentation-2016
• Databricks Engineering Blogs

Do you have any questions for my preparedanswers?

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Q1
Q2
Q3
Q4
Title
Blue
Orange
Green
Use this chart to start

Here are some icons to use - scalable
DB Benefits
DB Features
General /Data Science
Icons can be recoloredwithinPowerpoint — see: format picture/ picture color / recolor
Orange, Green, and Black versions (no recolorationnecessary) can be found in go/icons

More icons
Industries
Security
Spark Benefits Spark Features

Slide for Large Question
or Section Headers

Thank You
Parting words or contact information go here.

A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji

Similar to A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji (20)

More from Data Con LA

More from Data Con LA (20)

Recently uploaded

Recently uploaded (20)

A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji