Spark DataFrames for Data Munging

© 2016 Mesosphere, Inc. All Rights Reserved.
SPARK DATAFRAMES FOR
DATA MUNGING
1
Susan X. Huynh, Scala by the Bay, Nov. 2016

OUTLINE
2
Motivation
Spark DataFrame API
Demo
Beyond Data Munging

MOTIVATION
3
•Your job: Analyze 100 GB of log data:
{"created_at":"Tue Sep 13 19:54:43 +0000 2016","id":775784797046124544,"id_str":"775784797046124544","text":"@abcd4321 ur icon
is good","source":"u003ca href="http://twitter.com/download/android" rel="nofollow"u003eTwitter for Androidu003c/a
u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":
1180432963,"in_reply_to_user_id_str":"1180432963","in_reply_to_screen_name":"fakejoshler","user":{"id":
4795786058,"id_str":"4795786058","name":"maggie","screen_name":"wxyz1234","location":"she her - gabby, mily","url":"http://
666gutz.tumblr.com","description":"one too many skeletons","protected":false,"verified":false,"followers_count":
2168,"friends_count":84,"listed_count":67,"favourites_count":22298,"statuses_count":29769,"created_at":"Fri Jan 22 00:04:30
+0000 2016","utc_offset":-25200,"time_zone":"Pacific Time (US &
Canada)","geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"000000"
,"profile_background_image_url":"http://abs.twimg.com/images/themes/theme1/
bg.png","profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme1/
bg.png","profile_background_tile":false,"profile_link_color":"000000","profile_sidebar_border_color":"000000","profile_sidebar_
fill_color":"000000","profile_text_color":"000000","profile_use_background_image":false,

WHAT DO YOU MEAN BY “ANALYZE”?
4
AKA: data munging, ETL, data cleaning, acronym: PETS (or PEST? :)
Parse
Explore
Transform
Summarize
Data pipeline
Motivation
"source":"u003ca href="http://twitter.com/download/android" rel=”nofollow”…

BEST TOOL FOR THE JOB?
5
DataFrame
Pandas (Python)
R
Big data + SQL
Hive, Impala
DataFrame + Big data / SQL
Spark DataFrame
Motivation
https://flic.kr/p/fnCVbL

WHY SPARK?
6
Open source
Scalable
Fast ad-hoc queries
Motivation

WHY SPARK DATAFRAME?
7
Parse: Easy to read structured, semi-structured (JSON) formats
Explore: DataFrame
Transform / Summarize:
SQL queries + procedural processing
Utilities for math, string, date / time manipulation
Scala
Motivation

PARSE: READING JSON DATA
8
> spark
res4: org.apache.spark.sql.SparkSession@3fc09112
> val df = spark.read.json(“/path/to/mydata.json”)
df: org.apache.spark.sql.DataFrame = [contributors: string ... 33 more fields]
DataFrame: a table with rows and columns (fields)
Spark DataFrame API
"source":"u003ca href="http://twitter.com/download/android" rel=”nofollow”…

EXPLORE
9
> df.printSchema() // lists the columns in a DataFrame
root
|-- contributors: string (nullable = true)
|-- coordinates: struct (nullable = true)
| |-- coordinates: array (nullable = true)
| | |-- element: double (containsNull = true)
| |-- type: string (nullable = true)
|-- created_at: string (nullable = true)
|-- delete: struct (nullable = true)
| |-- status: struct (nullable = true)
| | |-- id: long (nullable = true)
|-- lang: string (nullable = true)
…
Spark DataFrame API

EXPLORE (CONT’D)
10
> df.filter(col(”coordinates”).isNotNull) // filters on rows, with given condition
.select("coordinates",“created_at”) // filters on columns
.show()
+------------------------------------------------+------------------------------+
|coordinates |created_at |
+------------------------------------------------+------------------------------+
|[WrappedArray(104.86544034, 15.23611896),Point] |Thu Sep 15 02:00:00 +0000 2016|
|[WrappedArray(-43.301755, -22.990065),Point] |Thu Sep 15 02:00:03 +0000 2016|
|[WrappedArray(100.3833729, 6.13822131),Point] |Thu Sep 15 02:00:30 +0000 2016|
|[WrappedArray(-122.286, 47.5592),Point] |Thu Sep 15 02:00:38 +0000 2016|
|[WrappedArray(110.823004, -6.80342),Point] |Thu Sep 15 02:00:42 +0000 2016|
Other DataFrame ops: count(), describe(), create new columns, …
Spark DataFrame API

TRANSFORM/SUMMARIZE: SQL QUERIES + PROCEDURAL PROC.
11
> val langCount = df.select(“lang")
.where(col(”lang”).isNotNull)
.groupBy(“lang")
.count()
.orderBy(col(”count”).desc)
+----+-----+
|lang|count|
+----+-----+
| en|61644|
| es|22937|
| pt|21610|
| ja|19160|
| und|10376|
Also: joins
> val result = langCount.map{row:Row => …} // or flatMap, filter, …
Spark DataFrame API
SQL
PROCEDURAL

DEMO
13
Spark 2.0
Zeppelin notebook 0.6.1
8 GB JSON-formatted public Tweet data

BEYOND DATA MUNGING
14
Machine learning
Data pipeline in production
Streaming data

BEYOND DATA MUNGING
15
Machine learning => DataFrame-based ML API
Data pipeline in production => Dataset API, with type safety
Streaming data => Structured Streaming API, based on DataFrame
Spark 2.0

RECAP
16
Spark DataFrames combine the “data frame” abstraction with Big Data and SQL
Spark DataFrames simplify data munging tasks (“PETS”):
Parse => structured and semi-structured formats (JSON)
Explore => DataFrame: printSchema, filter by row / column, show
Transform,
Summarize => SQL + procedural processing, math / string / date-time utility functions
All in Scala

REFERENCES
17
Spark SQL and DataFrames Guide: http://spark.apache.org/docs/latest/sql-
programming-guide.html
Spark DataFrame API: http://spark.apache.org/docs/latest/api/scala/
index.html#org.apache.spark.sql.Dataset
Overview of Spark DataFrames: http://xinhstechblog.blogspot.com/2016/05/overview-
of-spark-dataframe-api.html
DataFrame Internals: https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/
SparkSQLSigmod2015.pdf

THANK YOU!
18

Spark DataFrames for Data Munging

More Related Content

What's hot

Similar to Spark DataFrames for Data Munging

Recently uploaded

Spark DataFrames for Data Munging