SlideShare a Scribd company logo
© 2016 Mesosphere, Inc. All Rights Reserved.
SPARK DATAFRAMES FOR
DATA MUNGING
1
Susan X. Huynh, Scala by the Bay, Nov. 2016
© 2016 Mesosphere, Inc. All Rights Reserved.
OUTLINE
2
Motivation
Spark DataFrame API
Demo
Beyond Data Munging
© 2016 Mesosphere, Inc. All Rights Reserved.
MOTIVATION
3
•Your job: Analyze 100 GB of log data:
{"created_at":"Tue Sep 13 19:54:43 +0000 2016","id":775784797046124544,"id_str":"775784797046124544","text":"@abcd4321 ur icon
is good","source":"u003ca href="http://twitter.com/download/android" rel="nofollow"u003eTwitter for Androidu003c/a
u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":
1180432963,"in_reply_to_user_id_str":"1180432963","in_reply_to_screen_name":"fakejoshler","user":{"id":
4795786058,"id_str":"4795786058","name":"maggie","screen_name":"wxyz1234","location":"she her - gabby, mily","url":"http://
666gutz.tumblr.com","description":"one too many skeletons","protected":false,"verified":false,"followers_count":
2168,"friends_count":84,"listed_count":67,"favourites_count":22298,"statuses_count":29769,"created_at":"Fri Jan 22 00:04:30
+0000 2016","utc_offset":-25200,"time_zone":"Pacific Time (US &
Canada)","geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"000000"
,"profile_background_image_url":"http://abs.twimg.com/images/themes/theme1/
bg.png","profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme1/
bg.png","profile_background_tile":false,"profile_link_color":"000000","profile_sidebar_border_color":"000000","profile_sidebar_
fill_color":"000000","profile_text_color":"000000","profile_use_background_image":false,
© 2016 Mesosphere, Inc. All Rights Reserved.
WHAT DO YOU MEAN BY “ANALYZE”?
4
AKA: data munging, ETL, data cleaning, acronym: PETS (or PEST? :)
Parse
Explore
Transform
Summarize
Data pipeline
Motivation
"source":"u003ca href="http://twitter.com/download/android" rel=”nofollow”…
© 2016 Mesosphere, Inc. All Rights Reserved.
BEST TOOL FOR THE JOB?
5
DataFrame
Pandas (Python)
R
Big data + SQL
Hive, Impala
DataFrame + Big data / SQL
Spark DataFrame
Motivation
https://flic.kr/p/fnCVbL
© 2016 Mesosphere, Inc. All Rights Reserved.
WHY SPARK?
6
Open source
Scalable
Fast ad-hoc queries
Motivation
© 2016 Mesosphere, Inc. All Rights Reserved.
WHY SPARK DATAFRAME?
7
Parse: Easy to read structured, semi-structured (JSON) formats
Explore: DataFrame
Transform / Summarize:
SQL queries + procedural processing
Utilities for math, string, date / time manipulation
Scala
Motivation
© 2016 Mesosphere, Inc. All Rights Reserved.
PARSE: READING JSON DATA
8
> spark
res4: org.apache.spark.sql.SparkSession@3fc09112
> val df = spark.read.json(“/path/to/mydata.json”)
df: org.apache.spark.sql.DataFrame = [contributors: string ... 33 more fields]
DataFrame: a table with rows and columns (fields)
Spark DataFrame API
"source":"u003ca href="http://twitter.com/download/android" rel=”nofollow”…
© 2016 Mesosphere, Inc. All Rights Reserved.
EXPLORE
9
> df.printSchema() // lists the columns in a DataFrame
root
|-- contributors: string (nullable = true)
|-- coordinates: struct (nullable = true)
| |-- coordinates: array (nullable = true)
| | |-- element: double (containsNull = true)
| |-- type: string (nullable = true)
|-- created_at: string (nullable = true)
|-- delete: struct (nullable = true)
| |-- status: struct (nullable = true)
| | |-- id: long (nullable = true)
|-- lang: string (nullable = true)
…
Spark DataFrame API
© 2016 Mesosphere, Inc. All Rights Reserved.
EXPLORE (CONT’D)
10
> df.filter(col(”coordinates”).isNotNull) // filters on rows, with given condition
.select("coordinates",“created_at”) // filters on columns
.show()
+------------------------------------------------+------------------------------+
|coordinates |created_at |
+------------------------------------------------+------------------------------+
|[WrappedArray(104.86544034, 15.23611896),Point] |Thu Sep 15 02:00:00 +0000 2016|
|[WrappedArray(-43.301755, -22.990065),Point] |Thu Sep 15 02:00:03 +0000 2016|
|[WrappedArray(100.3833729, 6.13822131),Point] |Thu Sep 15 02:00:30 +0000 2016|
|[WrappedArray(-122.286, 47.5592),Point] |Thu Sep 15 02:00:38 +0000 2016|
|[WrappedArray(110.823004, -6.80342),Point] |Thu Sep 15 02:00:42 +0000 2016|
Other DataFrame ops: count(), describe(), create new columns, …
Spark DataFrame API
© 2016 Mesosphere, Inc. All Rights Reserved.
TRANSFORM/SUMMARIZE: SQL QUERIES + PROCEDURAL PROC.
11
> val langCount = df.select(“lang")
.where(col(”lang”).isNotNull)
.groupBy(“lang")
.count()
.orderBy(col(”count”).desc)
+----+-----+
|lang|count|
+----+-----+
| en|61644|
| es|22937|
| pt|21610|
| ja|19160|
| und|10376|
Also: joins
> val result = langCount.map{row:Row => …} // or flatMap, filter, …
Spark DataFrame API
SQL
PROCEDURAL
© 2016 Mesosphere, Inc. All Rights Reserved.
MATH, STRING, DATE / TIME FUNCTIONS
12
> df.select(”created_at”)
.withColumn(“day_of_week”, col(”created_at”).substr(0, 3))
.show()
+--------------------+-----------+
| created_at|day_of_week|
+--------------------+-----------+
| null| null|
|Thu Sep 15 01:59:...| Thu|
|Thu Sep 15 01:59:...| Thu|
|Thu Sep 15 01:59:...| Thu|
| null| null|
|Thu Sep 15 01:59:...| Thu|
Also: sin, cos, exp, log, pow, toDegrees, toRadians, ceil, floor, round, concat, format_string, lower, regexp_extract, split,
trim, upper, current_timestamp, datediff, from_unixtime, …
Spark DataFrame API
© 2016 Mesosphere, Inc. All Rights Reserved.
DEMO
13
Spark 2.0
Zeppelin notebook 0.6.1
8 GB JSON-formatted public Tweet data
© 2016 Mesosphere, Inc. All Rights Reserved.
BEYOND DATA MUNGING
14
Machine learning
Data pipeline in production
Streaming data
© 2016 Mesosphere, Inc. All Rights Reserved.
BEYOND DATA MUNGING
15
Machine learning => DataFrame-based ML API
Data pipeline in production => Dataset API, with type safety
Streaming data => Structured Streaming API, based on DataFrame
Spark 2.0
© 2016 Mesosphere, Inc. All Rights Reserved.
RECAP
16
Spark DataFrames combine the “data frame” abstraction with Big Data and SQL
Spark DataFrames simplify data munging tasks (“PETS”):
Parse => structured and semi-structured formats (JSON)
Explore => DataFrame: printSchema, filter by row / column, show
Transform,
Summarize => SQL + procedural processing, math / string / date-time utility functions
All in Scala
© 2016 Mesosphere, Inc. All Rights Reserved.
REFERENCES
17
Spark SQL and DataFrames Guide: http://spark.apache.org/docs/latest/sql-
programming-guide.html
Spark DataFrame API: http://spark.apache.org/docs/latest/api/scala/
index.html#org.apache.spark.sql.Dataset
Overview of Spark DataFrames: http://xinhstechblog.blogspot.com/2016/05/overview-
of-spark-dataframe-api.html
DataFrame Internals: https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/
SparkSQLSigmod2015.pdf
© 2016 Mesosphere, Inc. All Rights Reserved.
THANK YOU!
18

More Related Content

What's hot

Some R Examples[R table and Graphics] -Advanced Data Visualization in R (Some...
Some R Examples[R table and Graphics] -Advanced Data Visualization in R (Some...Some R Examples[R table and Graphics] -Advanced Data Visualization in R (Some...
Some R Examples[R table and Graphics] -Advanced Data Visualization in R (Some...
Dr. Volkan OBAN
 
Programming Lisp Clojure - 2장 : 클로저 둘러보기
Programming Lisp Clojure - 2장 : 클로저 둘러보기Programming Lisp Clojure - 2장 : 클로저 둘러보기
Programming Lisp Clojure - 2장 : 클로저 둘러보기
JangHyuk You
 
WordPressでIoTをはじめよう
WordPressでIoTをはじめようWordPressでIoTをはじめよう
WordPressでIoTをはじめよう
Yuriko IKEDA
 
CS442 - Rogue: A Scala DSL for MongoDB
CS442 - Rogue: A Scala DSL for MongoDBCS442 - Rogue: A Scala DSL for MongoDB
CS442 - Rogue: A Scala DSL for MongoDB
jorgeortiz85
 
Cross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App EngineCross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App Engine
Andy McKay
 
R code
R codeR code
R code
Manav Goel
 
Advanced Data Visualization in R- Somes Examples.
Advanced Data Visualization in R- Somes Examples.Advanced Data Visualization in R- Somes Examples.
Advanced Data Visualization in R- Somes Examples.
Dr. Volkan OBAN
 
Sending a for ahuh. win32 exploit development old school
Sending a for ahuh. win32 exploit development old schoolSending a for ahuh. win32 exploit development old school
Sending a for ahuh. win32 exploit development old school
Nahidul Kibria
 
Nouveau document texte
Nouveau document texteNouveau document texte
Nouveau document texteSai Ef
 
Yy
YyYy
Yyyygh
 
Basic Calculus in R.
Basic Calculus in R. Basic Calculus in R.
Basic Calculus in R.
Dr. Volkan OBAN
 
openCypher: Technology Compatibility Kit (TCK) and Vendor Extensions
openCypher: Technology Compatibility Kit (TCK) and Vendor ExtensionsopenCypher: Technology Compatibility Kit (TCK) and Vendor Extensions
openCypher: Technology Compatibility Kit (TCK) and Vendor Extensions
openCypher
 
我在豆瓣使用Emacs
我在豆瓣使用Emacs我在豆瓣使用Emacs
我在豆瓣使用Emacs
董 伟明
 
XQuery in the Cloud
XQuery in the CloudXQuery in the Cloud
XQuery in the Cloud
William Candillon
 
PHP 7 – What changed internally?
PHP 7 – What changed internally?PHP 7 – What changed internally?
PHP 7 – What changed internally?
Nikita Popov
 
Scala Days 2011 - Rogue: A Type-Safe DSL for MongoDB
Scala Days 2011 - Rogue: A Type-Safe DSL for MongoDBScala Days 2011 - Rogue: A Type-Safe DSL for MongoDB
Scala Days 2011 - Rogue: A Type-Safe DSL for MongoDB
jorgeortiz85
 
Pandas+postgre sql 實作 with code
Pandas+postgre sql 實作 with codePandas+postgre sql 實作 with code
Pandas+postgre sql 實作 with code
Tim Hong
 
Manifests of Future Past
Manifests of Future PastManifests of Future Past
Manifests of Future Past
Puppet
 

What's hot (19)

Some R Examples[R table and Graphics] -Advanced Data Visualization in R (Some...
Some R Examples[R table and Graphics] -Advanced Data Visualization in R (Some...Some R Examples[R table and Graphics] -Advanced Data Visualization in R (Some...
Some R Examples[R table and Graphics] -Advanced Data Visualization in R (Some...
 
Programming Lisp Clojure - 2장 : 클로저 둘러보기
Programming Lisp Clojure - 2장 : 클로저 둘러보기Programming Lisp Clojure - 2장 : 클로저 둘러보기
Programming Lisp Clojure - 2장 : 클로저 둘러보기
 
WordPressでIoTをはじめよう
WordPressでIoTをはじめようWordPressでIoTをはじめよう
WordPressでIoTをはじめよう
 
CS442 - Rogue: A Scala DSL for MongoDB
CS442 - Rogue: A Scala DSL for MongoDBCS442 - Rogue: A Scala DSL for MongoDB
CS442 - Rogue: A Scala DSL for MongoDB
 
Cross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App EngineCross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App Engine
 
R code
R codeR code
R code
 
Advanced Data Visualization in R- Somes Examples.
Advanced Data Visualization in R- Somes Examples.Advanced Data Visualization in R- Somes Examples.
Advanced Data Visualization in R- Somes Examples.
 
Sending a for ahuh. win32 exploit development old school
Sending a for ahuh. win32 exploit development old schoolSending a for ahuh. win32 exploit development old school
Sending a for ahuh. win32 exploit development old school
 
Nouveau document texte
Nouveau document texteNouveau document texte
Nouveau document texte
 
Yy
YyYy
Yy
 
Basic Calculus in R.
Basic Calculus in R. Basic Calculus in R.
Basic Calculus in R.
 
openCypher: Technology Compatibility Kit (TCK) and Vendor Extensions
openCypher: Technology Compatibility Kit (TCK) and Vendor ExtensionsopenCypher: Technology Compatibility Kit (TCK) and Vendor Extensions
openCypher: Technology Compatibility Kit (TCK) and Vendor Extensions
 
XQuery Rocks
XQuery RocksXQuery Rocks
XQuery Rocks
 
我在豆瓣使用Emacs
我在豆瓣使用Emacs我在豆瓣使用Emacs
我在豆瓣使用Emacs
 
XQuery in the Cloud
XQuery in the CloudXQuery in the Cloud
XQuery in the Cloud
 
PHP 7 – What changed internally?
PHP 7 – What changed internally?PHP 7 – What changed internally?
PHP 7 – What changed internally?
 
Scala Days 2011 - Rogue: A Type-Safe DSL for MongoDB
Scala Days 2011 - Rogue: A Type-Safe DSL for MongoDBScala Days 2011 - Rogue: A Type-Safe DSL for MongoDB
Scala Days 2011 - Rogue: A Type-Safe DSL for MongoDB
 
Pandas+postgre sql 實作 with code
Pandas+postgre sql 實作 with codePandas+postgre sql 實作 with code
Pandas+postgre sql 實作 with code
 
Manifests of Future Past
Manifests of Future PastManifests of Future Past
Manifests of Future Past
 

Similar to Spark DataFrames for Data Munging

Burn down the silos! Helping dev and ops gel on high availability websites
Burn down the silos! Helping dev and ops gel on high availability websitesBurn down the silos! Helping dev and ops gel on high availability websites
Burn down the silos! Helping dev and ops gel on high availability websites
Lindsay Holmwood
 
Building and Deploying Application to Apache Mesos
Building and Deploying Application to Apache MesosBuilding and Deploying Application to Apache Mesos
Building and Deploying Application to Apache Mesos
Joe Stein
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with ClojureDmitry Buzdin
 
Working with Complex Types in DataFrames: Optics to the Rescue
Working with Complex Types in DataFrames: Optics to the RescueWorking with Complex Types in DataFrames: Optics to the Rescue
Working with Complex Types in DataFrames: Optics to the Rescue
Databricks
 
PerlApp2Postgresql (2)
PerlApp2Postgresql (2)PerlApp2Postgresql (2)
PerlApp2Postgresql (2)Jerome Eteve
 
Postgres performance for humans
Postgres performance for humansPostgres performance for humans
Postgres performance for humansCraig Kerstiens
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
 
Practical JSON in MySQL 5.7 and Beyond
Practical JSON in MySQL 5.7 and BeyondPractical JSON in MySQL 5.7 and Beyond
Practical JSON in MySQL 5.7 and Beyond
Ike Walker
 
DataMapper
DataMapperDataMapper
DataMapper
Yehuda Katz
 
Scaling Scala to the database - Stefan Zeiger (Typesafe)
Scaling Scala to the database - Stefan Zeiger (Typesafe)Scaling Scala to the database - Stefan Zeiger (Typesafe)
Scaling Scala to the database - Stefan Zeiger (Typesafe)
jaxLondonConference
 
Declarative & workflow based infrastructure with Terraform
Declarative & workflow based infrastructure with TerraformDeclarative & workflow based infrastructure with Terraform
Declarative & workflow based infrastructure with Terraform
Radek Simko
 
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiMonitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
InfluxData
 
[Coscup 2012] JavascriptMVC
[Coscup 2012] JavascriptMVC[Coscup 2012] JavascriptMVC
[Coscup 2012] JavascriptMVC
Alive Kuo
 
Cache and Drupal
Cache and DrupalCache and Drupal
Cache and Drupal
Kornel Lugosi
 
Graph Database Query Languages
Graph Database Query LanguagesGraph Database Query Languages
Graph Database Query Languages
Jay Coskey
 
The rise of json in rdbms land jab17
The rise of json in rdbms land jab17The rise of json in rdbms land jab17
The rise of json in rdbms land jab17
alikonweb
 
Ruby is Awesome
Ruby is AwesomeRuby is Awesome
Ruby is Awesome
Astrails
 
All Things Open 2016 -- Database Programming for Newbies
All Things Open 2016 -- Database Programming for NewbiesAll Things Open 2016 -- Database Programming for Newbies
All Things Open 2016 -- Database Programming for Newbies
Dave Stokes
 
Machine Learning with Microsoft Azure
Machine Learning with Microsoft AzureMachine Learning with Microsoft Azure
Machine Learning with Microsoft Azure
Dmitry Petukhov
 

Similar to Spark DataFrames for Data Munging (20)

Burn down the silos! Helping dev and ops gel on high availability websites
Burn down the silos! Helping dev and ops gel on high availability websitesBurn down the silos! Helping dev and ops gel on high availability websites
Burn down the silos! Helping dev and ops gel on high availability websites
 
Building and Deploying Application to Apache Mesos
Building and Deploying Application to Apache MesosBuilding and Deploying Application to Apache Mesos
Building and Deploying Application to Apache Mesos
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with Clojure
 
Working with Complex Types in DataFrames: Optics to the Rescue
Working with Complex Types in DataFrames: Optics to the RescueWorking with Complex Types in DataFrames: Optics to the Rescue
Working with Complex Types in DataFrames: Optics to the Rescue
 
PerlApp2Postgresql (2)
PerlApp2Postgresql (2)PerlApp2Postgresql (2)
PerlApp2Postgresql (2)
 
Postgres performance for humans
Postgres performance for humansPostgres performance for humans
Postgres performance for humans
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Practical JSON in MySQL 5.7 and Beyond
Practical JSON in MySQL 5.7 and BeyondPractical JSON in MySQL 5.7 and Beyond
Practical JSON in MySQL 5.7 and Beyond
 
DataMapper
DataMapperDataMapper
DataMapper
 
Scaling Scala to the database - Stefan Zeiger (Typesafe)
Scaling Scala to the database - Stefan Zeiger (Typesafe)Scaling Scala to the database - Stefan Zeiger (Typesafe)
Scaling Scala to the database - Stefan Zeiger (Typesafe)
 
Declarative & workflow based infrastructure with Terraform
Declarative & workflow based infrastructure with TerraformDeclarative & workflow based infrastructure with Terraform
Declarative & workflow based infrastructure with Terraform
 
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiMonitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
 
[Coscup 2012] JavascriptMVC
[Coscup 2012] JavascriptMVC[Coscup 2012] JavascriptMVC
[Coscup 2012] JavascriptMVC
 
Cache and Drupal
Cache and DrupalCache and Drupal
Cache and Drupal
 
Graph Database Query Languages
Graph Database Query LanguagesGraph Database Query Languages
Graph Database Query Languages
 
The rise of json in rdbms land jab17
The rise of json in rdbms land jab17The rise of json in rdbms land jab17
The rise of json in rdbms land jab17
 
Ruby is Awesome
Ruby is AwesomeRuby is Awesome
Ruby is Awesome
 
All Things Open 2016 -- Database Programming for Newbies
All Things Open 2016 -- Database Programming for NewbiesAll Things Open 2016 -- Database Programming for Newbies
All Things Open 2016 -- Database Programming for Newbies
 
JQuery Flot
JQuery FlotJQuery Flot
JQuery Flot
 
Machine Learning with Microsoft Azure
Machine Learning with Microsoft AzureMachine Learning with Microsoft Azure
Machine Learning with Microsoft Azure
 

Recently uploaded

Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024
Sharepoint Designs
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
Peter Caitens
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 
Why React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdfWhy React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdf
ayushiqss
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Hivelance Technology
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 

Recently uploaded (20)

Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
Why React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdfWhy React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdf
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 

Spark DataFrames for Data Munging

  • 1. © 2016 Mesosphere, Inc. All Rights Reserved. SPARK DATAFRAMES FOR DATA MUNGING 1 Susan X. Huynh, Scala by the Bay, Nov. 2016
  • 2. © 2016 Mesosphere, Inc. All Rights Reserved. OUTLINE 2 Motivation Spark DataFrame API Demo Beyond Data Munging
  • 3. © 2016 Mesosphere, Inc. All Rights Reserved. MOTIVATION 3 •Your job: Analyze 100 GB of log data: {"created_at":"Tue Sep 13 19:54:43 +0000 2016","id":775784797046124544,"id_str":"775784797046124544","text":"@abcd4321 ur icon is good","source":"u003ca href="http://twitter.com/download/android" rel="nofollow"u003eTwitter for Androidu003c/a u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id": 1180432963,"in_reply_to_user_id_str":"1180432963","in_reply_to_screen_name":"fakejoshler","user":{"id": 4795786058,"id_str":"4795786058","name":"maggie","screen_name":"wxyz1234","location":"she her - gabby, mily","url":"http:// 666gutz.tumblr.com","description":"one too many skeletons","protected":false,"verified":false,"followers_count": 2168,"friends_count":84,"listed_count":67,"favourites_count":22298,"statuses_count":29769,"created_at":"Fri Jan 22 00:04:30 +0000 2016","utc_offset":-25200,"time_zone":"Pacific Time (US & Canada)","geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"000000" ,"profile_background_image_url":"http://abs.twimg.com/images/themes/theme1/ bg.png","profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme1/ bg.png","profile_background_tile":false,"profile_link_color":"000000","profile_sidebar_border_color":"000000","profile_sidebar_ fill_color":"000000","profile_text_color":"000000","profile_use_background_image":false,
  • 4. © 2016 Mesosphere, Inc. All Rights Reserved. WHAT DO YOU MEAN BY “ANALYZE”? 4 AKA: data munging, ETL, data cleaning, acronym: PETS (or PEST? :) Parse Explore Transform Summarize Data pipeline Motivation "source":"u003ca href="http://twitter.com/download/android" rel=”nofollow”…
  • 5. © 2016 Mesosphere, Inc. All Rights Reserved. BEST TOOL FOR THE JOB? 5 DataFrame Pandas (Python) R Big data + SQL Hive, Impala DataFrame + Big data / SQL Spark DataFrame Motivation https://flic.kr/p/fnCVbL
  • 6. © 2016 Mesosphere, Inc. All Rights Reserved. WHY SPARK? 6 Open source Scalable Fast ad-hoc queries Motivation
  • 7. © 2016 Mesosphere, Inc. All Rights Reserved. WHY SPARK DATAFRAME? 7 Parse: Easy to read structured, semi-structured (JSON) formats Explore: DataFrame Transform / Summarize: SQL queries + procedural processing Utilities for math, string, date / time manipulation Scala Motivation
  • 8. © 2016 Mesosphere, Inc. All Rights Reserved. PARSE: READING JSON DATA 8 > spark res4: org.apache.spark.sql.SparkSession@3fc09112 > val df = spark.read.json(“/path/to/mydata.json”) df: org.apache.spark.sql.DataFrame = [contributors: string ... 33 more fields] DataFrame: a table with rows and columns (fields) Spark DataFrame API "source":"u003ca href="http://twitter.com/download/android" rel=”nofollow”…
  • 9. © 2016 Mesosphere, Inc. All Rights Reserved. EXPLORE 9 > df.printSchema() // lists the columns in a DataFrame root |-- contributors: string (nullable = true) |-- coordinates: struct (nullable = true) | |-- coordinates: array (nullable = true) | | |-- element: double (containsNull = true) | |-- type: string (nullable = true) |-- created_at: string (nullable = true) |-- delete: struct (nullable = true) | |-- status: struct (nullable = true) | | |-- id: long (nullable = true) |-- lang: string (nullable = true) … Spark DataFrame API
  • 10. © 2016 Mesosphere, Inc. All Rights Reserved. EXPLORE (CONT’D) 10 > df.filter(col(”coordinates”).isNotNull) // filters on rows, with given condition .select("coordinates",“created_at”) // filters on columns .show() +------------------------------------------------+------------------------------+ |coordinates |created_at | +------------------------------------------------+------------------------------+ |[WrappedArray(104.86544034, 15.23611896),Point] |Thu Sep 15 02:00:00 +0000 2016| |[WrappedArray(-43.301755, -22.990065),Point] |Thu Sep 15 02:00:03 +0000 2016| |[WrappedArray(100.3833729, 6.13822131),Point] |Thu Sep 15 02:00:30 +0000 2016| |[WrappedArray(-122.286, 47.5592),Point] |Thu Sep 15 02:00:38 +0000 2016| |[WrappedArray(110.823004, -6.80342),Point] |Thu Sep 15 02:00:42 +0000 2016| Other DataFrame ops: count(), describe(), create new columns, … Spark DataFrame API
  • 11. © 2016 Mesosphere, Inc. All Rights Reserved. TRANSFORM/SUMMARIZE: SQL QUERIES + PROCEDURAL PROC. 11 > val langCount = df.select(“lang") .where(col(”lang”).isNotNull) .groupBy(“lang") .count() .orderBy(col(”count”).desc) +----+-----+ |lang|count| +----+-----+ | en|61644| | es|22937| | pt|21610| | ja|19160| | und|10376| Also: joins > val result = langCount.map{row:Row => …} // or flatMap, filter, … Spark DataFrame API SQL PROCEDURAL
  • 12. © 2016 Mesosphere, Inc. All Rights Reserved. MATH, STRING, DATE / TIME FUNCTIONS 12 > df.select(”created_at”) .withColumn(“day_of_week”, col(”created_at”).substr(0, 3)) .show() +--------------------+-----------+ | created_at|day_of_week| +--------------------+-----------+ | null| null| |Thu Sep 15 01:59:...| Thu| |Thu Sep 15 01:59:...| Thu| |Thu Sep 15 01:59:...| Thu| | null| null| |Thu Sep 15 01:59:...| Thu| Also: sin, cos, exp, log, pow, toDegrees, toRadians, ceil, floor, round, concat, format_string, lower, regexp_extract, split, trim, upper, current_timestamp, datediff, from_unixtime, … Spark DataFrame API
  • 13. © 2016 Mesosphere, Inc. All Rights Reserved. DEMO 13 Spark 2.0 Zeppelin notebook 0.6.1 8 GB JSON-formatted public Tweet data
  • 14. © 2016 Mesosphere, Inc. All Rights Reserved. BEYOND DATA MUNGING 14 Machine learning Data pipeline in production Streaming data
  • 15. © 2016 Mesosphere, Inc. All Rights Reserved. BEYOND DATA MUNGING 15 Machine learning => DataFrame-based ML API Data pipeline in production => Dataset API, with type safety Streaming data => Structured Streaming API, based on DataFrame Spark 2.0
  • 16. © 2016 Mesosphere, Inc. All Rights Reserved. RECAP 16 Spark DataFrames combine the “data frame” abstraction with Big Data and SQL Spark DataFrames simplify data munging tasks (“PETS”): Parse => structured and semi-structured formats (JSON) Explore => DataFrame: printSchema, filter by row / column, show Transform, Summarize => SQL + procedural processing, math / string / date-time utility functions All in Scala
  • 17. © 2016 Mesosphere, Inc. All Rights Reserved. REFERENCES 17 Spark SQL and DataFrames Guide: http://spark.apache.org/docs/latest/sql- programming-guide.html Spark DataFrame API: http://spark.apache.org/docs/latest/api/scala/ index.html#org.apache.spark.sql.Dataset Overview of Spark DataFrames: http://xinhstechblog.blogspot.com/2016/05/overview- of-spark-dataframe-api.html DataFrame Internals: https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/ SparkSQLSigmod2015.pdf
  • 18. © 2016 Mesosphere, Inc. All Rights Reserved. THANK YOU! 18