Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Jump Start on Apache®
Spark™ 2.2 with Databricks
Jules S. Damji
Apache Spark Community Evangelist
Spark Saturday DC , Nov ...
I have used Apache Spark Before…
I know the difference between
DataFrame and RDDs…
Spark CommunityEvangelist& Developer
Advocate @ Databricks
DeveloperAdvocate@ Hortonworks
Software engineering @: Sun Micr...
Morning Afternoon
Agenda for the day
• Introduction to DataFrames, Datasets
and Spark SQL
• Workshop Notebook 2
• Break
• ...
Get to know Databricks
1. Get http://databricks.com/try-databricks
2. https://github.com/dmatrix/spark-saturday
3. [OR] Im...
Why Apache Spark?
Big Data Systems of Yesterday…
MapReduce/Hadoop
Generalbatch
processing
Drill
Storm
Pregel Giraph
Dremel Mahout
Storm Impa...
MapReduce
Generalbatch
processing
Unified engine
Big Data Systems Today
?
Pregel
Dremel Millwheel
Drill
Giraph
ImpalaStorm...
Apache Spark Philosophy
Unified engine for complete
data applications
High-level user-friendly APIs
SQLStreaming ML Graph
...
An Analogy ….
New applications
Unified engineacross diverse workloads &
environments
TEAM
About Databricks
Started Spark project (now Apache Spark) at UC Berkeleyin 2009
PRODUCT
Unified Analytics Platform
MI...
Accelerate innovation by
unifying data science,
engineering and business.
Unified Analytics
Platform
UNIFIED
INFRASTRUCTUR...
The Unified Analytics Platform
Apache Spark Architecture
Apache Spark Architecture
Deployments	
Modes
• Local
• Standalone
• YARN
• Mesos
Driver
+
Executor
Driver
+
Executor
Container
EC2 Machine
Student-1 Notebook
Student-2 Notebook
Container
JVM
JVM
Local Mo...
30	GB	Container
30	GB	Container
22	GB	JVM
22	GB	JVM
S
S
S
S
S
S
S
S
Ex.
Ex.
30	GB	Container
30	GB	Container
22	GB	JVM
22	G...
Spark Deployment Modes
Apache Spark Architecture
An Anatomy ofan Application
Spark	Application
• Jobs
• Stages
• Tasks
S S
Container
S*
*
*
*
*
*
*
*
JVM
T
*
*
DF/RDD
A Spark Executor
Resilient Distributed Dataset
(RDD)
What are RDDs?
A Resilient Distributed Dataset
(RDD)
1. Distributed Data Abstraction
Logical Model Across Distributed Storage
S3 or HDFS
2. Resilient & Immutable
RDD RDD RDDT
RDD à Tà RDD -> RDD
T
T = Transformation
3. Compile-time Type-safe
Integer RDD
String or Text RDD
Double or Binary RDD
4. Unstructured/Structured Data: Text (logs, tweets,
articles, social)
Structured Tabular data..
5. Lazy
RDD RDD RDDT
RDD à Tà RDD à Tà RDD
T
T = Transformation
A = Action
RDD RDD RDDT A
2	kinds	of	Actions
collect, count, reduce, take, show..saveAsTextFile, (HDFS, S3, SQL, NoSQL, etc.)
How did we get here…?
Where we going...?
A Brief History
39
2012
Started
@
UC Berkeley
2010 2013
Databricks
started
& donated
to ASF
2014
Spark 1.0 & libraries
(SQ...
Apache Spark 2.X
• Steps to Bigger & Better Things….
Builds on all we learned in past 2 years
Major Themes in Apache Spark 2.x
TungstenPhase 2
speedupsof 5-10x
& Catalyst Optimizer
Faster
StructuredStreaming
real-tim...
Unified API Foundation for
the Future: SparkSessions,
DataFrame, Dataset, MLlib,
Structured Streaming
SparkSession – A Unified entry point to
Spark
• Conduit to Spark
– Creates Datasets/DataFrames
– Reads/writes data
– Works...
SparkSession vs SparkContext
SparkSessions	 Subsumes
• SparkContext
• SQLContext
• HiveContext
• StreamingContext
• SparkC...
SparkSession – A Unified entry point to
Spark
DataFrame & Dataset Structure
Long Term
• RDD as the low-level API in Spark
• For control and certain type-safety in Java/Scala
• Datasets & DataFrames ...
Spark 1.6 vs Spark 2.x
Spark 1.6 vs Spark 2.x
Towards SQL 2003
• Today, Spark can run all 99 TPC-DS queries!
- New standard compliant parser (with good error messages!)...
0
100
200
300
400
500
600
Runtime(seconds) Preliminary TPC-DS Spark2.0 vs 1.6 – Lower is Better
Time (1.6)
Time (2.0)
Other notable API improvements
• DataFrame-based ML pipeline API becoming the main MLlib API
• ML model & pipeline persist...
Workshop: Notebook on RDD & SparkSession
• ImportNotebook into your Spark 2.2 Cluster
– http://dbricks.co/ss_wkshp0
– http...
DataFrames/Dataset, Spark
SQL & Catalyst Optimizer
The not so secret truth…
SQL
is not about SQL
is about more thanSQL
10
Is About Creating and Running Spark Programs
Faster:
•  Write less code
•  Read less data
•  Do less work
• optimizerdo...
Spark SQL Architecture
Logical
Plan
Physical
Plan
Catalog
Optimizer
RDDs
…
Data
Source
API
SQL DataFrames
Code
Generator
D...
58
Using Catalyst in Spark SQL
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Ana...
LOGICAL OPTIMIZATIONS PHYSICAL OPTIMIZATIONS
Catalyst Optimizations
• Catalyst compiles operations into
physical plan for ...
PhysicalPlan
with Predicate Pushdown
and Column Pruning
join
optimized
scan
(events)
optimized
scan
(users)
LogicalPlan
fi...
Columns: Predicate pushdown
spark.read
.format("jdbc")
.option("url", "jdbc:postgresql:dbserver")
.option("dbtable", "peop...
43
Spark Core (RDD)
Catalyst
DataFrame/DatasetSQL
MLPipelines Structured
Streaming
{ JSON }
JDBC
andmore…
FoundationalSpar...
http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf
DataFrames & Datasets
Spark 2.x APIs
Background: What is in an RDD?
•Dependencies
• Partitions (with optional localityinfo)
• Compute function: Partition =>Ite...
Structured APIs In Spark
66
SQL DataFrames Datasets
Syntax
Errors
Analysis
Errors
Runtime Compile
Time
Runtime
Compile
Tim...
Unification of APIs in Spark 2.0
DataFrame API code.
//	convert	RDD	->	DF	with	column	names
val df	=	parsedRDD.toDF("project",	"page",	"numRequests")	
//fi...
Take DataFrame à SQL Table à Query
df. createOrReplaceTempView(("edits")	
val results	=	spark.sql("""SELECT	page,	sum(numR...
Easy to write code... Believe it!
from	pyspark.sql.functions	 import	avg	
dataRDD	=	sc.parallelize([("Jim",	20),	("Anne",	...
Why structure APIs?
data.map { case (dept, age) => dept -> (age, 1) }
.reduceByKey { case ((a1, c1), (a2, c2)) => (a1 + a2...
Type-safe:operate
on domain objects
with compiled
lambda functions
8
Dataset API in Spark 2.x
v a l d f = s p a r k .r e a...
Datasets: Lightning-fast Serialization with Encoders
DataFrames are Faster than RDDs
Datasets < Memory RDDs
Why When
DataFrames & Datasets
• StructuredData schema
• Code optimization & performance
• Space efficiency with Tungsten
...
Source: michaelmalak
BLOG: http://dbricks.co/3-apis
Spark Summit Talk: http://dbricks.co/summit-3aps
Project Tungsten II
Project Tungsten
• Substantially speed up execution by optimizing CPU efficiency,
via: SPARK-12795
(1) Runtime code genera...
6 “bricks”
Tungsten’sCompact RowFormat
0x0 123 32L 48L 4 “data”
(123, “data”, “bricks”)
Nullbitmap
Offset to data
Offset t...
Encoders
6 “bricks”0x0 123 32L 48L 4 “data”
JVM Object
Internal Representation
MyClass(123, “data”, “ br i c ks”)
Encoders...
Workshop: Notebook on
DataFrames/Datasets & Spark SQL
• Import Notebookinto your Spark2.x Cluster
– http://dbricks.co/sqld...
Introduction to Structured
Streaming Concepts
building robust
stream processing
apps is hard
Complexities in stream processing
COMPLEX DATA
Diverse data formats
(json, avro, binary, …)
Data can be dirty,
late, out-o...
Structured Streaming
stream processing on Spark SQL engine
fast, scalable, fault-tolerant
rich, unified, high level APIs
d...
you
should not have to
reason about streaming
Treat Streams as Unbounded Tables
90
data stream unbounded inputtable
newdata in the
data stream
=
newrows appended
to a u...
you
should write simple queries
&
Spark
should continuously update the answer
DataFrames,
Datasets, SQL
input = spark.readStream
.format("kafka")
.option("subscribe", "topic")
.load()
result = input
....
Streaming word count
Anatomy of a Streaming Query
Anatomy of a Streaming Query: Step 1
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.
Source
• Sp...
Anatomy of a Streaming Query: Step 2
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.groupBy('val...
Anatomy of a Streaming Query: Step 3
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.groupBy('val...
Anatomy of a Streaming Query: Output Modes
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.groupB...
Anatomy of a Streaming Query: Checkpoint
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.groupBy(...
Fault-tolerance with Checkpointing
Checkpointing – tracks progress
(offsets) of consuming data from
the source and interme...
Complex Streaming ETL
Traditional ETL
Raw, dirty, un/semi-structured is data dumped as files
Periodic jobs run every few hours to convert raw da...
Traditional ETL
Hours of delay before taking decisions on latest data
Unacceptable when time is of essence
[intrusion dete...
Streaming ETL w/ Structured Streaming
Structured Streaming enables raw data to be available
as structured data as soon as ...
Streaming ETL w/ Structured Streaming
Example
Json data being received in Kafka
Parse nested json and flatten it
Store in ...
Reading from Kafka
Specify options to configure
How?
kafka.boostrap.servers => broker1,broker2
What?
subscribe => topic1,t...
Reading from Kafka
val rawDataDF = spark.readStream
.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscr...
Transforming Data
Cast binary value to string
Name it column json
val parsedDataDF = rawData
.selectExpr("cast (value as s...
Transforming Data
Cast binary value to string
Name it column json
Parse json string and expand into
nested columns, name i...
Transforming Data
Cast binary value to string
Name it column json
Parse json string and expand into
nested columns, name i...
Transforming Data
Cast binary value to string
Name it column json
Parse json string and expand into
nested columns, name i...
Writing to
Save parsed data as Parquet
table in the given path
Partition files by date so that
future queries on time slic...
Checkpointing
Enable checkpointing by
setting the checkpoint
location to save offset logs
start actually starts a
continuo...
Streaming Query
query is a handle to the continuously
running StreamingQuery
Used to monitor and manage the
execution
val ...
More Kafka Support [Spark 2.2]
Write out to Kafka
DataFrame must have binary fields
named key and value
Direct, interactiv...
Amazon Kinesis [Databricks Runtime 3.0]
Configure with options (similar to Kafka)
How?
region => us-west-2 / us-east-1 / ....
Working With Time
Event Time
Many use cases require aggregate statistics by event time
E.g. what's the #errors in each system in the 1 hour ...
Event time Aggregations
Windowing is just another type of grouping in Struct.
Streaming
number of records every hour
Suppo...
Stateful Processing for Aggregations
Aggregates have to be saved as
distributed state between triggers
Each trigger reads ...
Automatically handles Late Data
12:00 - 13:00 1 12:00 - 13:00 3
13:00 - 14:00 1
12:00 - 13:00 3
13:00 - 14:00 2
14:00 - 15...
Watermarking
max eventtime
event time
watermark
allowed
lateness
of 10 mins
parsedDataDF
.withWatermark("timestamp", "10 m...
What else…
Arbitrary Stateful Operations [Spark 2.2]
mapGroupsWithState
allows any user-defined
stateful function to a
user-defined s...
Arbitrary Stateful Operations [Spark 2.2]
Other interestingoperations
Streaming Deduplication
Watermarks to limit state
Stream-batch Joins
Stream-stream Joins
Can u...
Arbitrary Stateful Processing
http://dbricks.co/asp_stream
More Info
Structured Streaming Programming Guide
http://spark.apache.org/docs/latest/structured-streaming-programming-guid...
Resources
• Getting Started Guide with Apache Spark on Databricks
• docs.databricks.com
• Spark Programming Guide
• Struct...
http://dbricks.co/spark-guide
https://databricks.com/spark-live (Nov	16)
https://databricks.com/company/careers
Do you have any questions for my preparedanswers?
Demo & Workshop: Structured Streaming
• Import Notebook into your Spark 2.2 Cluster
• http://dbricks.co/iotss_wkshp4
• Don...
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
Upcoming SlideShare
Loading in …5
×

Jump Start on Apache Spark 2.2 with Databricks

406 views

Published on

Apache Spark 2.0 and subsequent releases of Spark 2.1 and 2.2 have laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.

In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:

Agenda:

• Overview of Spark Fundamentals & Architecture

• What’s new in Spark 2.x

• Unified APIs: SparkSessions, SQL, DataFrames, Datasets

• Introduction to DataFrames, Datasets and Spark SQL

• Introduction to Structured Streaming Concepts

• Four Hands-On Labs

Published in: Software

Jump Start on Apache Spark 2.2 with Databricks

  1. 1. Jump Start on Apache® Spark™ 2.2 with Databricks Jules S. Damji Apache Spark Community Evangelist Spark Saturday DC , Nov 11, 2017 Big Data Madison @ Madison College
  2. 2. I have used Apache Spark Before…
  3. 3. I know the difference between DataFrame and RDDs…
  4. 4. Spark CommunityEvangelist& Developer Advocate @ Databricks DeveloperAdvocate@ Hortonworks Software engineering @: Sun Microsystems, Netscape, @Home, VeriSign, Scalix, Centrify, LoudCloud/Opsware, ProQuest https://www.linkedin.com/in/dmatrix @2twitme
  5. 5. Morning Afternoon Agenda for the day • Introduction to DataFrames, Datasets and Spark SQL • Workshop Notebook 2 • Break • Introduction to StructuredStreaming Concepts • Workshop Notebook 3 • Go Home…Watch the Game! • Get to know Databricks • Overview of Spark Fundamentals & Architecture • What’s New in Spark 2.x • Break • Unified APIs:SparkSessions, SQL, DataFrames, Datasets… • Workshop Notebook [0,1] • Lunch
  6. 6. Get to know Databricks 1. Get http://databricks.com/try-databricks 2. https://github.com/dmatrix/spark-saturday 3. [OR] ImportNotebook: http://dbricks.co/ss_wkshp0
  7. 7. Why Apache Spark?
  8. 8. Big Data Systems of Yesterday… MapReduce/Hadoop Generalbatch processing Drill Storm Pregel Giraph Dremel Mahout Storm Impala Drill . . . Specialized systems for newworkloads Hard to combine in pipelines
  9. 9. MapReduce Generalbatch processing Unified engine Big Data Systems Today ? Pregel Dremel Millwheel Drill Giraph ImpalaStorm S4 . . . Specialized systems for newworkloads
  10. 10. Apache Spark Philosophy Unified engine for complete data applications High-level user-friendly APIs SQLStreaming ML Graph … DL Applications
  11. 11. An Analogy …. New applications
  12. 12. Unified engineacross diverse workloads & environments
  13. 13. TEAM About Databricks Started Spark project (now Apache Spark) at UC Berkeleyin 2009 PRODUCT Unified Analytics Platform MISSION Making Big Data Simple
  14. 14. Accelerate innovation by unifying data science, engineering and business. Unified Analytics Platform UNIFIED INFRASTRUCTURE UNIFIED EXPERIENCE ACROSS TEAMS UNIFIED ANALYTIC WORKFLOWS
  15. 15. The Unified Analytics Platform
  16. 16. Apache Spark Architecture
  17. 17. Apache Spark Architecture Deployments Modes • Local • Standalone • YARN • Mesos
  18. 18. Driver + Executor Driver + Executor Container EC2 Machine Student-1 Notebook Student-2 Notebook Container JVM JVM Local Mode in Databricks
  19. 19. 30 GB Container 30 GB Container 22 GB JVM 22 GB JVM S S S S S S S S Ex. Ex. 30 GB Container 30 GB Container 22 GB JVM 22 GB JVM S S S S Dr Ex. ... ... Standalone Mode
  20. 20. Spark Deployment Modes
  21. 21. Apache Spark Architecture An Anatomy ofan Application Spark Application • Jobs • Stages • Tasks
  22. 22. S S Container S* * * * * * * * JVM T * * DF/RDD A Spark Executor
  23. 23. Resilient Distributed Dataset (RDD)
  24. 24. What are RDDs?
  25. 25. A Resilient Distributed Dataset (RDD) 1. Distributed Data Abstraction Logical Model Across Distributed Storage S3 or HDFS
  26. 26. 2. Resilient & Immutable RDD RDD RDDT RDD à Tà RDD -> RDD T T = Transformation
  27. 27. 3. Compile-time Type-safe Integer RDD String or Text RDD Double or Binary RDD
  28. 28. 4. Unstructured/Structured Data: Text (logs, tweets, articles, social)
  29. 29. Structured Tabular data..
  30. 30. 5. Lazy RDD RDD RDDT RDD à Tà RDD à Tà RDD T T = Transformation A = Action RDD RDD RDDT A
  31. 31. 2 kinds of Actions collect, count, reduce, take, show..saveAsTextFile, (HDFS, S3, SQL, NoSQL, etc.)
  32. 32. How did we get here…? Where we going...?
  33. 33. A Brief History 39 2012 Started @ UC Berkeley 2010 2013 Databricks started & donated to ASF 2014 Spark 1.0 & libraries (SQL, ML,GraphX) 2015 DataFrames/Datasets Tungsten Catalyst Optimizer ML Pipelines 2016-17 Apache Spark 2.0,2.1,2.2 Structured Streaming Cost Based Optimizer Deep Learning Pipelines Easier Smarter Faster
  34. 34. Apache Spark 2.X • Steps to Bigger & Better Things…. Builds on all we learned in past 2 years
  35. 35. Major Themes in Apache Spark 2.x TungstenPhase 2 speedupsof 5-10x & Catalyst Optimizer Faster StructuredStreaming real-time engine on SQL / DataFrames Smarter Unifying Datasets and DataFrames & SparkSessions Easier
  36. 36. Unified API Foundation for the Future: SparkSessions, DataFrame, Dataset, MLlib, Structured Streaming
  37. 37. SparkSession – A Unified entry point to Spark • Conduit to Spark – Creates Datasets/DataFrames – Reads/writes data – Works with metadata – Sets/gets Spark Configuration – Driver uses for Cluster resource management
  38. 38. SparkSession vs SparkContext SparkSessions Subsumes • SparkContext • SQLContext • HiveContext • StreamingContext • SparkConf
  39. 39. SparkSession – A Unified entry point to Spark
  40. 40. DataFrame & Dataset Structure
  41. 41. Long Term • RDD as the low-level API in Spark • For control and certain type-safety in Java/Scala • Datasets & DataFrames give richer semantics & optimizations • For semi-structured data and DSL like operations • New libraries will increasingly use these as interchange format • Examples: Structured Streaming, MLlib, GraphFrames, and Deep Learning Pipelines
  42. 42. Spark 1.6 vs Spark 2.x
  43. 43. Spark 1.6 vs Spark 2.x
  44. 44. Towards SQL 2003 • Today, Spark can run all 99 TPC-DS queries! - New standard compliant parser (with good error messages!) - Subqueries (correlated & uncorrelated) - Approximate aggregate stats - https://databricks.com/blog/2016/06/17/sql-subqueries-in-apache-spark- 2-0.html
  45. 45. 0 100 200 300 400 500 600 Runtime(seconds) Preliminary TPC-DS Spark2.0 vs 1.6 – Lower is Better Time (1.6) Time (2.0)
  46. 46. Other notable API improvements • DataFrame-based ML pipeline API becoming the main MLlib API • ML model & pipeline persistence with almost complete coverage • In all programminglanguages: Scala, Java, Python, R • Improved R support • (Parallelizable) User-defined functions in R • Generalized Linear Models (GLMs), Naïve Bayes, Survival Regression, K- Means • Structured Streaming Features & Production Readiness • https://databricks.com/blog/2017/07/11/introducing-apache-spark-2-2.html
  47. 47. Workshop: Notebook on RDD & SparkSession • ImportNotebook into your Spark 2.2 Cluster – http://dbricks.co/ss_wkshp0 – http://dbricks.co/ss_wkshp1 – http://docs.databricks.com – http://spark.apache.org/docs/latest/api/scala/index.html#or g.apache.spark.sql.SparkSession • Familiarizeyour self with Databricks Notebook environment • Work through each cell • CNTR + <return>/ Shift+ Return • Try challenges • Break…
  48. 48. DataFrames/Dataset, Spark SQL & Catalyst Optimizer
  49. 49. The not so secret truth… SQL is not about SQL is about more thanSQL
  50. 50. 10 Is About Creating and Running Spark Programs Faster: •  Write less code •  Read less data •  Do less work • optimizerdoes the hard work Spark SQL: The wholestory
  51. 51. Spark SQL Architecture Logical Plan Physical Plan Catalog Optimizer RDDs … Data Source API SQL DataFrames Code Generator Datasets
  52. 52. 58 Using Catalyst in Spark SQL Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog Analysis: analyzinga logicalplan to resolve references Logical Optimization: logicalplan optimization Physical Planning: Physical planning Code Generation:Compileparts of the query to Java bytecode SQL AST DataFrame Datasets
  53. 53. LOGICAL OPTIMIZATIONS PHYSICAL OPTIMIZATIONS Catalyst Optimizations • Catalyst compiles operations into physical plan for execution and generates JVM byte code • Intelligently choose between broadcast joins and shuffle joins to reduce network traffic • Lower level optimizations: eliminate expensive object allocations and reduce virtual functions calls • Push filter predicate down to data source, so irrelevant data can be skipped • Parquet: skip entire blocks, turn comparisons into cheaper integer comparisons via dictionary coding • RDMS: reduce amount of data traffic by pushing down predicates
  54. 54. PhysicalPlan with Predicate Pushdown and Column Pruning join optimized scan (events) optimized scan (users) LogicalPlan filter join PhysicalPlan join scan (users)events file userstable 60 scan (events) filter users.join(events, users("id") === events("uid")) . filter(events("date") > "2015-01-01") DataFrame Optimization
  55. 55. Columns: Predicate pushdown spark.read .format("jdbc") .option("url", "jdbc:postgresql:dbserver") .option("dbtable", "people") .load() .where($"name" === "michael") 61 You Write Spark Translates For Postgres SELECT * FROM people WHERE name = 'michael'
  56. 56. 43 Spark Core (RDD) Catalyst DataFrame/DatasetSQL MLPipelines Structured Streaming { JSON } JDBC andmore… FoundationalSpark2.x Components Spark SQL GraphFrames DL Pipelines TensorFrames
  57. 57. http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf
  58. 58. DataFrames & Datasets Spark 2.x APIs
  59. 59. Background: What is in an RDD? •Dependencies • Partitions (with optional localityinfo) • Compute function: Partition =>Iterator[T] Opaque Computation & Opaque Data
  60. 60. Structured APIs In Spark 66 SQL DataFrames Datasets Syntax Errors Analysis Errors Runtime Compile Time Runtime Compile Time Compile Time Runtime Analysis errors are reported before a distributed job starts
  61. 61. Unification of APIs in Spark 2.0
  62. 62. DataFrame API code. // convert RDD -> DF with column names val df = parsedRDD.toDF("project", "page", "numRequests") //filter, groupBy, sum, and then agg() df.filter($"project" === "en"). groupBy($"page"). agg(sum($"numRequests").as("count")). limit(100). show(100) project page numRequests en 23 45 en 24 200
  63. 63. Take DataFrame à SQL Table à Query df. createOrReplaceTempView(("edits") val results = spark.sql("""SELECT page, sum(numRequests) AS count FROM edits WHERE project = 'en' GROUP BY page LIMIT 100""") results.show(100) project page numRequests en 23 45 en 24 200
  64. 64. Easy to write code... Believe it! from pyspark.sql.functions import avg dataRDD = sc.parallelize([("Jim", 20), ("Anne", 31), ("Jim", 30)]) dataDF = dataRDD.toDF(["name", "age"]) # Using RDD code to compute aggregate average (dataRDD.map(lambda (x,y): (x, (y,1))) .reduceByKey(lambda x,y: (x[0] +y[0], x[1] +y[1])) .map(lambda (x, (y, z)): (x, y / z))) # Using DataFrame dataDF.groupBy("name").agg(avg("age")) name age Jim 20 Ann 31 Jim 30
  65. 65. Why structure APIs? data.map { case (dept, age) => dept -> (age, 1) } .reduceByKey { case ((a1, c1), (a2, c2)) => (a1 + a2, c1 + c2)} .map { case (dept, (age, c)) => dept -> age / c } select dept, avg(age) from data group by 1 SQL DataFrame RDD data.groupBy("dept").avg("age")
  66. 66. Type-safe:operate on domain objects with compiled lambda functions 8 Dataset API in Spark 2.x v a l d f = s p a r k .r e ad.j s on( "pe opl e.js on ") / / Convert data to domain o b j e c ts . case c l a s s Person(name: S tr i n g , age: I n t ) v a l d s : Dataset[Person] = d f.a s [P e r s on ] v a l fi l te r D S = d s . f i l t e r ( p = > p . a g e > 30)
  67. 67. Datasets: Lightning-fast Serialization with Encoders
  68. 68. DataFrames are Faster than RDDs
  69. 69. Datasets < Memory RDDs
  70. 70. Why When DataFrames & Datasets • StructuredData schema • Code optimization & performance • Space efficiency with Tungsten • High-level APIs and DSL • StrongType-safety • Ease-of-use & Readability • What-to-do
  71. 71. Source: michaelmalak
  72. 72. BLOG: http://dbricks.co/3-apis Spark Summit Talk: http://dbricks.co/summit-3aps
  73. 73. Project Tungsten II
  74. 74. Project Tungsten • Substantially speed up execution by optimizing CPU efficiency, via: SPARK-12795 (1) Runtime code generation (2) Exploiting cache locality (3) Off-heap memory management
  75. 75. 6 “bricks” Tungsten’sCompact RowFormat 0x0 123 32L 48L 4 “data” (123, “data”, “bricks”) Nullbitmap Offset to data Offset to data Fieldlengths 20
  76. 76. Encoders 6 “bricks”0x0 123 32L 48L 4 “data” JVM Object Internal Representation MyClass(123, “data”, “ br i c ks”) Encoders translate between domain objects and Spark's internal format
  77. 77. Workshop: Notebook on DataFrames/Datasets & Spark SQL • Import Notebookinto your Spark2.x Cluster – http://dbricks.co/sqlds_wkshp2 (optional) – http://dbricks.co/sqldf_wkshp2 (python) (optional) – http://dbricks.co/data_mounts (python) – http://dbricks.co/iotds_wkshp3 – https://spark.apache.org/docs/latest/api/scala/index.html#org.a pache.spark.sql.Dataset • Workthrough each Notebookcell • Try challenges • Break..
  78. 78. Introduction to Structured Streaming Concepts
  79. 79. building robust stream processing apps is hard
  80. 80. Complexities in stream processing COMPLEX DATA Diverse data formats (json, avro, binary, …) Data can be dirty, late, out-of-order COMPLEX SYSTEMS Diverse storage systems (Kafka, S3, Kinesis, RDBMS, …) System failures COMPLEX WORKLOADS Combining streaming with interactive queries Machine learning
  81. 81. Structured Streaming stream processing on Spark SQL engine fast, scalable, fault-tolerant rich, unified, high level APIs deal with complex data and complex workloads rich ecosystem of data sources integrate with many storage systems
  82. 82. you should not have to reason about streaming
  83. 83. Treat Streams as Unbounded Tables 90 data stream unbounded inputtable newdata in the data stream = newrows appended to a unboundedtable
  84. 84. you should write simple queries & Spark should continuously update the answer
  85. 85. DataFrames, Datasets, SQL input = spark.readStream .format("kafka") .option("subscribe", "topic") .load() result = input .select("device", "signal") .where("signal > 15") result.writeStream .format("parquet") .start("dest-path") Logical Plan Read from Kafka Project device, signal Filter signal > 15 Writeto Parquet Apache Spark automatically streamifies! Spark SQL converts batch-like query to a series of incremental execution plans operating on new batches of data Series of Incremental Execution Plans Kafka Source Optimized Operator codegen, off- heap, etc. Parquet Sink Optimized Physical Plan process newdata t = 1 t = 2 t = 3 process newdata process newdata
  86. 86. Streaming word count Anatomy of a Streaming Query
  87. 87. Anatomy of a Streaming Query: Step 1 spark.readStream .format("kafka") .option("subscribe", "input") .load() . Source • Specify one or more locations to read data from • Built in support for Files/Kafka/Socket, pluggable.
  88. 88. Anatomy of a Streaming Query: Step 2 spark.readStream .format("kafka") .option("subscribe", "input") .load() .groupBy('value.cast("string") as 'key) .agg(count("*") as 'value) Transformation • Using DataFrames,Datasets and/or SQL. • Internal processingalways exactly- once.
  89. 89. Anatomy of a Streaming Query: Step 3 spark.readStream .format("kafka") .option("subscribe", "input") .load() .groupBy('value.cast("string") as 'key) .agg(count("*") as 'value) .writeStream .format("kafka") .option("topic", "output") .trigger("1 minute") .outputMode(OutputMode.Complete()) .option("checkpointLocation", "…") .start() Sink • Accepts the output of each batch. • When supported sinks are transactional and exactly once (Files). • Use foreach to execute arbitrary code.
  90. 90. Anatomy of a Streaming Query: Output Modes spark.readStream .format("kafka") .option("subscribe", "input") .load() .groupBy('value.cast("string") as 'key) .agg(count("*") as 'value) .writeStream .format("kafka") .option("topic", "output") .trigger("1 minute") .outputMode("update") .option("checkpointLocation", "…") .start() Output mode – What's output • Complete – Output the whole answer every time • Update – Output changed rows • Append– Output new rowsonly Trigger – When to output • Specifiedas a time, eventually supportsdata size • No trigger means as fast as possible
  91. 91. Anatomy of a Streaming Query: Checkpoint spark.readStream .format("kafka") .option("subscribe", "input") .load() .groupBy('value.cast("string") as 'key) .agg(count("*") as 'value) .writeStream .format("kafka") .option("topic", "output") .trigger("1 minute") .outputMode("update") .option("checkpointLocation", "…") .start() Checkpoint • Tracks the progress of a query in persistent storage • Can be used to restart the query if there is a failure.
  92. 92. Fault-tolerance with Checkpointing Checkpointing – tracks progress (offsets) of consuming data from the source and intermediate state. Offsets and metadata saved as JSON Can resume after changing your streaming transformations end-to-end exactly-once guarantees process newdata t = 1 t = 2 t = 3 process newdata process newdata write ahead log
  93. 93. Complex Streaming ETL
  94. 94. Traditional ETL Raw, dirty, un/semi-structured is data dumped as files Periodic jobs run every few hours to convert raw data to structured data ready for further analytics 101 file dump seconds hours table 10101010
  95. 95. Traditional ETL Hours of delay before taking decisions on latest data Unacceptable when time is of essence [intrusion detection, anomaly detection, etc.] file dump seconds hours table 10101010
  96. 96. Streaming ETL w/ Structured Streaming Structured Streaming enables raw data to be available as structured data as soon as possible 103 seconds table 10101010
  97. 97. Streaming ETL w/ Structured Streaming Example Json data being received in Kafka Parse nested json and flatten it Store in structured Parquet table Get end-to-end failure guarantees val rawData = spark.readStream .format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() val parsedData = rawData .selectExpr("cast (value as string) as json")) .select(from_json("json", schema).as("data")) .select("data.*") val query = parsedData.writeStream .option("checkpointLocation", "/checkpoint") .partitionBy("date") .format("parquet") .start("/parquetTable")
  98. 98. Reading from Kafka Specify options to configure How? kafka.boostrap.servers => broker1,broker2 What? subscribe => topic1,topic2,topic3 // fixed list of topics subscribePattern => topic* // dynamic list of topics assign => {"topicA":[0,1] } // specific partitions Where? startingOffsets => latest(default) / earliest / {"topicA":{"0":23,"1":345} } val rawData = spark.readStream .format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load()
  99. 99. Reading from Kafka val rawDataDF = spark.readStream .format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() rawData dataframe has the following columns key value topic partition offset timestamp [binary] [binary] "topicA" 0 345 1486087873 [binary] [binary] "topicB" 3 2890 1486086721
  100. 100. Transforming Data Cast binary value to string Name it column json val parsedDataDF = rawData .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .select("data.*")
  101. 101. Transforming Data Cast binary value to string Name it column json Parse json string and expand into nested columns, name it data val parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .select("data.*") json { "timestamp": 1486087873, "device": "devA", …} { "timestamp": 1486082418, "device": "devX", …} data (nested) timestamp device … 1486087873 devA … 1486086721 devX … from_json("json") as "data"
  102. 102. Transforming Data Cast binary value to string Name it column json Parse json string and expand into nested columns, name it data Flatten the nested columns val parsedDataDF = rawData .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .select("data.*") data (nested) timestamp device … 1486087873 devA … 1486086721 devX … timestamp device … 1486087873 devA … 1486086721 devX … select("data.*") (not nested)
  103. 103. Transforming Data Cast binary value to string Name it column json Parse json string and expand into nested columns, name it data Flatten the nested columns val parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .select("data.*") powerful built-in APIs to performcomplex data transformations from_json, to_json, explode,... 100s offunctions (see our blogpost & tutorial)
  104. 104. Writing to Save parsed data as Parquet table in the given path Partition files by date so that future queries on time slices of data is fast e.g. query on last 48 hours of data val query = parsedData.writeStream .option("checkpointLocation", ...) .partitionBy("date") .format("parquet") .start("/parquetTable") //pathname
  105. 105. Checkpointing Enable checkpointing by setting the checkpoint location to save offset logs start actually starts a continuous running StreamingQuery in the Spark cluster val query = parsedData.writeStream .option("checkpointLocation", ...) .format("parquet") .partitionBy("date") .start("/parquetTable/")
  106. 106. Streaming Query query is a handle to the continuously running StreamingQuery Used to monitor and manage the execution val query = parsedData.writeStream .option("checkpointLocation", ...) .format("parquet") .partitionBy("date") .start("/parquetTable")/") process newdata t = 1 t = 2 t = 3 process newdata process newdata StreamingQuery
  107. 107. More Kafka Support [Spark 2.2] Write out to Kafka DataFrame must have binary fields named key and value Direct, interactive and batch queries on Kafka Makes Kafka even more powerful as a storage platform! result.writeStream .format("kafka") .option("topic", "output") .start() val df = spark .read // not readStream .format("kafka") .option("subscribe", "topic") .load() df.createOrReplaceTempView("topicData") spark.sql("select value from topicData")
  108. 108. Amazon Kinesis [Databricks Runtime 3.0] Configure with options (similar to Kafka) How? region => us-west-2 / us-east-1 / ... awsAccessKey (optional) => AKIA... awsSecretKey (optional) => ... What? streamName => name-of-the-stream Where? initialPosition => latest(default) / earliest / trim_horizon spark.readStream .format("kinesis") .option("streamName”,"myStream") .option("region", "us-west-2") .option("awsAccessKey", ...) .option("awsSecretKey", ...) .load()
  109. 109. Working With Time
  110. 110. Event Time Many use cases require aggregate statistics by event time E.g. what's the #errors in each system in the 1 hour windows? Many challenges Extractingevent time from data, handling late, out-of-order data DStream APIs were insufficient for event-time stuff
  111. 111. Event time Aggregations Windowing is just another type of grouping in Struct. Streaming number of records every hour Support UDAFs! parsedData .groupBy(window("timestamp","1 hour")) .count() parsedData .groupBy( "device", window("timestamp","10 mins")) .avg("signal") avg signal strength of each device every 10 mins
  112. 112. Stateful Processing for Aggregations Aggregates have to be saved as distributed state between triggers Each trigger reads previous state and writes updated state State stored in memory, backed by write ahead log in HDFS/S3 Fault-tolerant, exactly-once guarantee! process newdata t = 1 sink src t = 2 process newdata sink src t = 3 process newdata sink src state state write ahead log state updates are written to log for checkpointing state
  113. 113. Automatically handles Late Data 12:00 - 13:00 1 12:00 - 13:00 3 13:00 - 14:00 1 12:00 - 13:00 3 13:00 - 14:00 2 14:00 - 15:00 5 12:00 - 13:00 5 13:00 - 14:00 2 14:00 - 15:00 5 15:00 - 16:00 4 12:00 - 13:00 3 13:00 - 14:00 2 14:00 - 15:00 6 15:00 - 16:00 4 16:00 - 17:00 3 13:00 14:00 15:00 16:00 17:00Keeping state allows late data to update counts of old windows red = state updated with late data But size of the state increasesindefinitely if old windows are notdropped
  114. 114. Watermarking max eventtime event time watermark allowed lateness of 10 mins parsedDataDF .withWatermark("timestamp", "10 minutes") .groupBy(window("timestamp","5 minutes")) .count() late data allowedto aggregate data too late, dropped Useful only in stateful operations (streaming aggs, dropDuplicates,mapGroupsWithState,...) Ignored in non-stateful streaming queries and batch queries
  115. 115. What else…
  116. 116. Arbitrary Stateful Operations [Spark 2.2] mapGroupsWithState allows any user-defined stateful function to a user-defined state Direct support for per-key timeouts in event-time or processing-time Supports Scala and Java 123 ds.groupByKey(_.id) .mapGroupsWithState (timeoutConf) (mappingWithStateFunc) def mappingWithStateFunc( key: K, values: Iterator[V], state: GroupState[S]): U = { // update or remove state // set timeouts // return mapped value }
  117. 117. Arbitrary Stateful Operations [Spark 2.2]
  118. 118. Other interestingoperations Streaming Deduplication Watermarks to limit state Stream-batch Joins Stream-stream Joins Can use mapGroupsWithState Direct support coming soon Apache Spark 2.3! val batchDataDF = spark.read .format("parquet") .load("/additional-data") //join with stream DataFrame parsedDataDF.join(batchData, "device") parsedDataDF.dropDuplicates("eventId")
  119. 119. Arbitrary Stateful Processing http://dbricks.co/asp_stream
  120. 120. More Info Structured Streaming Programming Guide http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html Anthology of Technical Assets on Structured Streaming http://dbricks.co/ss_anthology
  121. 121. Resources • Getting Started Guide with Apache Spark on Databricks • docs.databricks.com • Spark Programming Guide • Structured Streaming Programming Guide • Databricks Engineering Blogs • sparkhub.databricks.com • spark-packages.org
  122. 122. http://dbricks.co/spark-guide
  123. 123. https://databricks.com/spark-live (Nov 16)
  124. 124. https://databricks.com/company/careers
  125. 125. Do you have any questions for my preparedanswers?
  126. 126. Demo & Workshop: Structured Streaming • Import Notebook into your Spark 2.2 Cluster • http://dbricks.co/iotss_wkshp4 • Done!

×