SlideShare a Scribd company logo
1 of 57
Download to read offline
Spark SQL Deep Dive
Michael Armbrust
Melbourne Spark Meetup – June 1st 2015
What is Apache Spark?
Fast and general cluster computing system,
interoperable with Hadoop, included in all major
distros
Improves efficiency through:
>  In-memory computing primitives
>  General computation graphs
Improves usability through:
>  Rich APIs in Scala, Java, Python
>  Interactive shell
Up to 100× faster
(2-10× on disk)
2-5× less code
Spark Model
Write programs in terms of transformations on
distributed datasets
Resilient Distributed Datasets (RDDs)
>  Collections of objects that can be stored in memory
or disk across a cluster
>  Parallel functional transformations (map, filter, …)
>  Automatically rebuilt on failure
More than Map & Reduce
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save
...
5	
  
On-Disk Sort Record:
Time to sort 100TB
2100 machines2013 Record:
Hadoop
2014 Record:
Spark
Source: Daytona GraySort benchmark, sortbenchmark.org
72 minutes
207 machines
23 minutes
Also sorted 1PB in 4 hours
6	
  
Spark “Hall of Fame”
LARGEST SINGLE-DAY INTAKE LONGEST-RUNNING JOB
LARGEST SHUFFLE MOST INTERESTING APP
Tencent
(1PB+ /day)
Alibaba
(1 week on 1PB+ data)
Databricks PB Sort
(1PB)
Jeremy Freeman
Mapping the Brain at Scale
(with lasers!)
LARGEST CLUSTER
Tencent
(8000+ nodes)
Based on Reynold Xin’s personal knowledge
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
val	
  lines	
  =	
  spark.textFile(“hdfs://...”)	
  
val	
  errors	
  =	
  lines.filter(_	
  startswith	
  “ERROR”)	
  
val	
  messages	
  =	
  errors.map(_.split(“t”)(2))	
  
messages.cache()	
   lines
Block 1
lines
Block 2
lines
Block 3
Worker
Worker
Worker
Driver
messages.filter(_	
  contains	
  “foo”).count()	
  
messages.filter(_	
  contains	
  “bar”).count()	
  
. . .
tasks
results
messages
Cache 1
messages
Cache 2
messages
Cache 3
Base RDD
Transformed RDD
Action
Result: full-text search of Wikipedia in
<1 sec (vs 20 sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec#
(vs 170 sec for on-disk data)
A General Stack
Spark
Spark
Streaming#
real-time
Spark
SQL
GraphX
graph
MLlib
machine
learning
…
Spark
SQL
0
20000
40000
60000
80000
100000
120000
140000
Hadoop
MapReduce
Storm
(Streaming)
Impala (SQL)
 Giraph
(Graph)
Spark
non-test, non-example source lines
Powerful Stack – Agile Development
0
20000
40000
60000
80000
100000
120000
140000
Hadoop
MapReduce
Storm
(Streaming)
Impala (SQL)
 Giraph
(Graph)
Spark
non-test, non-example source lines
Streaming
Powerful Stack – Agile Development
0
20000
40000
60000
80000
100000
120000
140000
Hadoop
MapReduce
Storm
(Streaming)
Impala (SQL)
 Giraph
(Graph)
Spark
non-test, non-example source lines
SparkSQL
Streaming
Powerful Stack – Agile Development
Powerful Stack – Agile Development
0
20000
40000
60000
80000
100000
120000
140000
Hadoop
MapReduce
Storm
(Streaming)
Impala (SQL)
 Giraph
(Graph)
Spark
non-test, non-example source lines
GraphX
Streaming
SparkSQL
Powerful Stack – Agile Development
0
20000
40000
60000
80000
100000
120000
140000
Hadoop
MapReduce
Storm
(Streaming)
Impala (SQL)
 Giraph
(Graph)
Spark
non-test, non-example source lines
GraphX
Streaming
SparkSQL
Your App?
About SQL
Spark SQL
>  Part of the core distribution since Spark 1.0
(April 2014)
0
50
100
150
200
250
# Of Commits Per Month
0
50
100
150
200
2014-03
2014-04
2014-05
2014-06
2014-07
2014-08
2014-09
2014-10
2014-11
2014-12
2015-01
2015-02
2015-03
2015-04
2015-05
2015-06
# of Contributors
SELECT	
  COUNT(*)	
  
FROM	
  hiveTable	
  
WHERE	
  hive_udf(data)	
  	
  
Spark SQL
>  Part of the core distribution since Spark 1.0
(April 2014)
>  Runs SQL / HiveQL queries including UDFs
UDAFs and SerDes
About SQL
Spark SQL
>  Part of the core distribution since Spark 1.0
(April 2014)
>  Runs SQL / HiveQL queries including UDFs
UDAFs and SerDes
>  Connect existing BI tools to Spark through
JDBC
About SQL
Spark SQL
>  Part of the core distribution since Spark 1.0
(April 2014)
>  Runs SQL / HiveQL queries including UDFs
UDAFs and SerDes
>  Connect existing BI tools to Spark through
JDBC
>  Bindings in Python, Scala, and Java
About SQL
The not-so-secret truth…
is not about SQL.
SQL
SQL: The whole story
Create and Run Spark Programs Faster:
>  Write less code
>  Read less data
>  Let the optimizer do the hard work
DataFrame
noun – [dey-tuh-freym]
1.  A distributed collection of rows
organized into named columns.
2.  An abstraction for selecting, filtering,
aggregating and plotting structured
data (cf. R, Pandas).
3.  Archaic: Previously SchemaRDD
(cf. Spark < 1.3).
Write Less Code: Input & Output
Spark SQL’s Data Source API can read and write
DataFrames using a variety of formats.
21	
  
{ JSON }
Built-In External
JDBC
and more…
Write Less Code: Input & Output
Unified interface to reading/writing data in a
variety of formats:
df	
  =	
  sqlContext.read	
  	
  
	
  	
  .format("json")	
  	
  
	
  	
  .option("samplingRatio",	
  "0.1")	
  	
  
	
  	
  .load("/home/michael/data.json")	
  
	
  
df.write	
  	
  
	
  	
  .format("parquet")	
  	
  
	
  	
  .mode("append")	
  	
  
	
  	
  .partitionBy("year")	
  	
  
	
  	
  .saveAsTable("fasterData")	
  
	
  
Write Less Code: Input & Output
Unified interface to reading/writing data in a
variety of formats:
df	
  =	
  sqlContext.read	
  	
  
	
  	
  .format("json")	
  	
  
	
  	
  .option("samplingRatio",	
  "0.1")	
  	
  
	
  	
  .load("/home/michael/data.json")	
  
	
  
df.write	
  	
  
	
  	
  .format("parquet")	
  	
  
	
  	
  .mode("append")	
  	
  
	
  	
  .partitionBy("year")	
  	
  
	
  	
  .saveAsTable("fasterData")	
  
	
  
read and write	
  
functions create
new builders for
doing I/O
Write Less Code: Input & Output
Unified interface to reading/writing data in a
variety of formats:
df	
  =	
  sqlContext.read	
  	
  
	
  	
  .format("json")	
  	
  
	
  	
  .option("samplingRatio",	
  "0.1")	
  	
  
	
  	
  .load("/home/michael/data.json")	
  
	
  
df.write	
  	
  
	
  	
  .format("parquet")	
  	
  
	
  	
  .mode("append")	
  	
  
	
  	
  .partitionBy("year")	
  	
  
	
  	
  .saveAsTable("fasterData")	
  
	
  
Builder methods
are used to specify:
•  Format
•  Partitioning
•  Handling of
existing data
•  and more
Write Less Code: Input & Output
Unified interface to reading/writing data in a
variety of formats:
df	
  =	
  sqlContext.read	
  	
  
	
  	
  .format("json")	
  	
  
	
  	
  .option("samplingRatio",	
  "0.1")	
  	
  
	
  	
  .load("/home/michael/data.json")	
  
	
  
df.write	
  	
  
	
  	
  .format("parquet")	
  	
  
	
  	
  .mode("append")	
  	
  
	
  	
  .partitionBy("year")	
  	
  
	
  	
  .saveAsTable("fasterData")	
  
	
  
load(…), save(…) or
saveAsTable(…)	
  
functions create
new builders for
doing I/O
ETL Using Custom Data Sources
sqlContext.read	
  
	
  	
  .format("com.databricks.spark.git")	
  
	
  	
  .option("url",	
  "https://github.com/apache/spark.git")	
  
	
  	
  .option("numPartitions",	
  "100")	
  
	
  	
  .option("branches",	
  "master,branch-­‐1.3,branch-­‐1.2")	
  
	
  	
  .load()	
  
	
  	
  .repartition(1)	
  
	
  	
  .write	
  
	
  	
  .format("json")	
  
	
  	
  .save("/home/michael/spark.json")	
  	
  
Write Less Code: Powerful Operations
Common operations can be expressed concisely
as calls to the DataFrame API:
•  Selecting required columns
•  Joining different data sources
•  Aggregation (count, sum, average, etc)
•  Filtering
27	
  
Write Less Code: Compute an Average
private	
  IntWritable	
  one	
  =	
  	
  
	
  	
  new	
  IntWritable(1)	
  
private	
  IntWritable	
  output	
  =	
  
	
  	
  new	
  IntWritable()	
  
proctected	
  void	
  map(	
  
	
  	
  	
  	
  LongWritable	
  key,	
  
	
  	
  	
  	
  Text	
  value,	
  
	
  	
  	
  	
  Context	
  context)	
  {	
  
	
  	
  String[]	
  fields	
  =	
  value.split("t")	
  
	
  	
  output.set(Integer.parseInt(fields[1]))	
  
	
  	
  context.write(one,	
  output)	
  
}	
  
	
  
IntWritable	
  one	
  =	
  new	
  IntWritable(1)	
  
DoubleWritable	
  average	
  =	
  new	
  DoubleWritable()	
  
	
  
protected	
  void	
  reduce(	
  
	
  	
  	
  	
  IntWritable	
  key,	
  
	
  	
  	
  	
  Iterable<IntWritable>	
  values,	
  
	
  	
  	
  	
  Context	
  context)	
  {	
  
	
  	
  int	
  sum	
  =	
  0	
  
	
  	
  int	
  count	
  =	
  0	
  
	
  	
  for(IntWritable	
  value	
  :	
  values)	
  {	
  
	
  	
  	
  	
  	
  sum	
  +=	
  value.get()	
  
	
  	
  	
  	
  	
  count++	
  
	
  	
  	
  	
  }	
  
	
  	
  average.set(sum	
  /	
  (double)	
  count)	
  
	
  	
  context.Write(key,	
  average)	
  
}	
  
data	
  =	
  sc.textFile(...).split("t")	
  
data.map(lambda	
  x:	
  (x[0],	
  [x.[1],	
  1]))	
  	
  
	
  	
  	
  .reduceByKey(lambda	
  x,	
  y:	
  [x[0]	
  +	
  y[0],	
  x[1]	
  +	
  y[1]])	
  	
  
	
  	
  	
  .map(lambda	
  x:	
  [x[0],	
  x[1][0]	
  /	
  x[1][1]])	
  	
  
	
  	
  	
  .collect()	
  
Write Less Code: Compute an Average
Using RDDs	
  
data	
  =	
  sc.textFile(...).split("t")	
  
data.map(lambda	
  x:	
  (x[0],	
  [int(x[1]),	
  1]))	
  	
  
	
  	
  	
  .reduceByKey(lambda	
  x,	
  y:	
  [x[0]	
  +	
  y[0],	
  x[1]	
  +	
  y[1]])	
  	
  
	
  	
  	
  .map(lambda	
  x:	
  [x[0],	
  x[1][0]	
  /	
  x[1][1]])	
  	
  
	
  	
  	
  .collect()	
  
	
  
	
  
	
  
Using DataFrames
	
  
sqlCtx.table("people")	
  	
  
	
  	
  	
  .groupBy("name")	
  	
  
	
  	
  	
  .agg("name",	
  avg("age"))	
  	
  
	
  	
  	
  .collect()	
  	
  
Using SQL
	
  
SELECT	
  name,	
  avg(age)	
  
FROM	
  people	
  
GROUP	
  BY	
  name	
  
Not Just Less Code: Faster
Implementations
0 2 4 6 8 10
RDD Scala
RDD Python
DataFrame Scala
DataFrame Python
DataFrame SQL
Time to Aggregate 10 million int pairs (secs)
31	
  
Demo: Data Frames
Using Spark SQL to read, write, slice and dice
your data using a simple functions
Read Less Data
Spark SQL can help you read less data
automatically:
•  Converting to more efficient formats
•  Using columnar formats (i.e. parquet)
•  Using partitioning (i.e., /year=2014/month=02/…)1
•  Skipping data using statistics (i.e., min, max)2
•  Pushing predicates into storage systems (i.e., JDBC)
	
  
Optimization happens as late as
possible, therefore Spark SQL can
optimize across functions.
33
34
def	
  add_demographics(events):	
  
	
  	
  	
  u	
  =	
  sqlCtx.table("users")	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  #	
  Load	
  Hive	
  table	
  
	
  	
  	
  events	
  	
  
	
  	
  	
  	
  	
  .join(u,	
  events.user_id	
  ==	
  u.user_id)	
  	
  	
  	
  #	
  Join	
  on	
  user_id	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  .withColumn("city",	
  zipToCity(df.zip))	
  	
  	
  	
  #	
  udf	
  adds	
  city	
  column	
  
	
  
events	
  =	
  add_demographics(sqlCtx.load("/data/events",	
  "json"))	
  	
  
training_data	
  =	
  events.where(events.city	
  ==	
  "Palo	
  Alto")	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .select(events.timestamp).collect()	
  	
  
Logical Plan
filter
join
events file
users
table
expensive
only join
relevant users
Physical Plan
join
scan
(events)
filter
scan
(users)
35
def	
  add_demographics(events):	
  
	
  	
  	
  u	
  =	
  sqlCtx.table("users")	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  #	
  Load	
  partitioned	
  Hive	
  table	
  
	
  	
  	
  events	
  	
  
	
  	
  	
  	
  	
  .join(u,	
  events.user_id	
  ==	
  u.user_id)	
  	
  	
  	
  #	
  Join	
  on	
  user_id	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  .withColumn("city",	
  zipToCity(df.zip))	
  	
  	
  	
  #	
  Run	
  udf	
  to	
  add	
  city	
  column	
  
	
  
Physical Plan
with Predicate Pushdown
and Column Pruning
join
optimized
scan
(events)
optimized
scan
(users)
events	
  =	
  add_demographics(sqlCtx.load("/data/events",	
  "parquet"))	
  	
  
training_data	
  =	
  events.where(events.city	
  ==	
  "Palo	
  Alto")	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .select(events.timestamp).collect()	
  	
  
Logical Plan
filter
join
events file
users
table
Physical Plan
join
scan
(events)
filter
scan
(users)
Machine Learning Pipelines
tokenizer	
  =	
  Tokenizer(inputCol="text", outputCol="words”)	
  
hashingTF	
  =	
  HashingTF(inputCol="words", outputCol="features”)	
  
lr	
  =	
  LogisticRegression(maxIter=10,	
  regParam=0.01)	
  
pipeline	
  =	
  Pipeline(stages=[tokenizer,	
  hashingTF,	
  lr])	
  
	
  
df	
  =	
  sqlCtx.load("/path/to/data")
model	
  =	
  pipeline.fit(df)
df0 df1 df2 df3tokenizer hashingTF lr.model
lr
Pipeline Model
Set Footer from Insert Dropdown Menu 37
So how does it all work?
Plan Optimization & Execution
Set Footer from Insert Dropdown Menu 38
SQL AST
DataFrame
Unresolved
Logical
Plan
Logical
Plan
Optimized
Logical
Plan
RDDs
Selected
Physical
Plan
Analysis Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Catalog
DataFrames and SQL share the same optimization/execution pipeline
Code
Generation
An example query
SELECT	
  name	
  
FROM	
  (	
  
	
  	
  	
  SELECT	
  id,	
  name	
  
	
  	
  	
  FROM	
  People)	
  p	
  
WHERE	
  p.id	
  =	
  1	
  
Project
name
Project
id,name
Filter
id = 1
People
Logical
Plan
39
Naïve Query Planning
SELECT	
  name	
  
FROM	
  (	
  
	
  	
  	
  SELECT	
  id,	
  name	
  
	
  	
  	
  FROM	
  People)	
  p	
  
WHERE	
  p.id	
  =	
  1	
  
Project
name
Project
id,name
Filter
id = 1
People
Logical
Plan
Project
name
Project
id,name
Filter
id = 1
TableScan
People
Physical
Plan
40
Optimized Execution
Writing imperative code
to optimize all possible
patterns is hard.

Project
name
Project
id,name
Filter
id = 1
People
Logical
Plan
Project
name
Project
id,name
Filter
id = 1
People
IndexLookup
id = 1
return: name
Logical
Plan
Physical
Plan
Instead write simple rules:
•  Each rule makes one change
•  Run many rules together to
fixed point.
41
Prior Work: #
Optimizer Generators
Volcano / Cascades: 
•  Create a custom language for expressing
rules that rewrite trees of relational
operators.
•  Build a compiler that generates executable
code for these rules.
Cons:	
  Developers	
  need	
  to	
  learn	
  this	
  custom	
  
language.	
  Language	
  might	
  not	
  be	
  powerful	
  
enough.	
   42
TreeNode Library
Easily transformable trees of operators
•  Standard collection functionality - foreach,	
  
map,collect,etc.
•  transform function – recursive modification
of tree fragments that match a pattern.
•  Debugging support – pretty printing,
splicing, etc.
43
Tree Transformations
Developers express tree transformations as
PartialFunction[TreeType,TreeType]
1.  If the function does apply to an operator, that
operator is replaced with the result.
2.  When the function does not apply to an
operator, that operator is left unchanged.
3.  The transformation is applied recursively to all
children.
44
Writing Rules as Tree
Transformations
1.  Find filters on top of
projections.
2.  Check that the filter
can be evaluated
without the result of
the project.
3.  If so, switch the
operators.
Project
name
Project
id,name
Filter
id = 1
People
Original
Plan
Project
name
Project
id,name
Filter
id = 1
People
Filter
Push-Down
45
Filter Push Down Transformation
	
  
val	
  newPlan	
  =	
  queryPlan	
  transform	
  {	
  
	
  case	
  f	
  @	
  Filter(_,	
  p	
  @	
  Project(_,	
  grandChild))	
  	
  
	
  	
  	
  if(f.references	
  subsetOf	
  grandChild.output)	
  =>	
  
	
  p.copy(child	
  =	
  f.copy(child	
  =	
  grandChild)	
  
}	
  
46
Filter Push Down Transformation
	
  
val	
  newPlan	
  =	
  queryPlan	
  transform	
  {	
  
	
  case	
  f	
  @	
  Filter(_,	
  p	
  @	
  Project(_,	
  grandChild))	
  	
  
	
  	
  	
  if(f.references	
  subsetOf	
  grandChild.output)	
  =>	
  
	
  p.copy(child	
  =	
  f.copy(child	
  =	
  grandChild)	
  
}	
  
Partial Function
Tree
47
Filter Push Down Transformation
	
  
val	
  newPlan	
  =	
  queryPlan	
  transform	
  {	
  
	
  case	
  f	
  @	
  Filter(_,	
  p	
  @	
  Project(_,	
  grandChild))	
  	
  
	
  	
  	
  if(f.references	
  subsetOf	
  grandChild.output)	
  =>	
  
	
  p.copy(child	
  =	
  f.copy(child	
  =	
  grandChild)	
  
}	
  
Find Filter on Project
48
Filter Push Down Transformation
	
  
val	
  newPlan	
  =	
  queryPlan	
  transform	
  {	
  
	
  case	
  f	
  @	
  Filter(_,	
  p	
  @	
  Project(_,	
  grandChild))	
  	
  
	
  	
  	
  if(f.references	
  subsetOf	
  grandChild.output)	
  =>	
  
	
  p.copy(child	
  =	
  f.copy(child	
  =	
  grandChild)	
  
}	
  
Check that the filter can be evaluated without the result of the project.
49
Filter Push Down Transformation
	
  
val	
  newPlan	
  =	
  queryPlan	
  transform	
  {	
  
	
  case	
  f	
  @	
  Filter(_,	
  p	
  @	
  Project(_,	
  grandChild))	
  	
  
	
  	
  	
  if(f.references	
  subsetOf	
  grandChild.output)	
  =>	
  
	
  p.copy(child	
  =	
  f.copy(child	
  =	
  grandChild)	
  
}	
  
If so, switch the order.
50
Filter Push Down Transformation
	
  
val	
  newPlan	
  =	
  queryPlan	
  transform	
  {	
  
	
  case	
  f	
  @	
  Filter(_,	
  p	
  @	
  Project(_,	
  grandChild))	
  	
  
	
  	
  	
  if(f.references	
  subsetOf	
  grandChild.output)	
  =>	
  
	
  p.copy(child	
  =	
  f.copy(child	
  =	
  grandChild)	
  
}	
  
Scala: Pattern Matching
51
Filter Push Down Transformation
	
  
val	
  newPlan	
  =	
  queryPlan	
  transform	
  {	
  
	
  case	
  f	
  @	
  Filter(_,	
  p	
  @	
  Project(_,	
  grandChild))	
  	
  
	
  	
  	
  if(f.references	
  subsetOf	
  grandChild.output)	
  =>	
  
	
  p.copy(child	
  =	
  f.copy(child	
  =	
  grandChild)	
  
}	
  
Catalyst: Attribute Reference Tracking
52
Filter Push Down Transformation
	
  
val	
  newPlan	
  =	
  queryPlan	
  transform	
  {	
  
	
  case	
  f	
  @	
  Filter(_,	
  p	
  @	
  Project(_,	
  grandChild))	
  	
  
	
  	
  	
  if(f.references	
  subsetOf	
  grandChild.output)	
  =>	
  
	
  p.copy(child	
  =	
  f.copy(child	
  =	
  grandChild)	
  
}	
  
Scala: Copy Constructors
53
Optimizing with Rules
Project
name
Project
id,name
Filter
id = 1
People
Original
Plan
Project
name
Project
id,name
Filter
id = 1
People
Filter
Push-Down
Project
name
Filter
id = 1
People
Combine
Projection
IndexLookup
id = 1
return: name
Physical
Plan
54
Future Work – Project Tungsten
Consider “abcd” – 4 bytes with UTF8 encoding
java.lang.String	
  object	
  internals:	
  
OFFSET	
  	
  SIZE	
  	
  	
  TYPE	
  DESCRIPTION	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  VALUE	
  
	
  	
  	
  	
  	
  0	
  	
  	
  	
  	
  4	
  	
  	
  	
  	
  	
  	
  	
  (object	
  header)	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  ...	
  
	
  	
  	
  	
  	
  4	
  	
  	
  	
  	
  4	
  	
  	
  	
  	
  	
  	
  	
  (object	
  header)	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  ...	
  
	
  	
  	
  	
  	
  8	
  	
  	
  	
  	
  4	
  	
  	
  	
  	
  	
  	
  	
  (object	
  header)	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  ...	
  
	
  	
  	
  	
  12	
  	
  	
  	
  	
  4	
  char[]	
  String.value	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  []	
  
	
  	
  	
  	
  16	
  	
  	
  	
  	
  4	
  	
  	
  	
  int	
  String.hash	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  0	
  
	
  	
  	
  	
  20	
  	
  	
  	
  	
  4	
  	
  	
  	
  int	
  String.hash32	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  0	
  
Instance	
  size:	
  24	
  bytes	
  (reported	
  by	
  Instrumentation	
  API)	
  
Project Tungsten
Overcome JVM limitations:
•  Memory Management and Binary Processing:
leveraging application semantics to manage
memory explicitly and eliminate the overhead of
JVM object model and garbage collection
•  Cache-aware computation: algorithms and data
structures to exploit memory hierarchy
•  Code generation: using code generation to exploit
modern compilers and CPUs
Questions?
Learn more at: http://spark.apache.org/docs/latest/
Get Involved: https://github.com/apache/spark

More Related Content

What's hot

OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...Altinity Ltd
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsMaterialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsDatabricks
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overviewJulian Hyde
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySparkRussell Jurney
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.
 

What's hot (20)

OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Presto
PrestoPresto
Presto
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsMaterialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 

Similar to Spark SQL Deep Dive @ Melbourne Spark Meetup

Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Databricks
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFramePrashant Gupta
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's comingDatabricks
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingDatabricks
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfWalmirCouto3
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityDatabricks
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustSpark Summit
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingGerger
 

Similar to Spark SQL Deep Dive @ Melbourne Spark Meetup (20)

Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Data modeling 101 - Basics - Software Domain
Data modeling 101 - Basics - Software DomainData modeling 101 - Basics - Software Domain
Data modeling 101 - Basics - Software DomainAbdul Ahad
 
Copilot para Microsoft 365 y Power Platform Copilot
Copilot para Microsoft 365 y Power Platform CopilotCopilot para Microsoft 365 y Power Platform Copilot
Copilot para Microsoft 365 y Power Platform CopilotEdgard Alejos
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdfAndrey Devyatkin
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slidesvaideheekore1
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jNeo4j
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfmaor17
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfkalichargn70th171
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 

Recently uploaded (20)

Data modeling 101 - Basics - Software Domain
Data modeling 101 - Basics - Software DomainData modeling 101 - Basics - Software Domain
Data modeling 101 - Basics - Software Domain
 
Copilot para Microsoft 365 y Power Platform Copilot
Copilot para Microsoft 365 y Power Platform CopilotCopilot para Microsoft 365 y Power Platform Copilot
Copilot para Microsoft 365 y Power Platform Copilot
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdf
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 

Spark SQL Deep Dive @ Melbourne Spark Meetup

  • 1. Spark SQL Deep Dive Michael Armbrust Melbourne Spark Meetup – June 1st 2015
  • 2. What is Apache Spark? Fast and general cluster computing system, interoperable with Hadoop, included in all major distros Improves efficiency through: >  In-memory computing primitives >  General computation graphs Improves usability through: >  Rich APIs in Scala, Java, Python >  Interactive shell Up to 100× faster (2-10× on disk) 2-5× less code
  • 3. Spark Model Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets (RDDs) >  Collections of objects that can be stored in memory or disk across a cluster >  Parallel functional transformations (map, filter, …) >  Automatically rebuilt on failure
  • 4. More than Map & Reduce map filter groupBy sort union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey groupByKey cogroup cross zip sample take first partitionBy mapWith pipe save ...
  • 5. 5   On-Disk Sort Record: Time to sort 100TB 2100 machines2013 Record: Hadoop 2014 Record: Spark Source: Daytona GraySort benchmark, sortbenchmark.org 72 minutes 207 machines 23 minutes Also sorted 1PB in 4 hours
  • 6. 6   Spark “Hall of Fame” LARGEST SINGLE-DAY INTAKE LONGEST-RUNNING JOB LARGEST SHUFFLE MOST INTERESTING APP Tencent (1PB+ /day) Alibaba (1 week on 1PB+ data) Databricks PB Sort (1PB) Jeremy Freeman Mapping the Brain at Scale (with lasers!) LARGEST CLUSTER Tencent (8000+ nodes) Based on Reynold Xin’s personal knowledge
  • 7. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns val  lines  =  spark.textFile(“hdfs://...”)   val  errors  =  lines.filter(_  startswith  “ERROR”)   val  messages  =  errors.map(_.split(“t”)(2))   messages.cache()   lines Block 1 lines Block 2 lines Block 3 Worker Worker Worker Driver messages.filter(_  contains  “foo”).count()   messages.filter(_  contains  “bar”).count()   . . . tasks results messages Cache 1 messages Cache 2 messages Cache 3 Base RDD Transformed RDD Action Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec# (vs 170 sec for on-disk data)
  • 11. 0 20000 40000 60000 80000 100000 120000 140000 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines SparkSQL Streaming Powerful Stack – Agile Development
  • 12. Powerful Stack – Agile Development 0 20000 40000 60000 80000 100000 120000 140000 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines GraphX Streaming SparkSQL
  • 13. Powerful Stack – Agile Development 0 20000 40000 60000 80000 100000 120000 140000 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines GraphX Streaming SparkSQL Your App?
  • 14. About SQL Spark SQL >  Part of the core distribution since Spark 1.0 (April 2014) 0 50 100 150 200 250 # Of Commits Per Month 0 50 100 150 200 2014-03 2014-04 2014-05 2014-06 2014-07 2014-08 2014-09 2014-10 2014-11 2014-12 2015-01 2015-02 2015-03 2015-04 2015-05 2015-06 # of Contributors
  • 15. SELECT  COUNT(*)   FROM  hiveTable   WHERE  hive_udf(data)     Spark SQL >  Part of the core distribution since Spark 1.0 (April 2014) >  Runs SQL / HiveQL queries including UDFs UDAFs and SerDes About SQL
  • 16. Spark SQL >  Part of the core distribution since Spark 1.0 (April 2014) >  Runs SQL / HiveQL queries including UDFs UDAFs and SerDes >  Connect existing BI tools to Spark through JDBC About SQL
  • 17. Spark SQL >  Part of the core distribution since Spark 1.0 (April 2014) >  Runs SQL / HiveQL queries including UDFs UDAFs and SerDes >  Connect existing BI tools to Spark through JDBC >  Bindings in Python, Scala, and Java About SQL
  • 18. The not-so-secret truth… is not about SQL. SQL
  • 19. SQL: The whole story Create and Run Spark Programs Faster: >  Write less code >  Read less data >  Let the optimizer do the hard work
  • 20. DataFrame noun – [dey-tuh-freym] 1.  A distributed collection of rows organized into named columns. 2.  An abstraction for selecting, filtering, aggregating and plotting structured data (cf. R, Pandas). 3.  Archaic: Previously SchemaRDD (cf. Spark < 1.3).
  • 21. Write Less Code: Input & Output Spark SQL’s Data Source API can read and write DataFrames using a variety of formats. 21   { JSON } Built-In External JDBC and more…
  • 22. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: df  =  sqlContext.read        .format("json")        .option("samplingRatio",  "0.1")        .load("/home/michael/data.json")     df.write        .format("parquet")        .mode("append")        .partitionBy("year")        .saveAsTable("fasterData")    
  • 23. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: df  =  sqlContext.read        .format("json")        .option("samplingRatio",  "0.1")        .load("/home/michael/data.json")     df.write        .format("parquet")        .mode("append")        .partitionBy("year")        .saveAsTable("fasterData")     read and write   functions create new builders for doing I/O
  • 24. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: df  =  sqlContext.read        .format("json")        .option("samplingRatio",  "0.1")        .load("/home/michael/data.json")     df.write        .format("parquet")        .mode("append")        .partitionBy("year")        .saveAsTable("fasterData")     Builder methods are used to specify: •  Format •  Partitioning •  Handling of existing data •  and more
  • 25. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: df  =  sqlContext.read        .format("json")        .option("samplingRatio",  "0.1")        .load("/home/michael/data.json")     df.write        .format("parquet")        .mode("append")        .partitionBy("year")        .saveAsTable("fasterData")     load(…), save(…) or saveAsTable(…)   functions create new builders for doing I/O
  • 26. ETL Using Custom Data Sources sqlContext.read      .format("com.databricks.spark.git")      .option("url",  "https://github.com/apache/spark.git")      .option("numPartitions",  "100")      .option("branches",  "master,branch-­‐1.3,branch-­‐1.2")      .load()      .repartition(1)      .write      .format("json")      .save("/home/michael/spark.json")    
  • 27. Write Less Code: Powerful Operations Common operations can be expressed concisely as calls to the DataFrame API: •  Selecting required columns •  Joining different data sources •  Aggregation (count, sum, average, etc) •  Filtering 27  
  • 28. Write Less Code: Compute an Average private  IntWritable  one  =        new  IntWritable(1)   private  IntWritable  output  =      new  IntWritable()   proctected  void  map(          LongWritable  key,          Text  value,          Context  context)  {      String[]  fields  =  value.split("t")      output.set(Integer.parseInt(fields[1]))      context.write(one,  output)   }     IntWritable  one  =  new  IntWritable(1)   DoubleWritable  average  =  new  DoubleWritable()     protected  void  reduce(          IntWritable  key,          Iterable<IntWritable>  values,          Context  context)  {      int  sum  =  0      int  count  =  0      for(IntWritable  value  :  values)  {            sum  +=  value.get()            count++          }      average.set(sum  /  (double)  count)      context.Write(key,  average)   }   data  =  sc.textFile(...).split("t")   data.map(lambda  x:  (x[0],  [x.[1],  1]))          .reduceByKey(lambda  x,  y:  [x[0]  +  y[0],  x[1]  +  y[1]])          .map(lambda  x:  [x[0],  x[1][0]  /  x[1][1]])          .collect()  
  • 29. Write Less Code: Compute an Average Using RDDs   data  =  sc.textFile(...).split("t")   data.map(lambda  x:  (x[0],  [int(x[1]),  1]))          .reduceByKey(lambda  x,  y:  [x[0]  +  y[0],  x[1]  +  y[1]])          .map(lambda  x:  [x[0],  x[1][0]  /  x[1][1]])          .collect()         Using DataFrames   sqlCtx.table("people")          .groupBy("name")          .agg("name",  avg("age"))          .collect()     Using SQL   SELECT  name,  avg(age)   FROM  people   GROUP  BY  name  
  • 30. Not Just Less Code: Faster Implementations 0 2 4 6 8 10 RDD Scala RDD Python DataFrame Scala DataFrame Python DataFrame SQL Time to Aggregate 10 million int pairs (secs)
  • 31. 31   Demo: Data Frames Using Spark SQL to read, write, slice and dice your data using a simple functions
  • 32. Read Less Data Spark SQL can help you read less data automatically: •  Converting to more efficient formats •  Using columnar formats (i.e. parquet) •  Using partitioning (i.e., /year=2014/month=02/…)1 •  Skipping data using statistics (i.e., min, max)2 •  Pushing predicates into storage systems (i.e., JDBC)  
  • 33. Optimization happens as late as possible, therefore Spark SQL can optimize across functions. 33
  • 34. 34 def  add_demographics(events):        u  =  sqlCtx.table("users")                                      #  Load  Hive  table        events              .join(u,  events.user_id  ==  u.user_id)        #  Join  on  user_id                      .withColumn("city",  zipToCity(df.zip))        #  udf  adds  city  column     events  =  add_demographics(sqlCtx.load("/data/events",  "json"))     training_data  =  events.where(events.city  ==  "Palo  Alto")                                              .select(events.timestamp).collect()     Logical Plan filter join events file users table expensive only join relevant users Physical Plan join scan (events) filter scan (users)
  • 35. 35 def  add_demographics(events):        u  =  sqlCtx.table("users")                                      #  Load  partitioned  Hive  table        events              .join(u,  events.user_id  ==  u.user_id)        #  Join  on  user_id                      .withColumn("city",  zipToCity(df.zip))        #  Run  udf  to  add  city  column     Physical Plan with Predicate Pushdown and Column Pruning join optimized scan (events) optimized scan (users) events  =  add_demographics(sqlCtx.load("/data/events",  "parquet"))     training_data  =  events.where(events.city  ==  "Palo  Alto")                                              .select(events.timestamp).collect()     Logical Plan filter join events file users table Physical Plan join scan (events) filter scan (users)
  • 36. Machine Learning Pipelines tokenizer  =  Tokenizer(inputCol="text", outputCol="words”)   hashingTF  =  HashingTF(inputCol="words", outputCol="features”)   lr  =  LogisticRegression(maxIter=10,  regParam=0.01)   pipeline  =  Pipeline(stages=[tokenizer,  hashingTF,  lr])     df  =  sqlCtx.load("/path/to/data") model  =  pipeline.fit(df) df0 df1 df2 df3tokenizer hashingTF lr.model lr Pipeline Model
  • 37. Set Footer from Insert Dropdown Menu 37 So how does it all work?
  • 38. Plan Optimization & Execution Set Footer from Insert Dropdown Menu 38 SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Catalog DataFrames and SQL share the same optimization/execution pipeline Code Generation
  • 39. An example query SELECT  name   FROM  (        SELECT  id,  name        FROM  People)  p   WHERE  p.id  =  1   Project name Project id,name Filter id = 1 People Logical Plan 39
  • 40. Naïve Query Planning SELECT  name   FROM  (        SELECT  id,  name        FROM  People)  p   WHERE  p.id  =  1   Project name Project id,name Filter id = 1 People Logical Plan Project name Project id,name Filter id = 1 TableScan People Physical Plan 40
  • 41. Optimized Execution Writing imperative code to optimize all possible patterns is hard. Project name Project id,name Filter id = 1 People Logical Plan Project name Project id,name Filter id = 1 People IndexLookup id = 1 return: name Logical Plan Physical Plan Instead write simple rules: •  Each rule makes one change •  Run many rules together to fixed point. 41
  • 42. Prior Work: # Optimizer Generators Volcano / Cascades: •  Create a custom language for expressing rules that rewrite trees of relational operators. •  Build a compiler that generates executable code for these rules. Cons:  Developers  need  to  learn  this  custom   language.  Language  might  not  be  powerful   enough.   42
  • 43. TreeNode Library Easily transformable trees of operators •  Standard collection functionality - foreach,   map,collect,etc. •  transform function – recursive modification of tree fragments that match a pattern. •  Debugging support – pretty printing, splicing, etc. 43
  • 44. Tree Transformations Developers express tree transformations as PartialFunction[TreeType,TreeType] 1.  If the function does apply to an operator, that operator is replaced with the result. 2.  When the function does not apply to an operator, that operator is left unchanged. 3.  The transformation is applied recursively to all children. 44
  • 45. Writing Rules as Tree Transformations 1.  Find filters on top of projections. 2.  Check that the filter can be evaluated without the result of the project. 3.  If so, switch the operators. Project name Project id,name Filter id = 1 People Original Plan Project name Project id,name Filter id = 1 People Filter Push-Down 45
  • 46. Filter Push Down Transformation   val  newPlan  =  queryPlan  transform  {    case  f  @  Filter(_,  p  @  Project(_,  grandChild))          if(f.references  subsetOf  grandChild.output)  =>    p.copy(child  =  f.copy(child  =  grandChild)   }   46
  • 47. Filter Push Down Transformation   val  newPlan  =  queryPlan  transform  {    case  f  @  Filter(_,  p  @  Project(_,  grandChild))          if(f.references  subsetOf  grandChild.output)  =>    p.copy(child  =  f.copy(child  =  grandChild)   }   Partial Function Tree 47
  • 48. Filter Push Down Transformation   val  newPlan  =  queryPlan  transform  {    case  f  @  Filter(_,  p  @  Project(_,  grandChild))          if(f.references  subsetOf  grandChild.output)  =>    p.copy(child  =  f.copy(child  =  grandChild)   }   Find Filter on Project 48
  • 49. Filter Push Down Transformation   val  newPlan  =  queryPlan  transform  {    case  f  @  Filter(_,  p  @  Project(_,  grandChild))          if(f.references  subsetOf  grandChild.output)  =>    p.copy(child  =  f.copy(child  =  grandChild)   }   Check that the filter can be evaluated without the result of the project. 49
  • 50. Filter Push Down Transformation   val  newPlan  =  queryPlan  transform  {    case  f  @  Filter(_,  p  @  Project(_,  grandChild))          if(f.references  subsetOf  grandChild.output)  =>    p.copy(child  =  f.copy(child  =  grandChild)   }   If so, switch the order. 50
  • 51. Filter Push Down Transformation   val  newPlan  =  queryPlan  transform  {    case  f  @  Filter(_,  p  @  Project(_,  grandChild))          if(f.references  subsetOf  grandChild.output)  =>    p.copy(child  =  f.copy(child  =  grandChild)   }   Scala: Pattern Matching 51
  • 52. Filter Push Down Transformation   val  newPlan  =  queryPlan  transform  {    case  f  @  Filter(_,  p  @  Project(_,  grandChild))          if(f.references  subsetOf  grandChild.output)  =>    p.copy(child  =  f.copy(child  =  grandChild)   }   Catalyst: Attribute Reference Tracking 52
  • 53. Filter Push Down Transformation   val  newPlan  =  queryPlan  transform  {    case  f  @  Filter(_,  p  @  Project(_,  grandChild))          if(f.references  subsetOf  grandChild.output)  =>    p.copy(child  =  f.copy(child  =  grandChild)   }   Scala: Copy Constructors 53
  • 54. Optimizing with Rules Project name Project id,name Filter id = 1 People Original Plan Project name Project id,name Filter id = 1 People Filter Push-Down Project name Filter id = 1 People Combine Projection IndexLookup id = 1 return: name Physical Plan 54
  • 55. Future Work – Project Tungsten Consider “abcd” – 4 bytes with UTF8 encoding java.lang.String  object  internals:   OFFSET    SIZE      TYPE  DESCRIPTION                                        VALUE            0          4                (object  header)                                ...            4          4                (object  header)                                ...            8          4                (object  header)                                ...          12          4  char[]  String.value                                      []          16          4        int  String.hash                                        0          20          4        int  String.hash32                                    0   Instance  size:  24  bytes  (reported  by  Instrumentation  API)  
  • 56. Project Tungsten Overcome JVM limitations: •  Memory Management and Binary Processing: leveraging application semantics to manage memory explicitly and eliminate the overhead of JVM object model and garbage collection •  Cache-aware computation: algorithms and data structures to exploit memory hierarchy •  Code generation: using code generation to exploit modern compilers and CPUs
  • 57. Questions? Learn more at: http://spark.apache.org/docs/latest/ Get Involved: https://github.com/apache/spark