Intro to Spark and Spark SQL

Intro to Spark 
and Spark SQL 
AMP Camp 2014 
Michael Armbrust - @michaelarmbrust
What is Apache Spark? 
Fast and general cluster computing system, 
interoperable with Hadoop, included in all major 
distros 
Improves efficiency through: 
> In-memory computing primitives 
> General computation graphs 
Improves usability through: 
> Rich APIs in Scala, Java, Python 
> Interactive shell 
Up to 100× faster 
(2-10× on disk) 
2-5× less code
Spark Model 
Write programs in terms of transformations on 
distributed datasets 
Resilient Distributed Datasets (RDDs) 
> Collections of objects that can be stored in memory 
or disk across a cluster 
> Parallel functional transformations (map, filter, …) 
> Automatically rebuilt on failure
More than Map & Reduce 
map 
filter 
groupBy 
sort 
union 
join 
leftOuterJoin 
rightOuterJoin 
reduce 
count 
fold 
reduceByKey 
groupByKey 
cogroup 
cross 
zip 
sample 
take 
first 
partitionBy 
mapWith 
pipe 
save 
...
Example: Log Mining 
Load error messages from a log into memory, 
then interactively search for various patterns 
val 
lines 
= 
spark.textFile(“hdfs://...”) 
val 
errors 
= 
lines.filter(_ 
startswith 
“ERROR”) 
val 
messages 
= 
errors.map(_.split(“t”)(2)) 
messages.cache() 
lines 
Block 1 
lines 
Block 2 
lines 
Block 3 
Worker 
Worker 
Worker 
Driver 
messages.filter(_ 
contains 
“foo”).count() 
messages.filter(_ 
contains 
“bar”).count() 
. . . 
results 
tasks 
messages 
Cache 1 
messages 
Cache 2 
messages 
Cache 3 
BaseT RraDnDsf 
ormed RDD 
Action 
Result: scaled full-text to search 1 TB data of Wikipedia in 5-7 secin 
# 
(vs 170 sec for on-disk data) 
<1 sec (vs 20 sec for on-disk data)
A General Stack 
Spark 
Streaming# 
real-time 
Spark 
Spark 
SQL 
GraphX 
graph 
MLlib 
machine 
learning 
…
Powerful Stack – Agile Development 
140000 
120000 
100000 
80000 
60000 
40000 
20000 
0 
Hadoop 
MapReduce 
Storm 
(Streaming) 
Impala (SQL) 
Giraph 
(Graph) 
Spark 
non-test, non-example source lines
Powerful Stack – Agile Development 
140000 
120000 
100000 
80000 
60000 
40000 
20000 
0 
Hadoop 
MapReduce 
Storm 
(Streaming) 
Impala (SQL) 
Giraph 
(Graph) 
Spark 
non-test, non-example source lines 
Streaming
Powerful Stack – Agile Development 
140000 
120000 
100000 
80000 
60000 
40000 
20000 
0 
Hadoop 
MapReduce 
Storm 
(Streaming) 
Impala (SQL) 
Giraph 
(Graph) 
Spark 
non-test, non-example source lines 
SparkSQL 
Streaming
Powerful Stack – Agile Development 
140000 
120000 
100000 
80000 
60000 
40000 
20000 
0 
Hadoop 
MapReduce 
Storm 
(Streaming) 
Impala (SQL) 
Giraph 
(Graph) 
Spark 
non-test, non-example source lines 
GraphX 
Streaming 
SparkSQL
Powerful Stack – Agile Development 
140000 
120000 
100000 
80000 
60000 
40000 
20000 
0 
Hadoop 
MapReduce 
Storm 
(Streaming) 
Impala (SQL) 
Giraph 
(Graph) 
Spark 
non-test, non-example source lines 
Your App? 
GraphX 
Streaming 
SparkSQL
Community Growth 
200 
180 
160 
140 
120 
100 
80 
60 
40 
20 
0 
0.6.0 0.7.0 0.8.0 0.9.0 1.0.0 1.1.0
Not Just for In-Memory Data 
Hadoop 
Record 
Spark 100TB Spark 
1PB 
Data Size 102.5TB 100TB 1000TB 
Time 72 min 23 min 234 min 
# Cores 50400 6592 6080 
Rate 1.42 TB/min 4.27 TB/min 4.27 TB/min 
Environment Dedicate Cloud (EC2) Cloud (EC2)
SQL Overview 
• Newest component of Spark initially 
contributed by databricks (< 1 year old) 
• Tightly integrated way to work with structured 
data (tables with rows/columns) 
• Transform RDDs using SQL 
• Data source integration: Hive, Parquet, JSON, 
and more
Relationship to 
Shark modified the Hive backend to run over 
Spark, but had two challenges: 
> Limited integration with Spark programs 
> Hive optimizer not designed for Spark 
Spark SQL reuses the best parts of Shark: 
Borrows 
• Hive data loading 
• In-memory column store 
Adds 
• RDD-aware optimizer 
• Rich language interfaces
Adding Schema to RDDs 
Spark + RDDs 
Functional transformations on 
partitioned collections of opaque 
objects. 
SQL + SchemaRDDs 
Declarative transformations on 
partitioned collections of tuples. 
User 
User 
User 
User 
User 
User 
Name 
Age 
Height 
Name 
Age 
Height 
Name 
Age 
Height 
Name 
Age 
Height 
Name 
Age 
Height 
Name 
Age 
Height
SchemaRDDs: More than SQL 
SchemaRDD 
{JSON} 
Image credit: http://barrymieny.deviantart.com/ 
SQL 
QL 
Parquet 
MLlib 
Unified interface for structured data 
JDBC 
ODBC
Getting Started: Spark SQL 
SQLContext/HiveContext 
• Entry point for all SQL functionality 
• Wraps/extends existing spark context 
from 
pyspark.sql 
import 
SQLContext 
sqlCtx 
= 
SQLContext(sc)
Example Dataset 
A text file filled with people’s names and ages: 
Michael, 
30 
Andy, 
31 
…
RDDs as Relations (Python) 
# Load a text file and convert each line to a dictionary. 
lines = sc.textFile("examples/…/people.txt") 
parts = lines.map(lambda l: l.split(",")) 
people = parts.map(lambda p: Row(name=p[0],age=int(p[1]))) 
# Infer the schema, and register the SchemaRDD as a table 
peopleTable = sqlCtx.inferSchema(people) 
peopleTable.registerAsTable("people")
RDDs as Relations (Scala) 
val 
sqlContext 
= 
new 
org.apache.spark.sql.SQLContext(sc) 
import 
sqlContext._ 
// 
Define 
the 
schema 
using 
a 
case 
class. 
case 
class 
Person(name: 
String, 
age: 
Int) 
// 
Create 
an 
RDD 
of 
Person 
objects 
and 
register 
it 
as 
a 
table. 
val 
people 
= 
sc.textFile("examples/src/main/resources/people.txt") 
.map(_.split(",")) 
.map(p 
=> 
Person(p(0), 
p(1).trim.toInt)) 
people.registerAsTable("people")
RDDs as Relations (Java) 
public 
class 
Person 
implements 
Serializable 
{ 
private 
String 
_name; 
private 
int 
_age; 
public 
String 
getName() 
{ 
return 
_name; 
} 
public 
void 
setName(String 
name) 
{ 
_name 
= 
name; 
} 
public 
int 
getAge() 
{ 
return 
_age; 
} 
public 
void 
setAge(int 
age) 
{ 
_age 
= 
age; 
} 
} 
JavaSQLContext 
ctx 
= 
new 
org.apache.spark.sql.api.java.JavaSQLContext(sc) 
JavaRDD<Person> 
people 
= 
ctx.textFile("examples/src/main/resources/ 
people.txt").map( 
new 
Function<String, 
Person>() 
{ 
public 
Person 
call(String 
line) 
throws 
Exception 
{ 
String[] 
parts 
= 
line.split(","); 
Person 
person 
= 
new 
Person(); 
person.setName(parts[0]); 
person.setAge(Integer.parseInt(parts[1].trim())); 
return 
person; 
} 
}); 
JavaSchemaRDD 
schemaPeople 
= 
sqlCtx.applySchema(people, 
Person.class);
Querying Using SQL 
# 
SQL 
can 
be 
run 
over 
SchemaRDDs 
that 
have 
been 
registered 
# 
as 
a 
table. 
teenagers 
= 
sqlCtx.sql(""" 
SELECT 
name 
FROM 
people 
WHERE 
age 
>= 
13 
AND 
age 
<= 
19""") 
# 
The 
results 
of 
SQL 
queries 
are 
RDDs 
and 
support 
all 
the 
normal 
# 
RDD 
operations. 
teenNames 
= 
teenagers.map(lambda 
p: 
"Name: 
" 
+ 
p.name)
Existing Tools, New Data Sources 
Spark SQL includes a server that exposes its data 
using JDBC/ODBC 
• Query data from HDFS/S3, 
• Including formats like Hive/Parquet/JSON* 
• Support for caching data in-memory 
* Coming in Spark 1.2
Caching Tables In-Memory 
Spark SQL can cache tables using an in-memory 
columnar format: 
• Scan only required columns 
• Fewer allocated objects (less GC) 
• Automatically selects best compression 
cacheTable("people") 
schemaRDD.cache() – *Requires Spark 1.2
Caching Comparison 
Spark MEMORY_ONLY Caching 
User 
Object 
User 
Object 
User 
Object 
User 
Object 
User 
Object 
SchemaRDD Columnar Caching 
ByteBuffer ByteBuffer ByteBuffer 
User 
Object 
Name 
Age 
Height 
Name 
Age 
Height 
Name 
Age 
Height 
Name 
Age 
Height 
Name 
Age 
Height 
Name 
Age 
Height 
java.lang.String 
java.lang.String 
java.lang.String 
java.lang.String 
java.lang.String 
java.lang.String
Language Integrated UDFs 
registerFunction(“countMatches”, 
lambda 
(pattern, 
text): 
re.subn(pattern, 
'', 
text)[1]) 
sql("SELECT 
countMatches(‘a’, 
text)…")
SQL and Machine Learning 
training_data_table 
= 
sql(""" 
SELECT 
e.action, 
u.age, 
u.latitude, 
u.logitude 
FROM 
Users 
u 
JOIN 
Events 
e 
ON 
u.userId 
= 
e.userId""") 
def 
featurize(u): 
LabeledPoint(u.action, 
[u.age, 
u.latitude, 
u.longitude]) 
// 
SQL 
results 
are 
RDDs 
so 
can 
be 
used 
directly 
in 
Mllib. 
training_data 
= 
training_data_table.map(featurize) 
model 
= 
new 
LogisticRegressionWithSGD.train(training_data)
Machine Learning Pipelines 
// 
training:{eventId:String, 
features:Vector, 
label:Int} 
val 
training 
= 
parquetFile("/path/to/training") 
val 
lr 
= 
new 
LogisticRegression().fit(training) 
// 
event: 
{eventId: 
String, 
features: 
Vector} 
val 
event 
= 
parquetFile("/path/to/event") 
val 
prediction 
= 
lr.transform(event).select('eventId, 
'prediction) 
prediction.saveAsParquetFile("/path/to/prediction”)
Reading Data Stored in Hive 
from 
pyspark.sql 
import 
HiveContext 
hiveCtx 
= 
HiveContext(sc) 
hiveCtx.hql(""" 
CREATE 
TABLE 
IF 
NOT 
EXISTS 
src 
(key 
INT, 
value 
STRING)""") 
hiveCtx.hql(""" 
LOAD 
DATA 
LOCAL 
INPATH 
'examples/…/kv1.txt' 
INTO 
TABLE 
src""") 
# 
Queries 
can 
be 
expressed 
in 
HiveQL. 
results 
= 
hiveCtx.hql("FROM 
src 
SELECT 
key, 
value").collect()
Parquet Compatibility 
Native support for reading data in Parquet: 
• Columnar storage avoids reading unneeded data. 
• RDDs can be written to parquet files, preserving the 
schema. 
• Convert other slower formats into Parquet for repeated 
querying
Using Parquet 
# 
SchemaRDDs 
can 
be 
saved 
as 
Parquet 
files, 
maintaining 
the 
# 
schema 
information. 
peopleTable.saveAsParquetFile("people.parquet") 
# 
Read 
in 
the 
Parquet 
file 
created 
above. 
Parquet 
files 
are 
# 
self-­‐describing 
so 
the 
schema 
is 
preserved. 
The 
result 
of 
# 
loading 
a 
parquet 
file 
is 
also 
a 
SchemaRDD. 
parquetFile 
= 
sqlCtx.parquetFile("people.parquet”) 
# 
Parquet 
files 
can 
be 
registered 
as 
tables 
used 
in 
SQL. 
parquetFile.registerAsTable("parquetFile”) 
teenagers 
= 
sqlCtx.sql(""" 
SELECT 
name 
FROM 
parquetFile 
WHERE 
age 
>= 
13 
AND 
age 
<= 
19""")
{JSON} Support 
• Use jsonFile or jsonRDD to convert a 
collection of JSON objects into a SchemaRDD 
• Infers and unions the schema of each record 
• Maintains nested structures and arrays
{JSON} Example 
# 
Create 
a 
SchemaRDD 
from 
the 
file(s) 
pointed 
to 
by 
path 
people 
= 
sqlContext.jsonFile(path) 
# 
Visualized 
inferred 
schema 
with 
printSchema(). 
people.printSchema() 
# 
root 
# 
|-­‐-­‐ 
age: 
integer 
# 
|-­‐-­‐ 
name: 
string 
# 
Register 
this 
SchemaRDD 
as 
a 
table. 
people.registerTempTable("people")
Data Sources API 
Allow easy integration with new sources of 
structured data: 
CREATE 
TEMPORARY 
TABLE 
episodes 
USING 
com.databricks.spark.avro 
OPTIONS 
( 
path 
”./episodes.avro” 
) 
https://github.com/databricks/spark-­‐avro
Efficient Expression Evaluation 
Interpreting expressions (e.g., ‘a + b’) can very 
expensive on the JVM: 
• Virtual function calls 
• Branches based on expression type 
• Object creation due to primitive boxing 
• Memory consumption by boxed primitive 
objects
Interpreting “a+b” 
1. Virtual call to Add.eval() 
2. Virtual call to a.eval() 
3. Return boxed Int 
4. Virtual call to b.eval() 
5. Return boxed Int 
6. Integer addition 
7. Return boxed result 
Add 
Attribute 
a 
Attribute 
b
Using Runtime Reflection 
def 
generateCode(e: 
Expression): 
Tree 
= 
e 
match 
{ 
case 
Attribute(ordinal) 
=> 
q"inputRow.getInt($ordinal)" 
case 
Add(left, 
right) 
=> 
q""" 
{ 
val 
leftResult 
= 
${generateCode(left)} 
val 
rightResult 
= 
${generateCode(right)} 
leftResult 
+ 
rightResult 
} 
""" 
}
Code Generating “a + b” 
val 
left: 
Int 
= 
inputRow.getInt(0) 
val 
right: 
Int 
= 
inputRow.getInt(1) 
val 
result: 
Int 
= 
left 
+ 
right 
resultRow.setInt(0, 
result) 
• Fewer function calls 
• No boxing of primitives
Performance Comparison 
40 
35 
30 
25 
20 
15 
10 
5 
0 
Intepreted Evaluation 
Hand-written Code 
Generated with Scala 
Reflection 
Milliseconds 
Evaluating 'a+a+a' One Billion Times!
Performance vs. 
Deep 
Analytics 
TPC-DS 
Interactive Reporting 
450 
400 
350 
300 
250 
200 
150 
100 
50 
0 
34 46 59 79 19 42 52 55 63 68 73 98 27 3 43 53 7 89 
Shark 
Spark SQL
What’s Coming in Spark 1.2? 
• MLlib pipeline support for SchemaRDDs 
• New APIs for developers to add external data 
sources 
• Full support for Decimal and Date types. 
• Statistics for in-memory data 
• Lots of bug fixes and improvements to Hive 
compatibility
Questions?
1 of 43

Recommended

Tachyon-2014-11-21-amp-camp5 by
Tachyon-2014-11-21-amp-camp5Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Haoyuan Li
3.5K views56 slides
Spark SQL - 10 Things You Need to Know by
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowKristian Alexander
1.4K views110 slides
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust by
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustSpark Summit
8.8K views35 slides
Spark Concepts - Spark SQL, Graphx, Streaming by
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingPetr Zapletal
3.4K views34 slides
20140908 spark sql & catalyst by
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalystTakuya UESHIN
8.8K views44 slides
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da... by
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Chris Fregly
5.2K views41 slides

More Related Content

What's hot

Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng by
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengDatabricks
1.2K views37 slides
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream... by
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Spark Summit
7.9K views31 slides
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 by
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Sameer Farooqui
3.9K views63 slides
GraphFrames: Graph Queries in Spark SQL by Ankur Dave by
GraphFrames: Graph Queries in Spark SQL by Ankur DaveGraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur DaveSpark Summit
7K views20 slides
Structuring Spark: DataFrames, Datasets, and Streaming by
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingDatabricks
7.8K views35 slides
Simplifying Big Data Analytics with Apache Spark by
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
12K views45 slides

What's hot(20)

Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng by Databricks
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Databricks1.2K views
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream... by Spark Summit
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Spark Summit7.9K views
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 by Sameer Farooqui
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui3.9K views
GraphFrames: Graph Queries in Spark SQL by Ankur Dave by Spark Summit
GraphFrames: Graph Queries in Spark SQL by Ankur DaveGraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
Spark Summit7K views
Structuring Spark: DataFrames, Datasets, and Streaming by Databricks
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks7.8K views
Simplifying Big Data Analytics with Apache Spark by Databricks
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks12K views
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara... by Databricks
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Databricks3.8K views
GraphFrames: DataFrame-based graphs for Apache® Spark™ by Databricks
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
Databricks7.3K views
Spark SQL Deep Dive @ Melbourne Spark Meetup by Databricks
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks9K views
Apache Spark sql by aftab alam
Apache Spark sqlApache Spark sql
Apache Spark sql
aftab alam386 views
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai by Databricks
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks10.9K views
Vertica And Spark: Connecting Computation And Data by Spark Summit
Vertica And Spark: Connecting Computation And DataVertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And Data
Spark Summit4K views
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure... by Databricks
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Databricks13.7K views
How Apache Spark fits into the Big Data landscape by Paco Nathan
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
Paco Nathan7.6K views
Web-Scale Graph Analytics with Apache® Spark™ by Databricks
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
Databricks2.7K views
Building a modern Application with DataFrames by Spark Summit
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
Spark Summit5.1K views
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials by Databricks
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Databricks6.6K views
Understanding Query Plans and Spark UIs by Databricks
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks4.7K views
Beyond SQL: Speeding up Spark with DataFrames by Databricks
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
Databricks43.3K views
End-to-end Data Pipeline with Apache Spark by Databricks
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
Databricks6.9K views

Similar to Intro to Spark and Spark SQL

Spark streaming , Spark SQL by
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
1.4K views25 slides
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一 by
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一scalaconfjp
3.7K views37 slides
Real-Time Spark: From Interactive Queries to Streaming by
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingDatabricks
5.2K views39 slides
20170126 big data processing by
20170126 big data processing20170126 big data processing
20170126 big data processingVienna Data Science Group
1.3K views76 slides
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production by
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri
289 views76 slides
An Introduction to Spark by
An Introduction to SparkAn Introduction to Spark
An Introduction to Sparkjlacefie
3K views85 slides

Similar to Intro to Spark and Spark SQL(20)

Spark streaming , Spark SQL by Yousun Jeong
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
Yousun Jeong1.4K views
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一 by scalaconfjp
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
scalaconfjp3.7K views
Real-Time Spark: From Interactive Queries to Streaming by Databricks
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
Databricks5.2K views
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production by Chetan Khatri
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Chetan Khatri289 views
An Introduction to Spark by jlacefie
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
jlacefie3K views
An Introduct to Spark - Atlanta Spark Meetup by jlacefie
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetup
jlacefie1.4K views
Jump Start into Apache® Spark™ and Databricks by Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks3.9K views
Unified Big Data Processing with Apache Spark by C4Media
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media2K views
Artigo 81 - spark_tutorial.pdf by WalmirCouto3
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
WalmirCouto31 view
Composable Parallel Processing in Apache Spark and Weld by Databricks
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
Databricks3.6K views
Apache spark - Architecture , Overview & libraries by Walaa Hamdy Assy
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy148 views
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D... by Inhacking
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Inhacking994 views
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi... by Аліна Шепшелей
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Spark what's new what's coming by Databricks
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
Databricks4.8K views
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S... by BigDataEverywhere
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere896 views
Strata NYC 2015 - What's coming for the Spark community by Databricks
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
Databricks1.2K views

More from jeykottalam

AMP Camp 5 Intro by
AMP Camp 5 IntroAMP Camp 5 Intro
AMP Camp 5 Introjeykottalam
2.4K views18 slides
Concurrency Control for Parallel Machine Learning by
Concurrency Control for Parallel Machine LearningConcurrency Control for Parallel Machine Learning
Concurrency Control for Parallel Machine Learningjeykottalam
3K views31 slides
MLlib: Spark's Machine Learning Library by
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Libraryjeykottalam
9.9K views33 slides
SparkR: Enabling Interactive Data Science at Scale by
SparkR: Enabling Interactive Data Science at ScaleSparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at Scalejeykottalam
5.3K views33 slides
SampleClean: Bringing Data Cleaning into the BDAS Stack by
SampleClean: Bringing Data Cleaning into the BDAS StackSampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS Stackjeykottalam
4K views34 slides
Machine Learning Pipelines by
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelinesjeykottalam
25.4K views32 slides

More from jeykottalam(8)

AMP Camp 5 Intro by jeykottalam
AMP Camp 5 IntroAMP Camp 5 Intro
AMP Camp 5 Intro
jeykottalam2.4K views
Concurrency Control for Parallel Machine Learning by jeykottalam
Concurrency Control for Parallel Machine LearningConcurrency Control for Parallel Machine Learning
Concurrency Control for Parallel Machine Learning
jeykottalam3K views
MLlib: Spark's Machine Learning Library by jeykottalam
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
jeykottalam9.9K views
SparkR: Enabling Interactive Data Science at Scale by jeykottalam
SparkR: Enabling Interactive Data Science at ScaleSparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at Scale
jeykottalam5.3K views
SampleClean: Bringing Data Cleaning into the BDAS Stack by jeykottalam
SampleClean: Bringing Data Cleaning into the BDAS StackSampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS Stack
jeykottalam4K views
Machine Learning Pipelines by jeykottalam
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
jeykottalam25.4K views
COCOA: Communication-Efficient Coordinate Ascent by jeykottalam
COCOA: Communication-Efficient Coordinate AscentCOCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate Ascent
jeykottalam3.6K views
The BDAS Open Source Community by jeykottalam
The BDAS Open Source CommunityThe BDAS Open Source Community
The BDAS Open Source Community
jeykottalam6.9K views

Recently uploaded

DSD-INT 2023 Exploring flash flood hazard reduction in arid regions using a h... by
DSD-INT 2023 Exploring flash flood hazard reduction in arid regions using a h...DSD-INT 2023 Exploring flash flood hazard reduction in arid regions using a h...
DSD-INT 2023 Exploring flash flood hazard reduction in arid regions using a h...Deltares
5 views31 slides
Generic or specific? Making sensible software design decisions by
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsBert Jan Schrijver
6 views60 slides
SAP FOR CONTRACT MANUFACTURING.pdf by
SAP FOR CONTRACT MANUFACTURING.pdfSAP FOR CONTRACT MANUFACTURING.pdf
SAP FOR CONTRACT MANUFACTURING.pdfVirendra Rai, PMP
11 views2 slides
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J... by
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...Deltares
9 views24 slides
360 graden fabriek by
360 graden fabriek360 graden fabriek
360 graden fabriekinfo33492
37 views25 slides
Headless JS UG Presentation.pptx by
Headless JS UG Presentation.pptxHeadless JS UG Presentation.pptx
Headless JS UG Presentation.pptxJack Spektor
7 views24 slides

Recently uploaded(20)

DSD-INT 2023 Exploring flash flood hazard reduction in arid regions using a h... by Deltares
DSD-INT 2023 Exploring flash flood hazard reduction in arid regions using a h...DSD-INT 2023 Exploring flash flood hazard reduction in arid regions using a h...
DSD-INT 2023 Exploring flash flood hazard reduction in arid regions using a h...
Deltares5 views
Generic or specific? Making sensible software design decisions by Bert Jan Schrijver
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisions
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J... by Deltares
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...
Deltares9 views
360 graden fabriek by info33492
360 graden fabriek360 graden fabriek
360 graden fabriek
info3349237 views
Headless JS UG Presentation.pptx by Jack Spektor
Headless JS UG Presentation.pptxHeadless JS UG Presentation.pptx
Headless JS UG Presentation.pptx
Jack Spektor7 views
SUGCON ANZ Presentation V2.1 Final.pptx by Jack Spektor
SUGCON ANZ Presentation V2.1 Final.pptxSUGCON ANZ Presentation V2.1 Final.pptx
SUGCON ANZ Presentation V2.1 Final.pptx
Jack Spektor22 views
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -... by Deltares
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...
DSD-INT 2023 Simulating a falling apron in Delft3D 4 - Engineering Practice -...
Deltares6 views
Airline Booking Software by SharmiMehta
Airline Booking SoftwareAirline Booking Software
Airline Booking Software
SharmiMehta5 views
20231129 - Platform @ localhost 2023 - Application-driven infrastructure with... by sparkfabrik
20231129 - Platform @ localhost 2023 - Application-driven infrastructure with...20231129 - Platform @ localhost 2023 - Application-driven infrastructure with...
20231129 - Platform @ localhost 2023 - Application-driven infrastructure with...
sparkfabrik5 views
Advanced API Mocking Techniques by Dimpy Adhikary
Advanced API Mocking TechniquesAdvanced API Mocking Techniques
Advanced API Mocking Techniques
Dimpy Adhikary19 views
Tridens DevOps by Tridens
Tridens DevOpsTridens DevOps
Tridens DevOps
Tridens9 views
DSD-INT 2023 The Danube Hazardous Substances Model - Kovacs by Deltares
DSD-INT 2023 The Danube Hazardous Substances Model - KovacsDSD-INT 2023 The Danube Hazardous Substances Model - Kovacs
DSD-INT 2023 The Danube Hazardous Substances Model - Kovacs
Deltares8 views
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium... by Lisi Hocke
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...
Lisi Hocke28 views
DSD-INT 2023 Machine learning in hydraulic engineering - Exploring unseen fut... by Deltares
DSD-INT 2023 Machine learning in hydraulic engineering - Exploring unseen fut...DSD-INT 2023 Machine learning in hydraulic engineering - Exploring unseen fut...
DSD-INT 2023 Machine learning in hydraulic engineering - Exploring unseen fut...
Deltares7 views
MariaDB stored procedures and why they should be improved by Federico Razzoli
MariaDB stored procedures and why they should be improvedMariaDB stored procedures and why they should be improved
MariaDB stored procedures and why they should be improved
Software evolution understanding: Automatic extraction of software identifier... by Ra'Fat Al-Msie'deen
Software evolution understanding: Automatic extraction of software identifier...Software evolution understanding: Automatic extraction of software identifier...
Software evolution understanding: Automatic extraction of software identifier...
Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated... by TomHalpin9
Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated...Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated...
Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated...
TomHalpin95 views

Intro to Spark and Spark SQL

  • 1. Intro to Spark and Spark SQL AMP Camp 2014 Michael Armbrust - @michaelarmbrust
  • 2. What is Apache Spark? Fast and general cluster computing system, interoperable with Hadoop, included in all major distros Improves efficiency through: > In-memory computing primitives > General computation graphs Improves usability through: > Rich APIs in Scala, Java, Python > Interactive shell Up to 100× faster (2-10× on disk) 2-5× less code
  • 3. Spark Model Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets (RDDs) > Collections of objects that can be stored in memory or disk across a cluster > Parallel functional transformations (map, filter, …) > Automatically rebuilt on failure
  • 4. More than Map & Reduce map filter groupBy sort union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey groupByKey cogroup cross zip sample take first partitionBy mapWith pipe save ...
  • 5. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns val lines = spark.textFile(“hdfs://...”) val errors = lines.filter(_ startswith “ERROR”) val messages = errors.map(_.split(“t”)(2)) messages.cache() lines Block 1 lines Block 2 lines Block 3 Worker Worker Worker Driver messages.filter(_ contains “foo”).count() messages.filter(_ contains “bar”).count() . . . results tasks messages Cache 1 messages Cache 2 messages Cache 3 BaseT RraDnDsf ormed RDD Action Result: scaled full-text to search 1 TB data of Wikipedia in 5-7 secin # (vs 170 sec for on-disk data) <1 sec (vs 20 sec for on-disk data)
  • 6. A General Stack Spark Streaming# real-time Spark Spark SQL GraphX graph MLlib machine learning …
  • 7. Powerful Stack – Agile Development 140000 120000 100000 80000 60000 40000 20000 0 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines
  • 8. Powerful Stack – Agile Development 140000 120000 100000 80000 60000 40000 20000 0 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines Streaming
  • 9. Powerful Stack – Agile Development 140000 120000 100000 80000 60000 40000 20000 0 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines SparkSQL Streaming
  • 10. Powerful Stack – Agile Development 140000 120000 100000 80000 60000 40000 20000 0 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines GraphX Streaming SparkSQL
  • 11. Powerful Stack – Agile Development 140000 120000 100000 80000 60000 40000 20000 0 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines Your App? GraphX Streaming SparkSQL
  • 12. Community Growth 200 180 160 140 120 100 80 60 40 20 0 0.6.0 0.7.0 0.8.0 0.9.0 1.0.0 1.1.0
  • 13. Not Just for In-Memory Data Hadoop Record Spark 100TB Spark 1PB Data Size 102.5TB 100TB 1000TB Time 72 min 23 min 234 min # Cores 50400 6592 6080 Rate 1.42 TB/min 4.27 TB/min 4.27 TB/min Environment Dedicate Cloud (EC2) Cloud (EC2)
  • 14. SQL Overview • Newest component of Spark initially contributed by databricks (< 1 year old) • Tightly integrated way to work with structured data (tables with rows/columns) • Transform RDDs using SQL • Data source integration: Hive, Parquet, JSON, and more
  • 15. Relationship to Shark modified the Hive backend to run over Spark, but had two challenges: > Limited integration with Spark programs > Hive optimizer not designed for Spark Spark SQL reuses the best parts of Shark: Borrows • Hive data loading • In-memory column store Adds • RDD-aware optimizer • Rich language interfaces
  • 16. Adding Schema to RDDs Spark + RDDs Functional transformations on partitioned collections of opaque objects. SQL + SchemaRDDs Declarative transformations on partitioned collections of tuples. User User User User User User Name Age Height Name Age Height Name Age Height Name Age Height Name Age Height Name Age Height
  • 17. SchemaRDDs: More than SQL SchemaRDD {JSON} Image credit: http://barrymieny.deviantart.com/ SQL QL Parquet MLlib Unified interface for structured data JDBC ODBC
  • 18. Getting Started: Spark SQL SQLContext/HiveContext • Entry point for all SQL functionality • Wraps/extends existing spark context from pyspark.sql import SQLContext sqlCtx = SQLContext(sc)
  • 19. Example Dataset A text file filled with people’s names and ages: Michael, 30 Andy, 31 …
  • 20. RDDs as Relations (Python) # Load a text file and convert each line to a dictionary. lines = sc.textFile("examples/…/people.txt") parts = lines.map(lambda l: l.split(",")) people = parts.map(lambda p: Row(name=p[0],age=int(p[1]))) # Infer the schema, and register the SchemaRDD as a table peopleTable = sqlCtx.inferSchema(people) peopleTable.registerAsTable("people")
  • 21. RDDs as Relations (Scala) val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ // Define the schema using a case class. case class Person(name: String, age: Int) // Create an RDD of Person objects and register it as a table. val people = sc.textFile("examples/src/main/resources/people.txt") .map(_.split(",")) .map(p => Person(p(0), p(1).trim.toInt)) people.registerAsTable("people")
  • 22. RDDs as Relations (Java) public class Person implements Serializable { private String _name; private int _age; public String getName() { return _name; } public void setName(String name) { _name = name; } public int getAge() { return _age; } public void setAge(int age) { _age = age; } } JavaSQLContext ctx = new org.apache.spark.sql.api.java.JavaSQLContext(sc) JavaRDD<Person> people = ctx.textFile("examples/src/main/resources/ people.txt").map( new Function<String, Person>() { public Person call(String line) throws Exception { String[] parts = line.split(","); Person person = new Person(); person.setName(parts[0]); person.setAge(Integer.parseInt(parts[1].trim())); return person; } }); JavaSchemaRDD schemaPeople = sqlCtx.applySchema(people, Person.class);
  • 23. Querying Using SQL # SQL can be run over SchemaRDDs that have been registered # as a table. teenagers = sqlCtx.sql(""" SELECT name FROM people WHERE age >= 13 AND age <= 19""") # The results of SQL queries are RDDs and support all the normal # RDD operations. teenNames = teenagers.map(lambda p: "Name: " + p.name)
  • 24. Existing Tools, New Data Sources Spark SQL includes a server that exposes its data using JDBC/ODBC • Query data from HDFS/S3, • Including formats like Hive/Parquet/JSON* • Support for caching data in-memory * Coming in Spark 1.2
  • 25. Caching Tables In-Memory Spark SQL can cache tables using an in-memory columnar format: • Scan only required columns • Fewer allocated objects (less GC) • Automatically selects best compression cacheTable("people") schemaRDD.cache() – *Requires Spark 1.2
  • 26. Caching Comparison Spark MEMORY_ONLY Caching User Object User Object User Object User Object User Object SchemaRDD Columnar Caching ByteBuffer ByteBuffer ByteBuffer User Object Name Age Height Name Age Height Name Age Height Name Age Height Name Age Height Name Age Height java.lang.String java.lang.String java.lang.String java.lang.String java.lang.String java.lang.String
  • 27. Language Integrated UDFs registerFunction(“countMatches”, lambda (pattern, text): re.subn(pattern, '', text)[1]) sql("SELECT countMatches(‘a’, text)…")
  • 28. SQL and Machine Learning training_data_table = sql(""" SELECT e.action, u.age, u.latitude, u.logitude FROM Users u JOIN Events e ON u.userId = e.userId""") def featurize(u): LabeledPoint(u.action, [u.age, u.latitude, u.longitude]) // SQL results are RDDs so can be used directly in Mllib. training_data = training_data_table.map(featurize) model = new LogisticRegressionWithSGD.train(training_data)
  • 29. Machine Learning Pipelines // training:{eventId:String, features:Vector, label:Int} val training = parquetFile("/path/to/training") val lr = new LogisticRegression().fit(training) // event: {eventId: String, features: Vector} val event = parquetFile("/path/to/event") val prediction = lr.transform(event).select('eventId, 'prediction) prediction.saveAsParquetFile("/path/to/prediction”)
  • 30. Reading Data Stored in Hive from pyspark.sql import HiveContext hiveCtx = HiveContext(sc) hiveCtx.hql(""" CREATE TABLE IF NOT EXISTS src (key INT, value STRING)""") hiveCtx.hql(""" LOAD DATA LOCAL INPATH 'examples/…/kv1.txt' INTO TABLE src""") # Queries can be expressed in HiveQL. results = hiveCtx.hql("FROM src SELECT key, value").collect()
  • 31. Parquet Compatibility Native support for reading data in Parquet: • Columnar storage avoids reading unneeded data. • RDDs can be written to parquet files, preserving the schema. • Convert other slower formats into Parquet for repeated querying
  • 32. Using Parquet # SchemaRDDs can be saved as Parquet files, maintaining the # schema information. peopleTable.saveAsParquetFile("people.parquet") # Read in the Parquet file created above. Parquet files are # self-­‐describing so the schema is preserved. The result of # loading a parquet file is also a SchemaRDD. parquetFile = sqlCtx.parquetFile("people.parquet”) # Parquet files can be registered as tables used in SQL. parquetFile.registerAsTable("parquetFile”) teenagers = sqlCtx.sql(""" SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19""")
  • 33. {JSON} Support • Use jsonFile or jsonRDD to convert a collection of JSON objects into a SchemaRDD • Infers and unions the schema of each record • Maintains nested structures and arrays
  • 34. {JSON} Example # Create a SchemaRDD from the file(s) pointed to by path people = sqlContext.jsonFile(path) # Visualized inferred schema with printSchema(). people.printSchema() # root # |-­‐-­‐ age: integer # |-­‐-­‐ name: string # Register this SchemaRDD as a table. people.registerTempTable("people")
  • 35. Data Sources API Allow easy integration with new sources of structured data: CREATE TEMPORARY TABLE episodes USING com.databricks.spark.avro OPTIONS ( path ”./episodes.avro” ) https://github.com/databricks/spark-­‐avro
  • 36. Efficient Expression Evaluation Interpreting expressions (e.g., ‘a + b’) can very expensive on the JVM: • Virtual function calls • Branches based on expression type • Object creation due to primitive boxing • Memory consumption by boxed primitive objects
  • 37. Interpreting “a+b” 1. Virtual call to Add.eval() 2. Virtual call to a.eval() 3. Return boxed Int 4. Virtual call to b.eval() 5. Return boxed Int 6. Integer addition 7. Return boxed result Add Attribute a Attribute b
  • 38. Using Runtime Reflection def generateCode(e: Expression): Tree = e match { case Attribute(ordinal) => q"inputRow.getInt($ordinal)" case Add(left, right) => q""" { val leftResult = ${generateCode(left)} val rightResult = ${generateCode(right)} leftResult + rightResult } """ }
  • 39. Code Generating “a + b” val left: Int = inputRow.getInt(0) val right: Int = inputRow.getInt(1) val result: Int = left + right resultRow.setInt(0, result) • Fewer function calls • No boxing of primitives
  • 40. Performance Comparison 40 35 30 25 20 15 10 5 0 Intepreted Evaluation Hand-written Code Generated with Scala Reflection Milliseconds Evaluating 'a+a+a' One Billion Times!
  • 41. Performance vs. Deep Analytics TPC-DS Interactive Reporting 450 400 350 300 250 200 150 100 50 0 34 46 59 79 19 42 52 55 63 68 73 98 27 3 43 53 7 89 Shark Spark SQL
  • 42. What’s Coming in Spark 1.2? • MLlib pipeline support for SchemaRDDs • New APIs for developers to add external data sources • Full support for Decimal and Date types. • Statistics for in-memory data • Lots of bug fixes and improvements to Hive compatibility