SlideShare a Scribd company logo
1 of 43
Download to read offline
Intro to Spark 
and Spark SQL 
AMP Camp 2014 
Michael Armbrust - @michaelarmbrust
What is Apache Spark? 
Fast and general cluster computing system, 
interoperable with Hadoop, included in all major 
distros 
Improves efficiency through: 
> In-memory computing primitives 
> General computation graphs 
Improves usability through: 
> Rich APIs in Scala, Java, Python 
> Interactive shell 
Up to 100× faster 
(2-10× on disk) 
2-5× less code
Spark Model 
Write programs in terms of transformations on 
distributed datasets 
Resilient Distributed Datasets (RDDs) 
> Collections of objects that can be stored in memory 
or disk across a cluster 
> Parallel functional transformations (map, filter, …) 
> Automatically rebuilt on failure
More than Map & Reduce 
map 
filter 
groupBy 
sort 
union 
join 
leftOuterJoin 
rightOuterJoin 
reduce 
count 
fold 
reduceByKey 
groupByKey 
cogroup 
cross 
zip 
sample 
take 
first 
partitionBy 
mapWith 
pipe 
save 
...
Example: Log Mining 
Load error messages from a log into memory, 
then interactively search for various patterns 
val 
lines 
= 
spark.textFile(“hdfs://...”) 
val 
errors 
= 
lines.filter(_ 
startswith 
“ERROR”) 
val 
messages 
= 
errors.map(_.split(“t”)(2)) 
messages.cache() 
lines 
Block 1 
lines 
Block 2 
lines 
Block 3 
Worker 
Worker 
Worker 
Driver 
messages.filter(_ 
contains 
“foo”).count() 
messages.filter(_ 
contains 
“bar”).count() 
. . . 
results 
tasks 
messages 
Cache 1 
messages 
Cache 2 
messages 
Cache 3 
BaseT RraDnDsf 
ormed RDD 
Action 
Result: scaled full-text to search 1 TB data of Wikipedia in 5-7 secin 
# 
(vs 170 sec for on-disk data) 
<1 sec (vs 20 sec for on-disk data)
A General Stack 
Spark 
Streaming# 
real-time 
Spark 
Spark 
SQL 
GraphX 
graph 
MLlib 
machine 
learning 
…
Powerful Stack – Agile Development 
140000 
120000 
100000 
80000 
60000 
40000 
20000 
0 
Hadoop 
MapReduce 
Storm 
(Streaming) 
Impala (SQL) 
Giraph 
(Graph) 
Spark 
non-test, non-example source lines
Powerful Stack – Agile Development 
140000 
120000 
100000 
80000 
60000 
40000 
20000 
0 
Hadoop 
MapReduce 
Storm 
(Streaming) 
Impala (SQL) 
Giraph 
(Graph) 
Spark 
non-test, non-example source lines 
Streaming
Powerful Stack – Agile Development 
140000 
120000 
100000 
80000 
60000 
40000 
20000 
0 
Hadoop 
MapReduce 
Storm 
(Streaming) 
Impala (SQL) 
Giraph 
(Graph) 
Spark 
non-test, non-example source lines 
SparkSQL 
Streaming
Powerful Stack – Agile Development 
140000 
120000 
100000 
80000 
60000 
40000 
20000 
0 
Hadoop 
MapReduce 
Storm 
(Streaming) 
Impala (SQL) 
Giraph 
(Graph) 
Spark 
non-test, non-example source lines 
GraphX 
Streaming 
SparkSQL
Powerful Stack – Agile Development 
140000 
120000 
100000 
80000 
60000 
40000 
20000 
0 
Hadoop 
MapReduce 
Storm 
(Streaming) 
Impala (SQL) 
Giraph 
(Graph) 
Spark 
non-test, non-example source lines 
Your App? 
GraphX 
Streaming 
SparkSQL
Community Growth 
200 
180 
160 
140 
120 
100 
80 
60 
40 
20 
0 
0.6.0 0.7.0 0.8.0 0.9.0 1.0.0 1.1.0
Not Just for In-Memory Data 
Hadoop 
Record 
Spark 100TB Spark 
1PB 
Data Size 102.5TB 100TB 1000TB 
Time 72 min 23 min 234 min 
# Cores 50400 6592 6080 
Rate 1.42 TB/min 4.27 TB/min 4.27 TB/min 
Environment Dedicate Cloud (EC2) Cloud (EC2)
SQL Overview 
• Newest component of Spark initially 
contributed by databricks (< 1 year old) 
• Tightly integrated way to work with structured 
data (tables with rows/columns) 
• Transform RDDs using SQL 
• Data source integration: Hive, Parquet, JSON, 
and more
Relationship to 
Shark modified the Hive backend to run over 
Spark, but had two challenges: 
> Limited integration with Spark programs 
> Hive optimizer not designed for Spark 
Spark SQL reuses the best parts of Shark: 
Borrows 
• Hive data loading 
• In-memory column store 
Adds 
• RDD-aware optimizer 
• Rich language interfaces
Adding Schema to RDDs 
Spark + RDDs 
Functional transformations on 
partitioned collections of opaque 
objects. 
SQL + SchemaRDDs 
Declarative transformations on 
partitioned collections of tuples. 
User 
User 
User 
User 
User 
User 
Name 
Age 
Height 
Name 
Age 
Height 
Name 
Age 
Height 
Name 
Age 
Height 
Name 
Age 
Height 
Name 
Age 
Height
SchemaRDDs: More than SQL 
SchemaRDD 
{JSON} 
Image credit: http://barrymieny.deviantart.com/ 
SQL 
QL 
Parquet 
MLlib 
Unified interface for structured data 
JDBC 
ODBC
Getting Started: Spark SQL 
SQLContext/HiveContext 
• Entry point for all SQL functionality 
• Wraps/extends existing spark context 
from 
pyspark.sql 
import 
SQLContext 
sqlCtx 
= 
SQLContext(sc)
Example Dataset 
A text file filled with people’s names and ages: 
Michael, 
30 
Andy, 
31 
…
RDDs as Relations (Python) 
# Load a text file and convert each line to a dictionary. 
lines = sc.textFile("examples/…/people.txt") 
parts = lines.map(lambda l: l.split(",")) 
people = parts.map(lambda p: Row(name=p[0],age=int(p[1]))) 
# Infer the schema, and register the SchemaRDD as a table 
peopleTable = sqlCtx.inferSchema(people) 
peopleTable.registerAsTable("people")
RDDs as Relations (Scala) 
val 
sqlContext 
= 
new 
org.apache.spark.sql.SQLContext(sc) 
import 
sqlContext._ 
// 
Define 
the 
schema 
using 
a 
case 
class. 
case 
class 
Person(name: 
String, 
age: 
Int) 
// 
Create 
an 
RDD 
of 
Person 
objects 
and 
register 
it 
as 
a 
table. 
val 
people 
= 
sc.textFile("examples/src/main/resources/people.txt") 
.map(_.split(",")) 
.map(p 
=> 
Person(p(0), 
p(1).trim.toInt)) 
people.registerAsTable("people")
RDDs as Relations (Java) 
public 
class 
Person 
implements 
Serializable 
{ 
private 
String 
_name; 
private 
int 
_age; 
public 
String 
getName() 
{ 
return 
_name; 
} 
public 
void 
setName(String 
name) 
{ 
_name 
= 
name; 
} 
public 
int 
getAge() 
{ 
return 
_age; 
} 
public 
void 
setAge(int 
age) 
{ 
_age 
= 
age; 
} 
} 
JavaSQLContext 
ctx 
= 
new 
org.apache.spark.sql.api.java.JavaSQLContext(sc) 
JavaRDD<Person> 
people 
= 
ctx.textFile("examples/src/main/resources/ 
people.txt").map( 
new 
Function<String, 
Person>() 
{ 
public 
Person 
call(String 
line) 
throws 
Exception 
{ 
String[] 
parts 
= 
line.split(","); 
Person 
person 
= 
new 
Person(); 
person.setName(parts[0]); 
person.setAge(Integer.parseInt(parts[1].trim())); 
return 
person; 
} 
}); 
JavaSchemaRDD 
schemaPeople 
= 
sqlCtx.applySchema(people, 
Person.class);
Querying Using SQL 
# 
SQL 
can 
be 
run 
over 
SchemaRDDs 
that 
have 
been 
registered 
# 
as 
a 
table. 
teenagers 
= 
sqlCtx.sql(""" 
SELECT 
name 
FROM 
people 
WHERE 
age 
>= 
13 
AND 
age 
<= 
19""") 
# 
The 
results 
of 
SQL 
queries 
are 
RDDs 
and 
support 
all 
the 
normal 
# 
RDD 
operations. 
teenNames 
= 
teenagers.map(lambda 
p: 
"Name: 
" 
+ 
p.name)
Existing Tools, New Data Sources 
Spark SQL includes a server that exposes its data 
using JDBC/ODBC 
• Query data from HDFS/S3, 
• Including formats like Hive/Parquet/JSON* 
• Support for caching data in-memory 
* Coming in Spark 1.2
Caching Tables In-Memory 
Spark SQL can cache tables using an in-memory 
columnar format: 
• Scan only required columns 
• Fewer allocated objects (less GC) 
• Automatically selects best compression 
cacheTable("people") 
schemaRDD.cache() – *Requires Spark 1.2
Caching Comparison 
Spark MEMORY_ONLY Caching 
User 
Object 
User 
Object 
User 
Object 
User 
Object 
User 
Object 
SchemaRDD Columnar Caching 
ByteBuffer ByteBuffer ByteBuffer 
User 
Object 
Name 
Age 
Height 
Name 
Age 
Height 
Name 
Age 
Height 
Name 
Age 
Height 
Name 
Age 
Height 
Name 
Age 
Height 
java.lang.String 
java.lang.String 
java.lang.String 
java.lang.String 
java.lang.String 
java.lang.String
Language Integrated UDFs 
registerFunction(“countMatches”, 
lambda 
(pattern, 
text): 
re.subn(pattern, 
'', 
text)[1]) 
sql("SELECT 
countMatches(‘a’, 
text)…")
SQL and Machine Learning 
training_data_table 
= 
sql(""" 
SELECT 
e.action, 
u.age, 
u.latitude, 
u.logitude 
FROM 
Users 
u 
JOIN 
Events 
e 
ON 
u.userId 
= 
e.userId""") 
def 
featurize(u): 
LabeledPoint(u.action, 
[u.age, 
u.latitude, 
u.longitude]) 
// 
SQL 
results 
are 
RDDs 
so 
can 
be 
used 
directly 
in 
Mllib. 
training_data 
= 
training_data_table.map(featurize) 
model 
= 
new 
LogisticRegressionWithSGD.train(training_data)
Machine Learning Pipelines 
// 
training:{eventId:String, 
features:Vector, 
label:Int} 
val 
training 
= 
parquetFile("/path/to/training") 
val 
lr 
= 
new 
LogisticRegression().fit(training) 
// 
event: 
{eventId: 
String, 
features: 
Vector} 
val 
event 
= 
parquetFile("/path/to/event") 
val 
prediction 
= 
lr.transform(event).select('eventId, 
'prediction) 
prediction.saveAsParquetFile("/path/to/prediction”)
Reading Data Stored in Hive 
from 
pyspark.sql 
import 
HiveContext 
hiveCtx 
= 
HiveContext(sc) 
hiveCtx.hql(""" 
CREATE 
TABLE 
IF 
NOT 
EXISTS 
src 
(key 
INT, 
value 
STRING)""") 
hiveCtx.hql(""" 
LOAD 
DATA 
LOCAL 
INPATH 
'examples/…/kv1.txt' 
INTO 
TABLE 
src""") 
# 
Queries 
can 
be 
expressed 
in 
HiveQL. 
results 
= 
hiveCtx.hql("FROM 
src 
SELECT 
key, 
value").collect()
Parquet Compatibility 
Native support for reading data in Parquet: 
• Columnar storage avoids reading unneeded data. 
• RDDs can be written to parquet files, preserving the 
schema. 
• Convert other slower formats into Parquet for repeated 
querying
Using Parquet 
# 
SchemaRDDs 
can 
be 
saved 
as 
Parquet 
files, 
maintaining 
the 
# 
schema 
information. 
peopleTable.saveAsParquetFile("people.parquet") 
# 
Read 
in 
the 
Parquet 
file 
created 
above. 
Parquet 
files 
are 
# 
self-­‐describing 
so 
the 
schema 
is 
preserved. 
The 
result 
of 
# 
loading 
a 
parquet 
file 
is 
also 
a 
SchemaRDD. 
parquetFile 
= 
sqlCtx.parquetFile("people.parquet”) 
# 
Parquet 
files 
can 
be 
registered 
as 
tables 
used 
in 
SQL. 
parquetFile.registerAsTable("parquetFile”) 
teenagers 
= 
sqlCtx.sql(""" 
SELECT 
name 
FROM 
parquetFile 
WHERE 
age 
>= 
13 
AND 
age 
<= 
19""")
{JSON} Support 
• Use jsonFile or jsonRDD to convert a 
collection of JSON objects into a SchemaRDD 
• Infers and unions the schema of each record 
• Maintains nested structures and arrays
{JSON} Example 
# 
Create 
a 
SchemaRDD 
from 
the 
file(s) 
pointed 
to 
by 
path 
people 
= 
sqlContext.jsonFile(path) 
# 
Visualized 
inferred 
schema 
with 
printSchema(). 
people.printSchema() 
# 
root 
# 
|-­‐-­‐ 
age: 
integer 
# 
|-­‐-­‐ 
name: 
string 
# 
Register 
this 
SchemaRDD 
as 
a 
table. 
people.registerTempTable("people")
Data Sources API 
Allow easy integration with new sources of 
structured data: 
CREATE 
TEMPORARY 
TABLE 
episodes 
USING 
com.databricks.spark.avro 
OPTIONS 
( 
path 
”./episodes.avro” 
) 
https://github.com/databricks/spark-­‐avro
Efficient Expression Evaluation 
Interpreting expressions (e.g., ‘a + b’) can very 
expensive on the JVM: 
• Virtual function calls 
• Branches based on expression type 
• Object creation due to primitive boxing 
• Memory consumption by boxed primitive 
objects
Interpreting “a+b” 
1. Virtual call to Add.eval() 
2. Virtual call to a.eval() 
3. Return boxed Int 
4. Virtual call to b.eval() 
5. Return boxed Int 
6. Integer addition 
7. Return boxed result 
Add 
Attribute 
a 
Attribute 
b
Using Runtime Reflection 
def 
generateCode(e: 
Expression): 
Tree 
= 
e 
match 
{ 
case 
Attribute(ordinal) 
=> 
q"inputRow.getInt($ordinal)" 
case 
Add(left, 
right) 
=> 
q""" 
{ 
val 
leftResult 
= 
${generateCode(left)} 
val 
rightResult 
= 
${generateCode(right)} 
leftResult 
+ 
rightResult 
} 
""" 
}
Code Generating “a + b” 
val 
left: 
Int 
= 
inputRow.getInt(0) 
val 
right: 
Int 
= 
inputRow.getInt(1) 
val 
result: 
Int 
= 
left 
+ 
right 
resultRow.setInt(0, 
result) 
• Fewer function calls 
• No boxing of primitives
Performance Comparison 
40 
35 
30 
25 
20 
15 
10 
5 
0 
Intepreted Evaluation 
Hand-written Code 
Generated with Scala 
Reflection 
Milliseconds 
Evaluating 'a+a+a' One Billion Times!
Performance vs. 
Deep 
Analytics 
TPC-DS 
Interactive Reporting 
450 
400 
350 
300 
250 
200 
150 
100 
50 
0 
34 46 59 79 19 42 52 55 63 68 73 98 27 3 43 53 7 89 
Shark 
Spark SQL
What’s Coming in Spark 1.2? 
• MLlib pipeline support for SchemaRDDs 
• New APIs for developers to add external data 
sources 
• Full support for Decimal and Date types. 
• Statistics for in-memory data 
• Lots of bug fixes and improvements to Hive 
compatibility
Questions?

More Related Content

What's hot

Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengDatabricks
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Spark Summit
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Sameer Farooqui
 
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur DaveGraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur DaveSpark Summit
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingDatabricks
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Databricks
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sqlaftab alam
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
Vertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And DataVertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And DataSpark Summit
 
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Databricks
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Databricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsDatabricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkDatabricks
 

What's hot (20)

Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur DaveGraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Vertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And DataVertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And Data
 
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 

Similar to Intro to Spark and Spark SQL

Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一scalaconfjp
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingDatabricks
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Sparkjlacefie
 
An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetupjlacefie
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfWalmirCouto3
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldDatabricks
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's comingDatabricks
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityDatabricks
 

Similar to Intro to Spark and Spark SQL (20)

Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
 
An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetup
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 

More from jeykottalam

AMP Camp 5 Intro
AMP Camp 5 IntroAMP Camp 5 Intro
AMP Camp 5 Introjeykottalam
 
Concurrency Control for Parallel Machine Learning
Concurrency Control for Parallel Machine LearningConcurrency Control for Parallel Machine Learning
Concurrency Control for Parallel Machine Learningjeykottalam
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Libraryjeykottalam
 
SparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at ScaleSparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at Scalejeykottalam
 
SampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS StackSampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS Stackjeykottalam
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelinesjeykottalam
 
COCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate AscentCOCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate Ascentjeykottalam
 
The BDAS Open Source Community
The BDAS Open Source CommunityThe BDAS Open Source Community
The BDAS Open Source Communityjeykottalam
 

More from jeykottalam (8)

AMP Camp 5 Intro
AMP Camp 5 IntroAMP Camp 5 Intro
AMP Camp 5 Intro
 
Concurrency Control for Parallel Machine Learning
Concurrency Control for Parallel Machine LearningConcurrency Control for Parallel Machine Learning
Concurrency Control for Parallel Machine Learning
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
 
SparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at ScaleSparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at Scale
 
SampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS StackSampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS Stack
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
 
COCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate AscentCOCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate Ascent
 
The BDAS Open Source Community
The BDAS Open Source CommunityThe BDAS Open Source Community
The BDAS Open Source Community
 

Recently uploaded

A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 

Recently uploaded (20)

A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 

Intro to Spark and Spark SQL

  • 1. Intro to Spark and Spark SQL AMP Camp 2014 Michael Armbrust - @michaelarmbrust
  • 2. What is Apache Spark? Fast and general cluster computing system, interoperable with Hadoop, included in all major distros Improves efficiency through: > In-memory computing primitives > General computation graphs Improves usability through: > Rich APIs in Scala, Java, Python > Interactive shell Up to 100× faster (2-10× on disk) 2-5× less code
  • 3. Spark Model Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets (RDDs) > Collections of objects that can be stored in memory or disk across a cluster > Parallel functional transformations (map, filter, …) > Automatically rebuilt on failure
  • 4. More than Map & Reduce map filter groupBy sort union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey groupByKey cogroup cross zip sample take first partitionBy mapWith pipe save ...
  • 5. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns val lines = spark.textFile(“hdfs://...”) val errors = lines.filter(_ startswith “ERROR”) val messages = errors.map(_.split(“t”)(2)) messages.cache() lines Block 1 lines Block 2 lines Block 3 Worker Worker Worker Driver messages.filter(_ contains “foo”).count() messages.filter(_ contains “bar”).count() . . . results tasks messages Cache 1 messages Cache 2 messages Cache 3 BaseT RraDnDsf ormed RDD Action Result: scaled full-text to search 1 TB data of Wikipedia in 5-7 secin # (vs 170 sec for on-disk data) <1 sec (vs 20 sec for on-disk data)
  • 6. A General Stack Spark Streaming# real-time Spark Spark SQL GraphX graph MLlib machine learning …
  • 7. Powerful Stack – Agile Development 140000 120000 100000 80000 60000 40000 20000 0 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines
  • 8. Powerful Stack – Agile Development 140000 120000 100000 80000 60000 40000 20000 0 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines Streaming
  • 9. Powerful Stack – Agile Development 140000 120000 100000 80000 60000 40000 20000 0 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines SparkSQL Streaming
  • 10. Powerful Stack – Agile Development 140000 120000 100000 80000 60000 40000 20000 0 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines GraphX Streaming SparkSQL
  • 11. Powerful Stack – Agile Development 140000 120000 100000 80000 60000 40000 20000 0 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines Your App? GraphX Streaming SparkSQL
  • 12. Community Growth 200 180 160 140 120 100 80 60 40 20 0 0.6.0 0.7.0 0.8.0 0.9.0 1.0.0 1.1.0
  • 13. Not Just for In-Memory Data Hadoop Record Spark 100TB Spark 1PB Data Size 102.5TB 100TB 1000TB Time 72 min 23 min 234 min # Cores 50400 6592 6080 Rate 1.42 TB/min 4.27 TB/min 4.27 TB/min Environment Dedicate Cloud (EC2) Cloud (EC2)
  • 14. SQL Overview • Newest component of Spark initially contributed by databricks (< 1 year old) • Tightly integrated way to work with structured data (tables with rows/columns) • Transform RDDs using SQL • Data source integration: Hive, Parquet, JSON, and more
  • 15. Relationship to Shark modified the Hive backend to run over Spark, but had two challenges: > Limited integration with Spark programs > Hive optimizer not designed for Spark Spark SQL reuses the best parts of Shark: Borrows • Hive data loading • In-memory column store Adds • RDD-aware optimizer • Rich language interfaces
  • 16. Adding Schema to RDDs Spark + RDDs Functional transformations on partitioned collections of opaque objects. SQL + SchemaRDDs Declarative transformations on partitioned collections of tuples. User User User User User User Name Age Height Name Age Height Name Age Height Name Age Height Name Age Height Name Age Height
  • 17. SchemaRDDs: More than SQL SchemaRDD {JSON} Image credit: http://barrymieny.deviantart.com/ SQL QL Parquet MLlib Unified interface for structured data JDBC ODBC
  • 18. Getting Started: Spark SQL SQLContext/HiveContext • Entry point for all SQL functionality • Wraps/extends existing spark context from pyspark.sql import SQLContext sqlCtx = SQLContext(sc)
  • 19. Example Dataset A text file filled with people’s names and ages: Michael, 30 Andy, 31 …
  • 20. RDDs as Relations (Python) # Load a text file and convert each line to a dictionary. lines = sc.textFile("examples/…/people.txt") parts = lines.map(lambda l: l.split(",")) people = parts.map(lambda p: Row(name=p[0],age=int(p[1]))) # Infer the schema, and register the SchemaRDD as a table peopleTable = sqlCtx.inferSchema(people) peopleTable.registerAsTable("people")
  • 21. RDDs as Relations (Scala) val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ // Define the schema using a case class. case class Person(name: String, age: Int) // Create an RDD of Person objects and register it as a table. val people = sc.textFile("examples/src/main/resources/people.txt") .map(_.split(",")) .map(p => Person(p(0), p(1).trim.toInt)) people.registerAsTable("people")
  • 22. RDDs as Relations (Java) public class Person implements Serializable { private String _name; private int _age; public String getName() { return _name; } public void setName(String name) { _name = name; } public int getAge() { return _age; } public void setAge(int age) { _age = age; } } JavaSQLContext ctx = new org.apache.spark.sql.api.java.JavaSQLContext(sc) JavaRDD<Person> people = ctx.textFile("examples/src/main/resources/ people.txt").map( new Function<String, Person>() { public Person call(String line) throws Exception { String[] parts = line.split(","); Person person = new Person(); person.setName(parts[0]); person.setAge(Integer.parseInt(parts[1].trim())); return person; } }); JavaSchemaRDD schemaPeople = sqlCtx.applySchema(people, Person.class);
  • 23. Querying Using SQL # SQL can be run over SchemaRDDs that have been registered # as a table. teenagers = sqlCtx.sql(""" SELECT name FROM people WHERE age >= 13 AND age <= 19""") # The results of SQL queries are RDDs and support all the normal # RDD operations. teenNames = teenagers.map(lambda p: "Name: " + p.name)
  • 24. Existing Tools, New Data Sources Spark SQL includes a server that exposes its data using JDBC/ODBC • Query data from HDFS/S3, • Including formats like Hive/Parquet/JSON* • Support for caching data in-memory * Coming in Spark 1.2
  • 25. Caching Tables In-Memory Spark SQL can cache tables using an in-memory columnar format: • Scan only required columns • Fewer allocated objects (less GC) • Automatically selects best compression cacheTable("people") schemaRDD.cache() – *Requires Spark 1.2
  • 26. Caching Comparison Spark MEMORY_ONLY Caching User Object User Object User Object User Object User Object SchemaRDD Columnar Caching ByteBuffer ByteBuffer ByteBuffer User Object Name Age Height Name Age Height Name Age Height Name Age Height Name Age Height Name Age Height java.lang.String java.lang.String java.lang.String java.lang.String java.lang.String java.lang.String
  • 27. Language Integrated UDFs registerFunction(“countMatches”, lambda (pattern, text): re.subn(pattern, '', text)[1]) sql("SELECT countMatches(‘a’, text)…")
  • 28. SQL and Machine Learning training_data_table = sql(""" SELECT e.action, u.age, u.latitude, u.logitude FROM Users u JOIN Events e ON u.userId = e.userId""") def featurize(u): LabeledPoint(u.action, [u.age, u.latitude, u.longitude]) // SQL results are RDDs so can be used directly in Mllib. training_data = training_data_table.map(featurize) model = new LogisticRegressionWithSGD.train(training_data)
  • 29. Machine Learning Pipelines // training:{eventId:String, features:Vector, label:Int} val training = parquetFile("/path/to/training") val lr = new LogisticRegression().fit(training) // event: {eventId: String, features: Vector} val event = parquetFile("/path/to/event") val prediction = lr.transform(event).select('eventId, 'prediction) prediction.saveAsParquetFile("/path/to/prediction”)
  • 30. Reading Data Stored in Hive from pyspark.sql import HiveContext hiveCtx = HiveContext(sc) hiveCtx.hql(""" CREATE TABLE IF NOT EXISTS src (key INT, value STRING)""") hiveCtx.hql(""" LOAD DATA LOCAL INPATH 'examples/…/kv1.txt' INTO TABLE src""") # Queries can be expressed in HiveQL. results = hiveCtx.hql("FROM src SELECT key, value").collect()
  • 31. Parquet Compatibility Native support for reading data in Parquet: • Columnar storage avoids reading unneeded data. • RDDs can be written to parquet files, preserving the schema. • Convert other slower formats into Parquet for repeated querying
  • 32. Using Parquet # SchemaRDDs can be saved as Parquet files, maintaining the # schema information. peopleTable.saveAsParquetFile("people.parquet") # Read in the Parquet file created above. Parquet files are # self-­‐describing so the schema is preserved. The result of # loading a parquet file is also a SchemaRDD. parquetFile = sqlCtx.parquetFile("people.parquet”) # Parquet files can be registered as tables used in SQL. parquetFile.registerAsTable("parquetFile”) teenagers = sqlCtx.sql(""" SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19""")
  • 33. {JSON} Support • Use jsonFile or jsonRDD to convert a collection of JSON objects into a SchemaRDD • Infers and unions the schema of each record • Maintains nested structures and arrays
  • 34. {JSON} Example # Create a SchemaRDD from the file(s) pointed to by path people = sqlContext.jsonFile(path) # Visualized inferred schema with printSchema(). people.printSchema() # root # |-­‐-­‐ age: integer # |-­‐-­‐ name: string # Register this SchemaRDD as a table. people.registerTempTable("people")
  • 35. Data Sources API Allow easy integration with new sources of structured data: CREATE TEMPORARY TABLE episodes USING com.databricks.spark.avro OPTIONS ( path ”./episodes.avro” ) https://github.com/databricks/spark-­‐avro
  • 36. Efficient Expression Evaluation Interpreting expressions (e.g., ‘a + b’) can very expensive on the JVM: • Virtual function calls • Branches based on expression type • Object creation due to primitive boxing • Memory consumption by boxed primitive objects
  • 37. Interpreting “a+b” 1. Virtual call to Add.eval() 2. Virtual call to a.eval() 3. Return boxed Int 4. Virtual call to b.eval() 5. Return boxed Int 6. Integer addition 7. Return boxed result Add Attribute a Attribute b
  • 38. Using Runtime Reflection def generateCode(e: Expression): Tree = e match { case Attribute(ordinal) => q"inputRow.getInt($ordinal)" case Add(left, right) => q""" { val leftResult = ${generateCode(left)} val rightResult = ${generateCode(right)} leftResult + rightResult } """ }
  • 39. Code Generating “a + b” val left: Int = inputRow.getInt(0) val right: Int = inputRow.getInt(1) val result: Int = left + right resultRow.setInt(0, result) • Fewer function calls • No boxing of primitives
  • 40. Performance Comparison 40 35 30 25 20 15 10 5 0 Intepreted Evaluation Hand-written Code Generated with Scala Reflection Milliseconds Evaluating 'a+a+a' One Billion Times!
  • 41. Performance vs. Deep Analytics TPC-DS Interactive Reporting 450 400 350 300 250 200 150 100 50 0 34 46 59 79 19 42 52 55 63 68 73 98 27 3 43 53 7 89 Shark Spark SQL
  • 42. What’s Coming in Spark 1.2? • MLlib pipeline support for SchemaRDDs • New APIs for developers to add external data sources • Full support for Decimal and Date types. • Statistics for in-memory data • Lots of bug fixes and improvements to Hive compatibility