Spark DataFrames:
Simple and Fast Analytics
on Structured Data
Michael Armbrust
Spark Summit Amsterdam 2015 - October, 28th
Graduated
from Alpha
in 1.3
• Spark SQL
• Part of the core distribution since Spark 1.0 (April 2014)
SQLAbout Me and
2
0
100
200
300
# Of Commits Per Month
0
50
100
150
200
# of Contributors
2
3
SELECT COUNT(*)
FROM hiveTable
WHERE hive_udf(data)
• Spark SQL
• Part of the core distribution since Spark 1.0 (April 2014)
• Runs SQL / HiveQLqueries,optionally alongside or
replacing existing Hive deployments
SQLAbout Me and
Improved
multi-version
support in 1.4
4
• Spark SQL
• Part of the core distribution since Spark 1.0 (April 2014)
• Runs SQL / HiveQLqueries,optionally alongside or
replacing existing Hive deployments
• Connectexisting BI tools to Spark through JDBC
SQLAbout Me and
• Spark SQL
• Part of the core distribution since Spark 1.0 (April 2014)
• Runs SQL / HiveQLqueries,optionally alongside or
replacing existing Hive deployments
• Connectexisting BI tools to Spark through JDBC
• Bindingsin Python,Scala, Java, and R
5
SQLAbout Me and
• Spark SQL
• Part of the core distribution since Spark 1.0 (April 2014)
• Runs SQL / HiveQLqueries,optionally alongside or
replacing existing Hive deployments
• Connectexisting BI tools to Spark through JDBC
• Bindingsin Python,Scala, Java, and R
• @michaelarmbrust
• Creator of Spark SQL @databricks
6
SQLAbout Me and
The not-so-secret truth...
7
is about more than SQL.
SQL
8
Create and run
Spark programs faster:
SQL
• Write less code
• Read less data
• Let the optimizer do the hard work
DataFrame
noun – [dey-tuh-freym]
9
1. A distributed collection of rows organized into
named columns.
2. An abstraction for selecting, filtering, aggregating
and plotting structured data (cf. R, Pandas).
3. Archaic: Previously SchemaRDD (cf. Spark< 1.3).
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats:
df = sqlContext.read 
.format("json")  
.option("samplingRatio",  "0.1")  
.load("/home/michael/data.json")
df.write 
.format("parquet")  
.mode("append")  
.partitionBy("year")  
.saveAsTable("fasterData")
10
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats:
df = sqlContext.read 
.format("json")  
.option("samplingRatio",  "0.1")  
.load("/home/michael/data.json")
df.write 
.format("parquet")  
.mode("append")  
.partitionBy("year")  
.saveAsTable("fasterData")
read and write  
functions create
new builders for
doing I/O
11
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats:
Builder methods
specify:
• Format
• Partitioning
• Handling of
existing data
df = sqlContext.read 
.format("json")  
.option("samplingRatio",  "0.1")  
.load("/home/michael/data.json")
df.write 
.format("parquet")  
.mode("append")  
.partitionBy("year")  
.saveAsTable("fasterData")
12
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats:
load(…), save(…) or
saveAsTable(…)  
finish the I/O
specification
df = sqlContext.read 
.format("json")  
.option("samplingRatio",  "0.1")  
.load("/home/michael/data.json")
df.write 
.format("parquet")  
.mode("append")  
.partitionBy("year")  
.saveAsTable("fasterData")
13
Read Less Data: Efficient Formats
• Compact binary encoding with intelligent compression
(delta, RLE, etc)
• Each column stored separately with an index that allows
skipping of unread columns
• Support for partitioning (/data/year=2015)
• Data skipping using statistics (column min/max, etc)
14
is an efficient columnar storage format:
ORC
Write Less Code: Data Source API
Spark SQL’s Data Source API can read and write DataFrames
usinga variety of formats.
15
{  JSON }
Built-In External
JDBC
and more…
Find more sources at http://spark-packages.org/
ORCplain text*
ETL Using Custom Data Sources
sqlContext.read
.format("com.databricks.spark.jira")
.option("url",  "https://issues.apache.org/jira/rest/api/latest/search")
.option("user",  "marmbrus")
.option("password",  "*******")
.option("query",  """
|project  =  SPARK  AND  
|component  =  SQL  AND  
|(status  =  Open  OR  status  =  "In  Progress"  OR  status  =  Reopened)""".stripMargin)
.load()
16
.repartition(1)
.write
.format("parquet")
.saveAsTable("sparkSqlJira")
Load	
  data	
  from	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  (Spark’s	
  Bug	
  Tracker)	
  
using	
  a	
  custom	
  data	
  source.
Write	
  the	
  converted	
  data	
  out	
  to	
  
a	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  table	
  stored	
  in	
  
Write Less Code: High-Level Operations
Solve common problems conciselyusing DataFrame functions:
• Selecting columnsand filtering
• Joining different data sources
• Aggregation (count, sum, average, etc)
• Plotting resultswith Pandas
17
Write Less Code: Compute an Average
private IntWritable one  =
new IntWritable(1)
private IntWritable output   =
new IntWritable()
proctected void map(
LongWritable key,
Text  value,
Context   context)   {
String[]   fields   = value.split("t")
output.set(Integer.parseInt(fields[1]))
context.write(one,   output)
}
IntWritable one  = new IntWritable(1)
DoubleWritable average   = new DoubleWritable()
protected void reduce(
IntWritable key,
Iterable<IntWritable> values,
Context   context)   {
int sum   = 0
int count   = 0
for(IntWritable value   : values)   {
sum  += value.get()
count++
}
average.set(sum   / (double)   count)
context.Write(key,   average)
}
data   = sc.textFile(...).split("t")
data.map(lambda x:  (x[0],   [x.[1],   1]))   
.reduceByKey(lambda x,  y:  [x[0]   + y[0],   x[1]   + y[1]])   
.map(lambda x:  [x[0],   x[1][0]   / x[1][1]])   
.collect()
18
Write Less Code: Compute an Average
Using RDDs
data  = sc.textFile(...).split("t")
data.map(lambda x:  (x[0],  [int(x[1]),  1]))  
.reduceByKey(lambda x,  y:  [x[0]  + y[0],  x[1]  + y[1]])  
.map(lambda x:  [x[0],  x[1][0]  / x[1][1]])  
.collect()
Using DataFrames
sqlCtx.table("people")   
.groupBy("name")   
.agg("name",  avg("age"))  
.map(lambda …)  
.collect()  
Full API Docs
• Python
• Scala
• Java
• R
19
Using SQL
SELECT name,  avg(age)
FROM people
GROUP BY  name
Not Just Less Code, Faster Too!
20
0 2 4 6 8 10
RDDScala
RDDPython
DataFrameScala
DataFramePython
DataFrameR
DataFrameSQL
Time to Aggregate 10 million int pairs (secs)
Plan Optimization & Execution
21
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
DataFrames and SQLshare the same optimization/execution pipeline
Seamlessly Integrated
Intermix DataFrame operations with
custom Python, Java, R, or Scala code
zipToCity =  udf(lambda zipCode:  <custom  logic  here>)
def add_demographics(events):
u  = sqlCtx.table("users")
events  
.join(u,  events.user_id == u.user_id)  
.withColumn("city",  zipToCity(df.zip))
Augments any
DataFrame
that contains
user_id
22
Optimize Full Pipelines
Optimization happensas late as possible, therefore
Spark SQL can optimize even across functions.
23
events  = add_demographics(sqlCtx.load("/data/events",  "json"))
training_data = events  
.where(events.city == "Amsterdam")  
.select(events.timestamp)  
.collect()  
24
def add_demographics(events):
u  = sqlCtx.table("users") #  Load  Hive  table
events  
.join(u,  events.user_id == u.user_id)   #  Join  on  user_id
.withColumn("city",  zipToCity(df.zip))   #  Run  udf to  add  city  column
events  = add_demographics(sqlCtx.load("/data/events",  "json"))  
training_data = events.where(events.city == "Amsterdam").select(events.timestamp).collect()  
Logical Plan
filter
bycity
join
events file users table
expensive
only join
relevent users
Physical Plan
join
scan
(events)
filter
bycity
scan
(users)
24
25
def add_demographics(events):
u  = sqlCtx.table("users")                                          #  Load  partitioned Hive  table
events  
.join(u,  events.user_id == u.user_id)   #  Join  on  user_id
.withColumn("city",  zipToCity(df.zip))            #  Run  udf to  add  city  column
Optimized Physical Plan
with Predicate Pushdown
and Column Pruning
join
optimized
scan
(events)
optimized
scan
(users)
events  = add_demographics(sqlCtx.load("/data/events",  "parquet"))  
training_data = events.where(events.city == "Amsterdam").select(events.timestamp).collect()  
Logical Plan
filter
bycity
join
events file users table
Physical Plan
join
scan
(events)
filter
bycity
scan
(users)
25
Machine Learning Pipelines
26
tokenizer = Tokenizer(inputCol="text",	
  outputCol="words”)
hashingTF = HashingTF(inputCol="words",	
  outputCol="features”)
lr = LogisticRegression(maxIter=10,  regParam=0.01)
pipeline  = Pipeline(stages=[tokenizer,  hashingTF,  lr])
df = sqlCtx.load("/path/to/data")
model  = pipeline.fit(df)
ds0 ds1 ds2 ds3tokenizer hashingTF lr.model
lr
Pipeline Model
• 100+ native functionswith
optimized codegen
implementations
– String manipulation – concat,  
format_string,  lower,  lpad
– Data/Time – current_timestamp,  
date_format,  date_add,  …
– Math – sqrt,  randn,   …
– Other –
monotonicallyIncreasingId,  
sparkPartitionId,   …
27
Rich Function Library
from pyspark.sql.functions import *
yesterday  = date_sub(current_date(),  1)
df2  = df.filter(df.created_at > yesterday)
import org.apache.spark.sql.functions._
val yesterday = date_sub(current_date(),  1)
val df2 = df.filter(df("created_at")  > yesterday)
Added	
  in	
  
Spark	
  1.5
Optimized Execution with
Project Tungsten
Compact encoding, cacheaware algorithms,
runtime code generation
28
The overheads of JVM objects
“abcd”
29
• Native: 4 byteswith UTF-8 encoding
• Java: 48 bytes
java.lang.String object internals:
OFFSET SIZE TYPE DESCRIPTION VALUE
0 4 (object header) ...
4 4 (object header) ...
8 4 (object header) ...
12 4 char[] String.value []
16 4 int String.hash 0
20 4 int String.hash32 0
Instance size: 24 bytes (reported by Instrumentation API)
12 byte object header
8 byte hashcode
20 bytes data + overhead
6 “bricks”
Tungsten’s Compact Encoding
30
0x0 123 32L 48L 4 “data”
(123, “data”, “bricks”)
Null bitmap
Offset to data
Offset to data Field lengths
“abcd” with Tungsten encoding: ~5-­‐6	
  bytes	
  
Runtime Bytecode Generation
31
df.where(df("year")  > 2015)
GreaterThan(year#234,  Literal(2015))
bool filter(Object  baseObject)  {
int offset  = baseOffset + bitSetWidthInBytes + 3*8L;
int value  =  Platform.getInt(baseObject,  offset);
return value34  > 2015;
}
DataFrame Code / SQL
Catalyst Expressions
Low-level bytecode
JVM intrinsic JIT-ed to
pointer arithmetic
Platform.getInt(baseObject,  offset);
• Type-safe: operate on domain
objectswith compiled lambda
functions
• Fast: Code-generated
encodersfor fast serialization
• Interoperable: Easily convert
DataFrame ßà Dataset
withoutboilerplate
32
Coming soon: Datasets
val df = ctx.read.json("people.json")
//  Convert  data  to  domain  objects.
case class Person(name:  String,  age:  Int)
val ds: Dataset[Person]  = df.as[Person]
ds.filter(_.age  > 30)
//  Compute  histogram  of  age  by  name.
val hist =  ds.groupBy(_.name).mapGroups {
case (name,  people:  Iter[Person])  =>
val buckets =  new Array[Int](10)            
people.map(_.age).foreach {  a  =>
buckets(a  / 10)  += 1
}                  
(name,  buckets)
}
Preview	
  in	
  
Spark	
  1.6
33
Create and run your
Spark programs faster:
SQL
• Write less code
• Read less data
• Let the optimizer do the hard work
Questions?
Committer Office Hours
Weds, 4:00-5:00 pm Michael
Thurs, 10:30-11:30 am Reynold
Thurs, 2:00-3:00 pm Andrew

Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

  • 1.
    Spark DataFrames: Simple andFast Analytics on Structured Data Michael Armbrust Spark Summit Amsterdam 2015 - October, 28th
  • 2.
    Graduated from Alpha in 1.3 •Spark SQL • Part of the core distribution since Spark 1.0 (April 2014) SQLAbout Me and 2 0 100 200 300 # Of Commits Per Month 0 50 100 150 200 # of Contributors 2
  • 3.
    3 SELECT COUNT(*) FROM hiveTable WHEREhive_udf(data) • Spark SQL • Part of the core distribution since Spark 1.0 (April 2014) • Runs SQL / HiveQLqueries,optionally alongside or replacing existing Hive deployments SQLAbout Me and Improved multi-version support in 1.4
  • 4.
    4 • Spark SQL •Part of the core distribution since Spark 1.0 (April 2014) • Runs SQL / HiveQLqueries,optionally alongside or replacing existing Hive deployments • Connectexisting BI tools to Spark through JDBC SQLAbout Me and
  • 5.
    • Spark SQL •Part of the core distribution since Spark 1.0 (April 2014) • Runs SQL / HiveQLqueries,optionally alongside or replacing existing Hive deployments • Connectexisting BI tools to Spark through JDBC • Bindingsin Python,Scala, Java, and R 5 SQLAbout Me and
  • 6.
    • Spark SQL •Part of the core distribution since Spark 1.0 (April 2014) • Runs SQL / HiveQLqueries,optionally alongside or replacing existing Hive deployments • Connectexisting BI tools to Spark through JDBC • Bindingsin Python,Scala, Java, and R • @michaelarmbrust • Creator of Spark SQL @databricks 6 SQLAbout Me and
  • 7.
    The not-so-secret truth... 7 isabout more than SQL. SQL
  • 8.
    8 Create and run Sparkprograms faster: SQL • Write less code • Read less data • Let the optimizer do the hard work
  • 9.
    DataFrame noun – [dey-tuh-freym] 9 1.A distributed collection of rows organized into named columns. 2. An abstraction for selecting, filtering, aggregating and plotting structured data (cf. R, Pandas). 3. Archaic: Previously SchemaRDD (cf. Spark< 1.3).
  • 10.
    Write Less Code:Input & Output Unified interface to reading/writing data in a variety of formats: df = sqlContext.read .format("json")   .option("samplingRatio",  "0.1")   .load("/home/michael/data.json") df.write .format("parquet")   .mode("append")   .partitionBy("year")   .saveAsTable("fasterData") 10
  • 11.
    Write Less Code:Input & Output Unified interface to reading/writing data in a variety of formats: df = sqlContext.read .format("json")   .option("samplingRatio",  "0.1")   .load("/home/michael/data.json") df.write .format("parquet")   .mode("append")   .partitionBy("year")   .saveAsTable("fasterData") read and write   functions create new builders for doing I/O 11
  • 12.
    Write Less Code:Input & Output Unified interface to reading/writing data in a variety of formats: Builder methods specify: • Format • Partitioning • Handling of existing data df = sqlContext.read .format("json")   .option("samplingRatio",  "0.1")   .load("/home/michael/data.json") df.write .format("parquet")   .mode("append")   .partitionBy("year")   .saveAsTable("fasterData") 12
  • 13.
    Write Less Code:Input & Output Unified interface to reading/writing data in a variety of formats: load(…), save(…) or saveAsTable(…)   finish the I/O specification df = sqlContext.read .format("json")   .option("samplingRatio",  "0.1")   .load("/home/michael/data.json") df.write .format("parquet")   .mode("append")   .partitionBy("year")   .saveAsTable("fasterData") 13
  • 14.
    Read Less Data:Efficient Formats • Compact binary encoding with intelligent compression (delta, RLE, etc) • Each column stored separately with an index that allows skipping of unread columns • Support for partitioning (/data/year=2015) • Data skipping using statistics (column min/max, etc) 14 is an efficient columnar storage format: ORC
  • 15.
    Write Less Code:Data Source API Spark SQL’s Data Source API can read and write DataFrames usinga variety of formats. 15 {  JSON } Built-In External JDBC and more… Find more sources at http://spark-packages.org/ ORCplain text*
  • 16.
    ETL Using CustomData Sources sqlContext.read .format("com.databricks.spark.jira") .option("url",  "https://issues.apache.org/jira/rest/api/latest/search") .option("user",  "marmbrus") .option("password",  "*******") .option("query",  """ |project  =  SPARK  AND   |component  =  SQL  AND   |(status  =  Open  OR  status  =  "In  Progress"  OR  status  =  Reopened)""".stripMargin) .load() 16 .repartition(1) .write .format("parquet") .saveAsTable("sparkSqlJira") Load  data  from                                (Spark’s  Bug  Tracker)   using  a  custom  data  source. Write  the  converted  data  out  to   a                                              table  stored  in  
  • 17.
    Write Less Code:High-Level Operations Solve common problems conciselyusing DataFrame functions: • Selecting columnsand filtering • Joining different data sources • Aggregation (count, sum, average, etc) • Plotting resultswith Pandas 17
  • 18.
    Write Less Code:Compute an Average private IntWritable one  = new IntWritable(1) private IntWritable output   = new IntWritable() proctected void map( LongWritable key, Text  value, Context   context)   { String[]   fields   = value.split("t") output.set(Integer.parseInt(fields[1])) context.write(one,   output) } IntWritable one  = new IntWritable(1) DoubleWritable average   = new DoubleWritable() protected void reduce( IntWritable key, Iterable<IntWritable> values, Context   context)   { int sum   = 0 int count   = 0 for(IntWritable value   : values)   { sum  += value.get() count++ } average.set(sum   / (double)   count) context.Write(key,   average) } data   = sc.textFile(...).split("t") data.map(lambda x:  (x[0],   [x.[1],   1]))   .reduceByKey(lambda x,  y:  [x[0]   + y[0],   x[1]   + y[1]])   .map(lambda x:  [x[0],   x[1][0]   / x[1][1]])   .collect() 18
  • 19.
    Write Less Code:Compute an Average Using RDDs data  = sc.textFile(...).split("t") data.map(lambda x:  (x[0],  [int(x[1]),  1]))   .reduceByKey(lambda x,  y:  [x[0]  + y[0],  x[1]  + y[1]])   .map(lambda x:  [x[0],  x[1][0]  / x[1][1]])   .collect() Using DataFrames sqlCtx.table("people")   .groupBy("name")   .agg("name",  avg("age"))   .map(lambda …)   .collect()   Full API Docs • Python • Scala • Java • R 19 Using SQL SELECT name,  avg(age) FROM people GROUP BY  name
  • 20.
    Not Just LessCode, Faster Too! 20 0 2 4 6 8 10 RDDScala RDDPython DataFrameScala DataFramePython DataFrameR DataFrameSQL Time to Aggregate 10 million int pairs (secs)
  • 21.
    Plan Optimization &Execution 21 SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog DataFrames and SQLshare the same optimization/execution pipeline
  • 22.
    Seamlessly Integrated Intermix DataFrameoperations with custom Python, Java, R, or Scala code zipToCity =  udf(lambda zipCode:  <custom  logic  here>) def add_demographics(events): u  = sqlCtx.table("users") events   .join(u,  events.user_id == u.user_id)   .withColumn("city",  zipToCity(df.zip)) Augments any DataFrame that contains user_id 22
  • 23.
    Optimize Full Pipelines Optimizationhappensas late as possible, therefore Spark SQL can optimize even across functions. 23 events  = add_demographics(sqlCtx.load("/data/events",  "json")) training_data = events   .where(events.city == "Amsterdam")   .select(events.timestamp)   .collect()  
  • 24.
    24 def add_demographics(events): u  =sqlCtx.table("users") #  Load  Hive  table events   .join(u,  events.user_id == u.user_id)   #  Join  on  user_id .withColumn("city",  zipToCity(df.zip))   #  Run  udf to  add  city  column events  = add_demographics(sqlCtx.load("/data/events",  "json"))   training_data = events.where(events.city == "Amsterdam").select(events.timestamp).collect()   Logical Plan filter bycity join events file users table expensive only join relevent users Physical Plan join scan (events) filter bycity scan (users) 24
  • 25.
    25 def add_demographics(events): u  =sqlCtx.table("users")                                          #  Load  partitioned Hive  table events   .join(u,  events.user_id == u.user_id)   #  Join  on  user_id .withColumn("city",  zipToCity(df.zip))            #  Run  udf to  add  city  column Optimized Physical Plan with Predicate Pushdown and Column Pruning join optimized scan (events) optimized scan (users) events  = add_demographics(sqlCtx.load("/data/events",  "parquet"))   training_data = events.where(events.city == "Amsterdam").select(events.timestamp).collect()   Logical Plan filter bycity join events file users table Physical Plan join scan (events) filter bycity scan (users) 25
  • 26.
    Machine Learning Pipelines 26 tokenizer= Tokenizer(inputCol="text",  outputCol="words”) hashingTF = HashingTF(inputCol="words",  outputCol="features”) lr = LogisticRegression(maxIter=10,  regParam=0.01) pipeline  = Pipeline(stages=[tokenizer,  hashingTF,  lr]) df = sqlCtx.load("/path/to/data") model  = pipeline.fit(df) ds0 ds1 ds2 ds3tokenizer hashingTF lr.model lr Pipeline Model
  • 27.
    • 100+ nativefunctionswith optimized codegen implementations – String manipulation – concat,   format_string,  lower,  lpad – Data/Time – current_timestamp,   date_format,  date_add,  … – Math – sqrt,  randn,   … – Other – monotonicallyIncreasingId,   sparkPartitionId,   … 27 Rich Function Library from pyspark.sql.functions import * yesterday  = date_sub(current_date(),  1) df2  = df.filter(df.created_at > yesterday) import org.apache.spark.sql.functions._ val yesterday = date_sub(current_date(),  1) val df2 = df.filter(df("created_at")  > yesterday) Added  in   Spark  1.5
  • 28.
    Optimized Execution with ProjectTungsten Compact encoding, cacheaware algorithms, runtime code generation 28
  • 29.
    The overheads ofJVM objects “abcd” 29 • Native: 4 byteswith UTF-8 encoding • Java: 48 bytes java.lang.String object internals: OFFSET SIZE TYPE DESCRIPTION VALUE 0 4 (object header) ... 4 4 (object header) ... 8 4 (object header) ... 12 4 char[] String.value [] 16 4 int String.hash 0 20 4 int String.hash32 0 Instance size: 24 bytes (reported by Instrumentation API) 12 byte object header 8 byte hashcode 20 bytes data + overhead
  • 30.
    6 “bricks” Tungsten’s CompactEncoding 30 0x0 123 32L 48L 4 “data” (123, “data”, “bricks”) Null bitmap Offset to data Offset to data Field lengths “abcd” with Tungsten encoding: ~5-­‐6  bytes  
  • 31.
    Runtime Bytecode Generation 31 df.where(df("year") > 2015) GreaterThan(year#234,  Literal(2015)) bool filter(Object  baseObject)  { int offset  = baseOffset + bitSetWidthInBytes + 3*8L; int value  =  Platform.getInt(baseObject,  offset); return value34  > 2015; } DataFrame Code / SQL Catalyst Expressions Low-level bytecode JVM intrinsic JIT-ed to pointer arithmetic Platform.getInt(baseObject,  offset);
  • 32.
    • Type-safe: operateon domain objectswith compiled lambda functions • Fast: Code-generated encodersfor fast serialization • Interoperable: Easily convert DataFrame ßà Dataset withoutboilerplate 32 Coming soon: Datasets val df = ctx.read.json("people.json") //  Convert  data  to  domain  objects. case class Person(name:  String,  age:  Int) val ds: Dataset[Person]  = df.as[Person] ds.filter(_.age  > 30) //  Compute  histogram  of  age  by  name. val hist =  ds.groupBy(_.name).mapGroups { case (name,  people:  Iter[Person])  => val buckets =  new Array[Int](10)             people.map(_.age).foreach {  a  => buckets(a  / 10)  += 1 }                   (name,  buckets) } Preview  in   Spark  1.6
  • 33.
    33 Create and runyour Spark programs faster: SQL • Write less code • Read less data • Let the optimizer do the hard work Questions? Committer Office Hours Weds, 4:00-5:00 pm Michael Thurs, 10:30-11:30 am Reynold Thurs, 2:00-3:00 pm Andrew