TRANSFORMATIONS AND
ACTIONS A Visual Guide of the
APIhttp://training.databricks.com/visualapi.pdf
Databricks would like to give a special thanks to Jeff Thomspon for
contributing 67 visual diagrams depicting the Spark API under the
MIT license to the Spark community.
Jeff’s original, creative work can be found here and you can read
more about Jeff’s project in his blog post.
After talking to Jeff, Databricks commissioned Adam Breindel to
further evolve Jeff’s work into the diagrams you see in this deck.
LinkedIn
Blog: data-frack
making big data simple
Databricks Cloud:
“A unified platform for building Big Data pipelines –
from ETL to Exploration and Dashboards, to Advanced
Analytics and Data Products.”
• Founded in late 2013
• by the creators of Apache Spark
• Original team from UC Berkeley AMPLab
• Raised $47 Million in 2 rounds
• ~55 employees
• We’re hiring!
• Level 2/3 support partnerships with
• Hortonworks
• MapR
• DataStax
(http://databricks.workable.com)
ke
y
RDD
Elements
original item
transformed
type
object on
driver
RDD
partition(s
)
A
B
user
functions
user
input
input
emitted
value
Legen
d
Randomized
operation
Legen
d
Set Theory / Relational
operation
Numeric calculation
Operations
=
TRANSFORMA
TIONS
ACTIO
NS
+
• map
• filter
• flatMap
• mapPartitions
• mapPartitionsWithIndex
• groupBy
• sortBy
=
medium
Essential Core & Intermediate Spark
Operations
TRANSFORMA
TIONS
ACTIO
NS Genera
l • sample
• randomSplit
Math /
Statistical
=
easy
Set Theory /
Relational
• union
• intersection
• subtract
• distinct
• cartesian
• zip
• takeOrdered
Data Structure /
I/O
• saveAsTextFile
• saveAsSequenceFile
• saveAsObjectFile
• saveAsHadoopDataset
• saveAsHadoopFile
• saveAsNewAPIHadoopDataset
• saveAsNewAPIHadoopFile
• keyBy
• zipWithIndex
• zipWithUniqueID
• zipPartitions
• coalesce
• repartition
• repartitionAndSortWithinPartitions
• pipe
• count
• takeSample
• max
• min
• sum
• histogram
• mean
• variance
• stdev
• sampleVariance
• countApprox
• countApproxDistinct
• reduce
• collect
• aggregate
• fold
• first
• take
• forEach
• top
• treeAggregate
• treeReduce
• forEachPartition
• collectAsMap
=
medium
Essential Core & Intermediate PairRDD
Operations
TRANSFORMA
TIONS
ACTIO
NS Genera
l • sampleByKey
Math /
Statistical
=
easy
Set Theory /
Relational
Data Structure
• keys
• values
• partitionBy
• countByKey
• countByValue
• countByValueApprox
• countApproxDistinctByKey
• countApproxDistinctByKey
• countByKeyApprox
• sampleByKeyExact
• cogroup (=groupWith)
• join
• subtractByKey
• fullOuterJoin
• leftOuterJoin
• rightOuterJoin
• flatMapValues
• groupByKey
• reduceByKey
• reduceByKeyLocally
• foldByKey
• aggregateByKey
• sortByKey
• combineByKey
vs
narro
w
wide
each partition of the parent RDD is
used by at most one partition of
the child RDD
multiple child RDD partitions
may depend on a single parent
RDD partition
“One of the challenges in providing RDDs as an abstraction is
choosing a representation for them that can track lineage across a
wide range of transformations.”
“The most interesting question in designing this interface is how
to represent dependencies between RDDs.”
“We found it both sufficient and useful to classify dependencies
into two types:
• narrow dependencies, where each partition of the parent RDD is
used by at most one partition of the child RDD
• wide dependencies, where multiple child partitions may depend
on it.”
narro
w
wid
e
each partition of the parent RDD is
used by at most one partition of
the child RDD
multiple child RDD partitions
may depend on a single parent
RDD partition
map,
filter
union
join w/ inputs
co-partitioned
groupByKe
y
join w/ inputs not
co-partitioned
TRANSFORMATIONS Core
Operations
MAP
3 items in RDD
RDD: x
MAP
User function
applied item by
item
RDD: x RDD: y
MAP
RDD: x RDD: y
MAP
RDD: x RDD: y
MAP
RDD: x RDD: y
After map() has been
applied…
before after
MAP
RDD: x RDD: y
Return a new RDD by applying a function to each element
of this RDD.
MAP
x = sc.parallelize(["b", "a", "c"])
y = x.map(lambda z: (z, 1))
print(x.collect())
print(y.collect())
['b', 'a', 'c']
[('b', 1), ('a', 1), ('c', 1)]
RDD:
x
RDD:
y
x:
y:
map(f, preservesPartitioning=False)
Return a new RDD by applying a function to each element
of this RDD
val x = sc.parallelize(Array("b", "a", "c"))
val y = x.map(z => (z,1))
println(x.collect().mkString(", "))
println(y.collect().mkString(", "))
FILTER
3 items in RDD
RDD: x
FILTER
Apply user
function:
keep item if
function
returns true
RDD: x RDD: y
emits
True
FILTER
RDD: x RDD: y
emits
False
FILTER
RDD: x RDD: y
emits
True
FILTER
RDD: x RDD: y
After filter() has been
applied…
before after
FILTER
x = sc.parallelize([1,2,3])
y = x.filter(lambda x: x%2 == 1) #keep odd values
print(x.collect())
print(y.collect())
[1, 2, 3]
[1, 3]
RDD:
x
RDD:
y
x:
y:
filter(f)
Return a new RDD containing only the elements that satisfy
a predicate
val x = sc.parallelize(Array(1,2,3))
val y = x.filter(n => n%2 == 1)
println(x.collect().mkString(", "))
println(y.collect().mkString(", "))
FLATMAP
3 items in RDD
RDD: x
FLATMAP
RDD: x RDD: y
FLATMAP
RDD: x RDD: y
FLATMAP
RDD: x RDD: y
FLATMAP
RDD: x RDD: y
After flatmap() has been
applied…
before after
FLATMAP
RDD: x RDD: y
Return a new RDD by first applying a function to all elements of this RDD, and then
flattening the results
FLATMAP
x = sc.parallelize([1,2,3])
y = x.flatMap(lambda x: (x, x*100, 42))
print(x.collect())
print(y.collect())
[1, 2, 3]
[1, 100, 42, 2, 200, 42, 3, 300, 42]
x:
y:
RDD:
x
RDD:
y
flatMap(f, preservesPartitioning=False)
Return a new RDD by first applying a function to all elements of this RDD, and then
flattening the results
val x = sc.parallelize(Array(1,2,3))
val y = x.flatMap(n => Array(n, n*100, 42))
println(x.collect().mkString(", "))
println(y.collect().mkString(", "))
GROUPBY
4 items in RDD
RDD: x
Jame
s
Anna
Fred
John
GROUPBY
RDD: x
Jame
s
Anna
Fred
John
emits
‘J’
J [ “John” ]
RDD: y
F [ “Fred” ]
GROUPBY
RDD: x
Jame
s
Anna
Fred
emits
‘F’
J [ “John” ]John
RDD: y
[ “Fred” ]
GROUPBY
RDD: x
Jame
s
Anna
emits
‘A’
J [ “John” ]
A [ “Anna” ]
Fred
John
F
RDD: y
[ “Fred” ]
GROUPBY
RDD: x
Jame
s
Anna
emits
‘J’
J [ “John”, “James” ]
[ “Anna” ]
Fred
John
F
A
RDD: y
GROUPBY
x = sc.parallelize(['John', 'Fred', 'Anna', 'James'])
y = x.groupBy(lambda w: w[0])
print [(k, list(v)) for (k, v) in y.collect()]
['John', 'Fred', 'Anna', 'James']
[('A',['Anna']),('J',['John','James']),('F',['Fred'])]
RDD:
x
RDD:
y
x:
y:
groupBy(f, numPartitions=None)
Group the data in the original RDD. Create pairs where the key is
the output of a user function, and the value is all items for which
the function yields this key.
val x = sc.parallelize(
Array("John", "Fred", "Anna", "James"))
val y = x.groupBy(w => w.charAt(0))
println(y.collect().mkString(", "))
GROUPBYKEY
5 items in RDD
Pair RDD:
x
B
B
A
A
A
5
4
3
2
1
GROUPBYKEY
Pair RDD:
x
5
4
3
2
1
RDD: y
A [ 2 , 3 , 1
]
B
B
A
A
A
GROUPBYKEY
Pair RDD:
x
RDD: y
B [ 5 , 4 ]
A [ 2 , 3 , 1
]
5
4
3
2
1
B
B
A
A
A
GROUPBYKEY
x = sc.parallelize([('B',5),('B',4),('A',3),('A',2),('A',1)])
y = x.groupByKey()
print(x.collect())
print(list((j[0], list(j[1])) for j in y.collect()))
[('B', 5),('B', 4),('A', 3),('A', 2),('A', 1)]
[('A', [2, 3, 1]),('B',[5, 4])]
RDD:
x
RDD:
y
x:
y:
groupByKey(numPartitions=None)
Group the values for each key in the original RDD. Create a new
pair where the original key corresponds to this collected group of
values.
val x = sc.parallelize(
Array(('B',5),('B',4),('A',3),('A',2),('A',1)))
val y = x.groupByKey()
println(x.collect().mkString(", "))
println(y.collect().mkString(", "))
MAPPARTITIONS
RDD: x RDD: y
partitions
A
B
A
B
REDUCEBYKEY VS GROUPBYKEY
val words = Array("one", "two", "two", "three", "three", "three")
val wordPairsRDD = sc.parallelize(words).map(word => (word, 1))
val wordCountsWithReduce = wordPairsRDD
.reduceByKey(_ + _)
.collect()
val wordCountsWithGroup = wordPairsRDD
.groupByKey()
.map(t => (t._1, t._2.sum))
.collect()
REDUCEBYKEY
(a, 1)
(b, 1)
(a, 1)
(b, 1)
(a, 1)
(a, 1) (a, 2)
(b, 2)(b, 1)
(b, 1)
(a, 1)
(a, 1)
(a, 3)
(b, 2)
(a, 1)
(b, 1)
(b, 1)
(a, 1)
(a, 2) (a, 6)
(a, 3)
(b, 1)
(b, 2) (b, 5)
(b, 2)
a b
GROUPBYKEY
(a, 1)
(b, 1)
(a, 1)
(a, 1)
(b, 1)
(b, 1)
(a, 1)
(a, 1)
(a, 1)
(b, 1)
(b, 1)
(a, 1)
(a, 1)
(a, 6)
(a, 1)
(b, 5)
a b
(a, 1)
(a, 1)
(a, 1)
(b, 1)
(b, 1)
(b, 1)
(b, 1)
(b, 1)
MAPPARTIT
IONS
x:
y:
mapPartitions(f, preservesPartitioning=False)
Return a new RDD by applying a function to each partition
of this RDD
A
B
A
B
x = sc.parallelize([1,2,3], 2)
def f(iterator): yield sum(iterator); yield 42
y = x.mapPartitions(f)
# glom() flattens elements on the same partition
print(x.glom().collect())
print(y.glom().collect())
[[1], [2, 3]]
[[1, 42], [5, 42]]
MAPPARTIT
IONS
x:
y:
mapPartitions(f, preservesPartitioning=False)
Return a new RDD by applying a function to each partition
of this RDD
A
B
A
B
Array(Array(1), Array(2, 3))
Array(Array(1, 42), Array(5, 42))
val x = sc.parallelize(Array(1,2,3), 2)
def f(i:Iterator[Int])={ (i.sum,42).productIterator }
val y = x.mapPartitions(f)
// glom() flattens elements on the same partition
val xOut = x.glom().collect()
val yOut = y.glom().collect()
MAPPARTITIONSWITHINDEX
RDD: x RDD: y
partitions
A
B
A
B
input
partition index
x:
y:
mapPartitionsWithIndex(f, preservesPartitioning=False)
Return a new RDD by applying a function to each partition of
this RDD, while tracking the index of the original partition
A
B
A
B
x = sc.parallelize([1,2,3], 2)
def f(partitionIndex, iterator): yield (partitionIndex, sum(iterator))
y = x.mapPartitionsWithIndex(f)
# glom() flattens elements on the same partition
print(x.glom().collect())
print(y.glom().collect())
[[1], [2, 3]]
[[0, 1], [1, 5]]
MAPPARTITIONSWIT
HINDEX
partition
index
B A
x:
y:
mapPartitionsWithIndex(f, preservesPartitioning=False)
Return a new RDD by applying a function to each partition of
this RDD, while tracking the index of the original partition.
A
B
A
B
Array(Array(1), Array(2, 3))
Array(Array(0, 1), Array(1, 5))
MAPPARTITIONSWIT
HINDEX
partition
index
B A
val x = sc.parallelize(Array(1,2,3), 2)
def f(partitionIndex:Int, i:Iterator[Int]) = {
(partitionIndex, i.sum).productIterator
}
val y = x.mapPartitionsWithIndex(f)
// glom() flattens elements on the same partition
val xOut = x.glom().collect()
val yOut = y.glom().collect()
SAMPLE
RDD: x RDD: y
1
3
5
4
3
2
1
SAMPLE
x = sc.parallelize([1, 2, 3, 4, 5])
y = x.sample(False, 0.4, 42)
print(x.collect())
print(y.collect())
[1, 2, 3, 4, 5]
[1, 3]
RDD:
x
RDD:
y
x:
y:
sample(withReplacement, fraction, seed=None)
Return a new RDD containing a statistical sample of the
original RDD
val x = sc.parallelize(Array(1, 2, 3, 4, 5))
val y = x.sample(false, 0.4)
// omitting seed will yield different output
println(y.collect().mkString(", "))
UNION RDD: x RDD: y
4
3
3
2
1
A
B
C
4
3
3
2
1
A
B
C
RDD: z
UNION
x = sc.parallelize([1,2,3], 2)
y = sc.parallelize([3,4], 1)
z = x.union(y)
print(z.glom().collect())
[1, 2, 3]
[3, 4]
[[1], [2, 3], [3, 4]]
x:
y:
union(otherRDD)
Return a new RDD containing all items from two original RDDs.
Duplicates are not culled.
val x = sc.parallelize(Array(1,2,3), 2)
val y = sc.parallelize(Array(3,4), 1)
val z = x.union(y)
val zOut = z.glom().collect()
z:
5B
JOIN RDD: x RDD: y
4
2B
A
1A
3A
JOIN RDD: x RDD: y
RDD: z
(1, 3)A
5B
4
2B
A
1A
3A
JOIN RDD: x RDD: y
RDD: z
(1, 4)A
(1, 3)A
5B
4
2B
A
1A
3A
JOIN RDD: x RDD: y
(2, 5) RDD: zB
(1, 4)A
(1, 3)A
5B
4
2B
A
1A
3A
JOIN
x = sc.parallelize([("a", 1), ("b", 2)])
y = sc.parallelize([("a", 3), ("a", 4), ("b", 5)])
z = x.join(y)
print(z.collect()) [("a", 1), ("b", 2)]
[("a", 3), ("a", 4), ("b", 5)]
[('a', (1, 3)), ('a', (1, 4)), ('b', (2, 5))]
x:
y:
union(otherRDD, numPartitions=None)
Return a new RDD containing all pairs of elements having the same key in
the original RDDs
val x = sc.parallelize(Array(("a", 1), ("b", 2)))
val y = sc.parallelize(Array(("a", 3), ("a", 4), ("b", 5)))
val z = x.join(y)
println(z.collect().mkString(", "))
z:
DISTINCT
4
3
3
2
1
RDD: x
DISTINCT
RDD: x
4
3
3
2
1
4
3
3
2
1
RDD: y
DISTINCT
RDD: x
4
3
3
2
1
4
3
2
1
RDD: y
DISTINCT
x = sc.parallelize([1,2,3,3,4])
y = x.distinct()
print(y.collect())
[1, 2, 3, 3, 4]
[1, 2, 3, 4]
x:
y:
distinct(numPartitions=None)
Return a new RDD containing distinct items from the original RDD (omitting
all duplicates)
val x = sc.parallelize(Array(1,2,3,3,4))
val y = x.distinct()
println(y.collect().mkString(", "))
*
*
*¤
¤
COALESCE
RDD: x
A
B
C
COALESCE
RDD: x
B
C
RDD: y
A
AB
C
COALESCE
RDD: x
B
C
RDD: y
A
AB
COALESCE
x = sc.parallelize([1, 2, 3, 4, 5], 3)
y = x.coalesce(2)
print(x.glom().collect())
print(y.glom().collect())
[[1], [2, 3], [4, 5]]
[[1], [2, 3, 4, 5]]
x:
y:
coalesce(numPartitions, shuffle=False)
Return a new RDD which is reduced to a smaller number of
partitions
val x = sc.parallelize(Array(1, 2, 3, 4, 5), 3)
val y = x.coalesce(2)
val xOut = x.glom().collect()
val yOut = y.glom().collect()
C
B
C
A
AB
KEYBY
RDD: x
Jame
s
Anna
Fred
John
RDD: y
J “John”
emits
‘J’
KEYBY
RDD: x
Jame
s
Anna
‘F’
Fred “Fred”F
RDD: y
J “John”John
Jame
s
“Anna”A
KEYBY
RDD: x
Anna
‘A’
Fred
John
“Fred”F
RDD: y
J “John”
J “James”
“Anna”A
KEYBY
RDD: x
Jame
s
Anna
emits
‘J’
Fred
John
“Fred”F
RDD: y
J “John”
KEYBY
x = sc.parallelize(['John', 'Fred', 'Anna', 'James'])
y = x.keyBy(lambda w: w[0])
print y.collect()
['John', 'Fred', 'Anna', 'James']
[('J','John'),('F','Fred'),('A','Anna'),('J','James')]
RDD:
x
RDD:
y
x:
y:
keyBy(f)
Create a Pair RDD, forming one pair for each item in the
original RDD. The pair’s key is calculated from the value via
a user-supplied function.
val x = sc.parallelize(
Array("John", "Fred", "Anna", "James"))
val y = x.keyBy(w => w.charAt(0))
println(y.collect().mkString(", "))
PARTITIONBY
RDD: x
J “John”
“Anna”A
“Fred”F
J “James”
PARTITIONBY
RDD: x
J “John”
“Anna”A
“Fred”F
J “James”
RDD: yRDD: y
J “James”
1
PARTITIONBY
RDD: x
J “John”
“Anna”A
“Fred”F
RDD: yRDD: y
J “James”
“Fred”F
0
J “James”
PARTITIONBY
RDD: x
J “John”
“Anna”A
RDD: yRDD: y
J “James”
“Anna”A
“Fred”F
0
“Fred”F
J “James”
PARTITIONBY
RDD: x
J “John”
RDD: yRDD: y
J “John”
J “James”
“Anna”A
“Fred”F1
“Anna”A
“Fred”F
J “James”
PARTITIONBY
x = sc.parallelize([('J','James'),('F','Fred'),
('A','Anna'),('J','John')], 3)
y = x.partitionBy(2, lambda w: 0 if w[0] < 'H' else 1)
print x.glom().collect()
print y.glom().collect()
[[('J', 'James')], [('F', 'Fred')],
[('A', 'Anna'), ('J', 'John')]]
[[('A', 'Anna'), ('F', 'Fred')],
[('J', 'James'), ('J', 'John')]]
x:
y:
partitionBy(numPartitions, partitioner=portable_hash)
Return a new RDD with the specified number of partitions,
placing original items into the partition returned by a user
supplied function
PARTITIONBY
Array(Array((A,Anna), (F,Fred)),
Array((J,John), (J,James)))
Array(Array((F,Fred), (A,Anna)),
Array((J,John), (J,James)))
x:
y:
partitionBy(numPartitions, partitioner=portable_hash)
Return a new RDD with the specified number of partitions,
placing original items into the partition returned by a user
supplied function.
import org.apache.spark.Partitioner
val x = sc.parallelize(Array(('J',"James"),('F',"Fred"),
('A',"Anna"),('J',"John")), 3)
val y = x.partitionBy(new Partitioner() {
val numPartitions = 2
def getPartition(k:Any) = {
if (k.asInstanceOf[Char] < 'H') 0 else 1
}
})
val yOut = y.glom().collect()
ZIP RDD: x RDD: y
3
2
1
A
B
9
4
1
A
B
ZIP RDD: x RDD: y
3
2
1
A
B
4
A
RDD: z
9
4
1
A
B
1 1
ZIP RDD: x RDD: y
3
2
1
A
B
4
A
RDD: z
9
4
1
A
B
2
1
4
1
ZIP RDD: x RDD: y
3
2
1
A
B
4
3
A
B
RDD: z
9
4
1
A
B
2
1
9
4
1
ZIP
x = sc.parallelize([1, 2, 3])
y = x.map(lambda n:n*n)
z = x.zip(y)
print(z.collect()) [1, 2, 3]
[1, 4, 9]
[(1, 1), (2, 4), (3, 9)]
x:
y:
zip(otherRDD)
Return a new RDD containing pairs whose key is the item in the original
RDD, and whose value is that item’s corresponding element (same
partition, same index) in a second RDD
val x = sc.parallelize(Array(1,2,3))
val y = x.map(n=>n*n)
val z = x.zip(y)
println(z.collect().mkString(", "))
z:
ACTIONS Core
Operations
vs
distribut
ed
driver
occurs across the
cluster
result must fit in
driver JVM
GETNUMPARTITIONS
partition(s
)
A
B
2
x:
y:
getNumPartitions()
Return the number of partitions in RDD
[[1], [2, 3]]
2
GETNUMPARTITIONS
A
B
2
x = sc.parallelize([1,2,3], 2)
y = x.getNumPartitions()
print(x.glom().collect())
print(y)
val x = sc.parallelize(Array(1,2,3), 2)
val y = x.partitions.size
val xOut = x.glom().collect()
println(y)
COLLECT
partition(s
)
A
B
[ ]
x:
y:
collect()
Return all items in the RDD to the driver in a single list
[[1], [2, 3]]
[1, 2, 3]
A
B
x = sc.parallelize([1,2,3], 2)
y = x.collect()
print(x.glom().collect())
print(y)
val x = sc.parallelize(Array(1,2,3), 2)
val y = x.collect()
val xOut = x.glom().collect()
println(y)
COLLECT [ ]
REDUCE
3
2
1
emits
4
3
4
REDUCE
3
2
1
emits
3
input
6
4
REDUCE
3
2
1
10
input
6
10
x:
y:
reduce(f)
Aggregate all the elements of the RDD by applying a user
function pairwise to elements and partial results, and returns
a result to the driver
[1, 2, 3, 4]
10
x = sc.parallelize([1,2,3,4])
y = x.reduce(lambda a,b: a+b)
print(x.collect())
print(y)
val x = sc.parallelize(Array(1,2,3,4))
val y = x.reduce((a,b) => a+b)
println(x.collect.mkString(", "))
println(y)
REDUCE
******
*
**
***
AGGREGATE
3
2
1
4
A
B
AGGREGATE
3
2
1
4
AGGREGATE
3
2
1
4
emits
([], 0)
([1], 1)
AGGREGATE
3
2
1
4
([], 0)
([2], 2)
([1], 1)
AGGREGATE
3
2
1
4
([2], 2)
([1], 1)([1,2], 3)
AGGREGATE
3
2
1
4
([2], 2)
([1], 1)([1,2], 3)
([3], 3)
([], 0)
AGGREGATE
3
2
1
4
([2], 2)
([1], 1)([1,2], 3)
([3], 3)
([], 0)
([4], 4)
AGGREGATE
3
2
1
4
([2], 2)
([1], 1)([1,2], 3)
([4], 4)
([3], 3)([3,4], 7)
AGGREGATE
3
2
1
4
([2], 2)
([1], 1)([1,2], 3)
([4], 4)
([3], 3)([3,4], 7)
AGGREGATE
3
2
1
4
([1,2], 3)
([3,4], 7)
AGGREGATE
3
2
1
4
([1,2], 3)
([3,4], 7)
([1,2,3,4],
10)
([1,2,3,4],
10)
aggregate(identity, seqOp, combOp)
Aggregate all the elements of the RDD by:
- applying a user function to combine elements with user-
supplied objects,
- then combining those user-defined results via a second user
function,
- and finally returning a result to the driver.
seqOp = lambda data, item: (data[0] + [item], data[1] + item)
combOp = lambda d1, d2: (d1[0] + d2[0], d1[1] + d2[1])
x = sc.parallelize([1,2,3,4])
y = x.aggregate(([], 0), seqOp, combOp)
print(y)
AGGREGATE
***
****
**
***
[( ),#]
x:
y:
[1, 2, 3, 4]
([1, 2, 3, 4], 10)
def seqOp = (data:(Array[Int], Int), item:Int) =>
(data._1 :+ item, data._2 + item)
def combOp = (d1:(Array[Int], Int), d2:(Array[Int], Int)) =>
(d1._1.union(d2._1), d1._2 + d2._2)
val x = sc.parallelize(Array(1,2,3,4))
val y = x.aggregate((Array[Int](), 0))(seqOp, combOp)
println(y)
x:
y:
aggregate(identity, seqOp, combOp)
Aggregate all the elements of the RDD by:
- applying a user function to combine elements with user-
supplied objects,
- then combining those user-defined results via a second user
function,
- and finally returning a result to the driver.
[1, 2, 3, 4]
(Array(3, 1, 2, 4),10)
AGGREGATE
***
****
**
***
[( ),#]
MAX
4
2
1
4
x:
y:
max()
Return the maximum item in the RDD
[2, 4, 1]
4
MAX
4
x = sc.parallelize([2,4,1])
y = x.max()
print(x.collect())
print(y)
val x = sc.parallelize(Array(2,4,1))
val y = x.max
println(x.collect().mkString(", "))
println(y)
2
1
4
max
SUM
7
2
1
4
x:
y:
sum()
Return the sum of the items in the RDD
[2, 4, 1]
7
SUM
7
x = sc.parallelize([2,4,1])
y = x.sum()
print(x.collect())
print(y)
val x = sc.parallelize(Array(2,4,1))
val y = x.sum
println(x.collect().mkString(", "))
println(y)
2
1
4
Σ
MEAN
2.33333333
2
1
4
x:
y:
mean()
Return the mean of the items in the RDD
[2, 4, 1]
2.3333333
MEAN
2.3333333
x = sc.parallelize([2,4,1])
y = x.mean()
print(x.collect())
print(y)
val x = sc.parallelize(Array(2,4,1))
val y = x.mean
println(x.collect().mkString(", "))
println(y)
2
1
4
x
STDEV
1.2472191
2
1
4
x:
y:
stdev()
Return the standard deviation of the items in the RDD
[2, 4, 1]
1.2472191
STDEV
1.247219
1
x = sc.parallelize([2,4,1])
y = x.stdev()
print(x.collect())
print(y)
val x = sc.parallelize(Array(2,4,1))
val y = x.stdev
println(x.collect().mkString(", "))
println(y)
2
1
4
σ
COUNTBYKEY
{'A': 1, 'J': 2, 'F':
1}
J “John”
“Anna”A
“Fred”F
J “James”
x:
y:
countByKey()
Return a map of keys and counts of their occurrences in t
RDD
[('J', 'James'), ('F','Fred'),
('A','Anna'), ('J','John')]
{'A': 1, 'J': 2, 'F': 1}
COUNTBYKEY
{A: 1, 'J': 2, 'F':
1}
x = sc.parallelize([('J', 'James'), ('F','Fred'),
('A','Anna'), ('J','John')])
y = x.countByKey()
print(y)
val x = sc.parallelize(Array(('J',"James"),('F',"Fred"),
('A',"Anna"),('J',"John")))
val y = x.countByKey()
println(y)
SAVEASTEXTFILE
x:
y:
saveAsTextFile(path, compressionCodecClass=None)
Save the RDD to the filesystem indicated in the path
[2, 4, 1]
[u'2', u'4', u'1']
SAVEASTEXTFILE
dbutils.fs.rm("/temp/demo", True)
x = sc.parallelize([2,4,1])
x.saveAsTextFile("/temp/demo")
y = sc.textFile("/temp/demo")
print(y.collect())
dbutils.fs.rm("/temp/demo", true)
val x = sc.parallelize(Array(2,4,1))
x.saveAsTextFile("/temp/demo")
val y = sc.textFile("/temp/demo")
println(y.collect().mkString(", "))
LAB
Q&A

Transformations and actions a visual guide training

  • 1.
    TRANSFORMATIONS AND ACTIONS AVisual Guide of the APIhttp://training.databricks.com/visualapi.pdf
  • 2.
    Databricks would liketo give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark API under the MIT license to the Spark community. Jeff’s original, creative work can be found here and you can read more about Jeff’s project in his blog post. After talking to Jeff, Databricks commissioned Adam Breindel to further evolve Jeff’s work into the diagrams you see in this deck. LinkedIn Blog: data-frack
  • 3.
    making big datasimple Databricks Cloud: “A unified platform for building Big Data pipelines – from ETL to Exploration and Dashboards, to Advanced Analytics and Data Products.” • Founded in late 2013 • by the creators of Apache Spark • Original team from UC Berkeley AMPLab • Raised $47 Million in 2 rounds • ~55 employees • We’re hiring! • Level 2/3 support partnerships with • Hortonworks • MapR • DataStax (http://databricks.workable.com)
  • 4.
  • 5.
    Randomized operation Legen d Set Theory /Relational operation Numeric calculation
  • 6.
  • 7.
    • map • filter •flatMap • mapPartitions • mapPartitionsWithIndex • groupBy • sortBy = medium Essential Core & Intermediate Spark Operations TRANSFORMA TIONS ACTIO NS Genera l • sample • randomSplit Math / Statistical = easy Set Theory / Relational • union • intersection • subtract • distinct • cartesian • zip • takeOrdered Data Structure / I/O • saveAsTextFile • saveAsSequenceFile • saveAsObjectFile • saveAsHadoopDataset • saveAsHadoopFile • saveAsNewAPIHadoopDataset • saveAsNewAPIHadoopFile • keyBy • zipWithIndex • zipWithUniqueID • zipPartitions • coalesce • repartition • repartitionAndSortWithinPartitions • pipe • count • takeSample • max • min • sum • histogram • mean • variance • stdev • sampleVariance • countApprox • countApproxDistinct • reduce • collect • aggregate • fold • first • take • forEach • top • treeAggregate • treeReduce • forEachPartition • collectAsMap
  • 8.
    = medium Essential Core &Intermediate PairRDD Operations TRANSFORMA TIONS ACTIO NS Genera l • sampleByKey Math / Statistical = easy Set Theory / Relational Data Structure • keys • values • partitionBy • countByKey • countByValue • countByValueApprox • countApproxDistinctByKey • countApproxDistinctByKey • countByKeyApprox • sampleByKeyExact • cogroup (=groupWith) • join • subtractByKey • fullOuterJoin • leftOuterJoin • rightOuterJoin • flatMapValues • groupByKey • reduceByKey • reduceByKeyLocally • foldByKey • aggregateByKey • sortByKey • combineByKey
  • 9.
    vs narro w wide each partition ofthe parent RDD is used by at most one partition of the child RDD multiple child RDD partitions may depend on a single parent RDD partition
  • 10.
    “One of thechallenges in providing RDDs as an abstraction is choosing a representation for them that can track lineage across a wide range of transformations.” “The most interesting question in designing this interface is how to represent dependencies between RDDs.” “We found it both sufficient and useful to classify dependencies into two types: • narrow dependencies, where each partition of the parent RDD is used by at most one partition of the child RDD • wide dependencies, where multiple child partitions may depend on it.”
  • 11.
    narro w wid e each partition ofthe parent RDD is used by at most one partition of the child RDD multiple child RDD partitions may depend on a single parent RDD partition map, filter union join w/ inputs co-partitioned groupByKe y join w/ inputs not co-partitioned
  • 12.
  • 13.
    MAP 3 items inRDD RDD: x
  • 14.
    MAP User function applied itemby item RDD: x RDD: y
  • 15.
  • 16.
  • 17.
    MAP RDD: x RDD:y After map() has been applied… before after
  • 18.
    MAP RDD: x RDD:y Return a new RDD by applying a function to each element of this RDD.
  • 19.
    MAP x = sc.parallelize(["b","a", "c"]) y = x.map(lambda z: (z, 1)) print(x.collect()) print(y.collect()) ['b', 'a', 'c'] [('b', 1), ('a', 1), ('c', 1)] RDD: x RDD: y x: y: map(f, preservesPartitioning=False) Return a new RDD by applying a function to each element of this RDD val x = sc.parallelize(Array("b", "a", "c")) val y = x.map(z => (z,1)) println(x.collect().mkString(", ")) println(y.collect().mkString(", "))
  • 20.
  • 21.
    FILTER Apply user function: keep itemif function returns true RDD: x RDD: y emits True
  • 22.
    FILTER RDD: x RDD:y emits False
  • 23.
    FILTER RDD: x RDD:y emits True
  • 24.
    FILTER RDD: x RDD:y After filter() has been applied… before after
  • 25.
    FILTER x = sc.parallelize([1,2,3]) y= x.filter(lambda x: x%2 == 1) #keep odd values print(x.collect()) print(y.collect()) [1, 2, 3] [1, 3] RDD: x RDD: y x: y: filter(f) Return a new RDD containing only the elements that satisfy a predicate val x = sc.parallelize(Array(1,2,3)) val y = x.filter(n => n%2 == 1) println(x.collect().mkString(", ")) println(y.collect().mkString(", "))
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
    FLATMAP RDD: x RDD:y After flatmap() has been applied… before after
  • 31.
    FLATMAP RDD: x RDD:y Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results
  • 32.
    FLATMAP x = sc.parallelize([1,2,3]) y= x.flatMap(lambda x: (x, x*100, 42)) print(x.collect()) print(y.collect()) [1, 2, 3] [1, 100, 42, 2, 200, 42, 3, 300, 42] x: y: RDD: x RDD: y flatMap(f, preservesPartitioning=False) Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results val x = sc.parallelize(Array(1,2,3)) val y = x.flatMap(n => Array(n, n*100, 42)) println(x.collect().mkString(", ")) println(y.collect().mkString(", "))
  • 33.
    GROUPBY 4 items inRDD RDD: x Jame s Anna Fred John
  • 34.
  • 35.
    F [ “Fred”] GROUPBY RDD: x Jame s Anna Fred emits ‘F’ J [ “John” ]John RDD: y
  • 36.
    [ “Fred” ] GROUPBY RDD:x Jame s Anna emits ‘A’ J [ “John” ] A [ “Anna” ] Fred John F RDD: y
  • 37.
    [ “Fred” ] GROUPBY RDD:x Jame s Anna emits ‘J’ J [ “John”, “James” ] [ “Anna” ] Fred John F A RDD: y
  • 38.
    GROUPBY x = sc.parallelize(['John','Fred', 'Anna', 'James']) y = x.groupBy(lambda w: w[0]) print [(k, list(v)) for (k, v) in y.collect()] ['John', 'Fred', 'Anna', 'James'] [('A',['Anna']),('J',['John','James']),('F',['Fred'])] RDD: x RDD: y x: y: groupBy(f, numPartitions=None) Group the data in the original RDD. Create pairs where the key is the output of a user function, and the value is all items for which the function yields this key. val x = sc.parallelize( Array("John", "Fred", "Anna", "James")) val y = x.groupBy(w => w.charAt(0)) println(y.collect().mkString(", "))
  • 39.
    GROUPBYKEY 5 items inRDD Pair RDD: x B B A A A 5 4 3 2 1
  • 40.
  • 41.
    GROUPBYKEY Pair RDD: x RDD: y B[ 5 , 4 ] A [ 2 , 3 , 1 ] 5 4 3 2 1 B B A A A
  • 42.
    GROUPBYKEY x = sc.parallelize([('B',5),('B',4),('A',3),('A',2),('A',1)]) y= x.groupByKey() print(x.collect()) print(list((j[0], list(j[1])) for j in y.collect())) [('B', 5),('B', 4),('A', 3),('A', 2),('A', 1)] [('A', [2, 3, 1]),('B',[5, 4])] RDD: x RDD: y x: y: groupByKey(numPartitions=None) Group the values for each key in the original RDD. Create a new pair where the original key corresponds to this collected group of values. val x = sc.parallelize( Array(('B',5),('B',4),('A',3),('A',2),('A',1))) val y = x.groupByKey() println(x.collect().mkString(", ")) println(y.collect().mkString(", "))
  • 43.
    MAPPARTITIONS RDD: x RDD:y partitions A B A B
  • 44.
    REDUCEBYKEY VS GROUPBYKEY valwords = Array("one", "two", "two", "three", "three", "three") val wordPairsRDD = sc.parallelize(words).map(word => (word, 1)) val wordCountsWithReduce = wordPairsRDD .reduceByKey(_ + _) .collect() val wordCountsWithGroup = wordPairsRDD .groupByKey() .map(t => (t._1, t._2.sum)) .collect()
  • 45.
    REDUCEBYKEY (a, 1) (b, 1) (a,1) (b, 1) (a, 1) (a, 1) (a, 2) (b, 2)(b, 1) (b, 1) (a, 1) (a, 1) (a, 3) (b, 2) (a, 1) (b, 1) (b, 1) (a, 1) (a, 2) (a, 6) (a, 3) (b, 1) (b, 2) (b, 5) (b, 2) a b
  • 46.
    GROUPBYKEY (a, 1) (b, 1) (a,1) (a, 1) (b, 1) (b, 1) (a, 1) (a, 1) (a, 1) (b, 1) (b, 1) (a, 1) (a, 1) (a, 6) (a, 1) (b, 5) a b (a, 1) (a, 1) (a, 1) (b, 1) (b, 1) (b, 1) (b, 1) (b, 1)
  • 47.
    MAPPARTIT IONS x: y: mapPartitions(f, preservesPartitioning=False) Return anew RDD by applying a function to each partition of this RDD A B A B x = sc.parallelize([1,2,3], 2) def f(iterator): yield sum(iterator); yield 42 y = x.mapPartitions(f) # glom() flattens elements on the same partition print(x.glom().collect()) print(y.glom().collect()) [[1], [2, 3]] [[1, 42], [5, 42]]
  • 48.
    MAPPARTIT IONS x: y: mapPartitions(f, preservesPartitioning=False) Return anew RDD by applying a function to each partition of this RDD A B A B Array(Array(1), Array(2, 3)) Array(Array(1, 42), Array(5, 42)) val x = sc.parallelize(Array(1,2,3), 2) def f(i:Iterator[Int])={ (i.sum,42).productIterator } val y = x.mapPartitions(f) // glom() flattens elements on the same partition val xOut = x.glom().collect() val yOut = y.glom().collect()
  • 49.
    MAPPARTITIONSWITHINDEX RDD: x RDD:y partitions A B A B input partition index
  • 50.
    x: y: mapPartitionsWithIndex(f, preservesPartitioning=False) Return anew RDD by applying a function to each partition of this RDD, while tracking the index of the original partition A B A B x = sc.parallelize([1,2,3], 2) def f(partitionIndex, iterator): yield (partitionIndex, sum(iterator)) y = x.mapPartitionsWithIndex(f) # glom() flattens elements on the same partition print(x.glom().collect()) print(y.glom().collect()) [[1], [2, 3]] [[0, 1], [1, 5]] MAPPARTITIONSWIT HINDEX partition index B A
  • 51.
    x: y: mapPartitionsWithIndex(f, preservesPartitioning=False) Return anew RDD by applying a function to each partition of this RDD, while tracking the index of the original partition. A B A B Array(Array(1), Array(2, 3)) Array(Array(0, 1), Array(1, 5)) MAPPARTITIONSWIT HINDEX partition index B A val x = sc.parallelize(Array(1,2,3), 2) def f(partitionIndex:Int, i:Iterator[Int]) = { (partitionIndex, i.sum).productIterator } val y = x.mapPartitionsWithIndex(f) // glom() flattens elements on the same partition val xOut = x.glom().collect() val yOut = y.glom().collect()
  • 52.
    SAMPLE RDD: x RDD:y 1 3 5 4 3 2 1
  • 53.
    SAMPLE x = sc.parallelize([1,2, 3, 4, 5]) y = x.sample(False, 0.4, 42) print(x.collect()) print(y.collect()) [1, 2, 3, 4, 5] [1, 3] RDD: x RDD: y x: y: sample(withReplacement, fraction, seed=None) Return a new RDD containing a statistical sample of the original RDD val x = sc.parallelize(Array(1, 2, 3, 4, 5)) val y = x.sample(false, 0.4) // omitting seed will yield different output println(y.collect().mkString(", "))
  • 54.
    UNION RDD: xRDD: y 4 3 3 2 1 A B C 4 3 3 2 1 A B C RDD: z
  • 55.
    UNION x = sc.parallelize([1,2,3],2) y = sc.parallelize([3,4], 1) z = x.union(y) print(z.glom().collect()) [1, 2, 3] [3, 4] [[1], [2, 3], [3, 4]] x: y: union(otherRDD) Return a new RDD containing all items from two original RDDs. Duplicates are not culled. val x = sc.parallelize(Array(1,2,3), 2) val y = sc.parallelize(Array(3,4), 1) val z = x.union(y) val zOut = z.glom().collect() z:
  • 56.
    5B JOIN RDD: xRDD: y 4 2B A 1A 3A
  • 57.
    JOIN RDD: xRDD: y RDD: z (1, 3)A 5B 4 2B A 1A 3A
  • 58.
    JOIN RDD: xRDD: y RDD: z (1, 4)A (1, 3)A 5B 4 2B A 1A 3A
  • 59.
    JOIN RDD: xRDD: y (2, 5) RDD: zB (1, 4)A (1, 3)A 5B 4 2B A 1A 3A
  • 60.
    JOIN x = sc.parallelize([("a",1), ("b", 2)]) y = sc.parallelize([("a", 3), ("a", 4), ("b", 5)]) z = x.join(y) print(z.collect()) [("a", 1), ("b", 2)] [("a", 3), ("a", 4), ("b", 5)] [('a', (1, 3)), ('a', (1, 4)), ('b', (2, 5))] x: y: union(otherRDD, numPartitions=None) Return a new RDD containing all pairs of elements having the same key in the original RDDs val x = sc.parallelize(Array(("a", 1), ("b", 2))) val y = sc.parallelize(Array(("a", 3), ("a", 4), ("b", 5))) val z = x.join(y) println(z.collect().mkString(", ")) z:
  • 61.
  • 62.
  • 63.
  • 64.
    DISTINCT x = sc.parallelize([1,2,3,3,4]) y= x.distinct() print(y.collect()) [1, 2, 3, 3, 4] [1, 2, 3, 4] x: y: distinct(numPartitions=None) Return a new RDD containing distinct items from the original RDD (omitting all duplicates) val x = sc.parallelize(Array(1,2,3,3,4)) val y = x.distinct() println(y.collect().mkString(", ")) * * *¤ ¤
  • 65.
  • 66.
  • 67.
  • 68.
    COALESCE x = sc.parallelize([1,2, 3, 4, 5], 3) y = x.coalesce(2) print(x.glom().collect()) print(y.glom().collect()) [[1], [2, 3], [4, 5]] [[1], [2, 3, 4, 5]] x: y: coalesce(numPartitions, shuffle=False) Return a new RDD which is reduced to a smaller number of partitions val x = sc.parallelize(Array(1, 2, 3, 4, 5), 3) val y = x.coalesce(2) val xOut = x.glom().collect() val yOut = y.glom().collect() C B C A AB
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
    KEYBY x = sc.parallelize(['John','Fred', 'Anna', 'James']) y = x.keyBy(lambda w: w[0]) print y.collect() ['John', 'Fred', 'Anna', 'James'] [('J','John'),('F','Fred'),('A','Anna'),('J','James')] RDD: x RDD: y x: y: keyBy(f) Create a Pair RDD, forming one pair for each item in the original RDD. The pair’s key is calculated from the value via a user-supplied function. val x = sc.parallelize( Array("John", "Fred", "Anna", "James")) val y = x.keyBy(w => w.charAt(0)) println(y.collect().mkString(", "))
  • 74.
  • 75.
    PARTITIONBY RDD: x J “John” “Anna”A “Fred”F J“James” RDD: yRDD: y J “James” 1
  • 76.
    PARTITIONBY RDD: x J “John” “Anna”A “Fred”F RDD:yRDD: y J “James” “Fred”F 0 J “James”
  • 77.
    PARTITIONBY RDD: x J “John” “Anna”A RDD:yRDD: y J “James” “Anna”A “Fred”F 0 “Fred”F J “James”
  • 78.
    PARTITIONBY RDD: x J “John” RDD:yRDD: y J “John” J “James” “Anna”A “Fred”F1 “Anna”A “Fred”F J “James”
  • 79.
    PARTITIONBY x = sc.parallelize([('J','James'),('F','Fred'), ('A','Anna'),('J','John')],3) y = x.partitionBy(2, lambda w: 0 if w[0] < 'H' else 1) print x.glom().collect() print y.glom().collect() [[('J', 'James')], [('F', 'Fred')], [('A', 'Anna'), ('J', 'John')]] [[('A', 'Anna'), ('F', 'Fred')], [('J', 'James'), ('J', 'John')]] x: y: partitionBy(numPartitions, partitioner=portable_hash) Return a new RDD with the specified number of partitions, placing original items into the partition returned by a user supplied function
  • 80.
    PARTITIONBY Array(Array((A,Anna), (F,Fred)), Array((J,John), (J,James))) Array(Array((F,Fred),(A,Anna)), Array((J,John), (J,James))) x: y: partitionBy(numPartitions, partitioner=portable_hash) Return a new RDD with the specified number of partitions, placing original items into the partition returned by a user supplied function. import org.apache.spark.Partitioner val x = sc.parallelize(Array(('J',"James"),('F',"Fred"), ('A',"Anna"),('J',"John")), 3) val y = x.partitionBy(new Partitioner() { val numPartitions = 2 def getPartition(k:Any) = { if (k.asInstanceOf[Char] < 'H') 0 else 1 } }) val yOut = y.glom().collect()
  • 81.
    ZIP RDD: xRDD: y 3 2 1 A B 9 4 1 A B
  • 82.
    ZIP RDD: xRDD: y 3 2 1 A B 4 A RDD: z 9 4 1 A B 1 1
  • 83.
    ZIP RDD: xRDD: y 3 2 1 A B 4 A RDD: z 9 4 1 A B 2 1 4 1
  • 84.
    ZIP RDD: xRDD: y 3 2 1 A B 4 3 A B RDD: z 9 4 1 A B 2 1 9 4 1
  • 85.
    ZIP x = sc.parallelize([1,2, 3]) y = x.map(lambda n:n*n) z = x.zip(y) print(z.collect()) [1, 2, 3] [1, 4, 9] [(1, 1), (2, 4), (3, 9)] x: y: zip(otherRDD) Return a new RDD containing pairs whose key is the item in the original RDD, and whose value is that item’s corresponding element (same partition, same index) in a second RDD val x = sc.parallelize(Array(1,2,3)) val y = x.map(n=>n*n) val z = x.zip(y) println(z.collect().mkString(", ")) z:
  • 86.
  • 87.
  • 88.
  • 89.
    x: y: getNumPartitions() Return the numberof partitions in RDD [[1], [2, 3]] 2 GETNUMPARTITIONS A B 2 x = sc.parallelize([1,2,3], 2) y = x.getNumPartitions() print(x.glom().collect()) print(y) val x = sc.parallelize(Array(1,2,3), 2) val y = x.partitions.size val xOut = x.glom().collect() println(y)
  • 90.
  • 91.
    x: y: collect() Return all itemsin the RDD to the driver in a single list [[1], [2, 3]] [1, 2, 3] A B x = sc.parallelize([1,2,3], 2) y = x.collect() print(x.glom().collect()) print(y) val x = sc.parallelize(Array(1,2,3), 2) val y = x.collect() val xOut = x.glom().collect() println(y) COLLECT [ ]
  • 92.
  • 93.
  • 94.
  • 95.
    x: y: reduce(f) Aggregate all theelements of the RDD by applying a user function pairwise to elements and partial results, and returns a result to the driver [1, 2, 3, 4] 10 x = sc.parallelize([1,2,3,4]) y = x.reduce(lambda a,b: a+b) print(x.collect()) print(y) val x = sc.parallelize(Array(1,2,3,4)) val y = x.reduce((a,b) => a+b) println(x.collect.mkString(", ")) println(y) REDUCE ****** * ** ***
  • 96.
  • 97.
  • 98.
  • 99.
  • 100.
  • 101.
  • 102.
    AGGREGATE 3 2 1 4 ([2], 2) ([1], 1)([1,2],3) ([3], 3) ([], 0) ([4], 4)
  • 103.
    AGGREGATE 3 2 1 4 ([2], 2) ([1], 1)([1,2],3) ([4], 4) ([3], 3)([3,4], 7)
  • 104.
    AGGREGATE 3 2 1 4 ([2], 2) ([1], 1)([1,2],3) ([4], 4) ([3], 3)([3,4], 7)
  • 105.
  • 106.
  • 107.
    aggregate(identity, seqOp, combOp) Aggregateall the elements of the RDD by: - applying a user function to combine elements with user- supplied objects, - then combining those user-defined results via a second user function, - and finally returning a result to the driver. seqOp = lambda data, item: (data[0] + [item], data[1] + item) combOp = lambda d1, d2: (d1[0] + d2[0], d1[1] + d2[1]) x = sc.parallelize([1,2,3,4]) y = x.aggregate(([], 0), seqOp, combOp) print(y) AGGREGATE *** **** ** *** [( ),#] x: y: [1, 2, 3, 4] ([1, 2, 3, 4], 10)
  • 108.
    def seqOp =(data:(Array[Int], Int), item:Int) => (data._1 :+ item, data._2 + item) def combOp = (d1:(Array[Int], Int), d2:(Array[Int], Int)) => (d1._1.union(d2._1), d1._2 + d2._2) val x = sc.parallelize(Array(1,2,3,4)) val y = x.aggregate((Array[Int](), 0))(seqOp, combOp) println(y) x: y: aggregate(identity, seqOp, combOp) Aggregate all the elements of the RDD by: - applying a user function to combine elements with user- supplied objects, - then combining those user-defined results via a second user function, - and finally returning a result to the driver. [1, 2, 3, 4] (Array(3, 1, 2, 4),10) AGGREGATE *** **** ** *** [( ),#]
  • 109.
  • 110.
    x: y: max() Return the maximumitem in the RDD [2, 4, 1] 4 MAX 4 x = sc.parallelize([2,4,1]) y = x.max() print(x.collect()) print(y) val x = sc.parallelize(Array(2,4,1)) val y = x.max println(x.collect().mkString(", ")) println(y) 2 1 4 max
  • 111.
  • 112.
    x: y: sum() Return the sumof the items in the RDD [2, 4, 1] 7 SUM 7 x = sc.parallelize([2,4,1]) y = x.sum() print(x.collect()) print(y) val x = sc.parallelize(Array(2,4,1)) val y = x.sum println(x.collect().mkString(", ")) println(y) 2 1 4 Σ
  • 113.
  • 114.
    x: y: mean() Return the meanof the items in the RDD [2, 4, 1] 2.3333333 MEAN 2.3333333 x = sc.parallelize([2,4,1]) y = x.mean() print(x.collect()) print(y) val x = sc.parallelize(Array(2,4,1)) val y = x.mean println(x.collect().mkString(", ")) println(y) 2 1 4 x
  • 115.
  • 116.
    x: y: stdev() Return the standarddeviation of the items in the RDD [2, 4, 1] 1.2472191 STDEV 1.247219 1 x = sc.parallelize([2,4,1]) y = x.stdev() print(x.collect()) print(y) val x = sc.parallelize(Array(2,4,1)) val y = x.stdev println(x.collect().mkString(", ")) println(y) 2 1 4 σ
  • 117.
    COUNTBYKEY {'A': 1, 'J':2, 'F': 1} J “John” “Anna”A “Fred”F J “James”
  • 118.
    x: y: countByKey() Return a mapof keys and counts of their occurrences in t RDD [('J', 'James'), ('F','Fred'), ('A','Anna'), ('J','John')] {'A': 1, 'J': 2, 'F': 1} COUNTBYKEY {A: 1, 'J': 2, 'F': 1} x = sc.parallelize([('J', 'James'), ('F','Fred'), ('A','Anna'), ('J','John')]) y = x.countByKey() print(y) val x = sc.parallelize(Array(('J',"James"),('F',"Fred"), ('A',"Anna"),('J',"John"))) val y = x.countByKey() println(y)
  • 119.
  • 120.
    x: y: saveAsTextFile(path, compressionCodecClass=None) Save theRDD to the filesystem indicated in the path [2, 4, 1] [u'2', u'4', u'1'] SAVEASTEXTFILE dbutils.fs.rm("/temp/demo", True) x = sc.parallelize([2,4,1]) x.saveAsTextFile("/temp/demo") y = sc.textFile("/temp/demo") print(y.collect()) dbutils.fs.rm("/temp/demo", true) val x = sc.parallelize(Array(2,4,1)) x.saveAsTextFile("/temp/demo") val y = sc.textFile("/temp/demo") println(y.collect().mkString(", "))
  • 121.
  • 122.

Editor's Notes

  • #2 To-do:
  • #4 Apache Spark is an open-source cluster computing framework originally developed in 2009 in the UC Berkeley AMPLab. Spark makes it easy to get value from big data. It can read from any data source (relational, NoSQL, file systems, etc) and offers one unified API for batch analytics, SQL queries, real time analysis, machine learning and graph processing. Developers no longer have to learn separate processing engines for different tasks. Spark is 10x-100x faster than older systems such as Hadoop MapReduce with proven scalability (the largest Spark cluster has over 8,000 nodes). Spark had over 450 contributors in 2014, making it the most active open source project in the Big Data space. The core Spark team is now at Databricks. With over 50 employees and $47 million in funding, Databricks is the main steward of the Spark project. As a Trainer for Databricks and Spark, you will work closely with this team to understand Spark’s internals and then teach it to engineers around the world.
  • #10 TODO: Sameer create 2 additional slides before this explaining Transformations and Actions
  • #11 TODO: Sameer create 2 additional slides before this explaining Transformations and Actions
  • #12 TODO: Sameer create 2 additional slides before this explaining Transformations and Actions
  • #14 Return a new RDD by applying a function to each element of this RDD.
  • #26 Idiomatic Scala might write this filter as _%2==1 using _ to stand for the input param
  • #39 Order in output not guaranteed Point out that most wide transformations support an optional argument with the number of partitions in the output RDD
  • #41 Point out that results are not necessarily ordered, and that order is not guaranteed consistent between executions
  • #42 To-do: swap
  • #43 Order in output not guaranteed Point out that most wide transformations support an optional argument with the number of partitions in the output RDD
  • #45 Let's look at two different ways to compute word counts, one using reduceByKey and the other using groupByKey… While both of these functions will produce the correct answer, the reduceByKey example works much better on a large dataset. That's because Spark knows it can combine output with a common key on each partition before shuffling the data.
  • #46 Look at the diagram below to understand what happens with reduceByKey. Notice how pairs on the same machine with the same key are combined (by using the lamdba function passed intoreduceByKey) before the data is shuffled. Then the lamdba function is called again to reduce all the values from each partition to produce one final result.
  • #47 On the other hand, when calling groupByKey - all the key-value pairs are shuffled around. This is a lot of unnessary data to being transferred over the network. To determine which machine to shuffle a pair to, Spark calls a partitioning function on the key of the pair. Spark spills data to disk when there is more data shuffled onto a single executor machine than can fit in memory. However, it flushes out the data to disk one key at a time - so if a single key has more key-value pairs than can fit in memory, an out of memory exception occurs. This will be more gracefully handled in a later release of Spark so the job can still proceed, but should still be avoided - when Spark needs to spill to disk, performance is severely impacted. You can imagine that for a much larger dataset size, the difference in the amount of data you are shuffling becomes more exaggerated and different between reduceByKey and groupByKey. Here are more functions to prefer over groupByKey: - combineByKey can be used when you are combining elements but your return type differs from your input value type. - foldByKey merges the values for each key using an associative function and a neutral "zero value".
  • #48 Point out that the result of the mapping is an iterator (with unspecified size), so m items in the original partition can map to n items in the resulting new partition. So the new partition can have more items, the same number of items, or fewer items than the original.
  • #49 Point out that the result of the mapping is an iterator (with unspecified size), so m items in the original partition can map to n items in the resulting new partition. So the new partition can have more items, the same number of items, or fewer items than the original.
  • #51 To-do: Is it correct to remove one of the items from Partition A in RDD x’s diagram here?
  • #52 To-do: Is it correct to remove one of the items from Partition A in RDD x’s diagram here?
  • #53 To-do: swap
  • #54 Important to point out that the sample size itself will vary, and also that the Python and Scala random number generators will produce different output from the same seed, so that, e.g., calling val y = x.sample(false, 0.4, 42) in Scala will not produce the same output shown here on the right (which is the Python output) This is a good one to demo, to show varying results
  • #56 Point Out: Duplicates are not removed. They can be removed with the more expensive .distinct call, presented shortly
  • #57 To-do: swap
  • #58 To-do: swap
  • #59 To-do: swap
  • #60 To-do: swap
  • #61 Point out that this is an inner join, but that duplicate keys in the respective source RDDs result in a Cartesian join of those key-value pairs.
  • #65 Point out that this is an inner join, but that duplicate keys in the respective source RDDs result in a Cartesian join of those key-value pairs.
  • #69 Point out… In some cases, running a shuffle is desirable if the coalesced partitions would be too unevenly distributed to get good parallelism To increase the number of partitions (and trigger a shuffle), call repartition In Scala API (but not Python API), you can call coalesce with a larger number of partitions, and shuffle=true, and achieve a repartition
  • #80 Point out… In some cases, running a shuffle is desirable if the coalesced partitions would be too unevenly distributed to get good parallelism To increase the number of partitions (and trigger a shuffle), call repartition In Scala API (but not Python API), you can call coalesce with a larger number of partitions, and shuffle=true, and achieve a repartition
  • #82 To-do: swap
  • #83 To-do: swap
  • #84 To-do: swap
  • #85 To-do: swap
  • #86 Point out that the two source RDDs should have the same partitions and the same counts within partitions. Typically, one is a mapped version of the other.
  • #96 Explain: function must be commutative and associative, and collection must be closed under the function, allowing maximal parallelism. Refer to fold and aggregate for non-commutative functions or functions whose result is a different type.
  • #104 TODO: Sameer to find out exact execution process – in particular, when does the final combine occur on the driver, versus in the cluster Aggregate the elements of each partition, and then the results for all the partitions, using a given combine functions and a neutral “zero value.” The functions op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; however, it should not modify t2. The first function (seqOp) can return a different result type, U, than the type of this RDD. Thus, we need one operation for merging a T into an U and one operation for merging two U
  • #122 The state-of-the-art in Big Data is "simple things complex, complex things impossible." We think the future should be "simple things easy, and complex things possible."