SlideShare a Scribd company logo
TRANSFORMATIONS AND
ACTIONS A Visual Guide of the
APIhttp://training.databricks.com/visualapi.pdf
Databricks would like to give a special thanks to Jeff Thomspon for
contributing 67 visual diagrams depicting the Spark API under the
MIT license to the Spark community.
Jeff’s original, creative work can be found here and you can read
more about Jeff’s project in his blog post.
After talking to Jeff, Databricks commissioned Adam Breindel to
further evolve Jeff’s work into the diagrams you see in this deck.
LinkedIn
Blog: data-frack
making big data simple
Databricks Cloud:
“A unified platform for building Big Data pipelines –
from ETL to Exploration and Dashboards, to Advanced
Analytics and Data Products.”
• Founded in late 2013
• by the creators of Apache Spark
• Original team from UC Berkeley AMPLab
• Raised $47 Million in 2 rounds
• ~55 employees
• We’re hiring!
• Level 2/3 support partnerships with
• Hortonworks
• MapR
• DataStax
(http://databricks.workable.com)
ke
y
RDD
Elements
original item
transformed
type
object on
driver
RDD
partition(s
)
A
B
user
functions
user
input
input
emitted
value
Legen
d
Randomized
operation
Legen
d
Set Theory / Relational
operation
Numeric calculation
Operations
=
TRANSFORMA
TIONS
ACTIO
NS
+
• map
• filter
• flatMap
• mapPartitions
• mapPartitionsWithIndex
• groupBy
• sortBy
=
medium
Essential Core & Intermediate Spark
Operations
TRANSFORMA
TIONS
ACTIO
NS Genera
l • sample
• randomSplit
Math /
Statistical
=
easy
Set Theory /
Relational
• union
• intersection
• subtract
• distinct
• cartesian
• zip
• takeOrdered
Data Structure /
I/O
• saveAsTextFile
• saveAsSequenceFile
• saveAsObjectFile
• saveAsHadoopDataset
• saveAsHadoopFile
• saveAsNewAPIHadoopDataset
• saveAsNewAPIHadoopFile
• keyBy
• zipWithIndex
• zipWithUniqueID
• zipPartitions
• coalesce
• repartition
• repartitionAndSortWithinPartitions
• pipe
• count
• takeSample
• max
• min
• sum
• histogram
• mean
• variance
• stdev
• sampleVariance
• countApprox
• countApproxDistinct
• reduce
• collect
• aggregate
• fold
• first
• take
• forEach
• top
• treeAggregate
• treeReduce
• forEachPartition
• collectAsMap
=
medium
Essential Core & Intermediate PairRDD
Operations
TRANSFORMA
TIONS
ACTIO
NS Genera
l • sampleByKey
Math /
Statistical
=
easy
Set Theory /
Relational
Data Structure
• keys
• values
• partitionBy
• countByKey
• countByValue
• countByValueApprox
• countApproxDistinctByKey
• countApproxDistinctByKey
• countByKeyApprox
• sampleByKeyExact
• cogroup (=groupWith)
• join
• subtractByKey
• fullOuterJoin
• leftOuterJoin
• rightOuterJoin
• flatMapValues
• groupByKey
• reduceByKey
• reduceByKeyLocally
• foldByKey
• aggregateByKey
• sortByKey
• combineByKey
vs
narro
w
wide
each partition of the parent RDD is
used by at most one partition of
the child RDD
multiple child RDD partitions
may depend on a single parent
RDD partition
“One of the challenges in providing RDDs as an abstraction is
choosing a representation for them that can track lineage across a
wide range of transformations.”
“The most interesting question in designing this interface is how
to represent dependencies between RDDs.”
“We found it both sufficient and useful to classify dependencies
into two types:
• narrow dependencies, where each partition of the parent RDD is
used by at most one partition of the child RDD
• wide dependencies, where multiple child partitions may depend
on it.”
narro
w
wid
e
each partition of the parent RDD is
used by at most one partition of
the child RDD
multiple child RDD partitions
may depend on a single parent
RDD partition
map,
filter
union
join w/ inputs
co-partitioned
groupByKe
y
join w/ inputs not
co-partitioned
TRANSFORMATIONS Core
Operations
MAP
3 items in RDD
RDD: x
MAP
User function
applied item by
item
RDD: x RDD: y
MAP
RDD: x RDD: y
MAP
RDD: x RDD: y
MAP
RDD: x RDD: y
After map() has been
applied…
before after
MAP
RDD: x RDD: y
Return a new RDD by applying a function to each element
of this RDD.
MAP
x = sc.parallelize(["b", "a", "c"])
y = x.map(lambda z: (z, 1))
print(x.collect())
print(y.collect())
['b', 'a', 'c']
[('b', 1), ('a', 1), ('c', 1)]
RDD:
x
RDD:
y
x:
y:
map(f, preservesPartitioning=False)
Return a new RDD by applying a function to each element
of this RDD
val x = sc.parallelize(Array("b", "a", "c"))
val y = x.map(z => (z,1))
println(x.collect().mkString(", "))
println(y.collect().mkString(", "))
FILTER
3 items in RDD
RDD: x
FILTER
Apply user
function:
keep item if
function
returns true
RDD: x RDD: y
emits
True
FILTER
RDD: x RDD: y
emits
False
FILTER
RDD: x RDD: y
emits
True
FILTER
RDD: x RDD: y
After filter() has been
applied…
before after
FILTER
x = sc.parallelize([1,2,3])
y = x.filter(lambda x: x%2 == 1) #keep odd values
print(x.collect())
print(y.collect())
[1, 2, 3]
[1, 3]
RDD:
x
RDD:
y
x:
y:
filter(f)
Return a new RDD containing only the elements that satisfy
a predicate
val x = sc.parallelize(Array(1,2,3))
val y = x.filter(n => n%2 == 1)
println(x.collect().mkString(", "))
println(y.collect().mkString(", "))
FLATMAP
3 items in RDD
RDD: x
FLATMAP
RDD: x RDD: y
FLATMAP
RDD: x RDD: y
FLATMAP
RDD: x RDD: y
FLATMAP
RDD: x RDD: y
After flatmap() has been
applied…
before after
FLATMAP
RDD: x RDD: y
Return a new RDD by first applying a function to all elements of this RDD, and then
flattening the results
FLATMAP
x = sc.parallelize([1,2,3])
y = x.flatMap(lambda x: (x, x*100, 42))
print(x.collect())
print(y.collect())
[1, 2, 3]
[1, 100, 42, 2, 200, 42, 3, 300, 42]
x:
y:
RDD:
x
RDD:
y
flatMap(f, preservesPartitioning=False)
Return a new RDD by first applying a function to all elements of this RDD, and then
flattening the results
val x = sc.parallelize(Array(1,2,3))
val y = x.flatMap(n => Array(n, n*100, 42))
println(x.collect().mkString(", "))
println(y.collect().mkString(", "))
GROUPBY
4 items in RDD
RDD: x
Jame
s
Anna
Fred
John
GROUPBY
RDD: x
Jame
s
Anna
Fred
John
emits
‘J’
J [ “John” ]
RDD: y
F [ “Fred” ]
GROUPBY
RDD: x
Jame
s
Anna
Fred
emits
‘F’
J [ “John” ]John
RDD: y
[ “Fred” ]
GROUPBY
RDD: x
Jame
s
Anna
emits
‘A’
J [ “John” ]
A [ “Anna” ]
Fred
John
F
RDD: y
[ “Fred” ]
GROUPBY
RDD: x
Jame
s
Anna
emits
‘J’
J [ “John”, “James” ]
[ “Anna” ]
Fred
John
F
A
RDD: y
GROUPBY
x = sc.parallelize(['John', 'Fred', 'Anna', 'James'])
y = x.groupBy(lambda w: w[0])
print [(k, list(v)) for (k, v) in y.collect()]
['John', 'Fred', 'Anna', 'James']
[('A',['Anna']),('J',['John','James']),('F',['Fred'])]
RDD:
x
RDD:
y
x:
y:
groupBy(f, numPartitions=None)
Group the data in the original RDD. Create pairs where the key is
the output of a user function, and the value is all items for which
the function yields this key.
val x = sc.parallelize(
Array("John", "Fred", "Anna", "James"))
val y = x.groupBy(w => w.charAt(0))
println(y.collect().mkString(", "))
GROUPBYKEY
5 items in RDD
Pair RDD:
x
B
B
A
A
A
5
4
3
2
1
GROUPBYKEY
Pair RDD:
x
5
4
3
2
1
RDD: y
A [ 2 , 3 , 1
]
B
B
A
A
A
GROUPBYKEY
Pair RDD:
x
RDD: y
B [ 5 , 4 ]
A [ 2 , 3 , 1
]
5
4
3
2
1
B
B
A
A
A
GROUPBYKEY
x = sc.parallelize([('B',5),('B',4),('A',3),('A',2),('A',1)])
y = x.groupByKey()
print(x.collect())
print(list((j[0], list(j[1])) for j in y.collect()))
[('B', 5),('B', 4),('A', 3),('A', 2),('A', 1)]
[('A', [2, 3, 1]),('B',[5, 4])]
RDD:
x
RDD:
y
x:
y:
groupByKey(numPartitions=None)
Group the values for each key in the original RDD. Create a new
pair where the original key corresponds to this collected group of
values.
val x = sc.parallelize(
Array(('B',5),('B',4),('A',3),('A',2),('A',1)))
val y = x.groupByKey()
println(x.collect().mkString(", "))
println(y.collect().mkString(", "))
MAPPARTITIONS
RDD: x RDD: y
partitions
A
B
A
B
REDUCEBYKEY VS GROUPBYKEY
val words = Array("one", "two", "two", "three", "three", "three")
val wordPairsRDD = sc.parallelize(words).map(word => (word, 1))
val wordCountsWithReduce = wordPairsRDD
.reduceByKey(_ + _)
.collect()
val wordCountsWithGroup = wordPairsRDD
.groupByKey()
.map(t => (t._1, t._2.sum))
.collect()
REDUCEBYKEY
(a, 1)
(b, 1)
(a, 1)
(b, 1)
(a, 1)
(a, 1) (a, 2)
(b, 2)(b, 1)
(b, 1)
(a, 1)
(a, 1)
(a, 3)
(b, 2)
(a, 1)
(b, 1)
(b, 1)
(a, 1)
(a, 2) (a, 6)
(a, 3)
(b, 1)
(b, 2) (b, 5)
(b, 2)
a b
GROUPBYKEY
(a, 1)
(b, 1)
(a, 1)
(a, 1)
(b, 1)
(b, 1)
(a, 1)
(a, 1)
(a, 1)
(b, 1)
(b, 1)
(a, 1)
(a, 1)
(a, 6)
(a, 1)
(b, 5)
a b
(a, 1)
(a, 1)
(a, 1)
(b, 1)
(b, 1)
(b, 1)
(b, 1)
(b, 1)
MAPPARTIT
IONS
x:
y:
mapPartitions(f, preservesPartitioning=False)
Return a new RDD by applying a function to each partition
of this RDD
A
B
A
B
x = sc.parallelize([1,2,3], 2)
def f(iterator): yield sum(iterator); yield 42
y = x.mapPartitions(f)
# glom() flattens elements on the same partition
print(x.glom().collect())
print(y.glom().collect())
[[1], [2, 3]]
[[1, 42], [5, 42]]
MAPPARTIT
IONS
x:
y:
mapPartitions(f, preservesPartitioning=False)
Return a new RDD by applying a function to each partition
of this RDD
A
B
A
B
Array(Array(1), Array(2, 3))
Array(Array(1, 42), Array(5, 42))
val x = sc.parallelize(Array(1,2,3), 2)
def f(i:Iterator[Int])={ (i.sum,42).productIterator }
val y = x.mapPartitions(f)
// glom() flattens elements on the same partition
val xOut = x.glom().collect()
val yOut = y.glom().collect()
MAPPARTITIONSWITHINDEX
RDD: x RDD: y
partitions
A
B
A
B
input
partition index
x:
y:
mapPartitionsWithIndex(f, preservesPartitioning=False)
Return a new RDD by applying a function to each partition of
this RDD, while tracking the index of the original partition
A
B
A
B
x = sc.parallelize([1,2,3], 2)
def f(partitionIndex, iterator): yield (partitionIndex, sum(iterator))
y = x.mapPartitionsWithIndex(f)
# glom() flattens elements on the same partition
print(x.glom().collect())
print(y.glom().collect())
[[1], [2, 3]]
[[0, 1], [1, 5]]
MAPPARTITIONSWIT
HINDEX
partition
index
B A
x:
y:
mapPartitionsWithIndex(f, preservesPartitioning=False)
Return a new RDD by applying a function to each partition of
this RDD, while tracking the index of the original partition.
A
B
A
B
Array(Array(1), Array(2, 3))
Array(Array(0, 1), Array(1, 5))
MAPPARTITIONSWIT
HINDEX
partition
index
B A
val x = sc.parallelize(Array(1,2,3), 2)
def f(partitionIndex:Int, i:Iterator[Int]) = {
(partitionIndex, i.sum).productIterator
}
val y = x.mapPartitionsWithIndex(f)
// glom() flattens elements on the same partition
val xOut = x.glom().collect()
val yOut = y.glom().collect()
SAMPLE
RDD: x RDD: y
1
3
5
4
3
2
1
SAMPLE
x = sc.parallelize([1, 2, 3, 4, 5])
y = x.sample(False, 0.4, 42)
print(x.collect())
print(y.collect())
[1, 2, 3, 4, 5]
[1, 3]
RDD:
x
RDD:
y
x:
y:
sample(withReplacement, fraction, seed=None)
Return a new RDD containing a statistical sample of the
original RDD
val x = sc.parallelize(Array(1, 2, 3, 4, 5))
val y = x.sample(false, 0.4)
// omitting seed will yield different output
println(y.collect().mkString(", "))
UNION RDD: x RDD: y
4
3
3
2
1
A
B
C
4
3
3
2
1
A
B
C
RDD: z
UNION
x = sc.parallelize([1,2,3], 2)
y = sc.parallelize([3,4], 1)
z = x.union(y)
print(z.glom().collect())
[1, 2, 3]
[3, 4]
[[1], [2, 3], [3, 4]]
x:
y:
union(otherRDD)
Return a new RDD containing all items from two original RDDs.
Duplicates are not culled.
val x = sc.parallelize(Array(1,2,3), 2)
val y = sc.parallelize(Array(3,4), 1)
val z = x.union(y)
val zOut = z.glom().collect()
z:
5B
JOIN RDD: x RDD: y
4
2B
A
1A
3A
JOIN RDD: x RDD: y
RDD: z
(1, 3)A
5B
4
2B
A
1A
3A
JOIN RDD: x RDD: y
RDD: z
(1, 4)A
(1, 3)A
5B
4
2B
A
1A
3A
JOIN RDD: x RDD: y
(2, 5) RDD: zB
(1, 4)A
(1, 3)A
5B
4
2B
A
1A
3A
JOIN
x = sc.parallelize([("a", 1), ("b", 2)])
y = sc.parallelize([("a", 3), ("a", 4), ("b", 5)])
z = x.join(y)
print(z.collect()) [("a", 1), ("b", 2)]
[("a", 3), ("a", 4), ("b", 5)]
[('a', (1, 3)), ('a', (1, 4)), ('b', (2, 5))]
x:
y:
union(otherRDD, numPartitions=None)
Return a new RDD containing all pairs of elements having the same key in
the original RDDs
val x = sc.parallelize(Array(("a", 1), ("b", 2)))
val y = sc.parallelize(Array(("a", 3), ("a", 4), ("b", 5)))
val z = x.join(y)
println(z.collect().mkString(", "))
z:
DISTINCT
4
3
3
2
1
RDD: x
DISTINCT
RDD: x
4
3
3
2
1
4
3
3
2
1
RDD: y
DISTINCT
RDD: x
4
3
3
2
1
4
3
2
1
RDD: y
DISTINCT
x = sc.parallelize([1,2,3,3,4])
y = x.distinct()
print(y.collect())
[1, 2, 3, 3, 4]
[1, 2, 3, 4]
x:
y:
distinct(numPartitions=None)
Return a new RDD containing distinct items from the original RDD (omitting
all duplicates)
val x = sc.parallelize(Array(1,2,3,3,4))
val y = x.distinct()
println(y.collect().mkString(", "))
*
*
*¤
¤
COALESCE
RDD: x
A
B
C
COALESCE
RDD: x
B
C
RDD: y
A
AB
C
COALESCE
RDD: x
B
C
RDD: y
A
AB
COALESCE
x = sc.parallelize([1, 2, 3, 4, 5], 3)
y = x.coalesce(2)
print(x.glom().collect())
print(y.glom().collect())
[[1], [2, 3], [4, 5]]
[[1], [2, 3, 4, 5]]
x:
y:
coalesce(numPartitions, shuffle=False)
Return a new RDD which is reduced to a smaller number of
partitions
val x = sc.parallelize(Array(1, 2, 3, 4, 5), 3)
val y = x.coalesce(2)
val xOut = x.glom().collect()
val yOut = y.glom().collect()
C
B
C
A
AB
KEYBY
RDD: x
Jame
s
Anna
Fred
John
RDD: y
J “John”
emits
‘J’
KEYBY
RDD: x
Jame
s
Anna
‘F’
Fred “Fred”F
RDD: y
J “John”John
Jame
s
“Anna”A
KEYBY
RDD: x
Anna
‘A’
Fred
John
“Fred”F
RDD: y
J “John”
J “James”
“Anna”A
KEYBY
RDD: x
Jame
s
Anna
emits
‘J’
Fred
John
“Fred”F
RDD: y
J “John”
KEYBY
x = sc.parallelize(['John', 'Fred', 'Anna', 'James'])
y = x.keyBy(lambda w: w[0])
print y.collect()
['John', 'Fred', 'Anna', 'James']
[('J','John'),('F','Fred'),('A','Anna'),('J','James')]
RDD:
x
RDD:
y
x:
y:
keyBy(f)
Create a Pair RDD, forming one pair for each item in the
original RDD. The pair’s key is calculated from the value via
a user-supplied function.
val x = sc.parallelize(
Array("John", "Fred", "Anna", "James"))
val y = x.keyBy(w => w.charAt(0))
println(y.collect().mkString(", "))
PARTITIONBY
RDD: x
J “John”
“Anna”A
“Fred”F
J “James”
PARTITIONBY
RDD: x
J “John”
“Anna”A
“Fred”F
J “James”
RDD: yRDD: y
J “James”
1
PARTITIONBY
RDD: x
J “John”
“Anna”A
“Fred”F
RDD: yRDD: y
J “James”
“Fred”F
0
J “James”
PARTITIONBY
RDD: x
J “John”
“Anna”A
RDD: yRDD: y
J “James”
“Anna”A
“Fred”F
0
“Fred”F
J “James”
PARTITIONBY
RDD: x
J “John”
RDD: yRDD: y
J “John”
J “James”
“Anna”A
“Fred”F1
“Anna”A
“Fred”F
J “James”
PARTITIONBY
x = sc.parallelize([('J','James'),('F','Fred'),
('A','Anna'),('J','John')], 3)
y = x.partitionBy(2, lambda w: 0 if w[0] < 'H' else 1)
print x.glom().collect()
print y.glom().collect()
[[('J', 'James')], [('F', 'Fred')],
[('A', 'Anna'), ('J', 'John')]]
[[('A', 'Anna'), ('F', 'Fred')],
[('J', 'James'), ('J', 'John')]]
x:
y:
partitionBy(numPartitions, partitioner=portable_hash)
Return a new RDD with the specified number of partitions,
placing original items into the partition returned by a user
supplied function
PARTITIONBY
Array(Array((A,Anna), (F,Fred)),
Array((J,John), (J,James)))
Array(Array((F,Fred), (A,Anna)),
Array((J,John), (J,James)))
x:
y:
partitionBy(numPartitions, partitioner=portable_hash)
Return a new RDD with the specified number of partitions,
placing original items into the partition returned by a user
supplied function.
import org.apache.spark.Partitioner
val x = sc.parallelize(Array(('J',"James"),('F',"Fred"),
('A',"Anna"),('J',"John")), 3)
val y = x.partitionBy(new Partitioner() {
val numPartitions = 2
def getPartition(k:Any) = {
if (k.asInstanceOf[Char] < 'H') 0 else 1
}
})
val yOut = y.glom().collect()
ZIP RDD: x RDD: y
3
2
1
A
B
9
4
1
A
B
ZIP RDD: x RDD: y
3
2
1
A
B
4
A
RDD: z
9
4
1
A
B
1 1
ZIP RDD: x RDD: y
3
2
1
A
B
4
A
RDD: z
9
4
1
A
B
2
1
4
1
ZIP RDD: x RDD: y
3
2
1
A
B
4
3
A
B
RDD: z
9
4
1
A
B
2
1
9
4
1
ZIP
x = sc.parallelize([1, 2, 3])
y = x.map(lambda n:n*n)
z = x.zip(y)
print(z.collect()) [1, 2, 3]
[1, 4, 9]
[(1, 1), (2, 4), (3, 9)]
x:
y:
zip(otherRDD)
Return a new RDD containing pairs whose key is the item in the original
RDD, and whose value is that item’s corresponding element (same
partition, same index) in a second RDD
val x = sc.parallelize(Array(1,2,3))
val y = x.map(n=>n*n)
val z = x.zip(y)
println(z.collect().mkString(", "))
z:
ACTIONS Core
Operations
vs
distribut
ed
driver
occurs across the
cluster
result must fit in
driver JVM
GETNUMPARTITIONS
partition(s
)
A
B
2
x:
y:
getNumPartitions()
Return the number of partitions in RDD
[[1], [2, 3]]
2
GETNUMPARTITIONS
A
B
2
x = sc.parallelize([1,2,3], 2)
y = x.getNumPartitions()
print(x.glom().collect())
print(y)
val x = sc.parallelize(Array(1,2,3), 2)
val y = x.partitions.size
val xOut = x.glom().collect()
println(y)
COLLECT
partition(s
)
A
B
[ ]
x:
y:
collect()
Return all items in the RDD to the driver in a single list
[[1], [2, 3]]
[1, 2, 3]
A
B
x = sc.parallelize([1,2,3], 2)
y = x.collect()
print(x.glom().collect())
print(y)
val x = sc.parallelize(Array(1,2,3), 2)
val y = x.collect()
val xOut = x.glom().collect()
println(y)
COLLECT [ ]
REDUCE
3
2
1
emits
4
3
4
REDUCE
3
2
1
emits
3
input
6
4
REDUCE
3
2
1
10
input
6
10
x:
y:
reduce(f)
Aggregate all the elements of the RDD by applying a user
function pairwise to elements and partial results, and returns
a result to the driver
[1, 2, 3, 4]
10
x = sc.parallelize([1,2,3,4])
y = x.reduce(lambda a,b: a+b)
print(x.collect())
print(y)
val x = sc.parallelize(Array(1,2,3,4))
val y = x.reduce((a,b) => a+b)
println(x.collect.mkString(", "))
println(y)
REDUCE
******
*
**
***
AGGREGATE
3
2
1
4
A
B
AGGREGATE
3
2
1
4
AGGREGATE
3
2
1
4
emits
([], 0)
([1], 1)
AGGREGATE
3
2
1
4
([], 0)
([2], 2)
([1], 1)
AGGREGATE
3
2
1
4
([2], 2)
([1], 1)([1,2], 3)
AGGREGATE
3
2
1
4
([2], 2)
([1], 1)([1,2], 3)
([3], 3)
([], 0)
AGGREGATE
3
2
1
4
([2], 2)
([1], 1)([1,2], 3)
([3], 3)
([], 0)
([4], 4)
AGGREGATE
3
2
1
4
([2], 2)
([1], 1)([1,2], 3)
([4], 4)
([3], 3)([3,4], 7)
AGGREGATE
3
2
1
4
([2], 2)
([1], 1)([1,2], 3)
([4], 4)
([3], 3)([3,4], 7)
AGGREGATE
3
2
1
4
([1,2], 3)
([3,4], 7)
AGGREGATE
3
2
1
4
([1,2], 3)
([3,4], 7)
([1,2,3,4],
10)
([1,2,3,4],
10)
aggregate(identity, seqOp, combOp)
Aggregate all the elements of the RDD by:
- applying a user function to combine elements with user-
supplied objects,
- then combining those user-defined results via a second user
function,
- and finally returning a result to the driver.
seqOp = lambda data, item: (data[0] + [item], data[1] + item)
combOp = lambda d1, d2: (d1[0] + d2[0], d1[1] + d2[1])
x = sc.parallelize([1,2,3,4])
y = x.aggregate(([], 0), seqOp, combOp)
print(y)
AGGREGATE
***
****
**
***
[( ),#]
x:
y:
[1, 2, 3, 4]
([1, 2, 3, 4], 10)
def seqOp = (data:(Array[Int], Int), item:Int) =>
(data._1 :+ item, data._2 + item)
def combOp = (d1:(Array[Int], Int), d2:(Array[Int], Int)) =>
(d1._1.union(d2._1), d1._2 + d2._2)
val x = sc.parallelize(Array(1,2,3,4))
val y = x.aggregate((Array[Int](), 0))(seqOp, combOp)
println(y)
x:
y:
aggregate(identity, seqOp, combOp)
Aggregate all the elements of the RDD by:
- applying a user function to combine elements with user-
supplied objects,
- then combining those user-defined results via a second user
function,
- and finally returning a result to the driver.
[1, 2, 3, 4]
(Array(3, 1, 2, 4),10)
AGGREGATE
***
****
**
***
[( ),#]
MAX
4
2
1
4
x:
y:
max()
Return the maximum item in the RDD
[2, 4, 1]
4
MAX
4
x = sc.parallelize([2,4,1])
y = x.max()
print(x.collect())
print(y)
val x = sc.parallelize(Array(2,4,1))
val y = x.max
println(x.collect().mkString(", "))
println(y)
2
1
4
max
SUM
7
2
1
4
x:
y:
sum()
Return the sum of the items in the RDD
[2, 4, 1]
7
SUM
7
x = sc.parallelize([2,4,1])
y = x.sum()
print(x.collect())
print(y)
val x = sc.parallelize(Array(2,4,1))
val y = x.sum
println(x.collect().mkString(", "))
println(y)
2
1
4
Σ
MEAN
2.33333333
2
1
4
x:
y:
mean()
Return the mean of the items in the RDD
[2, 4, 1]
2.3333333
MEAN
2.3333333
x = sc.parallelize([2,4,1])
y = x.mean()
print(x.collect())
print(y)
val x = sc.parallelize(Array(2,4,1))
val y = x.mean
println(x.collect().mkString(", "))
println(y)
2
1
4
x
STDEV
1.2472191
2
1
4
x:
y:
stdev()
Return the standard deviation of the items in the RDD
[2, 4, 1]
1.2472191
STDEV
1.247219
1
x = sc.parallelize([2,4,1])
y = x.stdev()
print(x.collect())
print(y)
val x = sc.parallelize(Array(2,4,1))
val y = x.stdev
println(x.collect().mkString(", "))
println(y)
2
1
4
σ
COUNTBYKEY
{'A': 1, 'J': 2, 'F':
1}
J “John”
“Anna”A
“Fred”F
J “James”
x:
y:
countByKey()
Return a map of keys and counts of their occurrences in t
RDD
[('J', 'James'), ('F','Fred'),
('A','Anna'), ('J','John')]
{'A': 1, 'J': 2, 'F': 1}
COUNTBYKEY
{A: 1, 'J': 2, 'F':
1}
x = sc.parallelize([('J', 'James'), ('F','Fred'),
('A','Anna'), ('J','John')])
y = x.countByKey()
print(y)
val x = sc.parallelize(Array(('J',"James"),('F',"Fred"),
('A',"Anna"),('J',"John")))
val y = x.countByKey()
println(y)
SAVEASTEXTFILE
x:
y:
saveAsTextFile(path, compressionCodecClass=None)
Save the RDD to the filesystem indicated in the path
[2, 4, 1]
[u'2', u'4', u'1']
SAVEASTEXTFILE
dbutils.fs.rm("/temp/demo", True)
x = sc.parallelize([2,4,1])
x.saveAsTextFile("/temp/demo")
y = sc.textFile("/temp/demo")
print(y.collect())
dbutils.fs.rm("/temp/demo", true)
val x = sc.parallelize(Array(2,4,1))
x.saveAsTextFile("/temp/demo")
val y = sc.textFile("/temp/demo")
println(y.collect().mkString(", "))
LAB
Q&A

More Related Content

What's hot

Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
sudhakara st
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
Joud Khattab
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
Mateusz Buśkiewicz
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
alexbaranau
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
Vadim Y. Bichutskiy
 
Spark
SparkSpark
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
Dat Tran
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence
Venkat Datla
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
ragho
 
Catalyst optimizer
Catalyst optimizerCatalyst optimizer
Catalyst optimizer
Ayub Mohammad
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
Julian Hyde
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Edureka!
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)
Julian Hyde
 

What's hot (20)

Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Spark
SparkSpark
Spark
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
Catalyst optimizer
Catalyst optimizerCatalyst optimizer
Catalyst optimizer
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)
 

Viewers also liked

Spark RDD : Transformations & Actions
Spark RDD : Transformations & ActionsSpark RDD : Transformations & Actions
Spark RDD : Transformations & Actions
MICHRAFY MUSTAFA
 
Spark Streaming, Machine Learning and meetup.com streaming API.
Spark Streaming, Machine Learning and  meetup.com streaming API.Spark Streaming, Machine Learning and  meetup.com streaming API.
Spark Streaming, Machine Learning and meetup.com streaming API.
Sergey Zelvenskiy
 
Visualising Basic Concepts of Docker
Visualising Basic Concepts of Docker Visualising Basic Concepts of Docker
Visualising Basic Concepts of Docker
vishnu rao
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
Slim Baltagi
 
Data Loss and Duplication in Kafka
Data Loss and Duplication in KafkaData Loss and Duplication in Kafka
Data Loss and Duplication in KafkaJayesh Thakrar
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) Model
Dean Wampler
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Databricks
 
Tutorial Kafka-Storm
Tutorial Kafka-StormTutorial Kafka-Storm
Tutorial Kafka-Storm
Universidad de Santiago de Chile
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
Spark Summit
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 
Big Data: Explore Hadoop and BigInsights self-study lab
Big Data:  Explore Hadoop and BigInsights self-study labBig Data:  Explore Hadoop and BigInsights self-study lab
Big Data: Explore Hadoop and BigInsights self-study lab
Cynthia Saracco
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
Taposh Roy
 
Introduction and overview ArangoDB query language AQL
Introduction and overview ArangoDB query language AQLIntroduction and overview ArangoDB query language AQL
Introduction and overview ArangoDB query language AQL
ArangoDB Database
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Glint with Apache Spark
Glint with Apache SparkGlint with Apache Spark
Glint with Apache Spark
Venkata Naga Ravi
 
7 Predictive Analytics, Spark , Streaming use cases
7 Predictive Analytics, Spark , Streaming use cases7 Predictive Analytics, Spark , Streaming use cases
7 Predictive Analytics, Spark , Streaming use cases
DataWorks Summit/Hadoop Summit
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
Dean Chen
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 

Viewers also liked (20)

Spark RDD : Transformations & Actions
Spark RDD : Transformations & ActionsSpark RDD : Transformations & Actions
Spark RDD : Transformations & Actions
 
Spark Streaming, Machine Learning and meetup.com streaming API.
Spark Streaming, Machine Learning and  meetup.com streaming API.Spark Streaming, Machine Learning and  meetup.com streaming API.
Spark Streaming, Machine Learning and meetup.com streaming API.
 
Visualising Basic Concepts of Docker
Visualising Basic Concepts of Docker Visualising Basic Concepts of Docker
Visualising Basic Concepts of Docker
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
Data Loss and Duplication in Kafka
Data Loss and Duplication in KafkaData Loss and Duplication in Kafka
Data Loss and Duplication in Kafka
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) Model
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Tutorial Kafka-Storm
Tutorial Kafka-StormTutorial Kafka-Storm
Tutorial Kafka-Storm
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
 
Big Data: Explore Hadoop and BigInsights self-study lab
Big Data:  Explore Hadoop and BigInsights self-study labBig Data:  Explore Hadoop and BigInsights self-study lab
Big Data: Explore Hadoop and BigInsights self-study lab
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
 
Introduction and overview ArangoDB query language AQL
Introduction and overview ArangoDB query language AQLIntroduction and overview ArangoDB query language AQL
Introduction and overview ArangoDB query language AQL
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Glint with Apache Spark
Glint with Apache SparkGlint with Apache Spark
Glint with Apache Spark
 
7 Predictive Analytics, Spark , Streaming use cases
7 Predictive Analytics, Spark , Streaming use cases7 Predictive Analytics, Spark , Streaming use cases
7 Predictive Analytics, Spark , Streaming use cases
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 

Similar to Transformations and actions a visual guide training

Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
Javier Arrieta
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
Thành Nguyễn
 
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
Ankur Dave
 
ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON
Padma shree. T
 
Visual Api Training
Visual Api TrainingVisual Api Training
Visual Api Training
Spark Summit
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
Vincent Poncet
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Massimo Schenone
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...
bhargavi804095
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Vincent Poncet
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
Chetan Khatri
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Duyhai Doan
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
Kelly Technologies
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Updatevithakur
 
Operations on rdd
Operations on rddOperations on rdd
Operations on rdd
sparrowAnalytics.com
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
YahooTechConference
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analytics
Avinash Pandu
 

Similar to Transformations and actions a visual guide training (20)

Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
 
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
 
ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON
 
Visual Api Training
Visual Api TrainingVisual Api Training
Visual Api Training
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
Operations on rdd
Operations on rddOperations on rdd
Operations on rdd
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analytics
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 

Recently uploaded (20)

一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 

Transformations and actions a visual guide training

  • 1. TRANSFORMATIONS AND ACTIONS A Visual Guide of the APIhttp://training.databricks.com/visualapi.pdf
  • 2. Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark API under the MIT license to the Spark community. Jeff’s original, creative work can be found here and you can read more about Jeff’s project in his blog post. After talking to Jeff, Databricks commissioned Adam Breindel to further evolve Jeff’s work into the diagrams you see in this deck. LinkedIn Blog: data-frack
  • 3. making big data simple Databricks Cloud: “A unified platform for building Big Data pipelines – from ETL to Exploration and Dashboards, to Advanced Analytics and Data Products.” • Founded in late 2013 • by the creators of Apache Spark • Original team from UC Berkeley AMPLab • Raised $47 Million in 2 rounds • ~55 employees • We’re hiring! • Level 2/3 support partnerships with • Hortonworks • MapR • DataStax (http://databricks.workable.com)
  • 5. Randomized operation Legen d Set Theory / Relational operation Numeric calculation
  • 7. • map • filter • flatMap • mapPartitions • mapPartitionsWithIndex • groupBy • sortBy = medium Essential Core & Intermediate Spark Operations TRANSFORMA TIONS ACTIO NS Genera l • sample • randomSplit Math / Statistical = easy Set Theory / Relational • union • intersection • subtract • distinct • cartesian • zip • takeOrdered Data Structure / I/O • saveAsTextFile • saveAsSequenceFile • saveAsObjectFile • saveAsHadoopDataset • saveAsHadoopFile • saveAsNewAPIHadoopDataset • saveAsNewAPIHadoopFile • keyBy • zipWithIndex • zipWithUniqueID • zipPartitions • coalesce • repartition • repartitionAndSortWithinPartitions • pipe • count • takeSample • max • min • sum • histogram • mean • variance • stdev • sampleVariance • countApprox • countApproxDistinct • reduce • collect • aggregate • fold • first • take • forEach • top • treeAggregate • treeReduce • forEachPartition • collectAsMap
  • 8. = medium Essential Core & Intermediate PairRDD Operations TRANSFORMA TIONS ACTIO NS Genera l • sampleByKey Math / Statistical = easy Set Theory / Relational Data Structure • keys • values • partitionBy • countByKey • countByValue • countByValueApprox • countApproxDistinctByKey • countApproxDistinctByKey • countByKeyApprox • sampleByKeyExact • cogroup (=groupWith) • join • subtractByKey • fullOuterJoin • leftOuterJoin • rightOuterJoin • flatMapValues • groupByKey • reduceByKey • reduceByKeyLocally • foldByKey • aggregateByKey • sortByKey • combineByKey
  • 9. vs narro w wide each partition of the parent RDD is used by at most one partition of the child RDD multiple child RDD partitions may depend on a single parent RDD partition
  • 10. “One of the challenges in providing RDDs as an abstraction is choosing a representation for them that can track lineage across a wide range of transformations.” “The most interesting question in designing this interface is how to represent dependencies between RDDs.” “We found it both sufficient and useful to classify dependencies into two types: • narrow dependencies, where each partition of the parent RDD is used by at most one partition of the child RDD • wide dependencies, where multiple child partitions may depend on it.”
  • 11. narro w wid e each partition of the parent RDD is used by at most one partition of the child RDD multiple child RDD partitions may depend on a single parent RDD partition map, filter union join w/ inputs co-partitioned groupByKe y join w/ inputs not co-partitioned
  • 13. MAP 3 items in RDD RDD: x
  • 14. MAP User function applied item by item RDD: x RDD: y
  • 17. MAP RDD: x RDD: y After map() has been applied… before after
  • 18. MAP RDD: x RDD: y Return a new RDD by applying a function to each element of this RDD.
  • 19. MAP x = sc.parallelize(["b", "a", "c"]) y = x.map(lambda z: (z, 1)) print(x.collect()) print(y.collect()) ['b', 'a', 'c'] [('b', 1), ('a', 1), ('c', 1)] RDD: x RDD: y x: y: map(f, preservesPartitioning=False) Return a new RDD by applying a function to each element of this RDD val x = sc.parallelize(Array("b", "a", "c")) val y = x.map(z => (z,1)) println(x.collect().mkString(", ")) println(y.collect().mkString(", "))
  • 20. FILTER 3 items in RDD RDD: x
  • 21. FILTER Apply user function: keep item if function returns true RDD: x RDD: y emits True
  • 22. FILTER RDD: x RDD: y emits False
  • 23. FILTER RDD: x RDD: y emits True
  • 24. FILTER RDD: x RDD: y After filter() has been applied… before after
  • 25. FILTER x = sc.parallelize([1,2,3]) y = x.filter(lambda x: x%2 == 1) #keep odd values print(x.collect()) print(y.collect()) [1, 2, 3] [1, 3] RDD: x RDD: y x: y: filter(f) Return a new RDD containing only the elements that satisfy a predicate val x = sc.parallelize(Array(1,2,3)) val y = x.filter(n => n%2 == 1) println(x.collect().mkString(", ")) println(y.collect().mkString(", "))
  • 26. FLATMAP 3 items in RDD RDD: x
  • 30. FLATMAP RDD: x RDD: y After flatmap() has been applied… before after
  • 31. FLATMAP RDD: x RDD: y Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results
  • 32. FLATMAP x = sc.parallelize([1,2,3]) y = x.flatMap(lambda x: (x, x*100, 42)) print(x.collect()) print(y.collect()) [1, 2, 3] [1, 100, 42, 2, 200, 42, 3, 300, 42] x: y: RDD: x RDD: y flatMap(f, preservesPartitioning=False) Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results val x = sc.parallelize(Array(1,2,3)) val y = x.flatMap(n => Array(n, n*100, 42)) println(x.collect().mkString(", ")) println(y.collect().mkString(", "))
  • 33. GROUPBY 4 items in RDD RDD: x Jame s Anna Fred John
  • 35. F [ “Fred” ] GROUPBY RDD: x Jame s Anna Fred emits ‘F’ J [ “John” ]John RDD: y
  • 36. [ “Fred” ] GROUPBY RDD: x Jame s Anna emits ‘A’ J [ “John” ] A [ “Anna” ] Fred John F RDD: y
  • 37. [ “Fred” ] GROUPBY RDD: x Jame s Anna emits ‘J’ J [ “John”, “James” ] [ “Anna” ] Fred John F A RDD: y
  • 38. GROUPBY x = sc.parallelize(['John', 'Fred', 'Anna', 'James']) y = x.groupBy(lambda w: w[0]) print [(k, list(v)) for (k, v) in y.collect()] ['John', 'Fred', 'Anna', 'James'] [('A',['Anna']),('J',['John','James']),('F',['Fred'])] RDD: x RDD: y x: y: groupBy(f, numPartitions=None) Group the data in the original RDD. Create pairs where the key is the output of a user function, and the value is all items for which the function yields this key. val x = sc.parallelize( Array("John", "Fred", "Anna", "James")) val y = x.groupBy(w => w.charAt(0)) println(y.collect().mkString(", "))
  • 39. GROUPBYKEY 5 items in RDD Pair RDD: x B B A A A 5 4 3 2 1
  • 40. GROUPBYKEY Pair RDD: x 5 4 3 2 1 RDD: y A [ 2 , 3 , 1 ] B B A A A
  • 41. GROUPBYKEY Pair RDD: x RDD: y B [ 5 , 4 ] A [ 2 , 3 , 1 ] 5 4 3 2 1 B B A A A
  • 42. GROUPBYKEY x = sc.parallelize([('B',5),('B',4),('A',3),('A',2),('A',1)]) y = x.groupByKey() print(x.collect()) print(list((j[0], list(j[1])) for j in y.collect())) [('B', 5),('B', 4),('A', 3),('A', 2),('A', 1)] [('A', [2, 3, 1]),('B',[5, 4])] RDD: x RDD: y x: y: groupByKey(numPartitions=None) Group the values for each key in the original RDD. Create a new pair where the original key corresponds to this collected group of values. val x = sc.parallelize( Array(('B',5),('B',4),('A',3),('A',2),('A',1))) val y = x.groupByKey() println(x.collect().mkString(", ")) println(y.collect().mkString(", "))
  • 43. MAPPARTITIONS RDD: x RDD: y partitions A B A B
  • 44. REDUCEBYKEY VS GROUPBYKEY val words = Array("one", "two", "two", "three", "three", "three") val wordPairsRDD = sc.parallelize(words).map(word => (word, 1)) val wordCountsWithReduce = wordPairsRDD .reduceByKey(_ + _) .collect() val wordCountsWithGroup = wordPairsRDD .groupByKey() .map(t => (t._1, t._2.sum)) .collect()
  • 45. REDUCEBYKEY (a, 1) (b, 1) (a, 1) (b, 1) (a, 1) (a, 1) (a, 2) (b, 2)(b, 1) (b, 1) (a, 1) (a, 1) (a, 3) (b, 2) (a, 1) (b, 1) (b, 1) (a, 1) (a, 2) (a, 6) (a, 3) (b, 1) (b, 2) (b, 5) (b, 2) a b
  • 46. GROUPBYKEY (a, 1) (b, 1) (a, 1) (a, 1) (b, 1) (b, 1) (a, 1) (a, 1) (a, 1) (b, 1) (b, 1) (a, 1) (a, 1) (a, 6) (a, 1) (b, 5) a b (a, 1) (a, 1) (a, 1) (b, 1) (b, 1) (b, 1) (b, 1) (b, 1)
  • 47. MAPPARTIT IONS x: y: mapPartitions(f, preservesPartitioning=False) Return a new RDD by applying a function to each partition of this RDD A B A B x = sc.parallelize([1,2,3], 2) def f(iterator): yield sum(iterator); yield 42 y = x.mapPartitions(f) # glom() flattens elements on the same partition print(x.glom().collect()) print(y.glom().collect()) [[1], [2, 3]] [[1, 42], [5, 42]]
  • 48. MAPPARTIT IONS x: y: mapPartitions(f, preservesPartitioning=False) Return a new RDD by applying a function to each partition of this RDD A B A B Array(Array(1), Array(2, 3)) Array(Array(1, 42), Array(5, 42)) val x = sc.parallelize(Array(1,2,3), 2) def f(i:Iterator[Int])={ (i.sum,42).productIterator } val y = x.mapPartitions(f) // glom() flattens elements on the same partition val xOut = x.glom().collect() val yOut = y.glom().collect()
  • 49. MAPPARTITIONSWITHINDEX RDD: x RDD: y partitions A B A B input partition index
  • 50. x: y: mapPartitionsWithIndex(f, preservesPartitioning=False) Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition A B A B x = sc.parallelize([1,2,3], 2) def f(partitionIndex, iterator): yield (partitionIndex, sum(iterator)) y = x.mapPartitionsWithIndex(f) # glom() flattens elements on the same partition print(x.glom().collect()) print(y.glom().collect()) [[1], [2, 3]] [[0, 1], [1, 5]] MAPPARTITIONSWIT HINDEX partition index B A
  • 51. x: y: mapPartitionsWithIndex(f, preservesPartitioning=False) Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition. A B A B Array(Array(1), Array(2, 3)) Array(Array(0, 1), Array(1, 5)) MAPPARTITIONSWIT HINDEX partition index B A val x = sc.parallelize(Array(1,2,3), 2) def f(partitionIndex:Int, i:Iterator[Int]) = { (partitionIndex, i.sum).productIterator } val y = x.mapPartitionsWithIndex(f) // glom() flattens elements on the same partition val xOut = x.glom().collect() val yOut = y.glom().collect()
  • 52. SAMPLE RDD: x RDD: y 1 3 5 4 3 2 1
  • 53. SAMPLE x = sc.parallelize([1, 2, 3, 4, 5]) y = x.sample(False, 0.4, 42) print(x.collect()) print(y.collect()) [1, 2, 3, 4, 5] [1, 3] RDD: x RDD: y x: y: sample(withReplacement, fraction, seed=None) Return a new RDD containing a statistical sample of the original RDD val x = sc.parallelize(Array(1, 2, 3, 4, 5)) val y = x.sample(false, 0.4) // omitting seed will yield different output println(y.collect().mkString(", "))
  • 54. UNION RDD: x RDD: y 4 3 3 2 1 A B C 4 3 3 2 1 A B C RDD: z
  • 55. UNION x = sc.parallelize([1,2,3], 2) y = sc.parallelize([3,4], 1) z = x.union(y) print(z.glom().collect()) [1, 2, 3] [3, 4] [[1], [2, 3], [3, 4]] x: y: union(otherRDD) Return a new RDD containing all items from two original RDDs. Duplicates are not culled. val x = sc.parallelize(Array(1,2,3), 2) val y = sc.parallelize(Array(3,4), 1) val z = x.union(y) val zOut = z.glom().collect() z:
  • 56. 5B JOIN RDD: x RDD: y 4 2B A 1A 3A
  • 57. JOIN RDD: x RDD: y RDD: z (1, 3)A 5B 4 2B A 1A 3A
  • 58. JOIN RDD: x RDD: y RDD: z (1, 4)A (1, 3)A 5B 4 2B A 1A 3A
  • 59. JOIN RDD: x RDD: y (2, 5) RDD: zB (1, 4)A (1, 3)A 5B 4 2B A 1A 3A
  • 60. JOIN x = sc.parallelize([("a", 1), ("b", 2)]) y = sc.parallelize([("a", 3), ("a", 4), ("b", 5)]) z = x.join(y) print(z.collect()) [("a", 1), ("b", 2)] [("a", 3), ("a", 4), ("b", 5)] [('a', (1, 3)), ('a', (1, 4)), ('b', (2, 5))] x: y: union(otherRDD, numPartitions=None) Return a new RDD containing all pairs of elements having the same key in the original RDDs val x = sc.parallelize(Array(("a", 1), ("b", 2))) val y = sc.parallelize(Array(("a", 3), ("a", 4), ("b", 5))) val z = x.join(y) println(z.collect().mkString(", ")) z:
  • 64. DISTINCT x = sc.parallelize([1,2,3,3,4]) y = x.distinct() print(y.collect()) [1, 2, 3, 3, 4] [1, 2, 3, 4] x: y: distinct(numPartitions=None) Return a new RDD containing distinct items from the original RDD (omitting all duplicates) val x = sc.parallelize(Array(1,2,3,3,4)) val y = x.distinct() println(y.collect().mkString(", ")) * * *¤ ¤
  • 68. COALESCE x = sc.parallelize([1, 2, 3, 4, 5], 3) y = x.coalesce(2) print(x.glom().collect()) print(y.glom().collect()) [[1], [2, 3], [4, 5]] [[1], [2, 3, 4, 5]] x: y: coalesce(numPartitions, shuffle=False) Return a new RDD which is reduced to a smaller number of partitions val x = sc.parallelize(Array(1, 2, 3, 4, 5), 3) val y = x.coalesce(2) val xOut = x.glom().collect() val yOut = y.glom().collect() C B C A AB
  • 73. KEYBY x = sc.parallelize(['John', 'Fred', 'Anna', 'James']) y = x.keyBy(lambda w: w[0]) print y.collect() ['John', 'Fred', 'Anna', 'James'] [('J','John'),('F','Fred'),('A','Anna'),('J','James')] RDD: x RDD: y x: y: keyBy(f) Create a Pair RDD, forming one pair for each item in the original RDD. The pair’s key is calculated from the value via a user-supplied function. val x = sc.parallelize( Array("John", "Fred", "Anna", "James")) val y = x.keyBy(w => w.charAt(0)) println(y.collect().mkString(", "))
  • 75. PARTITIONBY RDD: x J “John” “Anna”A “Fred”F J “James” RDD: yRDD: y J “James” 1
  • 76. PARTITIONBY RDD: x J “John” “Anna”A “Fred”F RDD: yRDD: y J “James” “Fred”F 0 J “James”
  • 77. PARTITIONBY RDD: x J “John” “Anna”A RDD: yRDD: y J “James” “Anna”A “Fred”F 0 “Fred”F J “James”
  • 78. PARTITIONBY RDD: x J “John” RDD: yRDD: y J “John” J “James” “Anna”A “Fred”F1 “Anna”A “Fred”F J “James”
  • 79. PARTITIONBY x = sc.parallelize([('J','James'),('F','Fred'), ('A','Anna'),('J','John')], 3) y = x.partitionBy(2, lambda w: 0 if w[0] < 'H' else 1) print x.glom().collect() print y.glom().collect() [[('J', 'James')], [('F', 'Fred')], [('A', 'Anna'), ('J', 'John')]] [[('A', 'Anna'), ('F', 'Fred')], [('J', 'James'), ('J', 'John')]] x: y: partitionBy(numPartitions, partitioner=portable_hash) Return a new RDD with the specified number of partitions, placing original items into the partition returned by a user supplied function
  • 80. PARTITIONBY Array(Array((A,Anna), (F,Fred)), Array((J,John), (J,James))) Array(Array((F,Fred), (A,Anna)), Array((J,John), (J,James))) x: y: partitionBy(numPartitions, partitioner=portable_hash) Return a new RDD with the specified number of partitions, placing original items into the partition returned by a user supplied function. import org.apache.spark.Partitioner val x = sc.parallelize(Array(('J',"James"),('F',"Fred"), ('A',"Anna"),('J',"John")), 3) val y = x.partitionBy(new Partitioner() { val numPartitions = 2 def getPartition(k:Any) = { if (k.asInstanceOf[Char] < 'H') 0 else 1 } }) val yOut = y.glom().collect()
  • 81. ZIP RDD: x RDD: y 3 2 1 A B 9 4 1 A B
  • 82. ZIP RDD: x RDD: y 3 2 1 A B 4 A RDD: z 9 4 1 A B 1 1
  • 83. ZIP RDD: x RDD: y 3 2 1 A B 4 A RDD: z 9 4 1 A B 2 1 4 1
  • 84. ZIP RDD: x RDD: y 3 2 1 A B 4 3 A B RDD: z 9 4 1 A B 2 1 9 4 1
  • 85. ZIP x = sc.parallelize([1, 2, 3]) y = x.map(lambda n:n*n) z = x.zip(y) print(z.collect()) [1, 2, 3] [1, 4, 9] [(1, 1), (2, 4), (3, 9)] x: y: zip(otherRDD) Return a new RDD containing pairs whose key is the item in the original RDD, and whose value is that item’s corresponding element (same partition, same index) in a second RDD val x = sc.parallelize(Array(1,2,3)) val y = x.map(n=>n*n) val z = x.zip(y) println(z.collect().mkString(", ")) z:
  • 89. x: y: getNumPartitions() Return the number of partitions in RDD [[1], [2, 3]] 2 GETNUMPARTITIONS A B 2 x = sc.parallelize([1,2,3], 2) y = x.getNumPartitions() print(x.glom().collect()) print(y) val x = sc.parallelize(Array(1,2,3), 2) val y = x.partitions.size val xOut = x.glom().collect() println(y)
  • 91. x: y: collect() Return all items in the RDD to the driver in a single list [[1], [2, 3]] [1, 2, 3] A B x = sc.parallelize([1,2,3], 2) y = x.collect() print(x.glom().collect()) print(y) val x = sc.parallelize(Array(1,2,3), 2) val y = x.collect() val xOut = x.glom().collect() println(y) COLLECT [ ]
  • 95. x: y: reduce(f) Aggregate all the elements of the RDD by applying a user function pairwise to elements and partial results, and returns a result to the driver [1, 2, 3, 4] 10 x = sc.parallelize([1,2,3,4]) y = x.reduce(lambda a,b: a+b) print(x.collect()) print(y) val x = sc.parallelize(Array(1,2,3,4)) val y = x.reduce((a,b) => a+b) println(x.collect.mkString(", ")) println(y) REDUCE ****** * ** ***
  • 102. AGGREGATE 3 2 1 4 ([2], 2) ([1], 1)([1,2], 3) ([3], 3) ([], 0) ([4], 4)
  • 103. AGGREGATE 3 2 1 4 ([2], 2) ([1], 1)([1,2], 3) ([4], 4) ([3], 3)([3,4], 7)
  • 104. AGGREGATE 3 2 1 4 ([2], 2) ([1], 1)([1,2], 3) ([4], 4) ([3], 3)([3,4], 7)
  • 107. aggregate(identity, seqOp, combOp) Aggregate all the elements of the RDD by: - applying a user function to combine elements with user- supplied objects, - then combining those user-defined results via a second user function, - and finally returning a result to the driver. seqOp = lambda data, item: (data[0] + [item], data[1] + item) combOp = lambda d1, d2: (d1[0] + d2[0], d1[1] + d2[1]) x = sc.parallelize([1,2,3,4]) y = x.aggregate(([], 0), seqOp, combOp) print(y) AGGREGATE *** **** ** *** [( ),#] x: y: [1, 2, 3, 4] ([1, 2, 3, 4], 10)
  • 108. def seqOp = (data:(Array[Int], Int), item:Int) => (data._1 :+ item, data._2 + item) def combOp = (d1:(Array[Int], Int), d2:(Array[Int], Int)) => (d1._1.union(d2._1), d1._2 + d2._2) val x = sc.parallelize(Array(1,2,3,4)) val y = x.aggregate((Array[Int](), 0))(seqOp, combOp) println(y) x: y: aggregate(identity, seqOp, combOp) Aggregate all the elements of the RDD by: - applying a user function to combine elements with user- supplied objects, - then combining those user-defined results via a second user function, - and finally returning a result to the driver. [1, 2, 3, 4] (Array(3, 1, 2, 4),10) AGGREGATE *** **** ** *** [( ),#]
  • 110. x: y: max() Return the maximum item in the RDD [2, 4, 1] 4 MAX 4 x = sc.parallelize([2,4,1]) y = x.max() print(x.collect()) print(y) val x = sc.parallelize(Array(2,4,1)) val y = x.max println(x.collect().mkString(", ")) println(y) 2 1 4 max
  • 112. x: y: sum() Return the sum of the items in the RDD [2, 4, 1] 7 SUM 7 x = sc.parallelize([2,4,1]) y = x.sum() print(x.collect()) print(y) val x = sc.parallelize(Array(2,4,1)) val y = x.sum println(x.collect().mkString(", ")) println(y) 2 1 4 Σ
  • 114. x: y: mean() Return the mean of the items in the RDD [2, 4, 1] 2.3333333 MEAN 2.3333333 x = sc.parallelize([2,4,1]) y = x.mean() print(x.collect()) print(y) val x = sc.parallelize(Array(2,4,1)) val y = x.mean println(x.collect().mkString(", ")) println(y) 2 1 4 x
  • 116. x: y: stdev() Return the standard deviation of the items in the RDD [2, 4, 1] 1.2472191 STDEV 1.247219 1 x = sc.parallelize([2,4,1]) y = x.stdev() print(x.collect()) print(y) val x = sc.parallelize(Array(2,4,1)) val y = x.stdev println(x.collect().mkString(", ")) println(y) 2 1 4 σ
  • 117. COUNTBYKEY {'A': 1, 'J': 2, 'F': 1} J “John” “Anna”A “Fred”F J “James”
  • 118. x: y: countByKey() Return a map of keys and counts of their occurrences in t RDD [('J', 'James'), ('F','Fred'), ('A','Anna'), ('J','John')] {'A': 1, 'J': 2, 'F': 1} COUNTBYKEY {A: 1, 'J': 2, 'F': 1} x = sc.parallelize([('J', 'James'), ('F','Fred'), ('A','Anna'), ('J','John')]) y = x.countByKey() print(y) val x = sc.parallelize(Array(('J',"James"),('F',"Fred"), ('A',"Anna"),('J',"John"))) val y = x.countByKey() println(y)
  • 120. x: y: saveAsTextFile(path, compressionCodecClass=None) Save the RDD to the filesystem indicated in the path [2, 4, 1] [u'2', u'4', u'1'] SAVEASTEXTFILE dbutils.fs.rm("/temp/demo", True) x = sc.parallelize([2,4,1]) x.saveAsTextFile("/temp/demo") y = sc.textFile("/temp/demo") print(y.collect()) dbutils.fs.rm("/temp/demo", true) val x = sc.parallelize(Array(2,4,1)) x.saveAsTextFile("/temp/demo") val y = sc.textFile("/temp/demo") println(y.collect().mkString(", "))
  • 121. LAB
  • 122. Q&A

Editor's Notes

  1. To-do:
  2. Apache Spark is an open-source cluster computing framework originally developed in 2009 in the UC Berkeley AMPLab. Spark makes it easy to get value from big data. It can read from any data source (relational, NoSQL, file systems, etc) and offers one unified API for batch analytics, SQL queries, real time analysis, machine learning and graph processing. Developers no longer have to learn separate processing engines for different tasks. Spark is 10x-100x faster than older systems such as Hadoop MapReduce with proven scalability (the largest Spark cluster has over 8,000 nodes). Spark had over 450 contributors in 2014, making it the most active open source project in the Big Data space. The core Spark team is now at Databricks. With over 50 employees and $47 million in funding, Databricks is the main steward of the Spark project. As a Trainer for Databricks and Spark, you will work closely with this team to understand Spark’s internals and then teach it to engineers around the world.
  3. TODO: Sameer create 2 additional slides before this explaining Transformations and Actions
  4. TODO: Sameer create 2 additional slides before this explaining Transformations and Actions
  5. TODO: Sameer create 2 additional slides before this explaining Transformations and Actions
  6. Return a new RDD by applying a function to each element of this RDD.
  7. Idiomatic Scala might write this filter as _%2==1 using _ to stand for the input param
  8. Order in output not guaranteed Point out that most wide transformations support an optional argument with the number of partitions in the output RDD
  9. Point out that results are not necessarily ordered, and that order is not guaranteed consistent between executions
  10. To-do: swap
  11. Order in output not guaranteed Point out that most wide transformations support an optional argument with the number of partitions in the output RDD
  12. Let's look at two different ways to compute word counts, one using reduceByKey and the other using groupByKey… While both of these functions will produce the correct answer, the reduceByKey example works much better on a large dataset. That's because Spark knows it can combine output with a common key on each partition before shuffling the data.
  13. Look at the diagram below to understand what happens with reduceByKey. Notice how pairs on the same machine with the same key are combined (by using the lamdba function passed intoreduceByKey) before the data is shuffled. Then the lamdba function is called again to reduce all the values from each partition to produce one final result.
  14. On the other hand, when calling groupByKey - all the key-value pairs are shuffled around. This is a lot of unnessary data to being transferred over the network. To determine which machine to shuffle a pair to, Spark calls a partitioning function on the key of the pair. Spark spills data to disk when there is more data shuffled onto a single executor machine than can fit in memory. However, it flushes out the data to disk one key at a time - so if a single key has more key-value pairs than can fit in memory, an out of memory exception occurs. This will be more gracefully handled in a later release of Spark so the job can still proceed, but should still be avoided - when Spark needs to spill to disk, performance is severely impacted. You can imagine that for a much larger dataset size, the difference in the amount of data you are shuffling becomes more exaggerated and different between reduceByKey and groupByKey. Here are more functions to prefer over groupByKey: - combineByKey can be used when you are combining elements but your return type differs from your input value type. - foldByKey merges the values for each key using an associative function and a neutral "zero value".
  15. Point out that the result of the mapping is an iterator (with unspecified size), so m items in the original partition can map to n items in the resulting new partition. So the new partition can have more items, the same number of items, or fewer items than the original.
  16. Point out that the result of the mapping is an iterator (with unspecified size), so m items in the original partition can map to n items in the resulting new partition. So the new partition can have more items, the same number of items, or fewer items than the original.
  17. To-do: Is it correct to remove one of the items from Partition A in RDD x’s diagram here?
  18. To-do: Is it correct to remove one of the items from Partition A in RDD x’s diagram here?
  19. To-do: swap
  20. Important to point out that the sample size itself will vary, and also that the Python and Scala random number generators will produce different output from the same seed, so that, e.g., calling val y = x.sample(false, 0.4, 42) in Scala will not produce the same output shown here on the right (which is the Python output) This is a good one to demo, to show varying results
  21. Point Out: Duplicates are not removed. They can be removed with the more expensive .distinct call, presented shortly
  22. To-do: swap
  23. To-do: swap
  24. To-do: swap
  25. To-do: swap
  26. Point out that this is an inner join, but that duplicate keys in the respective source RDDs result in a Cartesian join of those key-value pairs.
  27. Point out that this is an inner join, but that duplicate keys in the respective source RDDs result in a Cartesian join of those key-value pairs.
  28. Point out… In some cases, running a shuffle is desirable if the coalesced partitions would be too unevenly distributed to get good parallelism To increase the number of partitions (and trigger a shuffle), call repartition In Scala API (but not Python API), you can call coalesce with a larger number of partitions, and shuffle=true, and achieve a repartition
  29. Point out… In some cases, running a shuffle is desirable if the coalesced partitions would be too unevenly distributed to get good parallelism To increase the number of partitions (and trigger a shuffle), call repartition In Scala API (but not Python API), you can call coalesce with a larger number of partitions, and shuffle=true, and achieve a repartition
  30. To-do: swap
  31. To-do: swap
  32. To-do: swap
  33. To-do: swap
  34. Point out that the two source RDDs should have the same partitions and the same counts within partitions. Typically, one is a mapped version of the other.
  35. Explain: function must be commutative and associative, and collection must be closed under the function, allowing maximal parallelism. Refer to fold and aggregate for non-commutative functions or functions whose result is a different type.
  36. TODO: Sameer to find out exact execution process – in particular, when does the final combine occur on the driver, versus in the cluster Aggregate the elements of each partition, and then the results for all the partitions, using a given combine functions and a neutral “zero value.” The functions op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; however, it should not modify t2. The first function (seqOp) can return a different result type, U, than the type of this RDD. Thus, we need one operation for merging a T into an U and one operation for merging two U
  37. The state-of-the-art in Big Data is "simple things complex, complex things impossible." We think the future should be "simple things easy, and complex things possible."