3. Spark Pair RDD
Pair RDDs are rdds of key/value pair
Key/value RDDs are used to perform aggregations
Key/value RDDs expose new operations
Counting reviews for each product, grouping together data with the
same key, and grouping together two different RDDs
Spark and Scala
4. Creating Pair RDD (Scala)
In Scala, for the functions on key data to be available, we also need to
return tuples
Creating a pair RDD using the first word as the key word in Scala
val pairs = lines.map(x => (x.split(” “)(0), x))
Spark and Scala
5. Creating Pair RDD
Pair RDDs can be created by running a map() function that returns
key or value pairs
The procedure to build the key-value RDDs differs by language.
In Python language, for the functions on key data to work we need to
return an RDD composed of tuples
Creating a pair RDD using the first word as the key in Python
programming language.
pairs = lines.map(lambda x: (x.split(” “)[0], x))
Spark and Scala
6. Creating Pair RDD (Java)
Java doesn’t have a built-in function of tuple
so Spark’s Java API users create tuples using the scala.Tuple2 class.
Java users can construct a new tuple by writing new Tuple2(elem1,
elem2)
Access its relevant elements with the _1() and _2() methods
Java users also need to call special versions of Spark’s functions when
you are creating pair of RDDs.
For instance, the mapToPair () function should be used in place of the
basic map() function.
Spark and Scala
7. Creating Pair RDD (Java)
Creating a pair RDD using the first word as the key word in Java
program.
PairFunction<String, String, String> keyData =
new PairFunction<String, String, String>() {
public Tuple2<String, String> call(String x) {
return new Tuple2(x.split(” “)[0], x);
}
};
JavaPairRDD<String, String> pairs = lines.mapToPair(keyData);
Spark and Scala
8. PairRDD Transformations
Most Spark operations work on RDDs containing any type of objects,
a few special operations are only available on RDDs of key-value pairs
The most common ones are distributed “shuffle” operations, such as
grouping or aggregating the elements by a key
In Scala, these operations are automatically available on RDDs
containing Tuple2 objects, writing (a, b)
The key-value pair operations are available in the PairRDDFunctions
class, which automatically wraps around an RDD of tuples
Spark and Scala
9. PairRDD Transformations (Aggregation)
When datasets are described in terms of key/value pairs, it is
common feature that is required to aggregate statistics across all
elements with the same key value
Spark has a set of operations that combines values that own the
same key
These operations return RDDs and thus are transformations rather
than actions i.e. reduceByKey(), foldByKey(), combineByKey()
Spark and Scala
10. PairRDD Transformations (Grouping)
With key data is a common type of use case in grouping our data sets
is used with respect to predefined key value
For example, viewing all of a customer’s orders together in one file.
If our data is already keyed in the way we want to implement,
groupByKey() will group our data using the key value using our RDD.
On an RDD consisting of keys of type K and values of type V, we get
back an RDD operation of type [K, Iterable[V]]
Spark and Scala
11. PairRDD Transformations (Joins)
The most useful and effective operations we get with keyed data
values comes from using it together with other keyed data.
Joining datasets together is probably one of the most common type
of operations you can find out on a pair RDD
Supports following type of joins
Inner Join
left OuterJoin
rightOuterJoin
Spark and Scala
12. PairRDD Transformations (Sort)
We can sort an RDD with key or value pairs if there is an ordering
defined on the key set.
Once we have sorted our data elements, any subsequent call on the
sorted data to collect() or save() will result in ordered dataset.
Spark and Scala
13. PairRDD Transformations
Transformations Description
groupByKey([numTasks]
)
When called on a dataset of (K, V) pairs, returns a dataset of (K,
Iterable<V>) pairs. Note: If you are grouping in order to perform an
aggregation (such as a sum or average) over each key, using
reduceByKey or aggregateByKey will yield much better performance.
Note: By default, the level of parallelism in the output depends on the
number of partitions of the parent RDD.
You can pass an optional numTasks argument to set a different
number of tasks.
reduceByKey(func,
[numTasks])
When called on a dataset of (K, V) pairs, returns a dataset of (K, V)
pairs where the values for each key are aggregated using the given
reduce function func, which must be of type (V,V) => V. Like in
groupByKey, the number of reduce tasks is configurable through an
optional second argument
sortByKey([ascending],
[numTasks])
When called on a dataset of (K, V) pairs where K implements Ordered,
returns a dataset of (K, V) pairs sorted by keys in ascending or
descending order, as specified in the Boolean ascending argument.
Spark and Scala
14. PairRDD Transformations
Transformations Description
combineByKey[C](
createCombiner: V =>
C,
mergeValue: (C, V) =>
C,
mergeCombiners: (C,
C) => C): RDD[(K, C)]
1st Argument : createCombiner is called when a key(in the RDD
element) is found for the first time in a given Partition. This method
creates an initial value for the accumulator for that key
2nd Argument : mergeValue is called when the key already has an
accumulator
3rd Argument : mergeCombiners is called when more that one
partition has accumulator for the same key
mapvalues When called on a dataset of (K, V) pairs, returns a dataset of (K, V)
pairs where the values for each key are aggregated using the given
reduce function func, which must be of type (V,V) => V. Like in
groupByKey, the number of reduce tasks is configurable through an
optional second argument
Keys/values When called on a dataset of (K, V) pairs where K implements Ordered,
returns a dataset of (K, V) pairs sorted by keys in ascending or
descending order, as specified in the boolean ascending argument.
Spark and Scala
15. PairRDD Transformations
Transformations Description
aggregateByKey(zeroVal
ue)(seqOp, combOp,
[numTasks])
When called on a dataset of (K, V) pairs, returns a dataset of (K, U)
pairs where the values for each key are aggregated using the given
combine functions and a neutral "zero" value. Allows an aggregated
value type that is different than the input value type, while avoiding
unnecessary allocations. Like in groupByKey, the number of reduce
tasks is configurable through an optional second argument.
join(otherDataset,
[numTasks])
When called on datasets of type (K, V) and (K, W), returns a dataset of
(K, (V, W)) pairs with all pairs of elements for each key. Outer joins are
supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin
cogroup(otherDataset,
[numTasks])
When called on datasets of type (K, V) and (K, W), returns a dataset of
(K, (Iterable<V>, Iterable<W>)) tuples. This operation is also called
groupWith
Spark and Scala
16. PairRDD Actions
countByKey() : Count the number of elements for each key pair.
collectAsMap() : Collect the result outputs as a map to provide easy
lookup.
lookup(key) : Return all values associated with the provided key pair.
Spark and Scala
17. Demo of Pair RDD Transformations
• val rddnums= sc.parallelize(List("Hadoop Spark Scala
Python","DataScience Python C# Java","Hadoop Scala Python","Spark
Scala"))
• val inputmarkrdd = sc.parallelize(Seq(("maths", 50), ("maths",
60),("english", 65), ("physics", 66), ("physics", 61), ("physics", 87))
Spark and Scala
18. Spark Create Pair RDD
Create Pair RDD from String separated with space
Use flat Map to create RDD of key/value pair
Spark and Scala
19. Spark Group By Key
Use Group By Key to output count of words
Spark and Scala
20. Spark Reduce By Key
Use reduce By Key to output count of words
Spark and Scala
22. Diff between Group/Reduce By Key
reduce By Key combine output with a common key on each partition
before shuffling the data.
Reduce function is called again to reduce all the values from each
partition to produce one result.
On the other hand, when calling groupByKey - all the key-value pairs
are shuffled around. This is a lot of unnecessary data to being
transferred over the network.
To determine which machine to shuffle a pair to, Spark calls a
partitioning function on the key of the pair. Spark spills data to disk
when there is more data shuffled onto a single executor machine
than can fit in memory.
Spark and Scala
23. Diff between Group/Reduce By Key
However, it flushes out the data to disk one key at a time - so if a
single key has more key-value pairs than can fit in memory, an out of
memory exception occurs.
This will be more gracefully handled in a later release of Spark so the
job can still proceed, but should still be avoided - when Spark needs
to spill to disk, performance is severely impacted.
Spark and Scala
31. Spark Persistence
Spark and Scala
One of the most important capabilities in Spark is persisting (or
caching) a dataset in memory across operations.
When you persist an RDD, each node stores any partitions of it that it
computes in memory and reuses them in other actions on that
dataset (or datasets derived from it).
This allows future actions to be much faster (often by more than 10x)
Caching is a key tool for iterative algorithms and fast interactive use
32. Spark Persistence
Spark and Scala
RDD.cache is also a lazy operation.
If you run textFile.count the first time, the file will be loaded, cached,
and counted.
If you call textFile.count a second time, the operation will use the
cache.
33. Spark Persistence
Spark and Scala
You can mark an RDD to be persisted using the persist() or cache()
methods on it.
The first time it is computed in an action, it will be kept in memory on
the nodes.
Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will
automatically be recomputed using the transformations that
originally created it.
34. Memory vs Memory + Disk
Spark and Scala
Cache RDD in Memory
Cache RDD in memory +Disk
35. Spark Persistence
Spark and Scala
Each persisted RDD can be stored using a different storage level,
allowing you, for example, to persist the dataset on disk, persist it in
memory but as serialized Java objects (to save space), replicate it
across nodes.
These levels are set by passing a StorageLevel object (Scala, Java,
Python) to persist().
The cache() method is a shorthand for using the default storage level,
which is StorageLevel.MEMORY_ONLY (store deserialized objects in
memory).
37. Spark Persistence
Spark and Scala
Storage Level Explanation
MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not fit
in memory, some partitions will not be cached and will be recomputed
on the fly each time they're needed. This is the default level.
MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit
in memory, store the partitions that don't fit on disk, and read them
from there when they're needed.
MEMORY_ONLY_SER
(Java and Scala)
Store RDD as serialized Java objects (one byte array per partition). This is
generally more space-efficient than deserialized objects, especially
when using a fast serializer, but more CPU-intensive to read.
38. Spark Persistence
Spark and Scala
Storage Level Explanation
MEMORY_AND_DISK_SER
(Java and Scala)
Similar to MEMORY_ONLY_SER, but spill partitions that
don't fit in memory to disk instead of recomputing them on
the fly each time they're needed.
DISK_ONLY Store the RDD partitions only on disk
MEMORY_ONLY_2,
MEMORY_AND_DISK_2, etc
Same as the levels above, but replicate each partition on
two cluster nodes
39. Spark Persistence
Spark and Scala
In Python, stored objects will always be serialized with the Pickle
library, so it does not matter whether you choose a serialized level.
The available storage levels in Python include MEMORY_ONLY,
MEMORY_ONLY_2, MEMORY_AND_DISK, MEMORY_AND_DISK_2,
DISK_ONLY, and DISK_ONLY_2