SlideShare a Scribd company logo
1 of 45
Spark Pair RDD and Persistence
Spark and Scala
Spark Pair RDD and Persistence
 Spark Pair RDDs
 Creating Pair RDDs
 Pair RDD Reduce Transformations
 groupByKey
 reduceByKey
 aggregateByKey
 combineByKey
 Pair RDD Joins Transformations
 Pair RDD Sort by Key
 Pair RDD Actions
 countByKey
 collectAsMap
 lookUp
 RDD Cache and RDD Persistence
 RDD Storage Levels
 Q & A
Spark and Scala
Spark Pair RDD
 Pair RDDs are rdds of key/value pair
 Key/value RDDs are used to perform aggregations
 Key/value RDDs expose new operations
 Counting reviews for each product, grouping together data with the
same key, and grouping together two different RDDs
Spark and Scala
Creating Pair RDD (Scala)
 In Scala, for the functions on key data to be available, we also need to
return tuples
 Creating a pair RDD using the first word as the key word in Scala
 val pairs = lines.map(x => (x.split(” “)(0), x))
Spark and Scala
Creating Pair RDD
 Pair RDDs can be created by running a map() function that returns
key or value pairs
 The procedure to build the key-value RDDs differs by language.
 In Python language, for the functions on key data to work we need to
return an RDD composed of tuples
 Creating a pair RDD using the first word as the key in Python
programming language.
 pairs = lines.map(lambda x: (x.split(” “)[0], x))
Spark and Scala
Creating Pair RDD (Java)
 Java doesn’t have a built-in function of tuple
 so Spark’s Java API users create tuples using the scala.Tuple2 class.
 Java users can construct a new tuple by writing new Tuple2(elem1,
elem2)
 Access its relevant elements with the _1() and _2() methods
 Java users also need to call special versions of Spark’s functions when
you are creating pair of RDDs.
 For instance, the mapToPair () function should be used in place of the
basic map() function.
Spark and Scala
Creating Pair RDD (Java)
 Creating a pair RDD using the first word as the key word in Java
program.
 PairFunction<String, String, String> keyData =
new PairFunction<String, String, String>() {
public Tuple2<String, String> call(String x) {
return new Tuple2(x.split(” “)[0], x);
}
};
 JavaPairRDD<String, String> pairs = lines.mapToPair(keyData);
Spark and Scala
PairRDD Transformations
 Most Spark operations work on RDDs containing any type of objects,
a few special operations are only available on RDDs of key-value pairs
 The most common ones are distributed “shuffle” operations, such as
grouping or aggregating the elements by a key
 In Scala, these operations are automatically available on RDDs
containing Tuple2 objects, writing (a, b)
 The key-value pair operations are available in the PairRDDFunctions
class, which automatically wraps around an RDD of tuples
Spark and Scala
PairRDD Transformations (Aggregation)
 When datasets are described in terms of key/value pairs, it is
common feature that is required to aggregate statistics across all
elements with the same key value
 Spark has a set of operations that combines values that own the
same key
 These operations return RDDs and thus are transformations rather
than actions i.e. reduceByKey(), foldByKey(), combineByKey()
Spark and Scala
PairRDD Transformations (Grouping)
 With key data is a common type of use case in grouping our data sets
is used with respect to predefined key value
 For example, viewing all of a customer’s orders together in one file.
 If our data is already keyed in the way we want to implement,
groupByKey() will group our data using the key value using our RDD.
 On an RDD consisting of keys of type K and values of type V, we get
back an RDD operation of type [K, Iterable[V]]
Spark and Scala
PairRDD Transformations (Joins)
 The most useful and effective operations we get with keyed data
values comes from using it together with other keyed data.
 Joining datasets together is probably one of the most common type
of operations you can find out on a pair RDD
 Supports following type of joins
 Inner Join
 left OuterJoin
 rightOuterJoin
Spark and Scala
PairRDD Transformations (Sort)
 We can sort an RDD with key or value pairs if there is an ordering
defined on the key set.
 Once we have sorted our data elements, any subsequent call on the
sorted data to collect() or save() will result in ordered dataset.
Spark and Scala
PairRDD Transformations
Transformations Description
groupByKey([numTasks]
)
When called on a dataset of (K, V) pairs, returns a dataset of (K,
Iterable<V>) pairs. Note: If you are grouping in order to perform an
aggregation (such as a sum or average) over each key, using
reduceByKey or aggregateByKey will yield much better performance.
Note: By default, the level of parallelism in the output depends on the
number of partitions of the parent RDD.
You can pass an optional numTasks argument to set a different
number of tasks.
reduceByKey(func,
[numTasks])
When called on a dataset of (K, V) pairs, returns a dataset of (K, V)
pairs where the values for each key are aggregated using the given
reduce function func, which must be of type (V,V) => V. Like in
groupByKey, the number of reduce tasks is configurable through an
optional second argument
sortByKey([ascending],
[numTasks])
When called on a dataset of (K, V) pairs where K implements Ordered,
returns a dataset of (K, V) pairs sorted by keys in ascending or
descending order, as specified in the Boolean ascending argument.
Spark and Scala
PairRDD Transformations
Transformations Description
combineByKey[C](
createCombiner: V =>
C,
mergeValue: (C, V) =>
C,
mergeCombiners: (C,
C) => C): RDD[(K, C)]
1st Argument : createCombiner is called when a key(in the RDD
element) is found for the first time in a given Partition. This method
creates an initial value for the accumulator for that key
2nd Argument : mergeValue is called when the key already has an
accumulator
3rd Argument : mergeCombiners is called when more that one
partition has accumulator for the same key
mapvalues When called on a dataset of (K, V) pairs, returns a dataset of (K, V)
pairs where the values for each key are aggregated using the given
reduce function func, which must be of type (V,V) => V. Like in
groupByKey, the number of reduce tasks is configurable through an
optional second argument
Keys/values When called on a dataset of (K, V) pairs where K implements Ordered,
returns a dataset of (K, V) pairs sorted by keys in ascending or
descending order, as specified in the boolean ascending argument.
Spark and Scala
PairRDD Transformations
Transformations Description
aggregateByKey(zeroVal
ue)(seqOp, combOp,
[numTasks])
When called on a dataset of (K, V) pairs, returns a dataset of (K, U)
pairs where the values for each key are aggregated using the given
combine functions and a neutral "zero" value. Allows an aggregated
value type that is different than the input value type, while avoiding
unnecessary allocations. Like in groupByKey, the number of reduce
tasks is configurable through an optional second argument.
join(otherDataset,
[numTasks])
When called on datasets of type (K, V) and (K, W), returns a dataset of
(K, (V, W)) pairs with all pairs of elements for each key. Outer joins are
supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin
cogroup(otherDataset,
[numTasks])
When called on datasets of type (K, V) and (K, W), returns a dataset of
(K, (Iterable<V>, Iterable<W>)) tuples. This operation is also called
groupWith
Spark and Scala
PairRDD Actions
 countByKey() : Count the number of elements for each key pair.
 collectAsMap() : Collect the result outputs as a map to provide easy
lookup.
 lookup(key) : Return all values associated with the provided key pair.
Spark and Scala
Demo of Pair RDD Transformations
• val rddnums= sc.parallelize(List("Hadoop Spark Scala
Python","DataScience Python C# Java","Hadoop Scala Python","Spark
Scala"))
• val inputmarkrdd = sc.parallelize(Seq(("maths", 50), ("maths",
60),("english", 65), ("physics", 66), ("physics", 61), ("physics", 87))
Spark and Scala
Spark Create Pair RDD
 Create Pair RDD from String separated with space
 Use flat Map to create RDD of key/value pair
Spark and Scala
Spark Group By Key
 Use Group By Key to output count of words
Spark and Scala
Spark Reduce By Key
 Use reduce By Key to output count of words
Spark and Scala
Diff between Group/Reduce By Key
Spark and Scala
Diff between Group/Reduce By Key
 reduce By Key combine output with a common key on each partition
before shuffling the data.
 Reduce function is called again to reduce all the values from each
partition to produce one result.
 On the other hand, when calling groupByKey - all the key-value pairs
are shuffled around. This is a lot of unnecessary data to being
transferred over the network.
 To determine which machine to shuffle a pair to, Spark calls a
partitioning function on the key of the pair. Spark spills data to disk
when there is more data shuffled onto a single executor machine
than can fit in memory.
Spark and Scala
Diff between Group/Reduce By Key
 However, it flushes out the data to disk one key at a time - so if a
single key has more key-value pairs than can fit in memory, an out of
memory exception occurs.
 This will be more gracefully handled in a later release of Spark so the
job can still proceed, but should still be avoided - when Spark needs
to spill to disk, performance is severely impacted.
Spark and Scala
Aggregate By Key
Spark and Scala
Combine By Key
Spark and Scala
Combine By Key
Spark and Scala
Spark Sort By Key
Spark and Scala
MapValues/Keys/Values
Spark and Scala
CountByKey (Action)
Spark and Scala
CollectAsMap (Action)
Spark and Scala
Spark Persistence
Spark and Scala
 One of the most important capabilities in Spark is persisting (or
caching) a dataset in memory across operations.
 When you persist an RDD, each node stores any partitions of it that it
computes in memory and reuses them in other actions on that
dataset (or datasets derived from it).
 This allows future actions to be much faster (often by more than 10x)
 Caching is a key tool for iterative algorithms and fast interactive use
Spark Persistence
Spark and Scala
 RDD.cache is also a lazy operation.
 If you run textFile.count the first time, the file will be loaded, cached,
and counted.
 If you call textFile.count a second time, the operation will use the
cache.
Spark Persistence
Spark and Scala
 You can mark an RDD to be persisted using the persist() or cache()
methods on it.
 The first time it is computed in an action, it will be kept in memory on
the nodes.
 Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will
automatically be recomputed using the transformations that
originally created it.
Memory vs Memory + Disk
Spark and Scala
 Cache RDD in Memory
 Cache RDD in memory +Disk
Spark Persistence
Spark and Scala
 Each persisted RDD can be stored using a different storage level,
allowing you, for example, to persist the dataset on disk, persist it in
memory but as serialized Java objects (to save space), replicate it
across nodes.
 These levels are set by passing a StorageLevel object (Scala, Java,
Python) to persist().
 The cache() method is a shorthand for using the default storage level,
which is StorageLevel.MEMORY_ONLY (store deserialized objects in
memory).
RDD Cache
Spark and Scala
Spark Persistence
Spark and Scala
Storage Level Explanation
MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not fit
in memory, some partitions will not be cached and will be recomputed
on the fly each time they're needed. This is the default level.
MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit
in memory, store the partitions that don't fit on disk, and read them
from there when they're needed.
MEMORY_ONLY_SER
(Java and Scala)
Store RDD as serialized Java objects (one byte array per partition). This is
generally more space-efficient than deserialized objects, especially
when using a fast serializer, but more CPU-intensive to read.
Spark Persistence
Spark and Scala
Storage Level Explanation
MEMORY_AND_DISK_SER
(Java and Scala)
Similar to MEMORY_ONLY_SER, but spill partitions that
don't fit in memory to disk instead of recomputing them on
the fly each time they're needed.
DISK_ONLY Store the RDD partitions only on disk
MEMORY_ONLY_2,
MEMORY_AND_DISK_2, etc
Same as the levels above, but replicate each partition on
two cluster nodes
Spark Persistence
Spark and Scala
 In Python, stored objects will always be serialized with the Pickle
library, so it does not matter whether you choose a serialized level.
 The available storage levels in Python include MEMORY_ONLY,
MEMORY_ONLY_2, MEMORY_AND_DISK, MEMORY_AND_DISK_2,
DISK_ONLY, and DISK_ONLY_2
Spark Persistence Space and CPU
Spark and Scala
Spark UnPersist
Spark and Scala
 RDD.unpersist()
Spark Cache
Spark and Scala
Spark Cache
Spark and Scala
Spark Cache
Spark and Scala
Spark Persist
Spark and Scala

More Related Content

What's hot

Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisArnab Mitra
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
 
Introduction of Redis as NoSQL Database
Introduction of Redis as NoSQL DatabaseIntroduction of Redis as NoSQL Database
Introduction of Redis as NoSQL DatabaseAbhijeet Shekhar
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Simplilearn
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 

What's hot (20)

Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Hive
HiveHive
Hive
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
 
Introduction of Redis as NoSQL Database
Introduction of Redis as NoSQL DatabaseIntroduction of Redis as NoSQL Database
Introduction of Redis as NoSQL Database
 
Hive
HiveHive
Hive
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
 
Unit 5-apache hive
Unit 5-apache hiveUnit 5-apache hive
Unit 5-apache hive
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Sqoop
SqoopSqoop
Sqoop
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Apache hive
Apache hiveApache hive
Apache hive
 

Similar to 04 spark-pair rdd-rdd-persistence

Learning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value PairsLearning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value Pairsphanleson
 
Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache SparkMarcoYuriFujiiMelo
 
Spark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptxSpark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptxshivani22y
 
Berlin buzzwords 2018
Berlin buzzwords 2018Berlin buzzwords 2018
Berlin buzzwords 2018Matija Gobec
 
ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON Padma shree. T
 
Using spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and CassandraUsing spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and CassandraDenis Dus
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkVincent Poncet
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZDataFactZ
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to sparkJavier Arrieta
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Holden Karau
 

Similar to 04 spark-pair rdd-rdd-persistence (20)

Learning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value PairsLearning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value Pairs
 
Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache Spark
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Spark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptxSpark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptx
 
Berlin buzzwords 2018
Berlin buzzwords 2018Berlin buzzwords 2018
Berlin buzzwords 2018
 
ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON
 
Using spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and CassandraUsing spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and Cassandra
 
Operations on rdd
Operations on rddOperations on rdd
Operations on rdd
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016
 

Recently uploaded

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 

04 spark-pair rdd-rdd-persistence

  • 1. Spark Pair RDD and Persistence Spark and Scala
  • 2. Spark Pair RDD and Persistence  Spark Pair RDDs  Creating Pair RDDs  Pair RDD Reduce Transformations  groupByKey  reduceByKey  aggregateByKey  combineByKey  Pair RDD Joins Transformations  Pair RDD Sort by Key  Pair RDD Actions  countByKey  collectAsMap  lookUp  RDD Cache and RDD Persistence  RDD Storage Levels  Q & A Spark and Scala
  • 3. Spark Pair RDD  Pair RDDs are rdds of key/value pair  Key/value RDDs are used to perform aggregations  Key/value RDDs expose new operations  Counting reviews for each product, grouping together data with the same key, and grouping together two different RDDs Spark and Scala
  • 4. Creating Pair RDD (Scala)  In Scala, for the functions on key data to be available, we also need to return tuples  Creating a pair RDD using the first word as the key word in Scala  val pairs = lines.map(x => (x.split(” “)(0), x)) Spark and Scala
  • 5. Creating Pair RDD  Pair RDDs can be created by running a map() function that returns key or value pairs  The procedure to build the key-value RDDs differs by language.  In Python language, for the functions on key data to work we need to return an RDD composed of tuples  Creating a pair RDD using the first word as the key in Python programming language.  pairs = lines.map(lambda x: (x.split(” “)[0], x)) Spark and Scala
  • 6. Creating Pair RDD (Java)  Java doesn’t have a built-in function of tuple  so Spark’s Java API users create tuples using the scala.Tuple2 class.  Java users can construct a new tuple by writing new Tuple2(elem1, elem2)  Access its relevant elements with the _1() and _2() methods  Java users also need to call special versions of Spark’s functions when you are creating pair of RDDs.  For instance, the mapToPair () function should be used in place of the basic map() function. Spark and Scala
  • 7. Creating Pair RDD (Java)  Creating a pair RDD using the first word as the key word in Java program.  PairFunction<String, String, String> keyData = new PairFunction<String, String, String>() { public Tuple2<String, String> call(String x) { return new Tuple2(x.split(” “)[0], x); } };  JavaPairRDD<String, String> pairs = lines.mapToPair(keyData); Spark and Scala
  • 8. PairRDD Transformations  Most Spark operations work on RDDs containing any type of objects, a few special operations are only available on RDDs of key-value pairs  The most common ones are distributed “shuffle” operations, such as grouping or aggregating the elements by a key  In Scala, these operations are automatically available on RDDs containing Tuple2 objects, writing (a, b)  The key-value pair operations are available in the PairRDDFunctions class, which automatically wraps around an RDD of tuples Spark and Scala
  • 9. PairRDD Transformations (Aggregation)  When datasets are described in terms of key/value pairs, it is common feature that is required to aggregate statistics across all elements with the same key value  Spark has a set of operations that combines values that own the same key  These operations return RDDs and thus are transformations rather than actions i.e. reduceByKey(), foldByKey(), combineByKey() Spark and Scala
  • 10. PairRDD Transformations (Grouping)  With key data is a common type of use case in grouping our data sets is used with respect to predefined key value  For example, viewing all of a customer’s orders together in one file.  If our data is already keyed in the way we want to implement, groupByKey() will group our data using the key value using our RDD.  On an RDD consisting of keys of type K and values of type V, we get back an RDD operation of type [K, Iterable[V]] Spark and Scala
  • 11. PairRDD Transformations (Joins)  The most useful and effective operations we get with keyed data values comes from using it together with other keyed data.  Joining datasets together is probably one of the most common type of operations you can find out on a pair RDD  Supports following type of joins  Inner Join  left OuterJoin  rightOuterJoin Spark and Scala
  • 12. PairRDD Transformations (Sort)  We can sort an RDD with key or value pairs if there is an ordering defined on the key set.  Once we have sorted our data elements, any subsequent call on the sorted data to collect() or save() will result in ordered dataset. Spark and Scala
  • 13. PairRDD Transformations Transformations Description groupByKey([numTasks] ) When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance. Note: By default, the level of parallelism in the output depends on the number of partitions of the parent RDD. You can pass an optional numTasks argument to set a different number of tasks. reduceByKey(func, [numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument sortByKey([ascending], [numTasks]) When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the Boolean ascending argument. Spark and Scala
  • 14. PairRDD Transformations Transformations Description combineByKey[C]( createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C)] 1st Argument : createCombiner is called when a key(in the RDD element) is found for the first time in a given Partition. This method creates an initial value for the accumulator for that key 2nd Argument : mergeValue is called when the key already has an accumulator 3rd Argument : mergeCombiners is called when more that one partition has accumulator for the same key mapvalues When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument Keys/values When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument. Spark and Scala
  • 15. PairRDD Transformations Transformations Description aggregateByKey(zeroVal ue)(seqOp, combOp, [numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral "zero" value. Allows an aggregated value type that is different than the input value type, while avoiding unnecessary allocations. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument. join(otherDataset, [numTasks]) When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin cogroup(otherDataset, [numTasks]) When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable<V>, Iterable<W>)) tuples. This operation is also called groupWith Spark and Scala
  • 16. PairRDD Actions  countByKey() : Count the number of elements for each key pair.  collectAsMap() : Collect the result outputs as a map to provide easy lookup.  lookup(key) : Return all values associated with the provided key pair. Spark and Scala
  • 17. Demo of Pair RDD Transformations • val rddnums= sc.parallelize(List("Hadoop Spark Scala Python","DataScience Python C# Java","Hadoop Scala Python","Spark Scala")) • val inputmarkrdd = sc.parallelize(Seq(("maths", 50), ("maths", 60),("english", 65), ("physics", 66), ("physics", 61), ("physics", 87)) Spark and Scala
  • 18. Spark Create Pair RDD  Create Pair RDD from String separated with space  Use flat Map to create RDD of key/value pair Spark and Scala
  • 19. Spark Group By Key  Use Group By Key to output count of words Spark and Scala
  • 20. Spark Reduce By Key  Use reduce By Key to output count of words Spark and Scala
  • 21. Diff between Group/Reduce By Key Spark and Scala
  • 22. Diff between Group/Reduce By Key  reduce By Key combine output with a common key on each partition before shuffling the data.  Reduce function is called again to reduce all the values from each partition to produce one result.  On the other hand, when calling groupByKey - all the key-value pairs are shuffled around. This is a lot of unnecessary data to being transferred over the network.  To determine which machine to shuffle a pair to, Spark calls a partitioning function on the key of the pair. Spark spills data to disk when there is more data shuffled onto a single executor machine than can fit in memory. Spark and Scala
  • 23. Diff between Group/Reduce By Key  However, it flushes out the data to disk one key at a time - so if a single key has more key-value pairs than can fit in memory, an out of memory exception occurs.  This will be more gracefully handled in a later release of Spark so the job can still proceed, but should still be avoided - when Spark needs to spill to disk, performance is severely impacted. Spark and Scala
  • 25. Combine By Key Spark and Scala
  • 26. Combine By Key Spark and Scala
  • 27. Spark Sort By Key Spark and Scala
  • 31. Spark Persistence Spark and Scala  One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations.  When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it).  This allows future actions to be much faster (often by more than 10x)  Caching is a key tool for iterative algorithms and fast interactive use
  • 32. Spark Persistence Spark and Scala  RDD.cache is also a lazy operation.  If you run textFile.count the first time, the file will be loaded, cached, and counted.  If you call textFile.count a second time, the operation will use the cache.
  • 33. Spark Persistence Spark and Scala  You can mark an RDD to be persisted using the persist() or cache() methods on it.  The first time it is computed in an action, it will be kept in memory on the nodes.  Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.
  • 34. Memory vs Memory + Disk Spark and Scala  Cache RDD in Memory  Cache RDD in memory +Disk
  • 35. Spark Persistence Spark and Scala  Each persisted RDD can be stored using a different storage level, allowing you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), replicate it across nodes.  These levels are set by passing a StorageLevel object (Scala, Java, Python) to persist().  The cache() method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY (store deserialized objects in memory).
  • 37. Spark Persistence Spark and Scala Storage Level Explanation MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level. MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed. MEMORY_ONLY_SER (Java and Scala) Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.
  • 38. Spark Persistence Spark and Scala Storage Level Explanation MEMORY_AND_DISK_SER (Java and Scala) Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed. DISK_ONLY Store the RDD partitions only on disk MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc Same as the levels above, but replicate each partition on two cluster nodes
  • 39. Spark Persistence Spark and Scala  In Python, stored objects will always be serialized with the Pickle library, so it does not matter whether you choose a serialized level.  The available storage levels in Python include MEMORY_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK, MEMORY_AND_DISK_2, DISK_ONLY, and DISK_ONLY_2
  • 40. Spark Persistence Space and CPU Spark and Scala
  • 41. Spark UnPersist Spark and Scala  RDD.unpersist()