SlideShare a Scribd company logo
Advanced Spark Programming (1)
Advanced Spark Programming
Persistence (caching)
Advanced Spark Programming
1. RDDs are lazily evaluated
Persistence (caching)
Advanced Spark Programming
1. RDDs are lazily evaluated
2. RDD and its dependencies are recomputed on action
Persistence (caching)
Advanced Spark Programming
1. RDDs are lazily evaluated
2. RDD and its dependencies are recomputed on action
3. We may wish to use the same RDD multiple times.
Persistence (caching)
Advanced Spark Programming
1. RDDs are lazily evaluated
2. RDD and its dependencies are recomputed on action
3. We may wish to use the same RDD multiple times.
4. To avoid re-computing, we can persist the RDD
Persistence (caching)
Advanced Spark Programming
1. If a node that has data persisted on it fails, Spark will
recompute the lost partitions of the data when needed.
Persistence (caching)
Advanced Spark Programming
Persistence (caching)
1. If a node that has data persisted on it fails, Spark will
recompute the lost partitions of the data when needed.
2. We can also replicate our data on multiple nodes if we
want to be able to handle node failure without slowdown.
Advanced Spark Programming
Persistence (caching) - Example
Advanced Spark Programming
1. var nums = sc.parallelize((1 to 100000), 50)
2.
Persistence (caching) - Example
Advanced Spark Programming
1. var nums = sc.parallelize((1 to 100000), 50)
2. def mysum(itr:Iterator[Int]):Iterator[Int] = {
3. return Array(itr.sum).toIterator
4. }
5.
6. var partitions = nums.mapPartitions(mysum)
Persistence (caching) - Example
Advanced Spark Programming
1. var nums = sc.parallelize((1 to 100000), 50)
2. def mysum(itr:Iterator[Int]):Iterator[Int] = {
3. return Array(itr.sum).toIterator
4. }
5.
6. var partitions = nums.mapPartitions(mysum)
7. def incrByOne(x:Int ):Int = {
a. return x+1;
8. }
9.
10. var partitions1 = partitions.map(incrByOne)
11. partitions1.collect()
12. // say, partitions1 is going to be used very frequently
Persistence (caching) - Example
Advanced Spark Programming
Persistence (caching) - Example
persist()
partitions1.persist()
res21: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[4] at map at <console>:29
partitions1.getStorageLevel
res27: org.apache.spark.storage.StorageLevel = StorageLevel(false, true, false, true, 1)
Advanced Spark Programming
//To unpersist an RDD - before repersisting we need to unpersist
partitions1.unpersist()
Persistence (caching) - Example
persist()
Advanced Spark Programming
//To unpersist an RDD - before repersisting we need to unpersist
partitions1.unpersist()
import org.apache.spark.storage.StorageLevel
//This persists our RDD into memrory and disk
partitions1.persist(StorageLevel.MEMORY_AND_DISK)
Persistence (caching) - Example
persist()
Advanced Spark Programming
//To unpersist an RDD - before repersisting we need to unpersist
partitions1.unpersist()
import org.apache.spark.storage.StorageLevel
//This persists our RDD into memrory and disk
partitions1.persist(StorageLevel.MEMORY_AND_DISK)
partitions1.getStorageLevel
res2: org.apache.spark.storage.StorageLevel = StorageLevel(true, true, false, true, 1)
StorageLevel.MEMORY_AND_DISK
res7: org.apache.spark.storage.StorageLevel = StorageLevel(true, true, false, true, 1)
Persistence (caching) - Example
persist()
Advanced Spark Programming
Persistence - Understanding Storage Level
rdd.persist(storageLevel)
Advanced Spark Programming
Persistence - Understanding Storage Level
rdd.persist(storageLevel)
var mysl = StorageLevel(true, true, false, true, 1)
Advanced Spark Programming
Persistence - Understanding Storage Level
rdd.persist(storageLevel)
useDisk
var mysl = StorageLevel(true, true, false, true, 1)
Advanced Spark Programming
Persistence - Understanding Storage Level
rdd.persist(storageLevel)
useDisk useMemory
var mysl = StorageLevel(true, true, false, true, 1)
Advanced Spark Programming
Persistence - Understanding Storage Level
rdd.persist(storageLevel)
useDisk useMemory useOffHeap
var mysl = StorageLevel(true, true, false, true, 1)
Advanced Spark Programming
Persistence - Understanding Storage Level
rdd.persist(storageLevel)
useDisk useMemory useOffHeap deserialized
var mysl = StorageLevel(true, true, false, true, 1)
Advanced Spark Programming
Persistence - Understanding Storage Level
rdd.persist(storageLevel)
useDisk useMemory useOffHeap deserialized
var mysl = StorageLevel(true, true, false, true, 1)
replication
Advanced Spark Programming
Persistence - Understanding Storage Level
rdd.persist(storageLevel)
useDisk useMemory useOffHeap deserialized
var mysl = StorageLevel(true, true, false, true, 1)
replication
partitions1.persist(mysl)
Advanced Spark Programming
Persistence - StorageLevel Shortcuts
partitions1.persist(StorageLevel.MEMORY_ONLY)
Advanced Spark Programming
Store RDD as deserialized Java objects in the JVM. If the RDD does not fit
in memory, some partitions will not be cached and will be recomputed on
the fly each time they're needed. This is the default level.
Persistence (caching) - StorageLevel
useDisk useMemory useOffHeap deserialized replication
1
partitions1.persist(StorageLevel.MEMORY_ONLY)
Advanced Spark Programming
Store RDD as deserialized Java objects in the JVM. If the RDD does not fit
in memory, store the partitions on disk that don't fit in memory, and read
them from there when they're needed.
Persistence (caching) - StorageLevel
useDisk useMemory useOffHeap deserialized replication
1
partitions1.persist(StorageLevel.MEMORY_AND_DISK)
Advanced Spark Programming
Store RDD as serialized Java objects (one byte array per partition). This is
generally more space-efficient than deserialized objects, especially when
using a fast serializer, but more CPU-intensive to read.
Persistence (caching) - StorageLevel
useDisk useMemory useOffHeap deserialized replication
1
partitions1.persist(StorageLevel.MEMORY_ONLY_SER)
Advanced Spark Programming
Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in
memory to disk instead of recomputing them on the fly each time they're
needed.
Persistence (caching) - StorageLevel
useDisk useMemory useOffHeap deserialized replication
1
partitions1.persist(StorageLevel.MEMORY_AND_DISK_SER)
Advanced Spark Programming
Store the RDD partitions only on disk.
Persistence (caching) - StorageLevel
useDisk useMemory useOffHeap deserialized replication
1
partitions1.persist(StorageLevel.DISK_ONLY)
Advanced Spark Programming
Same as the levels above, but replicate each partition on two cluster nodes.
Persistence (caching) - StorageLevel
useDisk useMemory useOffHeap deserialized replication
- - - -
2
MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.
Advanced Spark Programming
Store RDD in serialized format in Tachyon.
Persistence (caching) - StorageLevel
useDisk useMemory useOffHeap deserialized replication
1
OFF_HEAP (experimental)
Advanced Spark Programming
Which Storage Level to Choose?
Advanced Spark Programming
● RDD fits comfortably in memory, use MEMORY_ONLY
Which Storage Level to Choose?
Advanced Spark Programming
● RDD fits comfortably in memory, use MEMORY_ONLY
● If not, try MEMORY_ONLY_SER
Which Storage Level to Choose?
Advanced Spark Programming
Which Storage Level to Choose?
● RDD fits comfortably in memory, use MEMORY_ONLY
● If not, try MEMORY_ONLY_SER
● Don’t persist to disk unless, the computation is really expensive
Advanced Spark Programming
Which Storage Level to Choose?
● RDD fits comfortably in memory, use MEMORY_ONLY
● If not, try MEMORY_ONLY_SER
● Don’t persist to disk unless, the computation is really expensive
● For fast fault recovery, use replicated
Advanced Spark Programming
Data Partitioning
● Layout data to minimize transfer
Advanced Spark Programming
Data Partitioning
● Layout data to minimize transfer
● Not helpful if you need to scan the dataset only once
Advanced Spark Programming
● Layout data to minimize transfer
● Not helpful if you need to scan the dataset only once
● Helpful when dataset is reused multiple times in key-oriented operations
Data Partitioning
Advanced Spark Programming
Data Partitioning
● Layout data to minimize transfer
● Not helpful if you need to scan the dataset only once
● Helpful when dataset is reused multiple times in key-oriented operations
● Available for k-v rdds
Advanced Spark Programming
Data Partitioning
● Layout data to minimize transfer
● Not helpful if you need to scan the dataset only once
● Helpful when dataset is reused multiple times in key-oriented operations
● Available for k-v rdds
● Causes system to group elements based on key
Advanced Spark Programming
Data Partitioning
● Layout data to minimize transfer
● Not helpful if you need to scan the dataset only once
● Helpful when dataset is reused multiple times in key-oriented operations
● Available for k-v rdds
● Causes system to group elements based on key
● Spark does not give explicit control which worker node has which key
Advanced Spark Programming
Data Partitioning
● Layout data to minimize transfer
● Not helpful if you need to scan the dataset only once
● Helpful when dataset is reused multiple times in key-oriented operations
● Available for k-v rdds
● Causes system to group elements based on key
● Spark does not give explicit control which worker node has which key
● It lets program control/ensure which set of key will appear together
○ - based on some hash value for e.g.
○ - or you could range-sort
Advanced Spark Programming
Data Partitioning - Example - Subscriptions
Objective:
Count how many users visited a link that was not one of their subscribed topics.
UserID LinkInfoUserID UserSubscriptionTopicsInfo
UserData
(subscriptions) Events
Advanced Spark Programming
Data Partitioning - Example
Code will run fine as is, but it will be inefficient. Lot of network transfer
Periodically, we combines UserData with last five mins events (smaller)
Advanced Spark Programming
Data Partitioning - Example
val sc = new SparkContext(...)
val userData = sc
.sequenceFile[ UserID, UserInfo](" hdfs://...")
//.partitionBy( new HashPartitioner( 100))
.persist()
// Function called periodically to process a logfile of events in the past 5 minutes;
// we assume that this is a SequenceFile containing (UserID, LinkInfo) pairs.
def processNewLogs( logFileName: String) {
val events = sc.sequenceFile[ UserID, LinkInfo]( logFileName)
val joined = userData.join( events) // RDD of (UserID, (UserInfo, LinkInfo)) pairs
val offTopicVisits = joined.filter({
case (userId, (userInfo, linkInfo)) = > !userInfo.topics.contains( linkInfo.topic)
}).count()
println(" Number of visits to non-subscribed topics: " + offTopicVisits)
}
Advanced Spark Programming
Data Partitioning - Example
val sc = new SparkContext(...)
val userData = sc
.sequenceFile[ UserID, UserInfo](" hdfs://...")
.persist()
Advanced Spark Programming
Data Partitioning - Example
val sc = new SparkContext(...)
val userData = sc
.sequenceFile[ UserID, UserInfo](" hdfs://...")
.persist()
// Function called periodically to process a logfile of events in the past 5 minutes;
// we assume that this is a SequenceFile containing (UserID, LinkInfo) pairs.
def processNewLogs( logFileName: String) {
Advanced Spark Programming
Data Partitioning - Example
val sc = new SparkContext(...)
val userData = sc
.sequenceFile[ UserID, UserInfo](" hdfs://...")
.persist()
// Function called periodically to process a logfile of events in the past 5 minutes;
// we assume that this is a SequenceFile containing (UserID, LinkInfo) pairs.
def processNewLogs( logFileName: String) {
val events = sc.sequenceFile[ UserID, LinkInfo]( logFileName)
Advanced Spark Programming
Data Partitioning - Example
val sc = new SparkContext(...)
val userData = sc
.sequenceFile[ UserID, UserInfo](" hdfs://...")
.persist()
// Function called periodically to process a logfile of events in the past 5 minutes;
// we assume that this is a SequenceFile containing (UserID, LinkInfo) pairs.
def processNewLogs( logFileName: String) {
val events = sc.sequenceFile[ UserID, LinkInfo]( logFileName)
val joined = userData.join( events) // RDD of (UserID, (UserInfo, LinkInfo)) pairs
Advanced Spark Programming
Data Partitioning - Example
val sc = new SparkContext(...)
val userData = sc
.sequenceFile[ UserID, UserInfo](" hdfs://...")
.persist()
// Function called periodically to process a logfile of events in the past 5 minutes;
// we assume that this is a SequenceFile containing (UserID, LinkInfo) pairs.
def processNewLogs( logFileName: String) {
val events = sc.sequenceFile[ UserID, LinkInfo]( logFileName)
val joined = userData.join( events) // RDD of (UserID, (UserInfo, LinkInfo)) pairs
val offTopicVisits = joined.filter {
case (userId, (userInfo, linkInfo)) = > !userInfo.topics.contains( linkInfo.topic)
}.count()
Advanced Spark Programming
Data Partitioning - Example
val sc = new SparkContext(...)
val userData = sc
.sequenceFile[ UserID, UserInfo](" hdfs://...")
//.partitionBy( new HashPartitioner( 100))
.persist()
// Function called periodically to process a logfile of events in the past 5 minutes;
// we assume that this is a SequenceFile containing (UserID, LinkInfo) pairs.
def processNewLogs( logFileName: String) {
val events = sc.sequenceFile[ UserID, LinkInfo]( logFileName)
val joined = userData.join( events) // RDD of (UserID, (UserInfo, LinkInfo)) pairs
val offTopicVisits = joined.filter {
case (userId, (userInfo, linkInfo)) = > !userInfo.topics.contains( linkInfo.topic)
}.count()
println(" Number of visits to non-subscribed topics: " + offTopicVisits)
}
Advanced Spark Programming
Data Partitioning - Example
Advanced Spark Programming
Data Partitioning - Example
import org.apache.spark.HashPartitioner
sc = SparkContext(...)
basedata = sc.sequenceFile("hdfs://...")
var userData = basedata.partitionBy(new HashPartitioner(8)).persist()
Advanced Spark Programming
Data Partitioning - Partitioners
1. Reorganize the keys of a PairRDD
2. Can be applied using partionBy
3. Examples
a. HashPartitioner
b. RangePartitioner
Advanced Spark Programming
● Implements hash-based partitioning using Java's Object.hashCode.
● Does not work if the key is an array
Data Partitioning - HashPartitioner
Advanced Spark Programming
● Partitions sortable records by range into roughly equal ranges
● The ranges are determined by sampling
Data Partitioning - RangePartitioner
Advanced Spark Programming
Data Re-partitioning - Example
Advanced Spark Programming
Data Re-partitioning - Example
import org.apache.spark.HashPartitioner
import org.apache.spark.RangePartitioner
var rdd = sc.parallelize(1 to 10).map((_,1))
rdd.partitionBy(new HashPartitioner(2)).glom().collect()
rdd.partitionBy(new RangePartitioner(2,rdd)).glom().collect()
Advanced Spark Programming
Operations That Benefit from Partitioning
● cogroup()
● groupWith(), groupByKey(), reduceByKey(),
● join(), leftOuterJoin(), rightOuter Join()
● combineByKey(), and lookup().
Data Partitioning
Advanced Spark Programming
Data Partitioning - Creating custom partitioner
[ [(sandeep,1)], [(giri,1), (abhishek,1)], [(sravani,1), (jude,1)] ]
Partition 2 Partition 3Partition 1
[ [(giri,1), (abhishek,1), (jude,1)], [(sravani,1), (sandeep,1)] ]
Partition 2Partition 1
Advanced Spark Programming
Data Partitioning - Creating custom partitioner
import org.apache.spark.Partitioner
class TwoPartsPartitioner(override val numPartitions: Int) extends Partitioner {
def getPartition(key: Any): Int = key match {
case s: String => {
if (s(0).toUpper > 'J') 1 else 0
}
}
override def equals(other: Any): Boolean = other.isInstanceOf[TwoPartsPartitioner]
override def hashCode: Int = 0
}
https://gist.github.com/girisandeep/f90e456da6f2381f9c86e8e6bc4e8260
Advanced Spark Programming
Data Partitioning - Creating custom partitioner
https://gist.github.com/girisandeep/f90e456da6f2381f9c86e8e6bc4e8260
var x = sc.parallelize(Array(("sandeep",1),("giri",1),("abhishek",1),("sravani",1),("jude",1)), 3)
x.glom().collect()
//Array(Array((sandeep,1)), Array((giri,1), (abhishek,1)), Array((sravani,1), (jude,1)))
//[ [(sandeep,1)], [(giri,1), (abhishek,1)], [(sravani,1), (jude,1)] ]
var y = x.partitionBy(new TwoPartsPartitioner(2))
y.glom().collect()
//Array(Array((giri,1), (abhishek,1), (jude,1)), Array((sandeep,1), (sravani,1)))
//[ [(giri,1), (abhishek,1), (jude,1)], [(sandeep,1), (sravani,1)] ]

More Related Content

What's hot

Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
Fernando Rodriguez
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
Russell Spitzer
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Spark Deep Dive
Spark Deep DiveSpark Deep Dive
Spark Deep Dive
Corey Nolet
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
Fabio Fumarola
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
Tudor Lapusan
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
Majid Hajibaba
 
Apache Spark Internals
Apache Spark InternalsApache Spark Internals
Apache Spark Internals
Knoldus Inc.
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
Sandy Ryza
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Duyhai Doan
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
DataWorks Summit
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
spinningmatt
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
Richard Kuo
 
Map reduce vs spark
Map reduce vs sparkMap reduce vs spark
Map reduce vs spark
Tudor Lapusan
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
Aakashdata
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 

What's hot (20)

Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
 
Spark Deep Dive
Spark Deep DiveSpark Deep Dive
Spark Deep Dive
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
 
Apache Spark Internals
Apache Spark InternalsApache Spark Internals
Apache Spark Internals
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Map reduce vs spark
Map reduce vs sparkMap reduce vs spark
Map reduce vs spark
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
 

Similar to Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab

Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademySpark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Syed Hadoop
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
clairvoyantllc
 
Spark
SparkSpark
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Apache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup TalkApache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup Talk
Eren Avşaroğulları
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
David Smelker
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
In-memory database
In-memory databaseIn-memory database
In-memory database
Chien Nguyen Dang
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
DataArt
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
Atif Akhtar
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)
Datio Big Data
 
Spark
SparkSpark
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek AlumniSpark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
Demi Ben-Ari
 
Spark
SparkSpark
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdf
Amit Raj
 
A Deep Dive Into Spark
A Deep Dive Into SparkA Deep Dive Into Spark
A Deep Dive Into Spark
Ashish kumar
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
MaheshPandit16
 
Elasticsearch Data Analyses
Elasticsearch Data AnalysesElasticsearch Data Analyses
Elasticsearch Data AnalysesAlaa Elhadba
 
Spark learning
Spark learningSpark learning
Spark learning
Ajay Guyyala
 

Similar to Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab (20)

Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademySpark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Spark
SparkSpark
Spark
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Apache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup TalkApache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup Talk
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
In-memory database
In-memory databaseIn-memory database
In-memory database
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)
 
Spark
SparkSpark
Spark
 
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek AlumniSpark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
 
Spark
SparkSpark
Spark
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdf
 
A Deep Dive Into Spark
A Deep Dive Into SparkA Deep Dive Into Spark
A Deep Dive Into Spark
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Elasticsearch Data Analyses
Elasticsearch Data AnalysesElasticsearch Data Analyses
Elasticsearch Data Analyses
 
Spark learning
Spark learningSpark learning
Spark learning
 

More from CloudxLab

Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
CloudxLab
 
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning Overview
CloudxLab
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
CloudxLab
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
CloudxLab
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
CloudxLab
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
CloudxLab
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural Nets
CloudxLab
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
CloudxLab
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
CloudxLab
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
CloudxLab
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLab
CloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
CloudxLab
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
CloudxLab
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
CloudxLab
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
CloudxLab
 
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 

More from CloudxLab (20)

Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
 
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning Overview
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural Nets
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
 

Recently uploaded

Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 

Recently uploaded (20)

Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 

Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab

  • 3. Advanced Spark Programming 1. RDDs are lazily evaluated Persistence (caching)
  • 4. Advanced Spark Programming 1. RDDs are lazily evaluated 2. RDD and its dependencies are recomputed on action Persistence (caching)
  • 5. Advanced Spark Programming 1. RDDs are lazily evaluated 2. RDD and its dependencies are recomputed on action 3. We may wish to use the same RDD multiple times. Persistence (caching)
  • 6. Advanced Spark Programming 1. RDDs are lazily evaluated 2. RDD and its dependencies are recomputed on action 3. We may wish to use the same RDD multiple times. 4. To avoid re-computing, we can persist the RDD Persistence (caching)
  • 7. Advanced Spark Programming 1. If a node that has data persisted on it fails, Spark will recompute the lost partitions of the data when needed. Persistence (caching)
  • 8. Advanced Spark Programming Persistence (caching) 1. If a node that has data persisted on it fails, Spark will recompute the lost partitions of the data when needed. 2. We can also replicate our data on multiple nodes if we want to be able to handle node failure without slowdown.
  • 10. Advanced Spark Programming 1. var nums = sc.parallelize((1 to 100000), 50) 2. Persistence (caching) - Example
  • 11. Advanced Spark Programming 1. var nums = sc.parallelize((1 to 100000), 50) 2. def mysum(itr:Iterator[Int]):Iterator[Int] = { 3. return Array(itr.sum).toIterator 4. } 5. 6. var partitions = nums.mapPartitions(mysum) Persistence (caching) - Example
  • 12. Advanced Spark Programming 1. var nums = sc.parallelize((1 to 100000), 50) 2. def mysum(itr:Iterator[Int]):Iterator[Int] = { 3. return Array(itr.sum).toIterator 4. } 5. 6. var partitions = nums.mapPartitions(mysum) 7. def incrByOne(x:Int ):Int = { a. return x+1; 8. } 9. 10. var partitions1 = partitions.map(incrByOne) 11. partitions1.collect() 12. // say, partitions1 is going to be used very frequently Persistence (caching) - Example
  • 13. Advanced Spark Programming Persistence (caching) - Example persist() partitions1.persist() res21: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[4] at map at <console>:29 partitions1.getStorageLevel res27: org.apache.spark.storage.StorageLevel = StorageLevel(false, true, false, true, 1)
  • 14. Advanced Spark Programming //To unpersist an RDD - before repersisting we need to unpersist partitions1.unpersist() Persistence (caching) - Example persist()
  • 15. Advanced Spark Programming //To unpersist an RDD - before repersisting we need to unpersist partitions1.unpersist() import org.apache.spark.storage.StorageLevel //This persists our RDD into memrory and disk partitions1.persist(StorageLevel.MEMORY_AND_DISK) Persistence (caching) - Example persist()
  • 16. Advanced Spark Programming //To unpersist an RDD - before repersisting we need to unpersist partitions1.unpersist() import org.apache.spark.storage.StorageLevel //This persists our RDD into memrory and disk partitions1.persist(StorageLevel.MEMORY_AND_DISK) partitions1.getStorageLevel res2: org.apache.spark.storage.StorageLevel = StorageLevel(true, true, false, true, 1) StorageLevel.MEMORY_AND_DISK res7: org.apache.spark.storage.StorageLevel = StorageLevel(true, true, false, true, 1) Persistence (caching) - Example persist()
  • 17. Advanced Spark Programming Persistence - Understanding Storage Level rdd.persist(storageLevel)
  • 18. Advanced Spark Programming Persistence - Understanding Storage Level rdd.persist(storageLevel) var mysl = StorageLevel(true, true, false, true, 1)
  • 19. Advanced Spark Programming Persistence - Understanding Storage Level rdd.persist(storageLevel) useDisk var mysl = StorageLevel(true, true, false, true, 1)
  • 20. Advanced Spark Programming Persistence - Understanding Storage Level rdd.persist(storageLevel) useDisk useMemory var mysl = StorageLevel(true, true, false, true, 1)
  • 21. Advanced Spark Programming Persistence - Understanding Storage Level rdd.persist(storageLevel) useDisk useMemory useOffHeap var mysl = StorageLevel(true, true, false, true, 1)
  • 22. Advanced Spark Programming Persistence - Understanding Storage Level rdd.persist(storageLevel) useDisk useMemory useOffHeap deserialized var mysl = StorageLevel(true, true, false, true, 1)
  • 23. Advanced Spark Programming Persistence - Understanding Storage Level rdd.persist(storageLevel) useDisk useMemory useOffHeap deserialized var mysl = StorageLevel(true, true, false, true, 1) replication
  • 24. Advanced Spark Programming Persistence - Understanding Storage Level rdd.persist(storageLevel) useDisk useMemory useOffHeap deserialized var mysl = StorageLevel(true, true, false, true, 1) replication partitions1.persist(mysl)
  • 25. Advanced Spark Programming Persistence - StorageLevel Shortcuts partitions1.persist(StorageLevel.MEMORY_ONLY)
  • 26. Advanced Spark Programming Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level. Persistence (caching) - StorageLevel useDisk useMemory useOffHeap deserialized replication 1 partitions1.persist(StorageLevel.MEMORY_ONLY)
  • 27. Advanced Spark Programming Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions on disk that don't fit in memory, and read them from there when they're needed. Persistence (caching) - StorageLevel useDisk useMemory useOffHeap deserialized replication 1 partitions1.persist(StorageLevel.MEMORY_AND_DISK)
  • 28. Advanced Spark Programming Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read. Persistence (caching) - StorageLevel useDisk useMemory useOffHeap deserialized replication 1 partitions1.persist(StorageLevel.MEMORY_ONLY_SER)
  • 29. Advanced Spark Programming Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed. Persistence (caching) - StorageLevel useDisk useMemory useOffHeap deserialized replication 1 partitions1.persist(StorageLevel.MEMORY_AND_DISK_SER)
  • 30. Advanced Spark Programming Store the RDD partitions only on disk. Persistence (caching) - StorageLevel useDisk useMemory useOffHeap deserialized replication 1 partitions1.persist(StorageLevel.DISK_ONLY)
  • 31. Advanced Spark Programming Same as the levels above, but replicate each partition on two cluster nodes. Persistence (caching) - StorageLevel useDisk useMemory useOffHeap deserialized replication - - - - 2 MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.
  • 32. Advanced Spark Programming Store RDD in serialized format in Tachyon. Persistence (caching) - StorageLevel useDisk useMemory useOffHeap deserialized replication 1 OFF_HEAP (experimental)
  • 33. Advanced Spark Programming Which Storage Level to Choose?
  • 34. Advanced Spark Programming ● RDD fits comfortably in memory, use MEMORY_ONLY Which Storage Level to Choose?
  • 35. Advanced Spark Programming ● RDD fits comfortably in memory, use MEMORY_ONLY ● If not, try MEMORY_ONLY_SER Which Storage Level to Choose?
  • 36. Advanced Spark Programming Which Storage Level to Choose? ● RDD fits comfortably in memory, use MEMORY_ONLY ● If not, try MEMORY_ONLY_SER ● Don’t persist to disk unless, the computation is really expensive
  • 37. Advanced Spark Programming Which Storage Level to Choose? ● RDD fits comfortably in memory, use MEMORY_ONLY ● If not, try MEMORY_ONLY_SER ● Don’t persist to disk unless, the computation is really expensive ● For fast fault recovery, use replicated
  • 38. Advanced Spark Programming Data Partitioning ● Layout data to minimize transfer
  • 39. Advanced Spark Programming Data Partitioning ● Layout data to minimize transfer ● Not helpful if you need to scan the dataset only once
  • 40. Advanced Spark Programming ● Layout data to minimize transfer ● Not helpful if you need to scan the dataset only once ● Helpful when dataset is reused multiple times in key-oriented operations Data Partitioning
  • 41. Advanced Spark Programming Data Partitioning ● Layout data to minimize transfer ● Not helpful if you need to scan the dataset only once ● Helpful when dataset is reused multiple times in key-oriented operations ● Available for k-v rdds
  • 42. Advanced Spark Programming Data Partitioning ● Layout data to minimize transfer ● Not helpful if you need to scan the dataset only once ● Helpful when dataset is reused multiple times in key-oriented operations ● Available for k-v rdds ● Causes system to group elements based on key
  • 43. Advanced Spark Programming Data Partitioning ● Layout data to minimize transfer ● Not helpful if you need to scan the dataset only once ● Helpful when dataset is reused multiple times in key-oriented operations ● Available for k-v rdds ● Causes system to group elements based on key ● Spark does not give explicit control which worker node has which key
  • 44. Advanced Spark Programming Data Partitioning ● Layout data to minimize transfer ● Not helpful if you need to scan the dataset only once ● Helpful when dataset is reused multiple times in key-oriented operations ● Available for k-v rdds ● Causes system to group elements based on key ● Spark does not give explicit control which worker node has which key ● It lets program control/ensure which set of key will appear together ○ - based on some hash value for e.g. ○ - or you could range-sort
  • 45. Advanced Spark Programming Data Partitioning - Example - Subscriptions Objective: Count how many users visited a link that was not one of their subscribed topics. UserID LinkInfoUserID UserSubscriptionTopicsInfo UserData (subscriptions) Events
  • 46. Advanced Spark Programming Data Partitioning - Example Code will run fine as is, but it will be inefficient. Lot of network transfer Periodically, we combines UserData with last five mins events (smaller)
  • 47. Advanced Spark Programming Data Partitioning - Example val sc = new SparkContext(...) val userData = sc .sequenceFile[ UserID, UserInfo](" hdfs://...") //.partitionBy( new HashPartitioner( 100)) .persist() // Function called periodically to process a logfile of events in the past 5 minutes; // we assume that this is a SequenceFile containing (UserID, LinkInfo) pairs. def processNewLogs( logFileName: String) { val events = sc.sequenceFile[ UserID, LinkInfo]( logFileName) val joined = userData.join( events) // RDD of (UserID, (UserInfo, LinkInfo)) pairs val offTopicVisits = joined.filter({ case (userId, (userInfo, linkInfo)) = > !userInfo.topics.contains( linkInfo.topic) }).count() println(" Number of visits to non-subscribed topics: " + offTopicVisits) }
  • 48. Advanced Spark Programming Data Partitioning - Example val sc = new SparkContext(...) val userData = sc .sequenceFile[ UserID, UserInfo](" hdfs://...") .persist()
  • 49. Advanced Spark Programming Data Partitioning - Example val sc = new SparkContext(...) val userData = sc .sequenceFile[ UserID, UserInfo](" hdfs://...") .persist() // Function called periodically to process a logfile of events in the past 5 minutes; // we assume that this is a SequenceFile containing (UserID, LinkInfo) pairs. def processNewLogs( logFileName: String) {
  • 50. Advanced Spark Programming Data Partitioning - Example val sc = new SparkContext(...) val userData = sc .sequenceFile[ UserID, UserInfo](" hdfs://...") .persist() // Function called periodically to process a logfile of events in the past 5 minutes; // we assume that this is a SequenceFile containing (UserID, LinkInfo) pairs. def processNewLogs( logFileName: String) { val events = sc.sequenceFile[ UserID, LinkInfo]( logFileName)
  • 51. Advanced Spark Programming Data Partitioning - Example val sc = new SparkContext(...) val userData = sc .sequenceFile[ UserID, UserInfo](" hdfs://...") .persist() // Function called periodically to process a logfile of events in the past 5 minutes; // we assume that this is a SequenceFile containing (UserID, LinkInfo) pairs. def processNewLogs( logFileName: String) { val events = sc.sequenceFile[ UserID, LinkInfo]( logFileName) val joined = userData.join( events) // RDD of (UserID, (UserInfo, LinkInfo)) pairs
  • 52. Advanced Spark Programming Data Partitioning - Example val sc = new SparkContext(...) val userData = sc .sequenceFile[ UserID, UserInfo](" hdfs://...") .persist() // Function called periodically to process a logfile of events in the past 5 minutes; // we assume that this is a SequenceFile containing (UserID, LinkInfo) pairs. def processNewLogs( logFileName: String) { val events = sc.sequenceFile[ UserID, LinkInfo]( logFileName) val joined = userData.join( events) // RDD of (UserID, (UserInfo, LinkInfo)) pairs val offTopicVisits = joined.filter { case (userId, (userInfo, linkInfo)) = > !userInfo.topics.contains( linkInfo.topic) }.count()
  • 53. Advanced Spark Programming Data Partitioning - Example val sc = new SparkContext(...) val userData = sc .sequenceFile[ UserID, UserInfo](" hdfs://...") //.partitionBy( new HashPartitioner( 100)) .persist() // Function called periodically to process a logfile of events in the past 5 minutes; // we assume that this is a SequenceFile containing (UserID, LinkInfo) pairs. def processNewLogs( logFileName: String) { val events = sc.sequenceFile[ UserID, LinkInfo]( logFileName) val joined = userData.join( events) // RDD of (UserID, (UserInfo, LinkInfo)) pairs val offTopicVisits = joined.filter { case (userId, (userInfo, linkInfo)) = > !userInfo.topics.contains( linkInfo.topic) }.count() println(" Number of visits to non-subscribed topics: " + offTopicVisits) }
  • 54. Advanced Spark Programming Data Partitioning - Example
  • 55. Advanced Spark Programming Data Partitioning - Example import org.apache.spark.HashPartitioner sc = SparkContext(...) basedata = sc.sequenceFile("hdfs://...") var userData = basedata.partitionBy(new HashPartitioner(8)).persist()
  • 56. Advanced Spark Programming Data Partitioning - Partitioners 1. Reorganize the keys of a PairRDD 2. Can be applied using partionBy 3. Examples a. HashPartitioner b. RangePartitioner
  • 57. Advanced Spark Programming ● Implements hash-based partitioning using Java's Object.hashCode. ● Does not work if the key is an array Data Partitioning - HashPartitioner
  • 58. Advanced Spark Programming ● Partitions sortable records by range into roughly equal ranges ● The ranges are determined by sampling Data Partitioning - RangePartitioner
  • 59. Advanced Spark Programming Data Re-partitioning - Example
  • 60. Advanced Spark Programming Data Re-partitioning - Example import org.apache.spark.HashPartitioner import org.apache.spark.RangePartitioner var rdd = sc.parallelize(1 to 10).map((_,1)) rdd.partitionBy(new HashPartitioner(2)).glom().collect() rdd.partitionBy(new RangePartitioner(2,rdd)).glom().collect()
  • 61. Advanced Spark Programming Operations That Benefit from Partitioning ● cogroup() ● groupWith(), groupByKey(), reduceByKey(), ● join(), leftOuterJoin(), rightOuter Join() ● combineByKey(), and lookup(). Data Partitioning
  • 62. Advanced Spark Programming Data Partitioning - Creating custom partitioner [ [(sandeep,1)], [(giri,1), (abhishek,1)], [(sravani,1), (jude,1)] ] Partition 2 Partition 3Partition 1 [ [(giri,1), (abhishek,1), (jude,1)], [(sravani,1), (sandeep,1)] ] Partition 2Partition 1
  • 63. Advanced Spark Programming Data Partitioning - Creating custom partitioner import org.apache.spark.Partitioner class TwoPartsPartitioner(override val numPartitions: Int) extends Partitioner { def getPartition(key: Any): Int = key match { case s: String => { if (s(0).toUpper > 'J') 1 else 0 } } override def equals(other: Any): Boolean = other.isInstanceOf[TwoPartsPartitioner] override def hashCode: Int = 0 } https://gist.github.com/girisandeep/f90e456da6f2381f9c86e8e6bc4e8260
  • 64. Advanced Spark Programming Data Partitioning - Creating custom partitioner https://gist.github.com/girisandeep/f90e456da6f2381f9c86e8e6bc4e8260 var x = sc.parallelize(Array(("sandeep",1),("giri",1),("abhishek",1),("sravani",1),("jude",1)), 3) x.glom().collect() //Array(Array((sandeep,1)), Array((giri,1), (abhishek,1)), Array((sravani,1), (jude,1))) //[ [(sandeep,1)], [(giri,1), (abhishek,1)], [(sravani,1), (jude,1)] ] var y = x.partitionBy(new TwoPartsPartitioner(2)) y.glom().collect() //Array(Array((giri,1), (abhishek,1), (jude,1)), Array((sandeep,1), (sravani,1))) //[ [(giri,1), (abhishek,1), (jude,1)], [(sandeep,1), (sravani,1)] ]