SlideShare a Scribd company logo
1 of 4
My Note on Spark RDD
Spark Intro:
Spark is an open source bigdata processing framework built around speed, ease of use, and
sophisticated analytics.It was originally developed in 2009 in UC Berkeley’sAMP Lab, and open sourced
in 2010 as an Apache project.
Unlike map-reduce, Spark performs in-memory data processing. This in-memory processing
is a lightening faster process, as there is no time spent in moving the data/processes in and out of
the disk, whereas MapReduce requires a lot of time to perform these input/output operations
thereby increasing latency. The primary difference between MapReduce and Spark is that
MapReduce uses persistent data storage and Spark uses Resilient Distributed Datasets (RDDs).
What RDD is?
ResilientDistributedDatasetorRDD is a corecomponent of Apache Spark framework.
Resilient Distributed Datasets (RDDs)are a distributed memory abstraction that lets programmers
perform in-memory computations on large clusters in a fault-tolerant manner.
Let’s understand RDD by its name:
Resilient:i.e., Fault-tolerant. With the help of LINEAGE Graph, RDD can achieve fault tolerance.
Distributed:i.e., Data shared on multiple nodes of a cluster.
Dataset: A collectionof partitioned data.
RDD is a collectionof data having the below properties.
1) Immutable.
2) Lazy Evaluated.
3) Type Inferred.
4) Cacheable.
Let’s discuss little deeper into these properties to understand more about RDD.
Immutability:
In object-oriented and functionalprogramming, an immutableobject(unchangeable
object)is an object whose state cannot be modified after it is created.
Big Data is by default immutable in nature, whichmeans people in general keep appending
the data to the large files or overwritesbut not the updates (No one wants to open a 1TB fileand
update some 1000th line in that file).
Fine, now why this immutability is needed now forRDD? And the answer is, if we have
immutability then we can achieve 2 things, they are
i) Parallel processing
ii) Caching
Parallel processing
The main issue in general people used to facewhen processing same data in parallel is,
locking of data, means when one operation is reading the data and processing it further, meanwhile
another operation is updating the data, which may lead to wrong results (whichwe used to observe
regularly when working with JavaMulti-threading, there we will use some locking conceptto avoid
issues). But if the data is Mutable, then we need not to consider about this locking concept,because
if it is immutable then under laying data never gets changed.
Caching
If data never gets changed, then wecan easily cacheit (keep it in memory) and can workon it.
Challenges of Immutability
Apart fromadvantages, there are some notable challenges exists with immutability. As
every change to an immutable data will lead to create a new object, whichmeans there will be a
major issue raises in terms of Memory.
Sometimes we may need to apply multiple transformations to an RDD, as per immutability
conceptfor each transformation, it will create a separate copy of the same data, which is very bad in
terms of Big Data, also, may lead to a poor performance as well.
These challenges can be overcomeby next property of RDD, i.e., Lazy Evaluated.Let’s see how it will
overcome.
Lazy Evaluated
Lazy Evaluated is an advantage in Big Data world. Let’s understand the meaning of it.
Lazy Evaluated means don’tcompute the transformationstill weuse it. Let’s understand little
deeper. The issue with immutability discussed above is when we are applying multiple
transformations on top of single data set, which willlead to multiple RDD creation; this can be
suppressed just by being lazy, whichmeans, Spark willnot execute transformation when requested,
instead it willexecute the list of transformations when user request forresults. This will avoid
creating multiple copies of data and can solvethe challenges of Immutability.
Note: Lazy Evaluated is an advantage onlyifthe under laying data set is Immutable.
Challengesof LazyEvaluated
Everythingin this world has its own pros and cons. Sometimes being lazy also lead to some
major issues. Let me give youone small example.
While working withmap-reduce, after processing N% of the data, sometimes wewill face
“Unmatched or Invalid Key Class Casting Exception” or something similar, whichwill lead to waste
of time. Especially,if we are lazy then it will be a big concern, because the transformation will not
be executed when requested.
In Spark, using Scala we can avoid getting this kind of issues; because, all semantic exceptions will
be raised at the time of compilation only. Moreover Spark is Type Inferred.
Type Inferred
Spark/Scala supports Type inferred feature. Let’s understand it in deeper.
In Java if we want to define a Map object, we willuse the below syntax.
Map<String,Double>map=new HashMap<>();
Here weare specifying the type of the data should be stored as a key value pair in Map. In case at
runtime if we are getting some other data type, then it willthrow an exception. Whereas Scala has a
built-in typeinference mechanism whichallows the programmer to omit certain type annotations.
It is oftennot necessary in Scala to specify the type of a variable, since the compiler can deduce the
type from the initialization expression of the variable.
Ex:
1) val x = 1 + 2 * 3  The typeof x is Intin this case.
2) val y = “test comment”  The typeof x is Stringin this case.
This Type inference can solve the challenges exists withLazy Evaluation.
Cacheable
One of the most important capabilities in Spark is persisting (orcaching) a dataset in
memory across operations. When you persists an RDD, each node stores any partitions of it that it
computes in memory and reuses them in other actions on that dataset or datasets derived from it.
This allows future actions to be much faster (often by more than 10x). Caching is a key tool for
iterative algorithms and fast interactive use.
Spark’s RDD will be cached when it is created first time, and when Spark creates an RDD it
will also persist all the transformations applied on it, whichis called as lineage.So, Spark cacheis a
fault-tolerant, whichmeans if we lost data from cacheSpark will re-create same data using lineage.
The summery,
 Immutable data allows for caching long times and helps performing parallel processing on
data sets whichwill give us very high perofrmance.
 Lazy data transformation allowsyou to re-create data in cache on failures using lineage,
with whichSpark can achievefault-tolerance.
 Type inference willmake RDD robust by taking type of the data at run time.
 Caching data will improve the performance.

More Related Content

Similar to Spark rdd

Learn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemtsLearn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemtssiddharth30121
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch ProcessingEdureka!
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 
Spark Unveiled Essential Insights for All Developers
Spark Unveiled Essential Insights for All DevelopersSpark Unveiled Essential Insights for All Developers
Spark Unveiled Essential Insights for All DevelopersKnoldus Inc.
 
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...7mind
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdfMaheshPandit16
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsRavindra kumar
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark Hubert Fan Chiang
 
Scala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusScala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusBoldRadius Solutions
 
Spark cluster computing with working sets
Spark cluster computing with working setsSpark cluster computing with working sets
Spark cluster computing with working setsJinxinTang
 
spark interview questions & answers acadgild blogs
 spark interview questions & answers acadgild blogs spark interview questions & answers acadgild blogs
spark interview questions & answers acadgild blogsprateek kumar
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your EyesDemi Ben-Ari
 

Similar to Spark rdd (20)

Learn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemtsLearn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemts
 
Spark vstez
Spark vstezSpark vstez
Spark vstez
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
No sql3 rmoug
No sql3 rmougNo sql3 rmoug
No sql3 rmoug
 
Spark
SparkSpark
Spark
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Spark Unveiled Essential Insights for All Developers
Spark Unveiled Essential Insights for All DevelopersSpark Unveiled Essential Insights for All Developers
Spark Unveiled Essential Insights for All Developers
 
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Scala a case4
Scala a case4Scala a case4
Scala a case4
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Scala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusScala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadius
 
The CQRS diet
The CQRS dietThe CQRS diet
The CQRS diet
 
Spark cluster computing with working sets
Spark cluster computing with working setsSpark cluster computing with working sets
Spark cluster computing with working sets
 
spark interview questions & answers acadgild blogs
 spark interview questions & answers acadgild blogs spark interview questions & answers acadgild blogs
spark interview questions & answers acadgild blogs
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your Eyes
 
Spark
SparkSpark
Spark
 

Recently uploaded

The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 

Recently uploaded (20)

The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 

Spark rdd

  • 1. My Note on Spark RDD Spark Intro: Spark is an open source bigdata processing framework built around speed, ease of use, and sophisticated analytics.It was originally developed in 2009 in UC Berkeley’sAMP Lab, and open sourced in 2010 as an Apache project. Unlike map-reduce, Spark performs in-memory data processing. This in-memory processing is a lightening faster process, as there is no time spent in moving the data/processes in and out of the disk, whereas MapReduce requires a lot of time to perform these input/output operations thereby increasing latency. The primary difference between MapReduce and Spark is that MapReduce uses persistent data storage and Spark uses Resilient Distributed Datasets (RDDs). What RDD is? ResilientDistributedDatasetorRDD is a corecomponent of Apache Spark framework. Resilient Distributed Datasets (RDDs)are a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. Let’s understand RDD by its name: Resilient:i.e., Fault-tolerant. With the help of LINEAGE Graph, RDD can achieve fault tolerance. Distributed:i.e., Data shared on multiple nodes of a cluster. Dataset: A collectionof partitioned data. RDD is a collectionof data having the below properties. 1) Immutable. 2) Lazy Evaluated. 3) Type Inferred. 4) Cacheable. Let’s discuss little deeper into these properties to understand more about RDD. Immutability: In object-oriented and functionalprogramming, an immutableobject(unchangeable object)is an object whose state cannot be modified after it is created.
  • 2. Big Data is by default immutable in nature, whichmeans people in general keep appending the data to the large files or overwritesbut not the updates (No one wants to open a 1TB fileand update some 1000th line in that file). Fine, now why this immutability is needed now forRDD? And the answer is, if we have immutability then we can achieve 2 things, they are i) Parallel processing ii) Caching Parallel processing The main issue in general people used to facewhen processing same data in parallel is, locking of data, means when one operation is reading the data and processing it further, meanwhile another operation is updating the data, which may lead to wrong results (whichwe used to observe regularly when working with JavaMulti-threading, there we will use some locking conceptto avoid issues). But if the data is Mutable, then we need not to consider about this locking concept,because if it is immutable then under laying data never gets changed. Caching If data never gets changed, then wecan easily cacheit (keep it in memory) and can workon it. Challenges of Immutability Apart fromadvantages, there are some notable challenges exists with immutability. As every change to an immutable data will lead to create a new object, whichmeans there will be a major issue raises in terms of Memory. Sometimes we may need to apply multiple transformations to an RDD, as per immutability conceptfor each transformation, it will create a separate copy of the same data, which is very bad in terms of Big Data, also, may lead to a poor performance as well. These challenges can be overcomeby next property of RDD, i.e., Lazy Evaluated.Let’s see how it will overcome. Lazy Evaluated Lazy Evaluated is an advantage in Big Data world. Let’s understand the meaning of it. Lazy Evaluated means don’tcompute the transformationstill weuse it. Let’s understand little deeper. The issue with immutability discussed above is when we are applying multiple transformations on top of single data set, which willlead to multiple RDD creation; this can be suppressed just by being lazy, whichmeans, Spark willnot execute transformation when requested, instead it willexecute the list of transformations when user request forresults. This will avoid creating multiple copies of data and can solvethe challenges of Immutability. Note: Lazy Evaluated is an advantage onlyifthe under laying data set is Immutable.
  • 3. Challengesof LazyEvaluated Everythingin this world has its own pros and cons. Sometimes being lazy also lead to some major issues. Let me give youone small example. While working withmap-reduce, after processing N% of the data, sometimes wewill face “Unmatched or Invalid Key Class Casting Exception” or something similar, whichwill lead to waste of time. Especially,if we are lazy then it will be a big concern, because the transformation will not be executed when requested. In Spark, using Scala we can avoid getting this kind of issues; because, all semantic exceptions will be raised at the time of compilation only. Moreover Spark is Type Inferred. Type Inferred Spark/Scala supports Type inferred feature. Let’s understand it in deeper. In Java if we want to define a Map object, we willuse the below syntax. Map<String,Double>map=new HashMap<>(); Here weare specifying the type of the data should be stored as a key value pair in Map. In case at runtime if we are getting some other data type, then it willthrow an exception. Whereas Scala has a built-in typeinference mechanism whichallows the programmer to omit certain type annotations. It is oftennot necessary in Scala to specify the type of a variable, since the compiler can deduce the type from the initialization expression of the variable. Ex: 1) val x = 1 + 2 * 3  The typeof x is Intin this case. 2) val y = “test comment”  The typeof x is Stringin this case. This Type inference can solve the challenges exists withLazy Evaluation. Cacheable One of the most important capabilities in Spark is persisting (orcaching) a dataset in memory across operations. When you persists an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset or datasets derived from it. This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use. Spark’s RDD will be cached when it is created first time, and when Spark creates an RDD it will also persist all the transformations applied on it, whichis called as lineage.So, Spark cacheis a fault-tolerant, whichmeans if we lost data from cacheSpark will re-create same data using lineage.
  • 4. The summery,  Immutable data allows for caching long times and helps performing parallel processing on data sets whichwill give us very high perofrmance.  Lazy data transformation allowsyou to re-create data in cache on failures using lineage, with whichSpark can achievefault-tolerance.  Type inference willmake RDD robust by taking type of the data at run time.  Caching data will improve the performance.