In the session, we discussed the End-to-end working of Apache Spark that mainly focused on "Why What and How" factors. We discussed RDDs and the high-level API like Dataframe, Dataset. The session also took care of the internals of Spark.
Human Factors of XR: Using Human Factors to Design XR Systems
End-to-end working of Apache Spark
1. Presented By: Sarfaraz Hussain Divyansh Jain
Software Consultant Software Consultant
Knoldus Inc Knoldus Inc
End-to-end working
of Apache Spark
2. Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
Punctuality
Respect Knolx session
timings, you are requested
not to join sessions after a 5
minutes threshold post the
session start time.
Feedback
Make sure to submit a
constructive feedback for all
sessions as it is very helpful
for the presenter.
Silent Mode
Keep your mobile devices in
silent mode, feel free to
move out of session in case
you need to attend an
urgent call.
Avoid Disturbance
Avoid unwanted chit chat
during the session.
3. Agenda
01 Why and What is Spark?
02 Working of Spark
03 Operations in Spark
04 Task and Stages
05 DataFrame & DataSets
06 Demo
5. Why Spark? (Distributed Computing)
Traditional Enterprise Approach :
- Splitting the data to different systems
- Coding is required in all systems
- No Fault Tolerance
- Aggregation of data
- System A is unaware of the data stored
in System B and vice versa.
6. What is Spark?
The official definition of Apache Spark says that
“Apache Spark™ is a unified analytics engine
for large-scale data processing.” It is an
in-memory computation processing engine
where the data is kept in random access
memory (RAM) instead of some slow disk drives
and is processed in parallel.
8. Working of Spark
val number = spark.sparkContext.textFile("path_to_File1.txt", 3)
val result = number.flatMap(_.split("t")).map(_.toInt).filter(x => x < 10)
RDD result = Find values smaller then 10 from number RDD
Master => Driver
Slave => Executor
9. RDD
Spark RDD is a resilient, partitioned, distributed and immutable collection of
data.
We can create an RDD using two methods:
- Load some data from a source.
- Create an RDD by transforming another RDD.
11. Working of Spark
val number = spark.sparkContext.textFile("path_to_File1.txt", 3)
val result = number.flatMap(_.split("t")).map(_.toInt).filter(x => x < 10)
B1 → B4 => 5,6
B2 → B5 => 9, 2
B3 → B6 => 1, 5
B4, B5, B6 => Result RDD
12. Lineage
- When a new RDD is derived from an existing RDD, that new
RDD contains a pointer to the parent RDD and Spark
keeps track of all the dependencies between these RDDs
using a component called the Lineage.
- In case of data loss, this lineage is used to rebuild the
data.
- The SparkContext (Driver) maintains the Lineage.
- It is also known as Dependency Graph.
13. Operations in Spark
- Transformation:
- Creation of new RDD from existing RDD.
- Action:
- This leads to creation of non-RDD result and
gives result to User, i.e. it creates results in some
form of Java Collection.
val number = spark.sparkContext.textFile("path_to_File1.txt", 3)
val result = number.flatMap(_.split("t")).map(_.toInt).filter(x =>
x < 10)
result.collect() → Action
15. Working of Spark (Lazy Evaluation)
Till the time we do not hit an Action, none of the above operations
(transformations) will take place.
- RDDs are lazily evaluated.
- RDDs are immutable.
- The values of action are stored to drivers or to the external storage
system.
- It brings laziness of RDD into motion.
- An action is one of the ways of sending data from Executer to the Driver.
- Action kick offs a job to execute on a cluster.
20. ● Two kinds of operations:
1. Transformation
2. Action
● Dependency are divided into two types:
1. Narrow Dependency
2. Wide Dependency
● Stages
Spark Execution Model
26. Problem with RDD
- Focus on “How To” rather then “What To”.
- Not much optimized by Spark. Optimizing RDD is developer’s
responsibility.
- RDDs are low-level API.
- Inadveradent ineffecieincies.
29. SQL Dataframes Datasets
Syntax Run Time CompileTime Compile
Time
Error
Analysis Run Time Run Time Compile
Time
Error
Analysis reports are reported before distributed job starts
RDD vs Dataframes vs Datasets
32. Dataframe = Easy to write..Believe it
name age
Jim 20
Ann 31
Jin 30
Output:
33. Analysis: Analyzing a logical plan to resolve references
Logical Optimization: Optimise the logical plan
Code generation: Compile parts of the query to Java
bytecode
Catalyst Optimizer
37. Why When
- High level APIs & DSL - Structured Data Schema
- Strong Type-Safety - Code optimisations &
- Ease of use & Readability
- What to Do
perfromance
Dataframe and Dataset