4. What is RDD?
Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark.
It is an immutable distributed collection of objects.
Each dataset in RDD is divided into logical partitions, which may be computed on different
nodes of the cluster.
RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.
5. What is Lineage Graph?
RDD Lineage (aka RDD operator graph or RDD dependency graph) is a graph of all the parent
RDDs of a RDD.
It is built as a result of applying transformations to the RDD and creates a logical execution
plan
RDD Lineage is stored on Driver’s Memory
6. Problem? Driver Memory Spiking Up?
Analyze what’s making Driver Memory to spike up?
Analyze what objects are occupying the most of the memory? (jmap -histo <pid>)
If the character class ([C) is taking most of your memory, probably Lineage Graph is the
CULPRIT.
7. Driver Memory Spiking Up - Analysis
Still not able to Figure Out?
Take a Heap Dump. (jmap -dump:live,format=b,file=heap.bin <pid>)
Ship the dump to your local machine and analyze the heap with memory analyzer tools like
Eclipse – MAT, J Visual VM etc.
10. Solution? RDD – Check Pointing
More Transformations – Big Lineage Graph – More Driver Memory – How to Handle?
By applying check pointing on RDD (rdd.checkpoint())
Cuts down the Lineage Graph Drastically, frees up Driver Memory, faster execution plan
11. RDD Check Pointing - Prerequisites
Specify Check Pointing Directory (typically HDFS)
spark.sparkContext.setCheckpointDir(dfs.getHomeDirectory.toString + "/" +
spark.sparkContext.applicationId)
Perform an action on the check pointed RDD (rdd.count)
12. RDD Check pointing - Cons
It writes entire RDD’s Data in Serialized format to HDFS.
13. Why write complete RDD to HDFS?
Since the Lineage Graph is cut down in Memory, Spark has no way to trace it back.
So whenever executors gets Lost/Preempted, rather than computing the RDD from the start
it uses check pointed data (entire RDD) from HDFS. Saves ton of time. Voilaaaaa .
So, Don’t be SAD , Cheer Up
15. Unions Vs Joins?
Typically Union checks for Data Type and Nullability of the column down to the root of the
Lineage Graph
I’d prefer Join over Union. (Depends on the use case, length of lineage graph, no. of
transformations etc.)
My job has lots of transformations and Unions never return at all
If you are experiencing the same problem, try converting them to Left/Right/Full/Inner
Joins.
You’ll end up .
16. Analysis Exception – Not able to figure out?
Sometimes Spark goes Crazy as well and doesn’t throw proper error to figure out
21. Analysis Exception – Bottom Line
Most of the times error is legitimate meaning there might be one/some of the column/s
missing from the Data Frames used in the query
If every column is there, it might be possibly due to Nullability of the column
If you are Joining/Unioning two data frames, if two columns has same name but their
NULLABILITY flag differs, then try to address that.
23. Null Pointer Exception
If you use Spark’s ORC/Parquet API to read the Data, then the metadata of the data is
retrieved from the file footers.
When writing to HDFS in ORC/Parquet from Spark, by default Spark stores Metadata info in
the File Footers.
Metadata Includes Column Names, Data Types, basic Aggregations, Compression Code etc.
24. Null Pointer Exception – contd.
If you read from Hive Table, Spark get’s Metadata info from Hive’s Metastore
We can give any Datatype in Hive irrespective to Data, it might fails when reading.
25. Null Pointer Exception – contd.
In the above example, ”PaidAmount” datatype in File Footer is Decimal(38, 6) and in Hive’s
Metastore is Decimal(10,0) (Hive’s Default Decimal Precision)
26. Null Pointer Exception – Solution
Try reading directly from Directories and Files.
If you want to read from Hive Tables directly, make sure if have the Data Types in Sync with
the Actual Data
27. Useful Utilities in Spark
https://github.com/aguyyala/spark-utitlities/blob/master/Utility.scala