Spark learning

What is RDD?
 Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark.
 It is an immutable distributed collection of objects.
 Each dataset in RDD is divided into logical partitions, which may be computed on different
nodes of the cluster.
 RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.

What is Lineage Graph?
 RDD Lineage (aka RDD operator graph or RDD dependency graph) is a graph of all the parent
RDDs of a RDD.
 It is built as a result of applying transformations to the RDD and creates a logical execution
plan
 RDD Lineage is stored on Driver’s Memory

Problem? Driver Memory Spiking Up?
 Analyze what’s making Driver Memory to spike up?
 Analyze what objects are occupying the most of the memory? (jmap -histo <pid>)
 If the character class ([C) is taking most of your memory, probably Lineage Graph is the
CULPRIT.

Driver Memory Spiking Up - Analysis
 Still not able to Figure Out?
 Take a Heap Dump. (jmap -dump:live,format=b,file=heap.bin <pid>)
 Ship the dump to your local machine and analyze the heap with memory analyzer tools like
Eclipse – MAT, J Visual VM etc.

Analyzing dump with Eclipse MAT

Solution? RDD – Check Pointing 
 More Transformations – Big Lineage Graph – More Driver Memory – How to Handle?
 By applying check pointing on RDD (rdd.checkpoint())
 Cuts down the Lineage Graph Drastically, frees up Driver Memory, faster execution plan 

RDD Check Pointing - Prerequisites
 Specify Check Pointing Directory (typically HDFS)
 spark.sparkContext.setCheckpointDir(dfs.getHomeDirectory.toString + "/" +
spark.sparkContext.applicationId)
 Perform an action on the check pointed RDD (rdd.count)

RDD Check pointing - Cons 
 It writes entire RDD’s Data in Serialized format to HDFS.

Why write complete RDD to HDFS?
 Since the Lineage Graph is cut down in Memory, Spark has no way to trace it back.
 So whenever executors gets Lost/Preempted, rather than computing the RDD from the start
it uses check pointed data (entire RDD) from HDFS. Saves ton of time. Voilaaaaa .
 So, Don’t be SAD , Cheer Up 

Unions Vs Joins?
 Typically Union checks for Data Type and Nullability of the column down to the root of the
Lineage Graph
 I’d prefer Join over Union. (Depends on the use case, length of lineage graph, no. of
transformations etc.)
 My job has lots of transformations and Unions never return at all 
 If you are experiencing the same problem, try converting them to Left/Right/Full/Inner
Joins.
 You’ll end up .

Analysis Exception – Not able to figure out?
 Sometimes Spark goes Crazy as well and doesn’t throw proper error to figure out

Analysis Exception – Example

Analysis Exception – Bottom Line
 Most of the times error is legitimate meaning there might be one/some of the column/s
missing from the Data Frames used in the query
 If every column is there, it might be possibly due to Nullability of the column
 If you are Joining/Unioning two data frames, if two columns has same name but their
NULLABILITY flag differs, then try to address that.

Null Pointer Exception
 If you use Spark’s ORC/Parquet API to read the Data, then the metadata of the data is
retrieved from the file footers.
 When writing to HDFS in ORC/Parquet from Spark, by default Spark stores Metadata info in
the File Footers.
 Metadata Includes Column Names, Data Types, basic Aggregations, Compression Code etc.

Null Pointer Exception – contd.
 If you read from Hive Table, Spark get’s Metadata info from Hive’s Metastore
 We can give any Datatype in Hive irrespective to Data, it might fails when reading.

Null Pointer Exception – contd.
 In the above example, ”PaidAmount” datatype in File Footer is Decimal(38, 6) and in Hive’s
Metastore is Decimal(10,0) (Hive’s Default Decimal Precision)

Null Pointer Exception – Solution
 Try reading directly from Directories and Files.
 If you want to read from Hive Tables directly, make sure if have the Data Types in Sync with
the Actual Data

Useful Utilities in Spark
 https://github.com/aguyyala/spark-utitlities/blob/master/Utility.scala

Spark learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark learning

Similar to Spark learning (20)

Recently uploaded

Recently uploaded (20)

Spark learning

Editor's Notes