SlideShare a Scribd company logo
1 of 29
Spark Learning
Spark Architecture
Spark – Overview
What is RDD?
 Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark.
 It is an immutable distributed collection of objects.
 Each dataset in RDD is divided into logical partitions, which may be computed on different
nodes of the cluster.
 RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.
What is Lineage Graph?
 RDD Lineage (aka RDD operator graph or RDD dependency graph) is a graph of all the parent
RDDs of a RDD.
 It is built as a result of applying transformations to the RDD and creates a logical execution
plan
 RDD Lineage is stored on Driver’s Memory
Problem? Driver Memory Spiking Up?
 Analyze what’s making Driver Memory to spike up?
 Analyze what objects are occupying the most of the memory? (jmap -histo <pid>)
 If the character class ([C) is taking most of your memory, probably Lineage Graph is the
CULPRIT.
Driver Memory Spiking Up - Analysis
 Still not able to Figure Out?
 Take a Heap Dump. (jmap -dump:live,format=b,file=heap.bin <pid>)
 Ship the dump to your local machine and analyze the heap with memory analyzer tools like
Eclipse – MAT, J Visual VM etc.
Analyzing dump with Eclipse MAT
Analyzing dump with Eclipse MAT
Solution? RDD – Check Pointing 
 More Transformations – Big Lineage Graph – More Driver Memory – How to Handle?
 By applying check pointing on RDD (rdd.checkpoint())
 Cuts down the Lineage Graph Drastically, frees up Driver Memory, faster execution plan 
RDD Check Pointing - Prerequisites
 Specify Check Pointing Directory (typically HDFS)
 spark.sparkContext.setCheckpointDir(dfs.getHomeDirectory.toString + "/" +
spark.sparkContext.applicationId)
 Perform an action on the check pointed RDD (rdd.count)
RDD Check pointing - Cons 
 It writes entire RDD’s Data in Serialized format to HDFS.
Why write complete RDD to HDFS?
 Since the Lineage Graph is cut down in Memory, Spark has no way to trace it back.
 So whenever executors gets Lost/Preempted, rather than computing the RDD from the start
it uses check pointed data (entire RDD) from HDFS. Saves ton of time. Voilaaaaa .
 So, Don’t be SAD , Cheer Up 
END of RDD Check Pointing
Unions Vs Joins?
 Typically Union checks for Data Type and Nullability of the column down to the root of the
Lineage Graph
 I’d prefer Join over Union. (Depends on the use case, length of lineage graph, no. of
transformations etc.)
 My job has lots of transformations and Unions never return at all 
 If you are experiencing the same problem, try converting them to Left/Right/Full/Inner
Joins.
 You’ll end up .
Analysis Exception – Not able to figure out?
 Sometimes Spark goes Crazy as well and doesn’t throw proper error to figure out
Analysis Exception – Example
Analysis Exception – Debug
Analysis Exception – Debug
Analysis Exception – Debug
Analysis Exception – Bottom Line
 Most of the times error is legitimate meaning there might be one/some of the column/s
missing from the Data Frames used in the query
 If every column is there, it might be possibly due to Nullability of the column
 If you are Joining/Unioning two data frames, if two columns has same name but their
NULLABILITY flag differs, then try to address that.
End of Analysis Exception
Null Pointer Exception
 If you use Spark’s ORC/Parquet API to read the Data, then the metadata of the data is
retrieved from the file footers.
 When writing to HDFS in ORC/Parquet from Spark, by default Spark stores Metadata info in
the File Footers.
 Metadata Includes Column Names, Data Types, basic Aggregations, Compression Code etc.
Null Pointer Exception – contd.
 If you read from Hive Table, Spark get’s Metadata info from Hive’s Metastore
 We can give any Datatype in Hive irrespective to Data, it might fails when reading.
Null Pointer Exception – contd.
 In the above example, ”PaidAmount” datatype in File Footer is Decimal(38, 6) and in Hive’s
Metastore is Decimal(10,0) (Hive’s Default Decimal Precision)
Null Pointer Exception – Solution
 Try reading directly from Directories and Files.
 If you want to read from Hive Tables directly, make sure if have the Data Types in Sync with
the Actual Data
Useful Utilities in Spark
 https://github.com/aguyyala/spark-utitlities/blob/master/Utility.scala
Questions?
END

More Related Content

What's hot

Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
Kristian Alexander
 
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanSpark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in Japan
Taro L. Saito
 

What's hot (20)

Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Spark rdd vs data frame vs dataset
Spark rdd vs data frame vs datasetSpark rdd vs data frame vs dataset
Spark rdd vs data frame vs dataset
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Learning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLLearning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQL
 
Learning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value PairsLearning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value Pairs
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 
Spark, Python and Parquet
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark RDD 101
Apache Spark RDD 101Apache Spark RDD 101
Apache Spark RDD 101
 
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanSpark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in Japan
 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
 

Similar to Spark learning

Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and Clojure
Dr. Christian Betz
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
jlacefie
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 

Similar to Spark learning (20)

Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
 
Spark
SparkSpark
Spark
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
 
Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and Clojure
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
 
Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache Spark
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
 
An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetup
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
spark interview questions & answers acadgild blogs
 spark interview questions & answers acadgild blogs spark interview questions & answers acadgild blogs
spark interview questions & answers acadgild blogs
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
 

Recently uploaded

Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
Tonystark477637
 

Recently uploaded (20)

Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 

Spark learning

  • 4. What is RDD?  Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark.  It is an immutable distributed collection of objects.  Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.  RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.
  • 5. What is Lineage Graph?  RDD Lineage (aka RDD operator graph or RDD dependency graph) is a graph of all the parent RDDs of a RDD.  It is built as a result of applying transformations to the RDD and creates a logical execution plan  RDD Lineage is stored on Driver’s Memory
  • 6. Problem? Driver Memory Spiking Up?  Analyze what’s making Driver Memory to spike up?  Analyze what objects are occupying the most of the memory? (jmap -histo <pid>)  If the character class ([C) is taking most of your memory, probably Lineage Graph is the CULPRIT.
  • 7. Driver Memory Spiking Up - Analysis  Still not able to Figure Out?  Take a Heap Dump. (jmap -dump:live,format=b,file=heap.bin <pid>)  Ship the dump to your local machine and analyze the heap with memory analyzer tools like Eclipse – MAT, J Visual VM etc.
  • 8. Analyzing dump with Eclipse MAT
  • 9. Analyzing dump with Eclipse MAT
  • 10. Solution? RDD – Check Pointing   More Transformations – Big Lineage Graph – More Driver Memory – How to Handle?  By applying check pointing on RDD (rdd.checkpoint())  Cuts down the Lineage Graph Drastically, frees up Driver Memory, faster execution plan 
  • 11. RDD Check Pointing - Prerequisites  Specify Check Pointing Directory (typically HDFS)  spark.sparkContext.setCheckpointDir(dfs.getHomeDirectory.toString + "/" + spark.sparkContext.applicationId)  Perform an action on the check pointed RDD (rdd.count)
  • 12. RDD Check pointing - Cons   It writes entire RDD’s Data in Serialized format to HDFS.
  • 13. Why write complete RDD to HDFS?  Since the Lineage Graph is cut down in Memory, Spark has no way to trace it back.  So whenever executors gets Lost/Preempted, rather than computing the RDD from the start it uses check pointed data (entire RDD) from HDFS. Saves ton of time. Voilaaaaa .  So, Don’t be SAD , Cheer Up 
  • 14. END of RDD Check Pointing
  • 15. Unions Vs Joins?  Typically Union checks for Data Type and Nullability of the column down to the root of the Lineage Graph  I’d prefer Join over Union. (Depends on the use case, length of lineage graph, no. of transformations etc.)  My job has lots of transformations and Unions never return at all   If you are experiencing the same problem, try converting them to Left/Right/Full/Inner Joins.  You’ll end up .
  • 16. Analysis Exception – Not able to figure out?  Sometimes Spark goes Crazy as well and doesn’t throw proper error to figure out
  • 21. Analysis Exception – Bottom Line  Most of the times error is legitimate meaning there might be one/some of the column/s missing from the Data Frames used in the query  If every column is there, it might be possibly due to Nullability of the column  If you are Joining/Unioning two data frames, if two columns has same name but their NULLABILITY flag differs, then try to address that.
  • 22. End of Analysis Exception
  • 23. Null Pointer Exception  If you use Spark’s ORC/Parquet API to read the Data, then the metadata of the data is retrieved from the file footers.  When writing to HDFS in ORC/Parquet from Spark, by default Spark stores Metadata info in the File Footers.  Metadata Includes Column Names, Data Types, basic Aggregations, Compression Code etc.
  • 24. Null Pointer Exception – contd.  If you read from Hive Table, Spark get’s Metadata info from Hive’s Metastore  We can give any Datatype in Hive irrespective to Data, it might fails when reading.
  • 25. Null Pointer Exception – contd.  In the above example, ”PaidAmount” datatype in File Footer is Decimal(38, 6) and in Hive’s Metastore is Decimal(10,0) (Hive’s Default Decimal Precision)
  • 26. Null Pointer Exception – Solution  Try reading directly from Directories and Files.  If you want to read from Hive Tables directly, make sure if have the Data Types in Sync with the Actual Data
  • 27. Useful Utilities in Spark  https://github.com/aguyyala/spark-utitlities/blob/master/Utility.scala
  • 29. END

Editor's Notes

  1. Source: Databricks