SlideShare a Scribd company logo
Anatomy of in-memory
processing in Spark
A deep-dive into custom memory management in Spark
https://github.com/shashankgowdal/introduction_to_dataset
● Shashank L
● Big data consultant and trainer at
datamantra.io
● www.shashankgowda.com
Agenda
● Era of in-memory processing
● Big data frameworks on JVM
● JVM memory model
● Custom memory management
● Allocation
● Serialization
● Processing
● Benefits of these on Spark
Era of in-memory processing
● After Spark, in memory has become a defacto
standard for big data workloads
● Advancement in hardware is pushing more
frameworks in that direction
● Memory management is coupled with runtime of the
framework
● Using memory efficiently in a big data workload is a
challenging task
● Memory management depends upon runtime of the
framework
Why JVM is a prominent runtime for
big data workloads
● Managed runtime
● Portable
● Hadoop was on JVM
● Rich eco system
Big data frameworks on JVM
● Many frameworks today runs on JVM today
○ Spark
○ Flink
○ Hadoop
○ etc
● Organising data in memory
○ In-memory processing
○ In-memory caching of intermediate results
● Memory management influences
○ Resource efficiency
○ Performance
Straight-forward approach
● JVM memory model approach
● Store collection of objects and perform any processing on
the collection.
● Advantages
○ Eases development cycle
○ Built in safety checks before modifying any of the memory
○ Reduces complexity
○ JVM built-in GC
JVM memory model - Disadvantages
● Predicting memory consumption is hard
○ If it fails, OutOfMemory error kills the JVM
● High garbage collection overhead
○ Easily 50% of the time spent in GC
● Objects have space overhead
○ JVM objects doesn’t take the same amount of memory as
we think.
Java object overhead
● Consider a string “abcd” as a JVM object. By looking at it, it
should take up 4 bytes (one per character) of memory.
Garbage collection challenges
● Many big data workloads create objects in a way that are
unfriendly to regular Java GC.
● Young generation garbage collection is frequent.
● Objects created in big data workloads tend to live in Young
generation itself because they are used few times.
Generality has a cost, so
semantics and schema should
be used to exploit specificity
instead
Custom memory management
● Allocation - Allocate fixed number of segments upfront.
● Serialization - Data objects are serialized into memory
segments
● Processing - Implement algorithms on binary
representation.
Allocation
Managing memory on our own
● sun.misc.Unsafe
● Directly manipulating memory without safety checks.
(hence, its unsafe)
● This API is used to build data structures off heap in
Spark.
sun.misc.Unsafe
● Unsafe is one of the gateway to low level
programming in Java.
● Exposes C-style memory access
● explicit allocation, deallocation, pointer arithmetics
● Unsafe methods are intrinsic
Hands-on
● DirectIntArray
● MemoryCorruption
Custom memory management in Spark
On heap
● Stores data inside an array of type Long
● Capable of storing 16GB at once
● Bytes are encoded in long and stored here
Off heap
● Allocates memory in the memory assigned to JVM other than
heap
● Uses Unsafe API
● Stores bytes directly
Encoding memory addresses
● Off heap: Addresses are raw memory pointers.
● On heap: Addresses are base object + offset pairs
● Spark uses its own page table abstraction to enable more
compact encoding of on-heap addresses.
Serialization
Data Structures prominently used in Big
data
● Sequence
● Key-Value pair
Java object-based row notation
● 3 fields of type (int, string, string)
with value (123, “data”,”mantra”)
➔ 5+ objects
➔ high space overhead
➔ expensive hashCode()
Tungsten’s unsafe row format
● Bitset for tracking null values
● Every column appears in the fixed-length value region
○ Fixed length variables are inclined
○ For variable length values, we store a relative offset
into the variable length data section
● Rows are always 8 byte aligned
● Equality comparison can be done on raw bytes.
Example of unsafe row
null tracking bitmap
(123, “data”,”mantra”)
Hands-on
● UnsafeRowCreator
java.util.HashMap
...
array
● Huge object overheads
● Poor memory locality
● Size estimation is hard
Tungsten BytesToBytesMap
...
● Low overheads
● Good memory locality, especially for scans
Processing
Many big data workloads are now
compute bound
● Network optimizations can only reduce job completion time by a median of
at most 2%.
● Optimizing or eliminating disk accesses can only reduce job completion
time by a median of at most 19%
● [1]
Hardware trends
Why is CPU the new bottleneck
● Hardware has improved
○ 1Gbps to 10Gbps link in networks
○ High B/W SSDs or Stripped HDD arrays
● Spark IO has been optimized
○ Many workloads now avoid significant disk IO by pruning
data that is not needed in a given job
○ New shuffle and network layer implementations
● Data formats have improved
○ Parquet, binary data formats
● Serialization and Hashing are CPU-bound bottlenecks
Code generation
● Generic evaluation of expression logic is very expensive
on the JVM
○ Virtual function calls
○ Branches based on expression type
○ Object creation due to primitive boxing
○ Memory consumption by boxed primitive objects
● Generating the code which directly applies the expression
logic on serialized data
Which Spark API can be benefited
Spark dataframes
SparkSQL
RDD
Why only Dataframes are benefited?
Python
DF
Java/Scala
DF
R
DF
Logical
Plan
Physical
execution
Catalyst
optimizer
Spark
SQL
Physical
execution
RDD
API
Runtime bytecode generation
Dataframe code
Catalyst expressions
Low level bytecode
Code generation
Aggregation optimization in DataFrame
Aggregation optimization in DataFrame
Input row Grouping keyUnsafeRow
BytesToBytesMapIterate
Update in place
Probe
ProjectConvert
Scan
Performance results with optimizations
(Run time)
Performance results with optimizations
(GC Time)
References
● https://www.eecs.berkeley.edu/~keo/publications/nsdi15-
final147.pdf
● https://databricks.com/blog/2015/04/28/project-tungsten-
bringing-spark-closer-to-bare-metal.html
● https://spark-summit.org/2015/events/deep-dive-into-
project-tungsten-bringing-spark-closer-to-bare-metal/
● https://gist.github.com/raphw/7935844
● http://www.bigsynapse.com/addressing-big-data-
performance

More Related Content

What's hot

Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
datamantra
 
Anatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source APIAnatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source API
datamantra
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
datamantra
 
Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streaming
datamantra
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
datamantra
 
Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
datamantra
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
DB Tsai
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
Kriangkrai Chaonithi
 
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
Spark Summit EU talk by Sol Ackerman and Franklyn D'souzaSpark Summit EU talk by Sol Ackerman and Franklyn D'souza
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
Spark Summit
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
datamantra
 
Getting started with SparkSQL - Desert Code Camp 2016
Getting started with SparkSQL  - Desert Code Camp 2016Getting started with SparkSQL  - Desert Code Camp 2016
Getting started with SparkSQL - Desert Code Camp 2016
clairvoyantllc
 
Spark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick PentreathSpark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick Pentreath
Spark Summit
 
When Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu MaWhen Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu Ma
Databricks
 
Javantura v4 - Getting started with Apache Spark - Dinko Srkoč
Javantura v4 - Getting started with Apache Spark - Dinko SrkočJavantura v4 - Getting started with Apache Spark - Dinko Srkoč
Javantura v4 - Getting started with Apache Spark - Dinko Srkoč
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics
Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data AnalyticsFugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics
Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics
Databricks
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Databricks
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkEclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache Spark
Jen Aman
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data Capture
Databricks
 
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Databricks
 

What's hot (20)

Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
 
Anatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source APIAnatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source API
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
 
Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streaming
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
Spark Summit EU talk by Sol Ackerman and Franklyn D'souzaSpark Summit EU talk by Sol Ackerman and Franklyn D'souza
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
 
Getting started with SparkSQL - Desert Code Camp 2016
Getting started with SparkSQL  - Desert Code Camp 2016Getting started with SparkSQL  - Desert Code Camp 2016
Getting started with SparkSQL - Desert Code Camp 2016
 
Spark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick PentreathSpark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick Pentreath
 
When Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu MaWhen Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu Ma
 
Javantura v4 - Getting started with Apache Spark - Dinko Srkoč
Javantura v4 - Getting started with Apache Spark - Dinko SrkočJavantura v4 - Getting started with Apache Spark - Dinko Srkoč
Javantura v4 - Getting started with Apache Spark - Dinko Srkoč
 
Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics
Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data AnalyticsFugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics
Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkEclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache Spark
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data Capture
 
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
 

Similar to Anatomy of in memory processing in Spark

Introduction to Memoria
Introduction to MemoriaIntroduction to Memoria
Introduction to MemoriaVictor Smirnov
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Omid Vahdaty
 
Distributed caching with java JCache
Distributed caching with java JCacheDistributed caching with java JCache
Distributed caching with java JCache
Kasun Gajasinghe
 
Introduce_non-volatile_generic_object_programming_model_for_In-Memory_Computing
Introduce_non-volatile_generic_object_programming_model_for_In-Memory_ComputingIntroduce_non-volatile_generic_object_programming_model_for_In-Memory_Computing
Introduce_non-volatile_generic_object_programming_model_for_In-Memory_ComputingYanpingWang
 
IMC Summit 2016 Breakout - Yanping Wang - Non-volatile Generic Object Program...
IMC Summit 2016 Breakout - Yanping Wang - Non-volatile Generic Object Program...IMC Summit 2016 Breakout - Yanping Wang - Non-volatile Generic Object Program...
IMC Summit 2016 Breakout - Yanping Wang - Non-volatile Generic Object Program...
In-Memory Computing Summit
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
Recent advancements in cache technology
Recent advancements in cache technologyRecent advancements in cache technology
Recent advancements in cache technology
Paras Nath Chaudhary
 
Caching principles-solutions
Caching principles-solutionsCaching principles-solutions
Caching principles-solutionspmanvi
 
Node Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
Node Architecture Implications for In-Memory Data Analytics on Scale-in ClustersNode Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
Node Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
Ahsan Javed Awan
 
MSE
MSEMSE
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
Shootout at the PAAS Corral
Shootout at the PAAS CorralShootout at the PAAS Corral
Shootout at the PAAS Corral
PostgreSQL Experts, Inc.
 
Silverstripe at scale - design & architecture for silverstripe applications
Silverstripe at scale - design & architecture for silverstripe applicationsSilverstripe at scale - design & architecture for silverstripe applications
Silverstripe at scale - design & architecture for silverstripe applications
BrettTasker
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
datamantra
 
Scalability broad strokes
Scalability   broad strokesScalability   broad strokes
Scalability broad strokes
Gagan Bajpai
 
Spark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed AwanSpark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed Awan
Spark Summit
 
Development of concurrent services using In-Memory Data Grids
Development of concurrent services using In-Memory Data GridsDevelopment of concurrent services using In-Memory Data Grids
Development of concurrent services using In-Memory Data Grids
jlorenzocima
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Spark Summit
 
Scalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBERScalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBER
Shuyi Chen
 

Similar to Anatomy of in memory processing in Spark (20)

Introduction to Memoria
Introduction to MemoriaIntroduction to Memoria
Introduction to Memoria
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Distributed caching with java JCache
Distributed caching with java JCacheDistributed caching with java JCache
Distributed caching with java JCache
 
Introduce_non-volatile_generic_object_programming_model_for_In-Memory_Computing
Introduce_non-volatile_generic_object_programming_model_for_In-Memory_ComputingIntroduce_non-volatile_generic_object_programming_model_for_In-Memory_Computing
Introduce_non-volatile_generic_object_programming_model_for_In-Memory_Computing
 
IMC Summit 2016 Breakout - Yanping Wang - Non-volatile Generic Object Program...
IMC Summit 2016 Breakout - Yanping Wang - Non-volatile Generic Object Program...IMC Summit 2016 Breakout - Yanping Wang - Non-volatile Generic Object Program...
IMC Summit 2016 Breakout - Yanping Wang - Non-volatile Generic Object Program...
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Recent advancements in cache technology
Recent advancements in cache technologyRecent advancements in cache technology
Recent advancements in cache technology
 
Caching principles-solutions
Caching principles-solutionsCaching principles-solutions
Caching principles-solutions
 
Node Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
Node Architecture Implications for In-Memory Data Analytics on Scale-in ClustersNode Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
Node Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
 
MSE
MSEMSE
MSE
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Shootout at the PAAS Corral
Shootout at the PAAS CorralShootout at the PAAS Corral
Shootout at the PAAS Corral
 
Silverstripe at scale - design & architecture for silverstripe applications
Silverstripe at scale - design & architecture for silverstripe applicationsSilverstripe at scale - design & architecture for silverstripe applications
Silverstripe at scale - design & architecture for silverstripe applications
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 
Scalability broad strokes
Scalability   broad strokesScalability   broad strokes
Scalability broad strokes
 
Spark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed AwanSpark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed Awan
 
Development of concurrent services using In-Memory Data Grids
Development of concurrent services using In-Memory Data GridsDevelopment of concurrent services using In-Memory Data Grids
Development of concurrent services using In-Memory Data Grids
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
 
Scalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBERScalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBER
 

More from datamantra

State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
datamantra
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
datamantra
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
datamantra
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
datamantra
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
datamantra
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
datamantra
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
datamantra
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
datamantra
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
datamantra
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
datamantra
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
datamantra
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
datamantra
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
datamantra
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
datamantra
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
datamantra
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
datamantra
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
datamantra
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
datamantra
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
datamantra
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
datamantra
 

More from datamantra (20)

State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 

Recently uploaded

Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 

Recently uploaded (20)

Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 

Anatomy of in memory processing in Spark

  • 1. Anatomy of in-memory processing in Spark A deep-dive into custom memory management in Spark https://github.com/shashankgowdal/introduction_to_dataset
  • 2. ● Shashank L ● Big data consultant and trainer at datamantra.io ● www.shashankgowda.com
  • 3. Agenda ● Era of in-memory processing ● Big data frameworks on JVM ● JVM memory model ● Custom memory management ● Allocation ● Serialization ● Processing ● Benefits of these on Spark
  • 4. Era of in-memory processing ● After Spark, in memory has become a defacto standard for big data workloads ● Advancement in hardware is pushing more frameworks in that direction ● Memory management is coupled with runtime of the framework ● Using memory efficiently in a big data workload is a challenging task ● Memory management depends upon runtime of the framework
  • 5. Why JVM is a prominent runtime for big data workloads ● Managed runtime ● Portable ● Hadoop was on JVM ● Rich eco system
  • 6. Big data frameworks on JVM ● Many frameworks today runs on JVM today ○ Spark ○ Flink ○ Hadoop ○ etc ● Organising data in memory ○ In-memory processing ○ In-memory caching of intermediate results ● Memory management influences ○ Resource efficiency ○ Performance
  • 7. Straight-forward approach ● JVM memory model approach ● Store collection of objects and perform any processing on the collection. ● Advantages ○ Eases development cycle ○ Built in safety checks before modifying any of the memory ○ Reduces complexity ○ JVM built-in GC
  • 8. JVM memory model - Disadvantages ● Predicting memory consumption is hard ○ If it fails, OutOfMemory error kills the JVM ● High garbage collection overhead ○ Easily 50% of the time spent in GC ● Objects have space overhead ○ JVM objects doesn’t take the same amount of memory as we think.
  • 9. Java object overhead ● Consider a string “abcd” as a JVM object. By looking at it, it should take up 4 bytes (one per character) of memory.
  • 10. Garbage collection challenges ● Many big data workloads create objects in a way that are unfriendly to regular Java GC. ● Young generation garbage collection is frequent. ● Objects created in big data workloads tend to live in Young generation itself because they are used few times.
  • 11. Generality has a cost, so semantics and schema should be used to exploit specificity instead
  • 12. Custom memory management ● Allocation - Allocate fixed number of segments upfront. ● Serialization - Data objects are serialized into memory segments ● Processing - Implement algorithms on binary representation.
  • 14. Managing memory on our own ● sun.misc.Unsafe ● Directly manipulating memory without safety checks. (hence, its unsafe) ● This API is used to build data structures off heap in Spark.
  • 15. sun.misc.Unsafe ● Unsafe is one of the gateway to low level programming in Java. ● Exposes C-style memory access ● explicit allocation, deallocation, pointer arithmetics ● Unsafe methods are intrinsic
  • 16.
  • 18. Custom memory management in Spark On heap ● Stores data inside an array of type Long ● Capable of storing 16GB at once ● Bytes are encoded in long and stored here Off heap ● Allocates memory in the memory assigned to JVM other than heap ● Uses Unsafe API ● Stores bytes directly
  • 19. Encoding memory addresses ● Off heap: Addresses are raw memory pointers. ● On heap: Addresses are base object + offset pairs ● Spark uses its own page table abstraction to enable more compact encoding of on-heap addresses.
  • 21. Data Structures prominently used in Big data ● Sequence ● Key-Value pair
  • 22. Java object-based row notation ● 3 fields of type (int, string, string) with value (123, “data”,”mantra”) ➔ 5+ objects ➔ high space overhead ➔ expensive hashCode()
  • 23. Tungsten’s unsafe row format ● Bitset for tracking null values ● Every column appears in the fixed-length value region ○ Fixed length variables are inclined ○ For variable length values, we store a relative offset into the variable length data section ● Rows are always 8 byte aligned ● Equality comparison can be done on raw bytes.
  • 24. Example of unsafe row null tracking bitmap (123, “data”,”mantra”)
  • 26. java.util.HashMap ... array ● Huge object overheads ● Poor memory locality ● Size estimation is hard
  • 27. Tungsten BytesToBytesMap ... ● Low overheads ● Good memory locality, especially for scans
  • 29. Many big data workloads are now compute bound ● Network optimizations can only reduce job completion time by a median of at most 2%. ● Optimizing or eliminating disk accesses can only reduce job completion time by a median of at most 19% ● [1]
  • 31. Why is CPU the new bottleneck ● Hardware has improved ○ 1Gbps to 10Gbps link in networks ○ High B/W SSDs or Stripped HDD arrays ● Spark IO has been optimized ○ Many workloads now avoid significant disk IO by pruning data that is not needed in a given job ○ New shuffle and network layer implementations ● Data formats have improved ○ Parquet, binary data formats ● Serialization and Hashing are CPU-bound bottlenecks
  • 32. Code generation ● Generic evaluation of expression logic is very expensive on the JVM ○ Virtual function calls ○ Branches based on expression type ○ Object creation due to primitive boxing ○ Memory consumption by boxed primitive objects ● Generating the code which directly applies the expression logic on serialized data
  • 33. Which Spark API can be benefited Spark dataframes SparkSQL RDD
  • 34. Why only Dataframes are benefited? Python DF Java/Scala DF R DF Logical Plan Physical execution Catalyst optimizer Spark SQL Physical execution RDD API
  • 35. Runtime bytecode generation Dataframe code Catalyst expressions Low level bytecode
  • 38. Aggregation optimization in DataFrame Input row Grouping keyUnsafeRow BytesToBytesMapIterate Update in place Probe ProjectConvert Scan
  • 39. Performance results with optimizations (Run time)
  • 40. Performance results with optimizations (GC Time)
  • 41. References ● https://www.eecs.berkeley.edu/~keo/publications/nsdi15- final147.pdf ● https://databricks.com/blog/2015/04/28/project-tungsten- bringing-spark-closer-to-bare-metal.html ● https://spark-summit.org/2015/events/deep-dive-into- project-tungsten-bringing-spark-closer-to-bare-metal/ ● https://gist.github.com/raphw/7935844 ● http://www.bigsynapse.com/addressing-big-data- performance