SlideShare a Scribd company logo
Apache Spark
High Level Overview
by
Karan Alang
Agenda
• What is Apache Spark ?
• Spark Ecosystem
• High Level Architecture
• Key Terminologies
• Spark-submit and Deploy modes
• RDD, DataFrames, Datasets, Spark SQL
.. and a few other concepts
Apache Spark – brief history
• Apache Spark was initially started by Matei Zaharia at UC Berkeley's
AMPLab in 2009, and open sourced in 2010 under a BSD license.
• In 2013, the project was donated to the Apache Software Foundation
and switched its license to Apache 2.0.
• In February 2014, Spark became a Top-Level Apache Project.
What is Apache Spark ?
• Apache Spark is a unified analytics engine for big data processing,
with built-in modules for
• Batch & streaming applications
• SQL
• Machine learning
• Graph processing
Essentially, it is an In-memory Analytics engine for large scale data processing in
distributed systems
Apache Spark Ecosystem
Apache Spark – High Level Architecture
Driver
- The process running the main() function of the application and creating the Spark Context
Worker Node
- Any node that can run the application in the cluster
Executor
- A process launched for an application on a worker node, which runs ‘Tasks’
- Each application will have it’s own set of executors
Cluster Manager
- An external service for acquiring resources on the cluster (e.g. Standalone manager, YARN, Mesos, Kubernetes)
Spark Context
- Entry gateway to the Spark Cluster, created by the Spark Driver
- Allows the Spark application to access Spark application with the help of the Cluster Manager
- requires SparkConf to create Spark context.
- In 2.x version, sparksession is created, which contains the sparkContext
Spark Conf
– contains the configuration at cluster level passed on to Spark Context
- sparkConf can be set at the application level
-
Key Components & Terminologies
Spark deployment – Client vs Cluster mode
Cluster mode :
- Spark driver runs inside an application master process
which is managed by YARN on the cluster
- the client can go away after initiating the application
Client mode :
- driver runs in the client process, and the application
master is only used for requesting resources from YARN
spark-submit : script used to submit spark
application in client or cluster mode
https://spark.apache.org/docs/latest/submitting-
applications.html
Spark Web UI – used to monitor the status and resource
consumption of Spark cluster
Resilient Distributed Datasets (RDD)
- fundamental data structure of Spark
- Immutable, Distributed collection of objects partitioned across nodes in the Spark cluster
- each RDD has multiple partitions , more the number of partitions – greater the parallelism
- leverages Low-level API that uses Transformations and actions
DataFrame
- Immutable distributed collection of objects
- Data is organized into Named columns, like a table in a Relational database
- Untyped API i.e. of type Dataset[Row]
Datasets
- Typed API i.e. Dataset[T]
- Available in Scala & Java
Spark SQL
- provides the ability to write SQL statements, to process Structured data
- Dataframes/Datasets/Spark SQL AP is optimized and leverage the Apache Spark performance optimizations like
Catalysts Optimizer, Tungsten Off-heap memory management.
Spark API - RDD, Data Frames, Datasets, Spark SQL
Spark API - RDD, Data Frames, Datasets, SQL
RDD Operations – Transformations, Actions
Transformations :
-apply function on RDD to create a
new RDD (RDD are immutable)
- Transformations are lazy in
nature
- Spark maintains the record of
operations using a DAG.
- ‘Narrow’ Transformations – donot
cause data shuffle eg. Map, filter
etc
- ‘Wide’ Transformation – cause
data shuffle eg. groupByKey()
Actions :
- Execution happens only when an
‘Action’ is done eg. count(),
saveAsText(), reduce() etc
Apache Spark – support for SQL windowing
function, Joins
• Spark SQL/Dataframe/Dataset API
• Support 3 types of windowing functions
• Ranking functions
• Rank
• Dense_rank
• percent_rank
• row_number
• Analytic Function
• Cume_dist
• Lag
• lead
• Aggregate Functions
• sum
• avg
• min
• Max
• count
Joins supported in Apache Spark
• Inner-Join
• Left-Join
• Right-Join
• Outer-Join
• Cross-Join
• Left-Semi-Join
• Left-Anti-Semi-Join
• Broadcast join (map-side join)
• Stream-Stream joins
Broadcast Join (or Broadcast hash join)
• Used to optimize join queries when the size of the smaller table is below
property – spark.sql.autoBroadcastJoinThreshold
• Similar to map-side join in Hadoop
• Smaller table is put in memory, and the join avoids sending all data of the larger
table across the network
Data Shuffle in Apache Spark
• What is Shuffle ?
• Process of data transfer between
Stages
• Redistributes data across spark
partitions (aka re-partitioning)
• Data will move across JVMs
processes, or even across the
wire (between executors on
different m/c)
• shuffle is expensive and should be
avoided at all costs
• Involves disk I/O, data
serialization and network I/O
Data Shuffle in Apache Spark
• Operations that cause shuffle include
• Repartition operations like repartition & coalesce
• ByKey operations like groupByKey, reduceByKey
• Join operations like cogroup, join
• To avoid/reduce shuffle
• Use shared variables (Broadcast variables, accumulators)
• Filter input earlier in the program rather than later.
• Use reduceByKey or aggregateByKey instead of groupByKey
Shared variables – broadcast variables
Broadcast variable
• Allows users to keep a ‘Read-only’ variable cached on each worker node, rather than
shipping a copy of it with tasks
Shared variables - accumulators
Accumulators
• Accumulators are variables that
are only “added” to through an
associative and commutative
operation and can therefore be
efficiently supported in parallel.
• They can be used to implement
counters (as in MapReduce) or
sums.
• Spark also attempts to distribute
broadcast variables using efficient
broadcast algorithms to reduce
communication cost.
• Accumulators are broadcasted to
worker nodes
• Worker nodes can modify state,
but cannot read content
• Only the driver program can read
accumulated value
Dynamic Allocation
• Allows spark to dynamically scale the cluster resources allocated to your
application based on the workload.
• When dynamic allocation is enabled and a Spark application has a backlog
of pending tasks, it can request executors.
• Set to ‘False’ by default
• To enable, set the property ‘spark.dynamicAllocation.enabled’ to True
• Other properties to set :
• spark.dynamicAllocation.initialExecutors
(default value -spark.dynamicAllocation.minExecutors)
• spark.dynamicAllocation.maxExecutors
• spark.dynamicAllocation.minExecutors
Spark Storage levels
• Spark RDD and DataFrames - provide the capability to specify the
storage level when we persist RDD/DataFrame
• Storage levels provide trade-offs between memory usage and CPU
efficiency
Spark Streaming
• Spark Streaming
• Uses Dstream API
• Powered by Spark RDD API’s
• Dstream API divides source data into micro batches, after processing sends to
destination
• Not ‘true’ streaming
• Structured Streaming
• Released in Spark 2.x
• Leverages Spark SQL API to process data
• Each row of the data stream is processed
and the result is updated into the
unbounded result table
• ‘True’ streaming
• Ability to handle late coming data (using
watermarks)
• User has the ability to determine the
frequency of data processing using triggers
• Write Ahead Logs are used to identify data
processed,and ensure end-to-end exactly-
once semantics and fault tolerance. WAL
are stored in checkpoints locations. (e.g. In
HDFS)
Structured Streaming
Structured Streaming – mapping of events to
tumbling windows
Structured Streaming :
Data gets appended to
Input table at trigger
interval specified
Output modes
1. Complete Mode
2. Append Mode
3. Update Mode (Available since Spark
2.1.1 – only updated rows are moved to the sink)
Structured Streaming –
overlapping windows
Watermarking in Structured Streaming is a way to
limit state in all stateful streaming operations by
specifying how much late data to consider.
watermark set as (max event time - '10 mins')
Watermark set as (max event
time - '10 mins')
Machine Learning using Apache Spark
• MLLib - Spark’s Machine learning library
• DataFrame-based API is primary API for ML using Apache Spark
• Provides tools for
• ML Algorithms for common algorithms like classification, regression,
clustering, and collaborative filtering
• Featurization: feature extraction, transformation, dimensionality reduction,
and selection
• Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
• Persistence: saving and load algorithms, models, and Pipelines
• Utilities: linear algebra, statistics, data handling, etc.
Graph processing using Apache Spark
• GraphX is Apache Spark's API for graphs and graph-parallel computation.
• Key features
• Seamlessly work with both graphs and collections.
• Comparable performance to the fastest specialized graph processing
systems.
• Libraries available include
• PageRank
• Connected components
• Label propagation
• SVD++
• Strongly connected components
• Triangle count
Delta Lake
• Delta Lake is an open source project with the Linux
Foundation.
• Key features :
• Provides ACID Transactions functionality in Data
lakes
• Delta Lake provides DML APIs to merge, update
and delete datasets.
• Schema enforcement
• Time Travel (Snapshots/Versioning)
• Schema Evolution
• Audit History
• 100% compatible with Apache Spark API
Appendix
Catalyst Optimizer
- which leverages advanced programming language features (e.g. Scala’s pattern matching and quasi quotes) in a
novel way to build an extensible query optimizer
Apache Spark 2.x – leverages Catalyst optimizer
to optimize the Spark execution engine
Project Tungsten
- is to improve Spark execution by optimizing Spark jobs for CPU and memory
efficiency (as opposed to network and disk I/O which are considered fast enough)
Optimization features include
- Off-Heap Memory Management using binary in-memory data representation
aka Tungsten row format and managing memory explicitly
- Cache Locality which is about cache-aware computations with cache-aware
layout for high cache hit rates
- Whole-Stage Code Generation (aka CodeGen)
-
Apache Spark 2.x – leverages Tungsten Execution
to optimize Spark Execution engine
How to determine number of Executors,
Cores, Memory for spark application?
• With Spark on YARN, there are daemons that run in the background eg. NameNode,
Secondary NameNode, DataNode, Task Tracker, Job Tracker.
• While specifying num-executors, we need to make sure that we leave aside enough
cores (~1 core per node) for these daemons to run smoothly.
• We need to budget in the resources that AM would need (~1 executor, 1024 MB
memory)
• HDFS Throughput
• Is maximized with ~5 cores/executor
• Full memory requested to YARN per executor = spark-executor-memory +
memoryOverhead (i.e. 1.07 * spark-executor-memory)
Tiny Executor (1 Executor/Core)
• Cluster Config -> 10 Nodes cluster, 16 cores/Node, 64 GB RAM/node
Configuration Options
• Tiny Executors
• 1 Executor/Core
• --num-executors = 16 * 10 = 160 executors (i.e. 16 Executors/node)
• --executor-cores (cores/executor) = 1
• --executor-memory = 64 GB/16 = 4GB/executor
• Analysis :
• Unable to take advantage of parallelism (ie.. Not running multiple tasks per JVM)
• Also, shared/cached variables like broadcast variables and accumulators will be replicated in each core of
the nodes which is 16 times
• Also, we are not leaving enough memory overhead for Hadoop/Yarn daemon processes and we
are not counting in ApplicationManager.
• Not Good
Fat Executor (1 Executor/Node)
• Cluster Config -> 10 Nodes cluster, 16 cores/Node, 64 GB RAM/node
Configuration Options
• Fat Executors
• 1 Executor/Node
• --num-executors = 1 * 10 = 10 executors (i.e. 1 Executors/node)
• --executor-cores (cores/executor) = 16
• --executor-memory = 64 GB/1 = 64GB/executor
• Analysis :
• With all 16 cores per executor, apart from AM and daemon processes are not counted for
• HDFS Throughput will hurt, and result in massive Garbage collection
• Not Good
Balance between Fat and Thin Executor
• Cluster Config -> 10 Nodes cluster, 16 cores/Node, 64 GB RAM/node
Configuration Options
• --executor-cores (cores/executor) = 5 (recommended for max HDFS Throughput)
• Leave 1 core for Hadoop/Yarn daemons
• Num cores available per node = 16 -1 = 15
• --num-executors = 15 * 10 = 150 executors
• Number of available executors (total cores/num-cores-per-executor) = 150/5 = 30
• Leaving 1 executor for YARN AM -> --num-executors = 29
• Number of executors/Node = 30/10 = 3
• Memory per executor (--executor-memory) = 64GB/3 = 21 GB
• Counting off heap overhead = 7% of 21GB = 3 GB, so actual –executor-memory = 21 – 3 = 18GB
• Analysis :
• Recommended –> 29 executors, 18GB memory, and 5 cores each

More Related Content

What's hot

Apache Spark Streaming
Apache Spark StreamingApache Spark Streaming
Apache Spark Streaming
Zahra Eskandari
 
Future of data visualization
Future of data visualizationFuture of data visualization
Future of data visualization
hadoopsphere
 
Spark sql
Spark sqlSpark sql
Spark sql
Zahra Eskandari
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
Streaming ETL for All
Streaming ETL for AllStreaming ETL for All
Streaming ETL for All
Joey Echeverria
 
Learning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a ClusterLearning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a Cluster
phanleson
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterDataWorks Summit
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Databricks
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
Lightbend
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
Using Visualization to Succeed with Big Data
Using Visualization to Succeed with Big Data Using Visualization to Succeed with Big Data
Using Visualization to Succeed with Big Data Pactera_US
 
End-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and AtlasEnd-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and Atlas
DataWorks Summit
 
Data Science
Data ScienceData Science
Data Science
Ahmet Bulut
 
Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureApplication Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and Future
VARUN SAXENA
 
Running Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache HadoopRunning Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache Hadoop
hitesh1892
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduce
Edureka!
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Databricks
 

What's hot (19)

Apache Spark Streaming
Apache Spark StreamingApache Spark Streaming
Apache Spark Streaming
 
Future of data visualization
Future of data visualizationFuture of data visualization
Future of data visualization
 
Spark sql
Spark sqlSpark sql
Spark sql
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
spark-kafka_mod
spark-kafka_modspark-kafka_mod
spark-kafka_mod
 
Streaming ETL for All
Streaming ETL for AllStreaming ETL for All
Streaming ETL for All
 
Learning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a ClusterLearning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a Cluster
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Using Visualization to Succeed with Big Data
Using Visualization to Succeed with Big Data Using Visualization to Succeed with Big Data
Using Visualization to Succeed with Big Data
 
End-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and AtlasEnd-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and Atlas
 
Data Science
Data ScienceData Science
Data Science
 
Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureApplication Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and Future
 
Running Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache HadoopRunning Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache Hadoop
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduce
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
 

Similar to Apache Spark - A High Level overview

Apache Spark
Apache SparkApache Spark
Apache Spark
masifqadri
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
Amazon Web Services
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache Spark
Simon Lia-Jonassen
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
Md. Mahedi Kaysar
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
mahchiev
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Spark core
Spark coreSpark core
Spark core
Prashant Gupta
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle Database
Martin Toshev
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
Amazon Web Services
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
Sohil Jain
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
Sohil Jain
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
Amazon Web Services
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
Oh Chan Kwon
 
Spark cep
Spark cepSpark cep
Spark cep
Byungjin Kim
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
DeepaThirumurugan
 

Similar to Apache Spark - A High Level overview (20)

Apache Spark
Apache SparkApache Spark
Apache Spark
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache Spark
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Spark core
Spark coreSpark core
Spark core
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle Database
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
 
Spark cep
Spark cepSpark cep
Spark cep
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 

Recently uploaded

Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 

Recently uploaded (20)

Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 

Apache Spark - A High Level overview

  • 1. Apache Spark High Level Overview by Karan Alang
  • 2. Agenda • What is Apache Spark ? • Spark Ecosystem • High Level Architecture • Key Terminologies • Spark-submit and Deploy modes • RDD, DataFrames, Datasets, Spark SQL .. and a few other concepts
  • 3. Apache Spark – brief history • Apache Spark was initially started by Matei Zaharia at UC Berkeley's AMPLab in 2009, and open sourced in 2010 under a BSD license. • In 2013, the project was donated to the Apache Software Foundation and switched its license to Apache 2.0. • In February 2014, Spark became a Top-Level Apache Project.
  • 4. What is Apache Spark ? • Apache Spark is a unified analytics engine for big data processing, with built-in modules for • Batch & streaming applications • SQL • Machine learning • Graph processing Essentially, it is an In-memory Analytics engine for large scale data processing in distributed systems
  • 6. Apache Spark – High Level Architecture
  • 7. Driver - The process running the main() function of the application and creating the Spark Context Worker Node - Any node that can run the application in the cluster Executor - A process launched for an application on a worker node, which runs ‘Tasks’ - Each application will have it’s own set of executors Cluster Manager - An external service for acquiring resources on the cluster (e.g. Standalone manager, YARN, Mesos, Kubernetes) Spark Context - Entry gateway to the Spark Cluster, created by the Spark Driver - Allows the Spark application to access Spark application with the help of the Cluster Manager - requires SparkConf to create Spark context. - In 2.x version, sparksession is created, which contains the sparkContext Spark Conf – contains the configuration at cluster level passed on to Spark Context - sparkConf can be set at the application level - Key Components & Terminologies
  • 8. Spark deployment – Client vs Cluster mode Cluster mode : - Spark driver runs inside an application master process which is managed by YARN on the cluster - the client can go away after initiating the application Client mode : - driver runs in the client process, and the application master is only used for requesting resources from YARN
  • 9. spark-submit : script used to submit spark application in client or cluster mode https://spark.apache.org/docs/latest/submitting- applications.html
  • 10. Spark Web UI – used to monitor the status and resource consumption of Spark cluster
  • 11. Resilient Distributed Datasets (RDD) - fundamental data structure of Spark - Immutable, Distributed collection of objects partitioned across nodes in the Spark cluster - each RDD has multiple partitions , more the number of partitions – greater the parallelism - leverages Low-level API that uses Transformations and actions DataFrame - Immutable distributed collection of objects - Data is organized into Named columns, like a table in a Relational database - Untyped API i.e. of type Dataset[Row] Datasets - Typed API i.e. Dataset[T] - Available in Scala & Java Spark SQL - provides the ability to write SQL statements, to process Structured data - Dataframes/Datasets/Spark SQL AP is optimized and leverage the Apache Spark performance optimizations like Catalysts Optimizer, Tungsten Off-heap memory management. Spark API - RDD, Data Frames, Datasets, Spark SQL
  • 12. Spark API - RDD, Data Frames, Datasets, SQL
  • 13. RDD Operations – Transformations, Actions Transformations : -apply function on RDD to create a new RDD (RDD are immutable) - Transformations are lazy in nature - Spark maintains the record of operations using a DAG. - ‘Narrow’ Transformations – donot cause data shuffle eg. Map, filter etc - ‘Wide’ Transformation – cause data shuffle eg. groupByKey() Actions : - Execution happens only when an ‘Action’ is done eg. count(), saveAsText(), reduce() etc
  • 14. Apache Spark – support for SQL windowing function, Joins • Spark SQL/Dataframe/Dataset API • Support 3 types of windowing functions • Ranking functions • Rank • Dense_rank • percent_rank • row_number • Analytic Function • Cume_dist • Lag • lead • Aggregate Functions • sum • avg • min • Max • count
  • 15. Joins supported in Apache Spark • Inner-Join • Left-Join • Right-Join • Outer-Join • Cross-Join • Left-Semi-Join • Left-Anti-Semi-Join • Broadcast join (map-side join) • Stream-Stream joins
  • 16. Broadcast Join (or Broadcast hash join) • Used to optimize join queries when the size of the smaller table is below property – spark.sql.autoBroadcastJoinThreshold • Similar to map-side join in Hadoop • Smaller table is put in memory, and the join avoids sending all data of the larger table across the network
  • 17. Data Shuffle in Apache Spark • What is Shuffle ? • Process of data transfer between Stages • Redistributes data across spark partitions (aka re-partitioning) • Data will move across JVMs processes, or even across the wire (between executors on different m/c) • shuffle is expensive and should be avoided at all costs • Involves disk I/O, data serialization and network I/O
  • 18. Data Shuffle in Apache Spark • Operations that cause shuffle include • Repartition operations like repartition & coalesce • ByKey operations like groupByKey, reduceByKey • Join operations like cogroup, join • To avoid/reduce shuffle • Use shared variables (Broadcast variables, accumulators) • Filter input earlier in the program rather than later. • Use reduceByKey or aggregateByKey instead of groupByKey
  • 19. Shared variables – broadcast variables Broadcast variable • Allows users to keep a ‘Read-only’ variable cached on each worker node, rather than shipping a copy of it with tasks
  • 20. Shared variables - accumulators Accumulators • Accumulators are variables that are only “added” to through an associative and commutative operation and can therefore be efficiently supported in parallel. • They can be used to implement counters (as in MapReduce) or sums. • Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost. • Accumulators are broadcasted to worker nodes • Worker nodes can modify state, but cannot read content • Only the driver program can read accumulated value
  • 21. Dynamic Allocation • Allows spark to dynamically scale the cluster resources allocated to your application based on the workload. • When dynamic allocation is enabled and a Spark application has a backlog of pending tasks, it can request executors. • Set to ‘False’ by default • To enable, set the property ‘spark.dynamicAllocation.enabled’ to True • Other properties to set : • spark.dynamicAllocation.initialExecutors (default value -spark.dynamicAllocation.minExecutors) • spark.dynamicAllocation.maxExecutors • spark.dynamicAllocation.minExecutors
  • 22. Spark Storage levels • Spark RDD and DataFrames - provide the capability to specify the storage level when we persist RDD/DataFrame • Storage levels provide trade-offs between memory usage and CPU efficiency
  • 23. Spark Streaming • Spark Streaming • Uses Dstream API • Powered by Spark RDD API’s • Dstream API divides source data into micro batches, after processing sends to destination • Not ‘true’ streaming
  • 24. • Structured Streaming • Released in Spark 2.x • Leverages Spark SQL API to process data • Each row of the data stream is processed and the result is updated into the unbounded result table • ‘True’ streaming • Ability to handle late coming data (using watermarks) • User has the ability to determine the frequency of data processing using triggers • Write Ahead Logs are used to identify data processed,and ensure end-to-end exactly- once semantics and fault tolerance. WAL are stored in checkpoints locations. (e.g. In HDFS) Structured Streaming
  • 25. Structured Streaming – mapping of events to tumbling windows
  • 26. Structured Streaming : Data gets appended to Input table at trigger interval specified Output modes 1. Complete Mode 2. Append Mode 3. Update Mode (Available since Spark 2.1.1 – only updated rows are moved to the sink)
  • 28. Watermarking in Structured Streaming is a way to limit state in all stateful streaming operations by specifying how much late data to consider. watermark set as (max event time - '10 mins') Watermark set as (max event time - '10 mins')
  • 29. Machine Learning using Apache Spark • MLLib - Spark’s Machine learning library • DataFrame-based API is primary API for ML using Apache Spark • Provides tools for • ML Algorithms for common algorithms like classification, regression, clustering, and collaborative filtering • Featurization: feature extraction, transformation, dimensionality reduction, and selection • Pipelines: tools for constructing, evaluating, and tuning ML Pipelines • Persistence: saving and load algorithms, models, and Pipelines • Utilities: linear algebra, statistics, data handling, etc.
  • 30. Graph processing using Apache Spark • GraphX is Apache Spark's API for graphs and graph-parallel computation. • Key features • Seamlessly work with both graphs and collections. • Comparable performance to the fastest specialized graph processing systems. • Libraries available include • PageRank • Connected components • Label propagation • SVD++ • Strongly connected components • Triangle count
  • 31. Delta Lake • Delta Lake is an open source project with the Linux Foundation. • Key features : • Provides ACID Transactions functionality in Data lakes • Delta Lake provides DML APIs to merge, update and delete datasets. • Schema enforcement • Time Travel (Snapshots/Versioning) • Schema Evolution • Audit History • 100% compatible with Apache Spark API
  • 33. Catalyst Optimizer - which leverages advanced programming language features (e.g. Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer Apache Spark 2.x – leverages Catalyst optimizer to optimize the Spark execution engine
  • 34. Project Tungsten - is to improve Spark execution by optimizing Spark jobs for CPU and memory efficiency (as opposed to network and disk I/O which are considered fast enough) Optimization features include - Off-Heap Memory Management using binary in-memory data representation aka Tungsten row format and managing memory explicitly - Cache Locality which is about cache-aware computations with cache-aware layout for high cache hit rates - Whole-Stage Code Generation (aka CodeGen) - Apache Spark 2.x – leverages Tungsten Execution to optimize Spark Execution engine
  • 35. How to determine number of Executors, Cores, Memory for spark application? • With Spark on YARN, there are daemons that run in the background eg. NameNode, Secondary NameNode, DataNode, Task Tracker, Job Tracker. • While specifying num-executors, we need to make sure that we leave aside enough cores (~1 core per node) for these daemons to run smoothly. • We need to budget in the resources that AM would need (~1 executor, 1024 MB memory) • HDFS Throughput • Is maximized with ~5 cores/executor • Full memory requested to YARN per executor = spark-executor-memory + memoryOverhead (i.e. 1.07 * spark-executor-memory)
  • 36. Tiny Executor (1 Executor/Core) • Cluster Config -> 10 Nodes cluster, 16 cores/Node, 64 GB RAM/node Configuration Options • Tiny Executors • 1 Executor/Core • --num-executors = 16 * 10 = 160 executors (i.e. 16 Executors/node) • --executor-cores (cores/executor) = 1 • --executor-memory = 64 GB/16 = 4GB/executor • Analysis : • Unable to take advantage of parallelism (ie.. Not running multiple tasks per JVM) • Also, shared/cached variables like broadcast variables and accumulators will be replicated in each core of the nodes which is 16 times • Also, we are not leaving enough memory overhead for Hadoop/Yarn daemon processes and we are not counting in ApplicationManager. • Not Good
  • 37. Fat Executor (1 Executor/Node) • Cluster Config -> 10 Nodes cluster, 16 cores/Node, 64 GB RAM/node Configuration Options • Fat Executors • 1 Executor/Node • --num-executors = 1 * 10 = 10 executors (i.e. 1 Executors/node) • --executor-cores (cores/executor) = 16 • --executor-memory = 64 GB/1 = 64GB/executor • Analysis : • With all 16 cores per executor, apart from AM and daemon processes are not counted for • HDFS Throughput will hurt, and result in massive Garbage collection • Not Good
  • 38. Balance between Fat and Thin Executor • Cluster Config -> 10 Nodes cluster, 16 cores/Node, 64 GB RAM/node Configuration Options • --executor-cores (cores/executor) = 5 (recommended for max HDFS Throughput) • Leave 1 core for Hadoop/Yarn daemons • Num cores available per node = 16 -1 = 15 • --num-executors = 15 * 10 = 150 executors • Number of available executors (total cores/num-cores-per-executor) = 150/5 = 30 • Leaving 1 executor for YARN AM -> --num-executors = 29 • Number of executors/Node = 30/10 = 3 • Memory per executor (--executor-memory) = 64GB/3 = 21 GB • Counting off heap overhead = 7% of 21GB = 3 GB, so actual –executor-memory = 21 – 3 = 18GB • Analysis : • Recommended –> 29 executors, 18GB memory, and 5 cores each