Apache Spark - A High Level overview

Apache Spark
High Level Overview
by
Karan Alang

Agenda
• What is Apache Spark ?
• Spark Ecosystem
• High Level Architecture
• Key Terminologies
• Spark-submit and Deploy modes
• RDD, DataFrames, Datasets, Spark SQL
.. and a few other concepts

Apache Spark – brief history
• Apache Spark was initially started by Matei Zaharia at UC Berkeley's
AMPLab in 2009, and open sourced in 2010 under a BSD license.
• In 2013, the project was donated to the Apache Software Foundation
and switched its license to Apache 2.0.
• In February 2014, Spark became a Top-Level Apache Project.

What is Apache Spark ?
• Apache Spark is a unified analytics engine for big data processing,
with built-in modules for
• Batch & streaming applications
• SQL
• Machine learning
• Graph processing
Essentially, it is an In-memory Analytics engine for large scale data processing in
distributed systems

Apache Spark – High Level Architecture

Driver
- The process running the main() function of the application and creating the Spark Context
Worker Node
- Any node that can run the application in the cluster
Executor
- A process launched for an application on a worker node, which runs ‘Tasks’
- Each application will have it’s own set of executors
Cluster Manager
- An external service for acquiring resources on the cluster (e.g. Standalone manager, YARN, Mesos, Kubernetes)
Spark Context
- Entry gateway to the Spark Cluster, created by the Spark Driver
- Allows the Spark application to access Spark application with the help of the Cluster Manager
- requires SparkConf to create Spark context.
- In 2.x version, sparksession is created, which contains the sparkContext
Spark Conf
– contains the configuration at cluster level passed on to Spark Context
- sparkConf can be set at the application level
-
Key Components & Terminologies

Spark deployment – Client vs Cluster mode
Cluster mode :
- Spark driver runs inside an application master process
which is managed by YARN on the cluster
- the client can go away after initiating the application
Client mode :
- driver runs in the client process, and the application
master is only used for requesting resources from YARN

spark-submit : script used to submit spark
application in client or cluster mode
https://spark.apache.org/docs/latest/submitting-
applications.html

Spark Web UI – used to monitor the status and resource
consumption of Spark cluster

Resilient Distributed Datasets (RDD)
- fundamental data structure of Spark
- Immutable, Distributed collection of objects partitioned across nodes in the Spark cluster
- each RDD has multiple partitions , more the number of partitions – greater the parallelism
- leverages Low-level API that uses Transformations and actions
DataFrame
- Immutable distributed collection of objects
- Data is organized into Named columns, like a table in a Relational database
- Untyped API i.e. of type Dataset[Row]
Datasets
- Typed API i.e. Dataset[T]
- Available in Scala & Java
Spark SQL
- provides the ability to write SQL statements, to process Structured data
- Dataframes/Datasets/Spark SQL AP is optimized and leverage the Apache Spark performance optimizations like
Catalysts Optimizer, Tungsten Off-heap memory management.
Spark API - RDD, Data Frames, Datasets, Spark SQL

Spark API - RDD, Data Frames, Datasets, SQL

RDD Operations – Transformations, Actions
Transformations :
-apply function on RDD to create a
new RDD (RDD are immutable)
- Transformations are lazy in
nature
- Spark maintains the record of
operations using a DAG.
- ‘Narrow’ Transformations – donot
cause data shuffle eg. Map, filter
etc
- ‘Wide’ Transformation – cause
data shuffle eg. groupByKey()
Actions :
- Execution happens only when an
‘Action’ is done eg. count(),
saveAsText(), reduce() etc

Apache Spark – support for SQL windowing
function, Joins
• Spark SQL/Dataframe/Dataset API
• Support 3 types of windowing functions
• Ranking functions
• Rank
• Dense_rank
• percent_rank
• row_number
• Analytic Function
• Cume_dist
• Lag
• lead
• Aggregate Functions
• sum
• avg
• min
• Max
• count

Joins supported in Apache Spark
• Inner-Join
• Left-Join
• Right-Join
• Outer-Join
• Cross-Join
• Left-Semi-Join
• Left-Anti-Semi-Join
• Broadcast join (map-side join)
• Stream-Stream joins

Broadcast Join (or Broadcast hash join)
• Used to optimize join queries when the size of the smaller table is below
property – spark.sql.autoBroadcastJoinThreshold
• Similar to map-side join in Hadoop
• Smaller table is put in memory, and the join avoids sending all data of the larger
table across the network

Data Shuffle in Apache Spark
• What is Shuffle ?
• Process of data transfer between
Stages
• Redistributes data across spark
partitions (aka re-partitioning)
• Data will move across JVMs
processes, or even across the
wire (between executors on
different m/c)
• shuffle is expensive and should be
avoided at all costs
• Involves disk I/O, data
serialization and network I/O

Data Shuffle in Apache Spark
• Operations that cause shuffle include
• Repartition operations like repartition & coalesce
• ByKey operations like groupByKey, reduceByKey
• Join operations like cogroup, join
• To avoid/reduce shuffle
• Use shared variables (Broadcast variables, accumulators)
• Filter input earlier in the program rather than later.
• Use reduceByKey or aggregateByKey instead of groupByKey

Shared variables – broadcast variables
Broadcast variable
• Allows users to keep a ‘Read-only’ variable cached on each worker node, rather than
shipping a copy of it with tasks

Shared variables - accumulators
Accumulators
• Accumulators are variables that
are only “added” to through an
associative and commutative
operation and can therefore be
efficiently supported in parallel.
• They can be used to implement
counters (as in MapReduce) or
sums.
• Spark also attempts to distribute
broadcast variables using efficient
broadcast algorithms to reduce
communication cost.
• Accumulators are broadcasted to
worker nodes
• Worker nodes can modify state,
but cannot read content
• Only the driver program can read
accumulated value

Dynamic Allocation
• Allows spark to dynamically scale the cluster resources allocated to your
application based on the workload.
• When dynamic allocation is enabled and a Spark application has a backlog
of pending tasks, it can request executors.
• Set to ‘False’ by default
• To enable, set the property ‘spark.dynamicAllocation.enabled’ to True
• Other properties to set :
• spark.dynamicAllocation.initialExecutors
(default value -spark.dynamicAllocation.minExecutors)
• spark.dynamicAllocation.maxExecutors
• spark.dynamicAllocation.minExecutors

Spark Storage levels
• Spark RDD and DataFrames - provide the capability to specify the
storage level when we persist RDD/DataFrame
• Storage levels provide trade-offs between memory usage and CPU
efficiency

Spark Streaming
• Spark Streaming
• Uses Dstream API
• Powered by Spark RDD API’s
• Dstream API divides source data into micro batches, after processing sends to
destination
• Not ‘true’ streaming

• Structured Streaming
• Released in Spark 2.x
• Leverages Spark SQL API to process data
• Each row of the data stream is processed
and the result is updated into the
unbounded result table
• ‘True’ streaming
• Ability to handle late coming data (using
watermarks)
• User has the ability to determine the
frequency of data processing using triggers
• Write Ahead Logs are used to identify data
processed,and ensure end-to-end exactly-
once semantics and fault tolerance. WAL
are stored in checkpoints locations. (e.g. In
HDFS)
Structured Streaming

Structured Streaming – mapping of events to
tumbling windows

Structured Streaming :
Data gets appended to
Input table at trigger
interval specified
Output modes
1. Complete Mode
2. Append Mode
3. Update Mode (Available since Spark
2.1.1 – only updated rows are moved to the sink)

Structured Streaming –
overlapping windows

Watermarking in Structured Streaming is a way to
limit state in all stateful streaming operations by
specifying how much late data to consider.
watermark set as (max event time - '10 mins')
Watermark set as (max event
time - '10 mins')

Machine Learning using Apache Spark
• MLLib - Spark’s Machine learning library
• DataFrame-based API is primary API for ML using Apache Spark
• Provides tools for
• ML Algorithms for common algorithms like classification, regression,
clustering, and collaborative filtering
• Featurization: feature extraction, transformation, dimensionality reduction,
and selection
• Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
• Persistence: saving and load algorithms, models, and Pipelines
• Utilities: linear algebra, statistics, data handling, etc.

Graph processing using Apache Spark
• GraphX is Apache Spark's API for graphs and graph-parallel computation.
• Key features
• Seamlessly work with both graphs and collections.
• Comparable performance to the fastest specialized graph processing
systems.
• Libraries available include
• PageRank
• Connected components
• Label propagation
• SVD++
• Strongly connected components
• Triangle count

Delta Lake
• Delta Lake is an open source project with the Linux
Foundation.
• Key features :
• Provides ACID Transactions functionality in Data
lakes
• Delta Lake provides DML APIs to merge, update
and delete datasets.
• Schema enforcement
• Time Travel (Snapshots/Versioning)
• Schema Evolution
• Audit History
• 100% compatible with Apache Spark API

Catalyst Optimizer
- which leverages advanced programming language features (e.g. Scala’s pattern matching and quasi quotes) in a
novel way to build an extensible query optimizer
Apache Spark 2.x – leverages Catalyst optimizer
to optimize the Spark execution engine

Project Tungsten
- is to improve Spark execution by optimizing Spark jobs for CPU and memory
efficiency (as opposed to network and disk I/O which are considered fast enough)
Optimization features include
- Off-Heap Memory Management using binary in-memory data representation
aka Tungsten row format and managing memory explicitly
- Cache Locality which is about cache-aware computations with cache-aware
layout for high cache hit rates
- Whole-Stage Code Generation (aka CodeGen)
-
Apache Spark 2.x – leverages Tungsten Execution
to optimize Spark Execution engine

How to determine number of Executors,
Cores, Memory for spark application?
• With Spark on YARN, there are daemons that run in the background eg. NameNode,
Secondary NameNode, DataNode, Task Tracker, Job Tracker.
• While specifying num-executors, we need to make sure that we leave aside enough
cores (~1 core per node) for these daemons to run smoothly.
• We need to budget in the resources that AM would need (~1 executor, 1024 MB
memory)
• HDFS Throughput
• Is maximized with ~5 cores/executor
• Full memory requested to YARN per executor = spark-executor-memory +
memoryOverhead (i.e. 1.07 * spark-executor-memory)

Tiny Executor (1 Executor/Core)
• Cluster Config -> 10 Nodes cluster, 16 cores/Node, 64 GB RAM/node
Configuration Options
• Tiny Executors
• 1 Executor/Core
• --num-executors = 16 * 10 = 160 executors (i.e. 16 Executors/node)
• --executor-cores (cores/executor) = 1
• --executor-memory = 64 GB/16 = 4GB/executor
• Analysis :
• Unable to take advantage of parallelism (ie.. Not running multiple tasks per JVM)
• Also, shared/cached variables like broadcast variables and accumulators will be replicated in each core of
the nodes which is 16 times
• Also, we are not leaving enough memory overhead for Hadoop/Yarn daemon processes and we
are not counting in ApplicationManager.
• Not Good

Fat Executor (1 Executor/Node)
• Fat Executors
• 1 Executor/Node
• --num-executors = 1 * 10 = 10 executors (i.e. 1 Executors/node)
• --executor-cores (cores/executor) = 16
• --executor-memory = 64 GB/1 = 64GB/executor
• Analysis :
• With all 16 cores per executor, apart from AM and daemon processes are not counted for
• HDFS Throughput will hurt, and result in massive Garbage collection
• Not Good

Balance between Fat and Thin Executor
• --executor-cores (cores/executor) = 5 (recommended for max HDFS Throughput)
• Leave 1 core for Hadoop/Yarn daemons
• Num cores available per node = 16 -1 = 15
• --num-executors = 15 * 10 = 150 executors
• Number of available executors (total cores/num-cores-per-executor) = 150/5 = 30
• Leaving 1 executor for YARN AM -> --num-executors = 29
• Number of executors/Node = 30/10 = 3
• Memory per executor (--executor-memory) = 64GB/3 = 21 GB
• Counting off heap overhead = 7% of 21GB = 3 GB, so actual –executor-memory = 21 – 3 = 18GB
• Analysis :
• Recommended –> 29 executors, 18GB memory, and 5 cores each

Apache Spark - A High Level overview

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Apache Spark - A High Level overview

Similar to Apache Spark - A High Level overview (20)

Recently uploaded

Recently uploaded (20)

Apache Spark - A High Level overview