Apache Spark Fundamentals Training

Apache Spark
Fundamentals
Training
Eren Avşaroğulları
@Workday
Dublin – May 10, 2018

Agenda
 What is Apache Spark?
 Spark Ecosystem &Terminology
 How to create RDDs
 OperationTypes (Transformations & Actions)
 Job Lifecycle
 RDD Evolution (DataFrames and DataSets)
 Persistency
 Clustering / Spark onYARN
 Job Scheduling
shows code samples

Bio
 B.Sc & M.Sc on Electronics & Control Engineering
 Sr. Software Engineer @
 Currently, work on Data Analytics
DataTransformations & Cleaning
erenavsarogullari

What is Apache Spark?
 Distributed Compute Engine
 Project started in 2009 at UC Berkley
 First version(v0.5) is released on June 2012
 Moved to Apache Software Foundation in 2013
 + 1200 contributors / +15K forks on Github
 Supported Languages: Java, Scala, Python and R
 spark-packages.org => ~405 Extensions
 Apache Bahir => http://bahir.apache.org/
 Community vs Enterprise Editions =>
https://databricks.com/product/comparing-databricks-to-apache-spark

Spark Ecosystem
Spark SQL
Spark Streaming
MLlib GraphX
Spark Core Engine
Standalone YARN MesosLocal
Cluster ModeLocal Mode
Kubernetes
Classical Structured

Terminology
 RDD: Resilient Distributed Dataset, immutable, resilient and partitioned.
 Application: An instance of Spark Context / Session. Single per JVM.
 Job: An action operator triggering
computation.
 DAG: Direct Acyclic Graph. An execution
plan of a job (a.k.a RDD dependency
graph)
 Driver:The program/process for running
the Job over the Spark Engine
 Executor: The process executing a task
 Worker: The node running executors.

How to create RDD?
 Collection Parallelize
 By Loading file
 Transformations
 Lets see the sample => Application-1

RDD
RDD
RDD
RDD OperationTypes
Two types of Spark operations on RDD
 Transformations: lazy evaluated (not computed immediately)
 Actions: triggers the computation and returns value
Transformations
RDD Actions ValueData
High Level Spark Data Processing Pipeline
Source Transformation Operators Sink

Transformations
 map(func)
 flatMap(func)
 filter(func)
 union(dataset)
 join(dataset, usingColumns: Seq[String])
 intersect(dataset)
 coalesce(numPartitions)
 repartition(numPartitions)
Full List:
https://spark.apache.org/docs/latest/rdd-programming-
guide.html#transformations

Transformations’ Classifications
Partition 1
Partition 2
Partition 3
Partition 1
Partition 2
Partition 3
RDD 1 RDD 2
NarrowTransformations
1 x 1
Partition 1
Partition 2
Partition 1
Partition 2
Partition 3
RDD 1
RDD 2
WideTransformations
(Shuffles)
1 x n
Shuffles Requires:
- Disk I/O
- Ser/De
- Network I/O

Actions
 first()
 take(n)
 collect()
 count()
 saveAsTextFile(path)
Full List: https://spark.apache.org/docs/latest/rdd-programming-
guide.html#actions
Lets see the sample => Application-2

RDD Dependencies (Lineage)
RDD 5
Stage 2
RDD 1
Stage 1
RDD 3
RDD 2
map
RDD 4
union
RDD 6
sort
RDD 7
join
Stage 3
Narrow
Transformations
Wide
Transformations
Shuffles
Shuffles

ExecutionTiers
=>The main program is executed on Spark Driver
=>Transformations are executed on SparkWorker
=> Action returns the results from workers to driver
val wordCountTuples: Array[(String, Int)] = sparkSession.sparkContext
.textFile("src/main/resources/vivaldi_life.txt")
.flatMap(_.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
.collect()
wordCountTuples.foreach(println)

RDD Evolution
RDD
V1.0
(2011)
DataFrame
V1.3
(2013)
DataSet
V1.6
(2015)
Untyped API
Schema based -Tabular
Java Objects
Low level data-structure
To work with
Unstructured Data
Typed API: [T]
Tabular
SQL Support
To work with
Semi-Structured (csv, json) / Structured Data (jdbc)
Catalyst Optimizer
Cost Optimizer
ProjectTungsten
Three tier
optimizations

How to create the DataFrame & DataSet?
 By loading file (spark.read.format("csv").load())
 SparkSession.createDataFrame(RDD, schema)
 SparkSession.createDataSet(collection or RDD)
Lets see the code – Application-3
Application-4-1/4-2

Persistency
Storage Modes Details
MEMORY_ONLY Store RDD as deserialized Java objects in the JVM
MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM
MEMORY_ONLY_SER Store RDD as serialized Java objects (Kryo API can be thought)
MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER
DISK_ONLY Store the RDD partitions only on disk.
MEMORY_ONLY_2,
MEMORY_AND_DISK_2
Same as the levels above, but replicate each partition on two
cluster nodes.
 RDD / DF.persist(newStorageLevel: StorageLevel)
 RDD.unpersist() => Unpersists RDD from memory and disk
Unpersist will need to be forced for long term to use executor memory efficiently.
Note: Also when cached data exceeds storage memory,
Spark will use Least Recently Used(LRU) Expiry Policy as default

Clustering / Spark onYARN (client mode)
YARN Client
Mode

Job Scheduling
 Single Application
 FIFO
 FAIR
 Across Applications
 StaticAllocation
 DynamicAllocation (Auto / Elastic Scaling)
 spark.dynamicAllocation.enabled
 spark.dynamicAllocation.executorIdleTimeout
 spark.dynamicAllocation.initialExecutors
 spark.dynamicAllocation.minExecutors
 spark.dynamicAllocation.maxExecutors

Q & A
Thanks
References
 https://spark.apache.org/docs/latest/
 https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
 https://jaceklaskowski.gitbooks.io/mastering-apache-spark
 https://stackoverflow.com/questions/36215672/spark-yarn-architecture
 High Performance Spark by
Holden Karau & RachelWarren

Apache Spark Fundamentals Training

More Related Content

What's hot

Similar to Apache Spark Fundamentals Training

Recently uploaded

Apache Spark Fundamentals Training

Editor's Notes