Apache Spark
Fundamentals
Training
Eren Avşaroğulları
@Workday
Dublin – May 10, 2018
Agenda
 What is Apache Spark?
 Spark Ecosystem &Terminology
 How to create RDDs
 OperationTypes (Transformations & Actions)
 Job Lifecycle
 RDD Evolution (DataFrames and DataSets)
 Persistency
 Clustering / Spark onYARN
 Job Scheduling
shows code samples
Bio
 B.Sc & M.Sc on Electronics & Control Engineering
 Sr. Software Engineer @
 Currently, work on Data Analytics
DataTransformations & Cleaning
erenavsarogullari
What is Apache Spark?
 Distributed Compute Engine
 Project started in 2009 at UC Berkley
 First version(v0.5) is released on June 2012
 Moved to Apache Software Foundation in 2013
 + 1200 contributors / +15K forks on Github
 Supported Languages: Java, Scala, Python and R
 spark-packages.org => ~405 Extensions
 Apache Bahir => http://bahir.apache.org/
 Community vs Enterprise Editions =>
https://databricks.com/product/comparing-databricks-to-apache-spark
Spark Ecosystem
Spark SQL
Spark Streaming
MLlib GraphX
Spark Core Engine
Standalone YARN MesosLocal
Cluster ModeLocal Mode
Kubernetes
Classical Structured
Terminology
 RDD: Resilient Distributed Dataset, immutable, resilient and partitioned.
 Application: An instance of Spark Context / Session. Single per JVM.
 Job: An action operator triggering
computation.
 DAG: Direct Acyclic Graph. An execution
plan of a job (a.k.a RDD dependency
graph)
 Driver:The program/process for running
the Job over the Spark Engine
 Executor: The process executing a task
 Worker: The node running executors.
How to create RDD?
 Collection Parallelize
 By Loading file
 Transformations
 Lets see the sample => Application-1
RDD
RDD
RDD
RDD OperationTypes
Two types of Spark operations on RDD
 Transformations: lazy evaluated (not computed immediately)
 Actions: triggers the computation and returns value
Transformations
RDD Actions ValueData
High Level Spark Data Processing Pipeline
Source Transformation Operators Sink
Transformations
 map(func)
 flatMap(func)
 filter(func)
 union(dataset)
 join(dataset, usingColumns: Seq[String])
 intersect(dataset)
 coalesce(numPartitions)
 repartition(numPartitions)
Full List:
https://spark.apache.org/docs/latest/rdd-programming-
guide.html#transformations
Transformations’ Classifications
Partition 1
Partition 2
Partition 3
Partition 1
Partition 2
Partition 3
RDD 1 RDD 2
NarrowTransformations
1 x 1
Partition 1
Partition 2
Partition 1
Partition 2
Partition 3
RDD 1
RDD 2
WideTransformations
(Shuffles)
1 x n
Shuffles Requires:
- Disk I/O
- Ser/De
- Network I/O
Actions
 first()
 take(n)
 collect()
 count()
 saveAsTextFile(path)
Full List: https://spark.apache.org/docs/latest/rdd-programming-
guide.html#actions
Lets see the sample => Application-2
RDD Dependencies (Lineage)
RDD 5
Stage 2
RDD 1
Stage 1
RDD 3
RDD 2
map
RDD 4
union
RDD 6
sort
RDD 7
join
Stage 3
Narrow
Transformations
Wide
Transformations
Shuffles
Shuffles
Job Lifecyle
ExecutionTiers
=>The main program is executed on Spark Driver
=>Transformations are executed on SparkWorker
=> Action returns the results from workers to driver
val wordCountTuples: Array[(String, Int)] = sparkSession.sparkContext
.textFile("src/main/resources/vivaldi_life.txt")
.flatMap(_.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
.collect()
wordCountTuples.foreach(println)
RDD Evolution
RDD
V1.0
(2011)
DataFrame
V1.3
(2013)
DataSet
V1.6
(2015)
Untyped API
Schema based -Tabular
Java Objects
Low level data-structure
To work with
Unstructured Data
Typed API: [T]
Tabular
SQL Support
To work with
Semi-Structured (csv, json) / Structured Data (jdbc)
Catalyst Optimizer
Cost Optimizer
ProjectTungsten
Three tier
optimizations
How to create the DataFrame & DataSet?
 By loading file (spark.read.format("csv").load())
 SparkSession.createDataFrame(RDD, schema)
 SparkSession.createDataSet(collection or RDD)
Lets see the code – Application-3
Application-4-1/4-2
Persistency
Storage Modes Details
MEMORY_ONLY Store RDD as deserialized Java objects in the JVM
MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM
MEMORY_ONLY_SER Store RDD as serialized Java objects (Kryo API can be thought)
MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER
DISK_ONLY Store the RDD partitions only on disk.
MEMORY_ONLY_2,
MEMORY_AND_DISK_2
Same as the levels above, but replicate each partition on two
cluster nodes.
 RDD / DF.persist(newStorageLevel: StorageLevel)
 RDD.unpersist() => Unpersists RDD from memory and disk
Unpersist will need to be forced for long term to use executor memory efficiently.
Note: Also when cached data exceeds storage memory,
Spark will use Least Recently Used(LRU) Expiry Policy as default
Clustering / Spark onYARN (client mode)
YARN Client
Mode
Job Scheduling
 Single Application
 FIFO
 FAIR
 Across Applications
 StaticAllocation
 DynamicAllocation (Auto / Elastic Scaling)
 spark.dynamicAllocation.enabled
 spark.dynamicAllocation.executorIdleTimeout
 spark.dynamicAllocation.initialExecutors
 spark.dynamicAllocation.minExecutors
 spark.dynamicAllocation.maxExecutors
Q & A
Thanks
References
 https://spark.apache.org/docs/latest/
 https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
 https://jaceklaskowski.gitbooks.io/mastering-apache-spark
 https://stackoverflow.com/questions/36215672/spark-yarn-architecture
 High Performance Spark by
Holden Karau & RachelWarren

Apache Spark Fundamentals Training

  • 1.
  • 2.
    Agenda  What isApache Spark?  Spark Ecosystem &Terminology  How to create RDDs  OperationTypes (Transformations & Actions)  Job Lifecycle  RDD Evolution (DataFrames and DataSets)  Persistency  Clustering / Spark onYARN  Job Scheduling shows code samples
  • 3.
    Bio  B.Sc &M.Sc on Electronics & Control Engineering  Sr. Software Engineer @  Currently, work on Data Analytics DataTransformations & Cleaning erenavsarogullari
  • 4.
    What is ApacheSpark?  Distributed Compute Engine  Project started in 2009 at UC Berkley  First version(v0.5) is released on June 2012  Moved to Apache Software Foundation in 2013  + 1200 contributors / +15K forks on Github  Supported Languages: Java, Scala, Python and R  spark-packages.org => ~405 Extensions  Apache Bahir => http://bahir.apache.org/  Community vs Enterprise Editions => https://databricks.com/product/comparing-databricks-to-apache-spark
  • 5.
    Spark Ecosystem Spark SQL SparkStreaming MLlib GraphX Spark Core Engine Standalone YARN MesosLocal Cluster ModeLocal Mode Kubernetes Classical Structured
  • 6.
    Terminology  RDD: ResilientDistributed Dataset, immutable, resilient and partitioned.  Application: An instance of Spark Context / Session. Single per JVM.  Job: An action operator triggering computation.  DAG: Direct Acyclic Graph. An execution plan of a job (a.k.a RDD dependency graph)  Driver:The program/process for running the Job over the Spark Engine  Executor: The process executing a task  Worker: The node running executors.
  • 7.
    How to createRDD?  Collection Parallelize  By Loading file  Transformations  Lets see the sample => Application-1
  • 8.
    RDD RDD RDD RDD OperationTypes Two typesof Spark operations on RDD  Transformations: lazy evaluated (not computed immediately)  Actions: triggers the computation and returns value Transformations RDD Actions ValueData High Level Spark Data Processing Pipeline Source Transformation Operators Sink
  • 9.
    Transformations  map(func)  flatMap(func) filter(func)  union(dataset)  join(dataset, usingColumns: Seq[String])  intersect(dataset)  coalesce(numPartitions)  repartition(numPartitions) Full List: https://spark.apache.org/docs/latest/rdd-programming- guide.html#transformations
  • 10.
    Transformations’ Classifications Partition 1 Partition2 Partition 3 Partition 1 Partition 2 Partition 3 RDD 1 RDD 2 NarrowTransformations 1 x 1 Partition 1 Partition 2 Partition 1 Partition 2 Partition 3 RDD 1 RDD 2 WideTransformations (Shuffles) 1 x n Shuffles Requires: - Disk I/O - Ser/De - Network I/O
  • 11.
    Actions  first()  take(n) collect()  count()  saveAsTextFile(path) Full List: https://spark.apache.org/docs/latest/rdd-programming- guide.html#actions Lets see the sample => Application-2
  • 12.
    RDD Dependencies (Lineage) RDD5 Stage 2 RDD 1 Stage 1 RDD 3 RDD 2 map RDD 4 union RDD 6 sort RDD 7 join Stage 3 Narrow Transformations Wide Transformations Shuffles Shuffles
  • 13.
  • 14.
    ExecutionTiers =>The main programis executed on Spark Driver =>Transformations are executed on SparkWorker => Action returns the results from workers to driver val wordCountTuples: Array[(String, Int)] = sparkSession.sparkContext .textFile("src/main/resources/vivaldi_life.txt") .flatMap(_.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) .collect() wordCountTuples.foreach(println)
  • 15.
    RDD Evolution RDD V1.0 (2011) DataFrame V1.3 (2013) DataSet V1.6 (2015) Untyped API Schemabased -Tabular Java Objects Low level data-structure To work with Unstructured Data Typed API: [T] Tabular SQL Support To work with Semi-Structured (csv, json) / Structured Data (jdbc) Catalyst Optimizer Cost Optimizer ProjectTungsten Three tier optimizations
  • 16.
    How to createthe DataFrame & DataSet?  By loading file (spark.read.format("csv").load())  SparkSession.createDataFrame(RDD, schema)  SparkSession.createDataSet(collection or RDD) Lets see the code – Application-3 Application-4-1/4-2
  • 17.
    Persistency Storage Modes Details MEMORY_ONLYStore RDD as deserialized Java objects in the JVM MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM MEMORY_ONLY_SER Store RDD as serialized Java objects (Kryo API can be thought) MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER DISK_ONLY Store the RDD partitions only on disk. MEMORY_ONLY_2, MEMORY_AND_DISK_2 Same as the levels above, but replicate each partition on two cluster nodes.  RDD / DF.persist(newStorageLevel: StorageLevel)  RDD.unpersist() => Unpersists RDD from memory and disk Unpersist will need to be forced for long term to use executor memory efficiently. Note: Also when cached data exceeds storage memory, Spark will use Least Recently Used(LRU) Expiry Policy as default
  • 18.
    Clustering / SparkonYARN (client mode) YARN Client Mode
  • 19.
    Job Scheduling  SingleApplication  FIFO  FAIR  Across Applications  StaticAllocation  DynamicAllocation (Auto / Elastic Scaling)  spark.dynamicAllocation.enabled  spark.dynamicAllocation.executorIdleTimeout  spark.dynamicAllocation.initialExecutors  spark.dynamicAllocation.minExecutors  spark.dynamicAllocation.maxExecutors
  • 20.
    Q & A Thanks References https://spark.apache.org/docs/latest/  https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals  https://jaceklaskowski.gitbooks.io/mastering-apache-spark  https://stackoverflow.com/questions/36215672/spark-yarn-architecture  High Performance Spark by Holden Karau & RachelWarren

Editor's Notes

  • #6 Spark SQL: Semi-Structured / Structured Data Support coming with Spark SQL on top of RDD Spark Streaming: Aims Streaming Use Cases so brings DStream Data Structures basically sequence of RDDs. Incoming data is splitted to mini RDDs in the light of window size(time or size). Mllib: Spark offers two ML libraries: Mllib and ML. - Mllib previous one and in maintenance period. New features are merged to ML as new one. GraphX: Aims for distributed Graph processing. As the cluster managers: Currently supported, Standalone, YARN and Mesos.
  • #10 Repartition creates new partitions so increase the partition count.
  • #11 Repartition creates new partitions so increase the partition count.
  • #16 Project Tungsten aims to use memory and CPU efficiently. Instead of storage of Java Object, creates a new binary object representation as Tungsten Row Format so it uses less memory and decrease GC overhead. 1 Million numbers keeps around 4MB by using RDD. Same collection keeps 1MB in DF form.
  • #18 Spark Executor Memory is splitted for the following parts: Execution Memory: %25 Storage Memory: %50 User memory: %25 (metadata and safeguarding for OOM) Reserved Memory: 300MB We use unpersist() to unpersist RDD. When the cached data exceeds the Memory capacity, Spark automatically evicts the old partitions(it will be recalculated when needed). This is called Last Recently used Cache(LRU) policy