Apache Spark 101
June 2016
Abdullah Cetin CAVDAR
@accavdar
#AnkaraSparkDay
Apache Spark's Goal
Apache Spark
is a fast and general engine for
large-scale data processing
Most Active Project in Big Data
Spark Survey 2015
Top 10 Industries Using Spark
Many Types of Product
Spark Engine
unified engine across diverse workloads &
environments
Programming Languages
Open Source Spark Ecosystem
Most Important Aspects
Spark
Programming
Model
Challenge?
Fast data sharing across parallel
jobs
Data Sharing in MapReduce
Data Sharing in Apache Spark
Components
Cluster Managers
Initializing Apache Spark
SparkConf and SparkContext
Apache Spark Shell
Python and Scala
RDD (Resilient Distributed Dataset)
An RDD is a read-only collection of objects
partitioned across a set of machines that
can be rebuilt if a partition is lost
RDD
Read-Only = Immutable
Parallelism
Caching
RDD
Partitioned = Distributed
More partitions = More parallelism
RDD
Rebuilt = Resilient
Recover lost data partitions
By replaying data lineage
RDD Operations
RDD Operations
Partitions
logical division of data / basic unit of
parallelisim
RDD Lineage
Lazy Evaluation
DAG (Directed Acyclic Graph)
Transformation & Action
RDD Creation
Parallelizing a collection
into driver application memory
for only prototyping and testing
Loading an external data set
le://, hdfs://, s3n://
sc.textFile()
sc.hadoopFile(), sc.newAPIHadoopFile()
sqlContext.read()
Word Count :)
Driver & Workers
Main Program is executed on Driver
Transformations are executed on Workers
Actions transfer from Workers to Driver
Driver cannot get data from executors except action and accumulator
RDD Dependencies
Minimize shuffle / Wide
Dependencies
RDD Persistence / Caching
persist() or
cache()
Without cache, it will restart from the rst RDD
LRU (Least Recently Used)
Default Storega Level: MEMORY_ONLY
Storage Levels
Shared Variables
Accumulators and Broadcast
Variables
Accumulators
Used to implement counters or sums
Broadcast Variables
Keep a read-only variable cached on each
machine
Spark UI
Default port 4040
Deploying to a Cluster
Use spark-submit
Data Frames & Performance
Distributed  collection  of  rows  organized 
into  named   columns
Tips
Avoid groupByKey and wide dependencies
Use enough number of partitions
Use coalesce not to make too many small les
Be cautious on Serialization/Deserialization
Major Features in 2.0
Thank you
#AnkaraSparkDay

Apache Spark 101