Apache Spark 101

Apache Spark 101
June 2016
Abdullah Cetin CAVDAR
@accavdar
#AnkaraSparkDay

Apache Spark
is a fast and general engine for
large-scale data processing

Most Active Project in Big Data
Spark Survey 2015

Top 10 Industries Using Spark

Spark Engine
unified engine across diverse workloads &
environments

Open Source Spark Ecosystem

Challenge?
Fast data sharing across parallel
jobs

Data Sharing in Apache Spark

Initializing Apache Spark
SparkConf and SparkContext

Apache Spark Shell
Python and Scala

RDD (Resilient Distributed Dataset)
An RDD is a read-only collection of objects
partitioned across a set of machines that
can be rebuilt if a partition is lost

RDD
Read-Only = Immutable
Parallelism
Caching

RDD
Partitioned = Distributed
More partitions = More parallelism

RDD
Rebuilt = Resilient
Recover lost data partitions
By replaying data lineage

Partitions
logical division of data / basic unit of
parallelisim

DAG (Directed Acyclic Graph)

RDD Creation
Parallelizing a collection
into driver application memory
for only prototyping and testing
Loading an external data set
le://, hdfs://, s3n://
sc.textFile()
sc.hadoopFile(), sc.newAPIHadoopFile()
sqlContext.read()

Driver & Workers
Main Program is executed on Driver
Transformations are executed on Workers
Actions transfer from Workers to Driver
Driver cannot get data from executors except action and accumulator

RDD Dependencies
Minimize shuffle / Wide
Dependencies

RDD Persistence / Caching
persist() or
cache()
Without cache, it will restart from the rst RDD
LRU (Least Recently Used)
Default Storega Level: MEMORY_ONLY

Shared Variables
Accumulators and Broadcast
Variables

Accumulators
Used to implement counters or sums

Broadcast Variables
Keep a read-only variable cached on each
machine

Deploying to a Cluster
Use spark-submit

Data Frames & Performance
Distributed collection of rows organized
into named columns

Tips
Avoid groupByKey and wide dependencies
Use enough number of partitions
Use coalesce not to make too many small les
Be cautious on Serialization/Deserialization

Apache Spark 101

More Related Content

What's hot

Viewers also liked

Similar to Apache Spark 101

Recently uploaded

Apache Spark 101