This presentation was at Budapest Spark Meetup.
http://www.meetup.com/Budapest-Spark-Meetup/events/229462749/
The video from the presenation can be found here: http://ustre.am/1uMAT
Let's Spark together - Beginners Edition!
We will code our first Spark program together and I will explain the process step by step. We will dig deep what happens when we run our program, and how to set up your Spark environment at home.
Bio:
Mate is Co-founder and CTO of enbrite.ly, a Budapest based startup with the vision to create the next generation decision supporting system in online advertising that covers the market needs of the future. Mate has many years of experience with Big Data architectures, data analytics pipelines, operation of infrastructures and growing organizations by focusing on culture.Beside enbrite.ly, he is Chief Architect at Dmlab, a leading data analytics company providing innovative data products and services. Mateteaches Big Data analytics at Budapest University of Technology and Economics and runs courses for companies. Speaker of local and international conferences and meetups.
9. DRIVER PROGRAM
Your main function. This is what you write.
Launches parallel operations on the cluster. The
driver access Spark through SparkContext.
You access the computing cluster via SparkContext
Via SparkContext you can create RDDs.
21. LIFECYCLE OF A SPARK PROGRAM
1. READ DATA FROM EXTERNAL SOURCE
2. CREATE LAZY EVALUATED
TRANSFORMATIONS
3. CACHE ANY INTERMEDIATE RDD TO REUSE
4. KICK IT OFF BY CALLING SOME ACTION
23. RDD INTERNALS
RDD INTERFACE
➔ set of PARTITIONS
➔ list of DEPENDENCIES on PARENT RDDs
➔ functions to COMPUTE a partition given parents
➔ preferred LOCATIONS (optional)
➔ PARTITIONER for K/V pairs (optional)
24. MULTIPLE RDDs
/**
* :: DeveloperApi ::
* Implemented by subclasses to compute a given partition.
*/
@DeveloperApi
def compute(split: Partition, context: TaskContext): Iterator[T]
/** Implemented by subclasses to return the set of partitions in this RDD. */
protected def getPartitions: Array[Partition]
/** Implemented by subclasses to return how this RDD depends on parent RDDs.
*/
protected def getDependencies : Seq[Dependency[_]] = deps
/** Optionally overridden by subclasses to specify placement preferences. */
protected def getPreferredLocations (split: Partition): Seq[String] = Nil
/** Optionally overridden by subclasses to specify how they are partitioned.
*/
@transient val partitioner: Option[Partitioner] = None
37. STAGE
❏ Unit of execution
❏ Named after the last transformation
(the one runJob was called on)
❏ Transformations pipelined together into
stages
❏ Stage boundary usually means shuffling
39. STAGE
❏ Unit of execution
❏ Named after the last transformation
(the one runJob was called on)
❏ Transformations pipelined together into
stages
❏ Stage boundary usually means shuffling
41. Repartitioning
text = sc.textFile("twit1.txt")
words = nonempty.flatMap(lambda x: x.split(" "))
fwords = words.filter(lambda x: len(x) > 1)
ones = fwords.map(lambda x: (x, 1))
rp = ones.repartition(6)
result = rp.reduceByKey(lambda l,r: r+l)
result.collect()
42. TaskSet
THE PROCESS
RDD Objects DAG Scheduler Task Scheduler Executor
RDD
RDD RDD
RDD
RDD
sc.textFile.
map()
.groupBy()
.filter()
Build DAG of
operators
T
T
T
T
T
T
T
T
T
S
S
S
S
- Split DAG into
stages of tasks
- Each stage when
ready = ALL
dependent task are
finished
DAG Task
Task
Scheduler
- Launches tasks
- Retry failed tasks
Executor
Block manager
Task threads
Task threads
Task threads
- Store and serve
blocks
- Executes tasks