2. Goals for Spark And Spark Streaming Project
• Generalise the framework for diverse workloads.
• Low Latency: For small jobs,latency expected is subsecond rather than waiting for
few seconds for job to start
• Fault Tolerance: Spark Internally should be capable of handling faults rather than
depending on users to treat it as special case
3. Need to Understand Internals of Spark
Understand Importance of Internals from perspective of performance
Example:
Consider a single core machine where we need to find the position of an integer in an array of
integers.First intuition would be to traverse through the list sequencially and rather than
randomly interating through list.
This is obvious just because we know how cache works and thus sequential access is better
than random.
But this may not be inherently obvious in Spark just because internals of spark works little
differently.
6. Example Job
val sc = new SparkContext(...)
val file = sc.textFile(…)
val errors=file.filter(…)
errors.cache(…)
errors.count(…)
RDD
Action
7. Resilient Distributed Dataset
RDD is a read-only, partitioned collection of records. RDDs are a
'immutable resilient distributed collection of records' which can be stored in
the volatile memory or in a persistent storage (HDFS, HBase etc) and can
be converted into another RDD through some of the transformations. An
action like count can also be applied on an RDD.
11. Overview
Run a streaming computation as a series of very small, deterministic batch jobs
SparkStreaming
Spark
- Chop up the live stream into batches of X seconds
- Spark treats each batch of data as RDDs
and processes them using RDD operations
- Finally, the processed results of the RDD
operations are returned in batches
12. Eg: Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => status.getText.split("
").filter(_.startsWith("#"))))
hashTags.saveAsHadoopFiles("hdfs://...") Transformation
#Ebola, #India,
#Mars ...