Sumedh Wale's presentation

SnappyDataSnappyData
Why is Apache Spark interesting
An introduction to Apache Spark
Sumedh Wale
Oct 2015
snappydata.io

SnappyDataSnappyData
 Introduction
 Apache Spark architecture
 Usage model
 Examples and comparison
 Spark SQL and Streaming
Agenda

SnappyData
What is Apache Spark?
• A computational engine for distributed data processing
• A programming paradigm (Scala, Java, R, Python) to make distributed
processing easy and efficient to use
• Combine analytics from SQL, streaming, machine learning, graphs and
any other custom library
• Process data in HDFS, Hive, JDBC and any other data sources

SnappyData
Interactive workloads?
• Faster than others like MapReduce (claims of order of magnitude faster
or more)
• Is it suited for interactive queries and real-time analytics?
• Basic paradigm is still batch processing. Micro-batches for streaming.
• What makes it faster?

SnappyData
Speed claims
• Execution model to optimize arbitrary operator graphs
• Forces developers need to think in terms of operators/transformations
that can be optimized.
• Uses memory where possible. Off-heap memory since 1.4/1.5.
• Optimizes resource management
– “Executors” stick around for the entire application (unlike MapReduce where
each job spawns new set of JVMs)
– Makes reference to previous task results in same application efficient

SnappyData
Spark Driver
Cluster
manager
Executor
Task Task
Cache Executor
Task Task
Cache
Job Job
Worker Worker

SnappyData
Job scheduling
• Application spawns its own set of driver and executors
• Jobs in an application use the same set of executors and share
resources
• Spark's FAIR and FIFO scheduling for jobs in an application
• Pools with different scheduling policies and weights

SnappyData
Resilient Distributed Dataset (RDD)
• A distributed collection of objects divided into “partitions”
• Driver creates partitions. Each partition knows how to get to its data.
• Partition can be scheduled on any executor
• Data can be from HDFS, NFS, JDBC, S3 or any other source
• RDD caching in Spark memory and/or disk (RDD.persist)

SnappyData
Job
Part1 Part3 Part5
Part2 Part4 Part6
Job
Part1 Part3 Part5
Part2 Part4 Part6
FIFO: schedules partitions from current stage of a job

SnappyData
FAIR: schedules partitions from all queued jobs
JobPart1 Part2 Part3 Part4 Part5 Part6
JobPart1 Part2 Part3 Part4 Part5 Part6

SnappyData
Parallel transformations
• All transformations on RDDs to yield new RDDs are parallel
• Partitions are mapped to result RDD partitions (one-to-one, many-to-one
or many-to-many)
• Execution is really bottom up. The final result drives execution.
• Transformations do not result in jobs by themselves.
• Actions result in job creation and submission.

SnappyData
Transformations and actions
• Mimics scala collections
• Transformations: map, mapPartitions, filter, groupBy
• PairRDDFunctions: reduceByKey, combineByKey, aggregateByKey
• Actions: collect, count, save
• Jobs create a DAG of required number of stages (MapReduce can only
have map and reduce stages).

SnappyData
Word count
val textFile = spark.textFile("hdfs://…")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://…")

SnappyData
Word count (MapReduce)
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
...

SnappyData
Word count (explanation)
val textFile = spark.textFile("hdfs://…") // Returns RDD[String]
Split each line in RDD into collection of words. flatMap create single
collection of words (instead of collection of collection like map would).
Map each word to a tuple for its count starting with 1.
.reduceByKey(_ + _)
Finally reduce by the key of tuple in previous step which is the word.
Reduction operation shortcut for: reduceByKey((a, b) => a + b)

SnappyData
What's resilient?
• RDDs keep track of their lineage
• Lineage used to efficiently recompute any lost partitions
• In actuality it is the RDD itself that has its parent information and how to
build partition iterator from it
• Checkpoint RDDs to “break lineage” and avoid depending on the
availability of base RDD

SnappyData
RDD demystified
class MapPartitionsRDD[U: ClassTag, T: ClassTag](
prev: RDD[T], f: (TaskContext, Int, Iterator[T]) => Iterator[U],
preservesPartitioning: Boolean = false) extends RDD[U](prev) {
override val partitioner =
if (preservesPartitioning) firstParent[T].partitioner else None
override def getPartitions: Array[Partition] = firstParent[T].partitions
override def getPreferredLocations(split: Partition) =
firstParent[T].preferredLocations(split)
override def compute(split: Partition, ctx: TaskContext): Iterator[U] =
f(ctx, split.index, firstParent[T].iterator(split, ctx))
}

SnappyData
RDD execution
• RDD implementation provides partitions
• Can provide “preferred locations” for each partition
• Above are evaluated on the driver JVM
• Optional partitioner for per key partitioning (can result in shuffle)
• Compute method invoked for each partition on the executor where the
partition is scheduled
• Transformations become a chain of compute calling compute of parent
RDD partition

SnappyData
Dependencies
• Partition-wise dependencies
• Narrow dependency from parent to child: Many-to-one, One-to-one
• Narrow dependencies will cause computations to chain efficiently
• Shuffle dependency: Many-to-many
• Shuffle always creates a new “stage” in a job
• A shuffle will cause data to be written to files completely before going to
next stage (no fsync, so OS buffer cache helps)

SnappyData
Word count revisited
val textFile = spark.textFile("hdfs://…")
.groupByKey()
.map((_, sum(_)))
counts.saveAsTextFile("hdfs://…")

SnappyData
reduceByKey and groupByKey
• Both cause shuffle
• groupByKey will result in shuffle of whole data first
• reduceByKey will shuffle after having “reduced” each partition
• Always use reduceByKey, combineByKey where possible

SnappyData
Spark SQL
• Familiar SQL and HiveQL interfaces.
• Catalyst engine with optimizer
• Queries like:
select avg(age), profession from people group by profession
• Catalyst engine will automatically choose reduceByKey path for
aggregates like AVG that support partial evaluation
• DataFrame API for query and operations on table data
• DataSources API to access structured data via tables

SnappyData
DataFrame
• Mostly syntactic sugar around RDD[Row] and schema
• Like RDDs, transformations return new DataFrames
• A LogicalPlan of DataFrame encapsulates the execution plan
• Delegates to a SparkPlan for actual Spark execution
• SparkPlan.doExecute() returns the underlying result RDD[InternalRow]

SnappyData
Example
val df = context.sql("""create table person(
Name String NOT NULL,
Age Int NOT NULL, Profession String NOT NULL)
using jdbc options (URL 'jdbc:gemfirexd://host:port',
Driver 'com.pivotal.gemfirexd.jdbc.ClientDriver')
""")
val result = df.groupBy("profession").agg(avg("age"), col("profession"))
result.collect().foreach(println)
val result2 = context.sql(
"select avg(age), profession from person group by profession")
result2.collect().foreach(println)

SnappyData
Spark Streaming
• Micro-batch processing for streaming data
• DStream[T] encapsulates an infinite sequence of RDD[T]
• Operations like foreachRDD()
• Fault-tolerance utilizing RDD resilience and streaming source resilience
(e.g. Kafka)
• Combine easily with batch and interactive queries

Sumedh Wale's presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Sumedh Wale's presentation

Similar to Sumedh Wale's presentation (20)

Recently uploaded

Recently uploaded (20)

Sumedh Wale's presentation