3. SnappyData
What is Apache Spark?
• A computational engine for distributed data processing
• A programming paradigm (Scala, Java, R, Python) to make distributed
processing easy and efficient to use
• Combine analytics from SQL, streaming, machine learning, graphs and
any other custom library
• Process data in HDFS, Hive, JDBC and any other data sources
4. SnappyData
Interactive workloads?
• Faster than others like MapReduce (claims of order of magnitude faster
or more)
• Is it suited for interactive queries and real-time analytics?
• Basic paradigm is still batch processing. Micro-batches for streaming.
• What makes it faster?
5. SnappyData
Speed claims
• Execution model to optimize arbitrary operator graphs
• Forces developers need to think in terms of operators/transformations
that can be optimized.
• Uses memory where possible. Off-heap memory since 1.4/1.5.
• Optimizes resource management
– “Executors” stick around for the entire application (unlike MapReduce where
each job spawns new set of JVMs)
– Makes reference to previous task results in same application efficient
7. SnappyData
Job scheduling
• Application spawns its own set of driver and executors
• Jobs in an application use the same set of executors and share
resources
• Spark's FAIR and FIFO scheduling for jobs in an application
• Pools with different scheduling policies and weights
8. SnappyData
Resilient Distributed Dataset (RDD)
• A distributed collection of objects divided into “partitions”
• Driver creates partitions. Each partition knows how to get to its data.
• Partition can be scheduled on any executor
• Data can be from HDFS, NFS, JDBC, S3 or any other source
• RDD caching in Spark memory and/or disk (RDD.persist)
11. SnappyData
Parallel transformations
• All transformations on RDDs to yield new RDDs are parallel
• Partitions are mapped to result RDD partitions (one-to-one, many-to-one
or many-to-many)
• Execution is really bottom up. The final result drives execution.
• Transformations do not result in jobs by themselves.
• Actions result in job creation and submission.
12. SnappyData
Transformations and actions
• Mimics scala collections
• Transformations: map, mapPartitions, filter, groupBy
• PairRDDFunctions: reduceByKey, combineByKey, aggregateByKey
• Actions: collect, count, save
• Jobs create a DAG of required number of stages (MapReduce can only
have map and reduce stages).
13. SnappyData
Word count
val textFile = spark.textFile("hdfs://…")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://…")
14. SnappyData
Word count (MapReduce)
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
...
15. SnappyData
Word count (explanation)
val textFile = spark.textFile("hdfs://…") // Returns RDD[String]
val counts = textFile.flatMap(line => line.split(" "))
Split each line in RDD into collection of words. flatMap create single
collection of words (instead of collection of collection like map would).
.map(word => (word, 1))
Map each word to a tuple for its count starting with 1.
.reduceByKey(_ + _)
Finally reduce by the key of tuple in previous step which is the word.
Reduction operation shortcut for: reduceByKey((a, b) => a + b)
16. SnappyData
What's resilient?
• RDDs keep track of their lineage
• Lineage used to efficiently recompute any lost partitions
• In actuality it is the RDD itself that has its parent information and how to
build partition iterator from it
• Checkpoint RDDs to “break lineage” and avoid depending on the
availability of base RDD
18. SnappyData
RDD execution
• RDD implementation provides partitions
• Can provide “preferred locations” for each partition
• Above are evaluated on the driver JVM
• Optional partitioner for per key partitioning (can result in shuffle)
• Compute method invoked for each partition on the executor where the
partition is scheduled
• Transformations become a chain of compute calling compute of parent
RDD partition
19. SnappyData
Dependencies
• Partition-wise dependencies
• Narrow dependency from parent to child: Many-to-one, One-to-one
• Narrow dependencies will cause computations to chain efficiently
• Shuffle dependency: Many-to-many
• Shuffle always creates a new “stage” in a job
• A shuffle will cause data to be written to files completely before going to
next stage (no fsync, so OS buffer cache helps)
20. SnappyData
Word count revisited
val textFile = spark.textFile("hdfs://…")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.groupByKey()
.map((_, sum(_)))
counts.saveAsTextFile("hdfs://…")
21. SnappyData
reduceByKey and groupByKey
• Both cause shuffle
• groupByKey will result in shuffle of whole data first
• reduceByKey will shuffle after having “reduced” each partition
• Always use reduceByKey, combineByKey where possible
22. SnappyData
Spark SQL
• Familiar SQL and HiveQL interfaces.
• Catalyst engine with optimizer
• Queries like:
select avg(age), profession from people group by profession
• Catalyst engine will automatically choose reduceByKey path for
aggregates like AVG that support partial evaluation
• DataFrame API for query and operations on table data
• DataSources API to access structured data via tables
23. SnappyData
DataFrame
• Mostly syntactic sugar around RDD[Row] and schema
• Like RDDs, transformations return new DataFrames
• A LogicalPlan of DataFrame encapsulates the execution plan
• Delegates to a SparkPlan for actual Spark execution
• SparkPlan.doExecute() returns the underlying result RDD[InternalRow]
24. SnappyData
Example
val df = context.sql("""create table person(
Name String NOT NULL,
Age Int NOT NULL, Profession String NOT NULL)
using jdbc options (URL 'jdbc:gemfirexd://host:port',
Driver 'com.pivotal.gemfirexd.jdbc.ClientDriver')
""")
val result = df.groupBy("profession").agg(avg("age"), col("profession"))
result.collect().foreach(println)
val result2 = context.sql(
"select avg(age), profession from person group by profession")
result2.collect().foreach(println)
25. SnappyData
Spark Streaming
• Micro-batch processing for streaming data
• DStream[T] encapsulates an infinite sequence of RDD[T]
• Operations like foreachRDD()
• Fault-tolerance utilizing RDD resilience and streaming source resilience
(e.g. Kafka)
• Combine easily with batch and interactive queries