SlideShare a Scribd company logo
1 of 25
SnappyDataSnappyData
Why is Apache Spark interesting
An introduction to Apache Spark
Sumedh Wale
Oct 2015
snappydata.io
SnappyDataSnappyData
 Introduction
 Apache Spark architecture
 Usage model
 Examples and comparison
 Spark SQL and Streaming
Agenda
SnappyData
What is Apache Spark?
• A computational engine for distributed data processing
• A programming paradigm (Scala, Java, R, Python) to make distributed
processing easy and efficient to use
• Combine analytics from SQL, streaming, machine learning, graphs and
any other custom library
• Process data in HDFS, Hive, JDBC and any other data sources
SnappyData
Interactive workloads?
• Faster than others like MapReduce (claims of order of magnitude faster
or more)
• Is it suited for interactive queries and real-time analytics?
• Basic paradigm is still batch processing. Micro-batches for streaming.
• What makes it faster?
SnappyData
Speed claims
• Execution model to optimize arbitrary operator graphs
• Forces developers need to think in terms of operators/transformations
that can be optimized.
• Uses memory where possible. Off-heap memory since 1.4/1.5.
• Optimizes resource management
– “Executors” stick around for the entire application (unlike MapReduce where
each job spawns new set of JVMs)
– Makes reference to previous task results in same application efficient
SnappyData
Spark Driver
Cluster
manager
Executor
Task Task
Cache Executor
Task Task
Cache
Job Job
Worker Worker
SnappyData
Job scheduling
• Application spawns its own set of driver and executors
• Jobs in an application use the same set of executors and share
resources
• Spark's FAIR and FIFO scheduling for jobs in an application
• Pools with different scheduling policies and weights
SnappyData
Resilient Distributed Dataset (RDD)
• A distributed collection of objects divided into “partitions”
• Driver creates partitions. Each partition knows how to get to its data.
• Partition can be scheduled on any executor
• Data can be from HDFS, NFS, JDBC, S3 or any other source
• RDD caching in Spark memory and/or disk (RDD.persist)
SnappyData
Job
Part1 Part3 Part5
Part2 Part4 Part6
Job
Part1 Part3 Part5
Part2 Part4 Part6
FIFO: schedules partitions from current stage of a job
SnappyData
FAIR: schedules partitions from all queued jobs
JobPart1 Part2 Part3 Part4 Part5 Part6
JobPart1 Part2 Part3 Part4 Part5 Part6
SnappyData
Parallel transformations
• All transformations on RDDs to yield new RDDs are parallel
• Partitions are mapped to result RDD partitions (one-to-one, many-to-one
or many-to-many)
• Execution is really bottom up. The final result drives execution.
• Transformations do not result in jobs by themselves.
• Actions result in job creation and submission.
SnappyData
Transformations and actions
• Mimics scala collections
• Transformations: map, mapPartitions, filter, groupBy
• PairRDDFunctions: reduceByKey, combineByKey, aggregateByKey
• Actions: collect, count, save
• Jobs create a DAG of required number of stages (MapReduce can only
have map and reduce stages).
SnappyData
Word count
val textFile = spark.textFile("hdfs://…")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://…")
SnappyData
Word count (MapReduce)
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
...
SnappyData
Word count (explanation)
val textFile = spark.textFile("hdfs://…") // Returns RDD[String]
val counts = textFile.flatMap(line => line.split(" "))
Split each line in RDD into collection of words. flatMap create single
collection of words (instead of collection of collection like map would).
.map(word => (word, 1))
Map each word to a tuple for its count starting with 1.
.reduceByKey(_ + _)
Finally reduce by the key of tuple in previous step which is the word.
Reduction operation shortcut for: reduceByKey((a, b) => a + b)
SnappyData
What's resilient?
• RDDs keep track of their lineage
• Lineage used to efficiently recompute any lost partitions
• In actuality it is the RDD itself that has its parent information and how to
build partition iterator from it
• Checkpoint RDDs to “break lineage” and avoid depending on the
availability of base RDD
SnappyData
RDD demystified
class MapPartitionsRDD[U: ClassTag, T: ClassTag](
prev: RDD[T], f: (TaskContext, Int, Iterator[T]) => Iterator[U],
preservesPartitioning: Boolean = false) extends RDD[U](prev) {
override val partitioner =
if (preservesPartitioning) firstParent[T].partitioner else None
override def getPartitions: Array[Partition] = firstParent[T].partitions
override def getPreferredLocations(split: Partition) =
firstParent[T].preferredLocations(split)
override def compute(split: Partition, ctx: TaskContext): Iterator[U] =
f(ctx, split.index, firstParent[T].iterator(split, ctx))
}
SnappyData
RDD execution
• RDD implementation provides partitions
• Can provide “preferred locations” for each partition
• Above are evaluated on the driver JVM
• Optional partitioner for per key partitioning (can result in shuffle)
• Compute method invoked for each partition on the executor where the
partition is scheduled
• Transformations become a chain of compute calling compute of parent
RDD partition
SnappyData
Dependencies
• Partition-wise dependencies
• Narrow dependency from parent to child: Many-to-one, One-to-one
• Narrow dependencies will cause computations to chain efficiently
• Shuffle dependency: Many-to-many
• Shuffle always creates a new “stage” in a job
• A shuffle will cause data to be written to files completely before going to
next stage (no fsync, so OS buffer cache helps)
SnappyData
Word count revisited
val textFile = spark.textFile("hdfs://…")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.groupByKey()
.map((_, sum(_)))
counts.saveAsTextFile("hdfs://…")
SnappyData
reduceByKey and groupByKey
• Both cause shuffle
• groupByKey will result in shuffle of whole data first
• reduceByKey will shuffle after having “reduced” each partition
• Always use reduceByKey, combineByKey where possible
SnappyData
Spark SQL
• Familiar SQL and HiveQL interfaces.
• Catalyst engine with optimizer
• Queries like:
select avg(age), profession from people group by profession
• Catalyst engine will automatically choose reduceByKey path for
aggregates like AVG that support partial evaluation
• DataFrame API for query and operations on table data
• DataSources API to access structured data via tables
SnappyData
DataFrame
• Mostly syntactic sugar around RDD[Row] and schema
• Like RDDs, transformations return new DataFrames
• A LogicalPlan of DataFrame encapsulates the execution plan
• Delegates to a SparkPlan for actual Spark execution
• SparkPlan.doExecute() returns the underlying result RDD[InternalRow]
SnappyData
Example
val df = context.sql("""create table person(
Name String NOT NULL,
Age Int NOT NULL, Profession String NOT NULL)
using jdbc options (URL 'jdbc:gemfirexd://host:port',
Driver 'com.pivotal.gemfirexd.jdbc.ClientDriver')
""")
val result = df.groupBy("profession").agg(avg("age"), col("profession"))
result.collect().foreach(println)
val result2 = context.sql(
"select avg(age), profession from person group by profession")
result2.collect().foreach(println)
SnappyData
Spark Streaming
• Micro-batch processing for streaming data
• DStream[T] encapsulates an infinite sequence of RDD[T]
• Operations like foreachRDD()
• Fault-tolerance utilizing RDD resilience and streaming source resilience
(e.g. Kafka)
• Combine easily with batch and interactive queries

More Related Content

What's hot

Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUsCreating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Databricks
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
Yiwei Ma
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Databricks
 

What's hot (20)

Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
SnappyData @ Seattle Spark Meetup
SnappyData @ Seattle Spark MeetupSnappyData @ Seattle Spark Meetup
SnappyData @ Seattle Spark Meetup
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
 
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUsCreating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development
 
Spark and Spark Streaming
Spark and Spark StreamingSpark and Spark Streaming
Spark and Spark Streaming
 
Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
 
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
R for hadoopers
R for hadoopersR for hadoopers
R for hadoopers
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...
 
SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in Spark
 

Similar to Sumedh Wale's presentation

AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 

Similar to Sumedh Wale's presentation (20)

Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
 
Why Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraWhy Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data Era
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache Spark
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

Sumedh Wale's presentation

  • 1. SnappyDataSnappyData Why is Apache Spark interesting An introduction to Apache Spark Sumedh Wale Oct 2015 snappydata.io
  • 2. SnappyDataSnappyData  Introduction  Apache Spark architecture  Usage model  Examples and comparison  Spark SQL and Streaming Agenda
  • 3. SnappyData What is Apache Spark? • A computational engine for distributed data processing • A programming paradigm (Scala, Java, R, Python) to make distributed processing easy and efficient to use • Combine analytics from SQL, streaming, machine learning, graphs and any other custom library • Process data in HDFS, Hive, JDBC and any other data sources
  • 4. SnappyData Interactive workloads? • Faster than others like MapReduce (claims of order of magnitude faster or more) • Is it suited for interactive queries and real-time analytics? • Basic paradigm is still batch processing. Micro-batches for streaming. • What makes it faster?
  • 5. SnappyData Speed claims • Execution model to optimize arbitrary operator graphs • Forces developers need to think in terms of operators/transformations that can be optimized. • Uses memory where possible. Off-heap memory since 1.4/1.5. • Optimizes resource management – “Executors” stick around for the entire application (unlike MapReduce where each job spawns new set of JVMs) – Makes reference to previous task results in same application efficient
  • 6. SnappyData Spark Driver Cluster manager Executor Task Task Cache Executor Task Task Cache Job Job Worker Worker
  • 7. SnappyData Job scheduling • Application spawns its own set of driver and executors • Jobs in an application use the same set of executors and share resources • Spark's FAIR and FIFO scheduling for jobs in an application • Pools with different scheduling policies and weights
  • 8. SnappyData Resilient Distributed Dataset (RDD) • A distributed collection of objects divided into “partitions” • Driver creates partitions. Each partition knows how to get to its data. • Partition can be scheduled on any executor • Data can be from HDFS, NFS, JDBC, S3 or any other source • RDD caching in Spark memory and/or disk (RDD.persist)
  • 9. SnappyData Job Part1 Part3 Part5 Part2 Part4 Part6 Job Part1 Part3 Part5 Part2 Part4 Part6 FIFO: schedules partitions from current stage of a job
  • 10. SnappyData FAIR: schedules partitions from all queued jobs JobPart1 Part2 Part3 Part4 Part5 Part6 JobPart1 Part2 Part3 Part4 Part5 Part6
  • 11. SnappyData Parallel transformations • All transformations on RDDs to yield new RDDs are parallel • Partitions are mapped to result RDD partitions (one-to-one, many-to-one or many-to-many) • Execution is really bottom up. The final result drives execution. • Transformations do not result in jobs by themselves. • Actions result in job creation and submission.
  • 12. SnappyData Transformations and actions • Mimics scala collections • Transformations: map, mapPartitions, filter, groupBy • PairRDDFunctions: reduceByKey, combineByKey, aggregateByKey • Actions: collect, count, save • Jobs create a DAG of required number of stages (MapReduce can only have map and reduce stages).
  • 13. SnappyData Word count val textFile = spark.textFile("hdfs://…") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://…")
  • 14. SnappyData Word count (MapReduce) public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { ...
  • 15. SnappyData Word count (explanation) val textFile = spark.textFile("hdfs://…") // Returns RDD[String] val counts = textFile.flatMap(line => line.split(" ")) Split each line in RDD into collection of words. flatMap create single collection of words (instead of collection of collection like map would). .map(word => (word, 1)) Map each word to a tuple for its count starting with 1. .reduceByKey(_ + _) Finally reduce by the key of tuple in previous step which is the word. Reduction operation shortcut for: reduceByKey((a, b) => a + b)
  • 16. SnappyData What's resilient? • RDDs keep track of their lineage • Lineage used to efficiently recompute any lost partitions • In actuality it is the RDD itself that has its parent information and how to build partition iterator from it • Checkpoint RDDs to “break lineage” and avoid depending on the availability of base RDD
  • 17. SnappyData RDD demystified class MapPartitionsRDD[U: ClassTag, T: ClassTag]( prev: RDD[T], f: (TaskContext, Int, Iterator[T]) => Iterator[U], preservesPartitioning: Boolean = false) extends RDD[U](prev) { override val partitioner = if (preservesPartitioning) firstParent[T].partitioner else None override def getPartitions: Array[Partition] = firstParent[T].partitions override def getPreferredLocations(split: Partition) = firstParent[T].preferredLocations(split) override def compute(split: Partition, ctx: TaskContext): Iterator[U] = f(ctx, split.index, firstParent[T].iterator(split, ctx)) }
  • 18. SnappyData RDD execution • RDD implementation provides partitions • Can provide “preferred locations” for each partition • Above are evaluated on the driver JVM • Optional partitioner for per key partitioning (can result in shuffle) • Compute method invoked for each partition on the executor where the partition is scheduled • Transformations become a chain of compute calling compute of parent RDD partition
  • 19. SnappyData Dependencies • Partition-wise dependencies • Narrow dependency from parent to child: Many-to-one, One-to-one • Narrow dependencies will cause computations to chain efficiently • Shuffle dependency: Many-to-many • Shuffle always creates a new “stage” in a job • A shuffle will cause data to be written to files completely before going to next stage (no fsync, so OS buffer cache helps)
  • 20. SnappyData Word count revisited val textFile = spark.textFile("hdfs://…") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .groupByKey() .map((_, sum(_))) counts.saveAsTextFile("hdfs://…")
  • 21. SnappyData reduceByKey and groupByKey • Both cause shuffle • groupByKey will result in shuffle of whole data first • reduceByKey will shuffle after having “reduced” each partition • Always use reduceByKey, combineByKey where possible
  • 22. SnappyData Spark SQL • Familiar SQL and HiveQL interfaces. • Catalyst engine with optimizer • Queries like: select avg(age), profession from people group by profession • Catalyst engine will automatically choose reduceByKey path for aggregates like AVG that support partial evaluation • DataFrame API for query and operations on table data • DataSources API to access structured data via tables
  • 23. SnappyData DataFrame • Mostly syntactic sugar around RDD[Row] and schema • Like RDDs, transformations return new DataFrames • A LogicalPlan of DataFrame encapsulates the execution plan • Delegates to a SparkPlan for actual Spark execution • SparkPlan.doExecute() returns the underlying result RDD[InternalRow]
  • 24. SnappyData Example val df = context.sql("""create table person( Name String NOT NULL, Age Int NOT NULL, Profession String NOT NULL) using jdbc options (URL 'jdbc:gemfirexd://host:port', Driver 'com.pivotal.gemfirexd.jdbc.ClientDriver') """) val result = df.groupBy("profession").agg(avg("age"), col("profession")) result.collect().foreach(println) val result2 = context.sql( "select avg(age), profession from person group by profession") result2.collect().foreach(println)
  • 25. SnappyData Spark Streaming • Micro-batch processing for streaming data • DStream[T] encapsulates an infinite sequence of RDD[T] • Operations like foreachRDD() • Fault-tolerance utilizing RDD resilience and streaming source resilience (e.g. Kafka) • Combine easily with batch and interactive queries