SlideShare a Scribd company logo
Spark’s distributed programming model
Martin Zapletal Cake Solutions
Apache Spark
Apache Spark and Big Data
1) History and market overview
2) Installation
3) MLlib and machine learning on Spark
4) Porting R code to Scala and Spark
5) Concepts - Core, SQL, GraphX, Streaming
6) Spark’s distributed programming model
Table of Contents
● Distributed programming introduction
● Programming models
● Datafow systems and DAGs
● RDD
● Transformations, Actions, Persistence, Shared variables
Distributed programming
● reminder
○ unreliable network
○ ubiquitous failures
○ everything asynchronous
○ consistency, ordering and synchronisation expensive
○ local time
○ correctness properties safety and liveness
○ ...
Two armies (generals)
● two armies, A (Red) and B (Blue)
● separated parts A1 and A2 of A army must synchronize attack to win
● consensus with unreliable communication channel
● no node failures, no byzantine failures, …
● designated leader
Parallel programming models
● Parallel computing models
○ Different parallel computing problems
■ Easily parallelizable or communication needed
○ Shared memory
■ On one machine
● Multiple CPUs/GPUs share memory
■ On multiple machines
● Shared memory accessed via network
● Still much slower compared to memory
■ OpenMP, Global Arrays, …
○ Share nothing
■ Processes communicate by sending messages
■ Send(), Receive()
■ MPI
○ usually no fault tolerance
Dataflow system
● term used to describe general parallel programming approach
● in traditional von Neumann architecture instructions executed sequentially by a
worker (cpu) and data do not move
● in Dataflow workers have different tasks assigned to them and form an assembly
line
● program represented by connections and black box operations - directed graph
● data moves between tasks
● task executed by worker as soon as inputs available
● inherently parallel
● no shared state
● closer to functional programming
● not Spark specific (Stratosphere, MapReduce, Pregel, Giraph, Storm, ...)
MapReduce
● shows that Dataflow can be expressed in terms of map and reduce
operations
● simple to parallelize
● but each map-reduce is separate from the rest
Directed acyclic graph
● Spark is a Dataflow execution engine that supports cyclic data flows
● whole DAG is formed lazily
● allows global optimizations
● has expresiveness of MPI
● lineage tracking
Optimizations
● similar to optimizations of RDBMS (operation reordering, bushy
join-order enumeration, aggregation push-down)
● however DAGs less restrictive than database queries and it is
difficult to optimize UDFs (higher order functions used in Spark,
Flink)
● potentially major performance improvement
● partially support for incremental algorithm optimization (local
change) with sparse computational dependencies (GraphX)
Optimizations
sc
.parallelize(people)
.map(p => Person(p.age, p.height * 2.54))
.filter(_.age < 35)
sc
.parallelize(people)
.filter(_.age < 35)
.map(p => Person(p.age, p.height * 2.54))
case class Person(age: Int, height: Double)
val people = (0 to 100).map(x => Person(x, x))
Optimizations
sc
.parallelize(people)
.map(p => Person(p.age, p.height * 2.54))
.filter(_.height < 170)
sc
.parallelize(people)
.filter(_.height < 170)
.map(p => Person(p.age, p.height * 2.54))
case class Person(age: Int, height: Double)
val people = (0 to 100).map(x => Person(x, x))
???
Optimizations
1. logical rewriting applying rules to trees of operators (e.g. filter push down)
○ static code analysis (bytecode of each UDF) to check reordering rules
○ emits all valid reordered data flow alternatives
2. logical representation translated to physical representation
○ chooses physical execution strategies for each alternative (partitioning,
broadcasting, external sorts, merge and hash joins, …)
○ uses a cost based optimizer (I/O, disk I/O, CPU costs, UDF costs, network)
Stream optimizations
● similar, because in Spark streams are just mini batches
● a few extra window, state operations
pageViews = readStream("http://...", "1s")
ones = pageViews.map(event => (event.url, 1))
counts = ones.runningReduce((a, b) => a + b)
Performance
Hadoop Spark Spark
Data size 102.5 TB 100 TB 1000 TB
Time [min] 72 23 234
Nodes 2100 206 190
Cores 50400 6592 6080
Rate/node [GB/min] 0.67 20.7 22.5
Environment dedicated data center EC2 EC2
● fastest open source solution to sort 100TB data in Daytona Gray Sort Benchmark (http:
//sortbenchmark.org/)
● required some improvements in shuffle approach
● very optimized sorting algorithm (cache locality, unsafe off-heap memory structures, gc, …)
● Databricks blog + presentation
Spark programming model
● RDD
● parallelizing collections
● loading external datasets
● operations
○ transformations
○ actions
● persistence
● shared variables
RDD
● transformations
○ lazy, form the DAG
○ map, filter, flatMap, mapPartitions, mapPartitionsWithIndex, sample, union,
intersection, distinct, groupByKey, reduceByKey, sortByKey, join, cogroup,
repatition, cartesian, glom, ...
● actions
○ execute DAG
○ retrieve result
○ reduce, collect, count, first, take, foreach, saveAs…, min, max, ...
● different categories of transformations with different complexity, performance and
sematics
● e.g. mapping, filtering, grouping, set operations, sorting, reducing, partitioning
● full list https://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.
rdd.RDD
Transformations with narrow deps
● map
● union
● join with copartitioned inputs
Transformations with wide deps
● groupBy
● join without copartitioned inputs
Actions collect
● retrieves result to driver program
● no longer distributed
Actions reduction
● associative, commutative operation
Cache
● cache partitions to be reused in next actions on it or on datasets derived
from it
● snapshot used instead of lineage recomputation
● fault tolerant
● cache(), persist()
● levels
○ memory
○ disk
○ both
○ serialized
○ replicated
○ off-heap
● automatic cache after shuffle
Shared variables - broadcast
● usually all variables used in UDF are copies on each node
● shared r/w variables would be very inefficient
● broadcast
○ read only variables
○ efficient broadcast algorithm, can deliver data cheaply to all nodes
val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar.value
Shared variables - accumulators
● accumulators
○ add only
○ use associative operation so efficient in parallel
○ only driver program can read the value
○ exactly once semantics only guaranteed for actions (in case of failure
and recalculation)
val accum = sc.accumulator(0, "My Accumulator")
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
accum.value
Shared variables - accumulators
object VectorAccumulatorParam extends AccumulatorParam[Vector] {
def zero(initialValue: Vector): Vector = {
Vector.zeros(initialValue.size)
}
def addInPlace(v1: Vector, v2: Vector): Vector = {
v1 += v2
}
}
Conclusion
● expressive and abstract programming model
● user defined functions
● based on research
● optimizations
● constraining in certain cases (spanning partition boundaries, functions of
multiple variables, ...)
Questions

More Related Content

What's hot

Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Datio Big Data
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
DataArt
 
Spark on yarn
Spark on yarnSpark on yarn
Spark on yarn
datamantra
 
Apache Spark Internals
Apache Spark InternalsApache Spark Internals
Apache Spark Internals
Knoldus Inc.
 
Road to Analytics
Road to AnalyticsRoad to Analytics
Road to Analytics
Datio Big Data
 
Spark overview
Spark overviewSpark overview
Spark overview
Lisa Hua
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Duyhai Doan
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime Internals
Cheng Lian
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Samy Dindane
 
Map reduce vs spark
Map reduce vs sparkMap reduce vs spark
Map reduce vs spark
Tudor Lapusan
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
GauravBiswas9
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Spark Summit
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
Tudor Lapusan
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 

What's hot (20)

Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Spark on yarn
Spark on yarnSpark on yarn
Spark on yarn
 
Apache Spark Internals
Apache Spark InternalsApache Spark Internals
Apache Spark Internals
 
Road to Analytics
Road to AnalyticsRoad to Analytics
Road to Analytics
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime Internals
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Map reduce vs spark
Map reduce vs sparkMap reduce vs spark
Map reduce vs spark
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 

Viewers also liked

Data in Motion: Streaming Static Data Efficiently 2
Data in Motion: Streaming Static Data Efficiently 2Data in Motion: Streaming Static Data Efficiently 2
Data in Motion: Streaming Static Data Efficiently 2
Martin Zapletal
 
Reloj inteligente para ciegos (smartwatch)
Reloj inteligente para ciegos (smartwatch)Reloj inteligente para ciegos (smartwatch)
Reloj inteligente para ciegos (smartwatch)
AgustinaBarreto11
 
Large volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformLarge volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive Platform
Martin Zapletal
 
Gadgets (I/O) for Disabled/Physically Challenged
Gadgets (I/O) for Disabled/Physically ChallengedGadgets (I/O) for Disabled/Physically Challenged
Gadgets (I/O) for Disabled/Physically Challenged
Mujab Muneeb
 
[임팩트투자 디파티] 닷(dot) 최아름 팀장_D.CAMP_201607
[임팩트투자 디파티] 닷(dot) 최아름 팀장_D.CAMP_201607[임팩트투자 디파티] 닷(dot) 최아름 팀장_D.CAMP_201607
[임팩트투자 디파티] 닷(dot) 최아름 팀장_D.CAMP_201607
D.CAMP
 
Deep learning review
Deep learning reviewDeep learning review
Deep learning review
Manas Gaur
 
20151223application of deep learning in basic bio
20151223application of deep learning in basic bio 20151223application of deep learning in basic bio
20151223application of deep learning in basic bio
Charlene Hsuan-Lin Her
 
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Impetus Technologies
 
Data in Motion: Streaming Static Data Efficiently
Data in Motion: Streaming Static Data EfficientlyData in Motion: Streaming Static Data Efficiently
Data in Motion: Streaming Static Data Efficiently
Martin Zapletal
 
Introduction to Parallel and Distributed Computing
Introduction to Parallel and Distributed ComputingIntroduction to Parallel and Distributed Computing
Introduction to Parallel and Distributed Computing
Sayed Chhattan Shah
 
Spark - The beginnings
Spark -  The beginningsSpark -  The beginnings
Spark - The beginnings
Daniel Leon
 
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksBig Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Data Con LA
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Chris Fregly
 
Apache Spark
Apache SparkApache Spark
Apache Spark
Mahdi Esmailoghli
 
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn..."Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
Edge AI and Vision Alliance
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Anastasios Skarlatidis
 
Magnetic Levitation - Interstate Traveler Co LLC HyRail rail-gcen-23-feb-2012
Magnetic Levitation - Interstate Traveler Co LLC HyRail rail-gcen-23-feb-2012Magnetic Levitation - Interstate Traveler Co LLC HyRail rail-gcen-23-feb-2012
Magnetic Levitation - Interstate Traveler Co LLC HyRail rail-gcen-23-feb-2012
Justin Sutton
 
Deep Learning Jeff-Shomaker_1-20-17_Final_
Deep Learning Jeff-Shomaker_1-20-17_Final_Deep Learning Jeff-Shomaker_1-20-17_Final_
Deep Learning Jeff-Shomaker_1-20-17_Final_
Jeffrey Shomaker
 
Apache spark linkedin
Apache spark linkedinApache spark linkedin
Apache spark linkedin
Yukti Kaura
 
What Deep Learning Means for Artificial Intelligence
What Deep Learning Means for Artificial IntelligenceWhat Deep Learning Means for Artificial Intelligence
What Deep Learning Means for Artificial Intelligence
Jonathan Mugan
 

Viewers also liked (20)

Data in Motion: Streaming Static Data Efficiently 2
Data in Motion: Streaming Static Data Efficiently 2Data in Motion: Streaming Static Data Efficiently 2
Data in Motion: Streaming Static Data Efficiently 2
 
Reloj inteligente para ciegos (smartwatch)
Reloj inteligente para ciegos (smartwatch)Reloj inteligente para ciegos (smartwatch)
Reloj inteligente para ciegos (smartwatch)
 
Large volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformLarge volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive Platform
 
Gadgets (I/O) for Disabled/Physically Challenged
Gadgets (I/O) for Disabled/Physically ChallengedGadgets (I/O) for Disabled/Physically Challenged
Gadgets (I/O) for Disabled/Physically Challenged
 
[임팩트투자 디파티] 닷(dot) 최아름 팀장_D.CAMP_201607
[임팩트투자 디파티] 닷(dot) 최아름 팀장_D.CAMP_201607[임팩트투자 디파티] 닷(dot) 최아름 팀장_D.CAMP_201607
[임팩트투자 디파티] 닷(dot) 최아름 팀장_D.CAMP_201607
 
Deep learning review
Deep learning reviewDeep learning review
Deep learning review
 
20151223application of deep learning in basic bio
20151223application of deep learning in basic bio 20151223application of deep learning in basic bio
20151223application of deep learning in basic bio
 
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
 
Data in Motion: Streaming Static Data Efficiently
Data in Motion: Streaming Static Data EfficientlyData in Motion: Streaming Static Data Efficiently
Data in Motion: Streaming Static Data Efficiently
 
Introduction to Parallel and Distributed Computing
Introduction to Parallel and Distributed ComputingIntroduction to Parallel and Distributed Computing
Introduction to Parallel and Distributed Computing
 
Spark - The beginnings
Spark -  The beginningsSpark -  The beginnings
Spark - The beginnings
 
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksBig Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn..."Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Magnetic Levitation - Interstate Traveler Co LLC HyRail rail-gcen-23-feb-2012
Magnetic Levitation - Interstate Traveler Co LLC HyRail rail-gcen-23-feb-2012Magnetic Levitation - Interstate Traveler Co LLC HyRail rail-gcen-23-feb-2012
Magnetic Levitation - Interstate Traveler Co LLC HyRail rail-gcen-23-feb-2012
 
Deep Learning Jeff-Shomaker_1-20-17_Final_
Deep Learning Jeff-Shomaker_1-20-17_Final_Deep Learning Jeff-Shomaker_1-20-17_Final_
Deep Learning Jeff-Shomaker_1-20-17_Final_
 
Apache spark linkedin
Apache spark linkedinApache spark linkedin
Apache spark linkedin
 
What Deep Learning Means for Artificial Intelligence
What Deep Learning Means for Artificial IntelligenceWhat Deep Learning Means for Artificial Intelligence
What Deep Learning Means for Artificial Intelligence
 

Similar to Apache spark - Spark's distributed programming model

Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
Lucian Neghina
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
Fabio Fumarola
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
Adarsh Pannu
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
wang xing
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
Hektor Jacynycz García
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profilepramodbiligiri
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internals
Anton Kirillov
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
samthemonad
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Massimo Schenone
 
Spark
SparkSpark
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
Fabio Fumarola
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
Girish Khanzode
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi
 
Spark Deep Dive
Spark Deep DiveSpark Deep Dive
Spark Deep Dive
Corey Nolet
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
huguk
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
Antonios Katsarakis
 
NetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and VerticaNetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and Vertica
Josef Niedermeier
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
Aishg4
 

Similar to Apache spark - Spark's distributed programming model (20)

Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internals
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Spark
SparkSpark
Spark
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Spark Deep Dive
Spark Deep DiveSpark Deep Dive
Spark Deep Dive
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
NetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and VerticaNetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and Vertica
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
 

More from Martin Zapletal

How Disney+ uses fast data ubiquity to improve the customer experience
 How Disney+ uses fast data ubiquity to improve the customer experience  How Disney+ uses fast data ubiquity to improve the customer experience
How Disney+ uses fast data ubiquity to improve the customer experience
Martin Zapletal
 
Customer experience at disney+ through data perspective
Customer experience at disney+ through data perspectiveCustomer experience at disney+ through data perspective
Customer experience at disney+ through data perspective
Martin Zapletal
 
Intelligent System Optimizations
Intelligent System OptimizationsIntelligent System Optimizations
Intelligent System Optimizations
Martin Zapletal
 
Intelligent Distributed Systems Optimizations
Intelligent Distributed Systems OptimizationsIntelligent Distributed Systems Optimizations
Intelligent Distributed Systems Optimizations
Martin Zapletal
 
Cassandra as an event sourced journal for big data analytics Cassandra Summit...
Cassandra as an event sourced journal for big data analytics Cassandra Summit...Cassandra as an event sourced journal for big data analytics Cassandra Summit...
Cassandra as an event sourced journal for big data analytics Cassandra Summit...
Martin Zapletal
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Martin Zapletal
 

More from Martin Zapletal (6)

How Disney+ uses fast data ubiquity to improve the customer experience
 How Disney+ uses fast data ubiquity to improve the customer experience  How Disney+ uses fast data ubiquity to improve the customer experience
How Disney+ uses fast data ubiquity to improve the customer experience
 
Customer experience at disney+ through data perspective
Customer experience at disney+ through data perspectiveCustomer experience at disney+ through data perspective
Customer experience at disney+ through data perspective
 
Intelligent System Optimizations
Intelligent System OptimizationsIntelligent System Optimizations
Intelligent System Optimizations
 
Intelligent Distributed Systems Optimizations
Intelligent Distributed Systems OptimizationsIntelligent Distributed Systems Optimizations
Intelligent Distributed Systems Optimizations
 
Cassandra as an event sourced journal for big data analytics Cassandra Summit...
Cassandra as an event sourced journal for big data analytics Cassandra Summit...Cassandra as an event sourced journal for big data analytics Cassandra Summit...
Cassandra as an event sourced journal for big data analytics Cassandra Summit...
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
 

Recently uploaded

TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 

Recently uploaded (20)

TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 

Apache spark - Spark's distributed programming model

  • 1. Spark’s distributed programming model Martin Zapletal Cake Solutions Apache Spark
  • 2. Apache Spark and Big Data 1) History and market overview 2) Installation 3) MLlib and machine learning on Spark 4) Porting R code to Scala and Spark 5) Concepts - Core, SQL, GraphX, Streaming 6) Spark’s distributed programming model
  • 3. Table of Contents ● Distributed programming introduction ● Programming models ● Datafow systems and DAGs ● RDD ● Transformations, Actions, Persistence, Shared variables
  • 4. Distributed programming ● reminder ○ unreliable network ○ ubiquitous failures ○ everything asynchronous ○ consistency, ordering and synchronisation expensive ○ local time ○ correctness properties safety and liveness ○ ...
  • 5. Two armies (generals) ● two armies, A (Red) and B (Blue) ● separated parts A1 and A2 of A army must synchronize attack to win ● consensus with unreliable communication channel ● no node failures, no byzantine failures, … ● designated leader
  • 6. Parallel programming models ● Parallel computing models ○ Different parallel computing problems ■ Easily parallelizable or communication needed ○ Shared memory ■ On one machine ● Multiple CPUs/GPUs share memory ■ On multiple machines ● Shared memory accessed via network ● Still much slower compared to memory ■ OpenMP, Global Arrays, … ○ Share nothing ■ Processes communicate by sending messages ■ Send(), Receive() ■ MPI ○ usually no fault tolerance
  • 7. Dataflow system ● term used to describe general parallel programming approach ● in traditional von Neumann architecture instructions executed sequentially by a worker (cpu) and data do not move ● in Dataflow workers have different tasks assigned to them and form an assembly line ● program represented by connections and black box operations - directed graph ● data moves between tasks ● task executed by worker as soon as inputs available ● inherently parallel ● no shared state ● closer to functional programming ● not Spark specific (Stratosphere, MapReduce, Pregel, Giraph, Storm, ...)
  • 8. MapReduce ● shows that Dataflow can be expressed in terms of map and reduce operations ● simple to parallelize ● but each map-reduce is separate from the rest
  • 9. Directed acyclic graph ● Spark is a Dataflow execution engine that supports cyclic data flows ● whole DAG is formed lazily ● allows global optimizations ● has expresiveness of MPI ● lineage tracking
  • 10. Optimizations ● similar to optimizations of RDBMS (operation reordering, bushy join-order enumeration, aggregation push-down) ● however DAGs less restrictive than database queries and it is difficult to optimize UDFs (higher order functions used in Spark, Flink) ● potentially major performance improvement ● partially support for incremental algorithm optimization (local change) with sparse computational dependencies (GraphX)
  • 11. Optimizations sc .parallelize(people) .map(p => Person(p.age, p.height * 2.54)) .filter(_.age < 35) sc .parallelize(people) .filter(_.age < 35) .map(p => Person(p.age, p.height * 2.54)) case class Person(age: Int, height: Double) val people = (0 to 100).map(x => Person(x, x))
  • 12. Optimizations sc .parallelize(people) .map(p => Person(p.age, p.height * 2.54)) .filter(_.height < 170) sc .parallelize(people) .filter(_.height < 170) .map(p => Person(p.age, p.height * 2.54)) case class Person(age: Int, height: Double) val people = (0 to 100).map(x => Person(x, x)) ???
  • 13. Optimizations 1. logical rewriting applying rules to trees of operators (e.g. filter push down) ○ static code analysis (bytecode of each UDF) to check reordering rules ○ emits all valid reordered data flow alternatives 2. logical representation translated to physical representation ○ chooses physical execution strategies for each alternative (partitioning, broadcasting, external sorts, merge and hash joins, …) ○ uses a cost based optimizer (I/O, disk I/O, CPU costs, UDF costs, network)
  • 14. Stream optimizations ● similar, because in Spark streams are just mini batches ● a few extra window, state operations pageViews = readStream("http://...", "1s") ones = pageViews.map(event => (event.url, 1)) counts = ones.runningReduce((a, b) => a + b)
  • 15. Performance Hadoop Spark Spark Data size 102.5 TB 100 TB 1000 TB Time [min] 72 23 234 Nodes 2100 206 190 Cores 50400 6592 6080 Rate/node [GB/min] 0.67 20.7 22.5 Environment dedicated data center EC2 EC2 ● fastest open source solution to sort 100TB data in Daytona Gray Sort Benchmark (http: //sortbenchmark.org/) ● required some improvements in shuffle approach ● very optimized sorting algorithm (cache locality, unsafe off-heap memory structures, gc, …) ● Databricks blog + presentation
  • 16. Spark programming model ● RDD ● parallelizing collections ● loading external datasets ● operations ○ transformations ○ actions ● persistence ● shared variables
  • 17. RDD ● transformations ○ lazy, form the DAG ○ map, filter, flatMap, mapPartitions, mapPartitionsWithIndex, sample, union, intersection, distinct, groupByKey, reduceByKey, sortByKey, join, cogroup, repatition, cartesian, glom, ... ● actions ○ execute DAG ○ retrieve result ○ reduce, collect, count, first, take, foreach, saveAs…, min, max, ... ● different categories of transformations with different complexity, performance and sematics ● e.g. mapping, filtering, grouping, set operations, sorting, reducing, partitioning ● full list https://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark. rdd.RDD
  • 18. Transformations with narrow deps ● map ● union ● join with copartitioned inputs
  • 19. Transformations with wide deps ● groupBy ● join without copartitioned inputs
  • 20. Actions collect ● retrieves result to driver program ● no longer distributed
  • 21. Actions reduction ● associative, commutative operation
  • 22. Cache ● cache partitions to be reused in next actions on it or on datasets derived from it ● snapshot used instead of lineage recomputation ● fault tolerant ● cache(), persist() ● levels ○ memory ○ disk ○ both ○ serialized ○ replicated ○ off-heap ● automatic cache after shuffle
  • 23. Shared variables - broadcast ● usually all variables used in UDF are copies on each node ● shared r/w variables would be very inefficient ● broadcast ○ read only variables ○ efficient broadcast algorithm, can deliver data cheaply to all nodes val broadcastVar = sc.broadcast(Array(1, 2, 3)) broadcastVar.value
  • 24. Shared variables - accumulators ● accumulators ○ add only ○ use associative operation so efficient in parallel ○ only driver program can read the value ○ exactly once semantics only guaranteed for actions (in case of failure and recalculation) val accum = sc.accumulator(0, "My Accumulator") sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x) accum.value
  • 25. Shared variables - accumulators object VectorAccumulatorParam extends AccumulatorParam[Vector] { def zero(initialValue: Vector): Vector = { Vector.zeros(initialValue.size) } def addInPlace(v1: Vector, v2: Vector): Vector = { v1 += v2 } }
  • 26. Conclusion ● expressive and abstract programming model ● user defined functions ● based on research ● optimizations ● constraining in certain cases (spanning partition boundaries, functions of multiple variables, ...)

Editor's Notes

  1. anything can fail (network, nodes, lost or damaged packets, …) Liveness properties : assert that something ‘good’ will eventually happen during execution. Safety Properties : assert that nothing ‘bad’ will ever happen during an execution (that is, that the program will never enter a ‘bad’ state).
  2. HPC shared memory may or may not be good depends on communication patterns locks may be needed
  3. descibe each - e.g. serialized, off-heap, replicated