SlideShare a Scribd company logo
1
Apache Spark 101
Lance Co Ting Keh
Machine Learning @ Box
2
Outline
I. Distributed Computing at a High Level
II. Disk versus Memory based Systems
III. Spark Core
I. Brief background
II. Benchmarks and Comparisons
III. What is an RDD
IV. RDD Actions and Transformations
V. Caching and Serialization
VI. Anatomy of a Program
VII. The Spark Family
3
Why Distributed Computing?
Divide and Conquer
Problem Single machine cannot complete the
computation at hand
Solution Parallelize the job and distribute work
among a network of machines
4
Issues Arise in Distributed Computing
View the world from the eyes of a single worker
• How do I distribute an algorithm?
• How do I partition my dataset?
• How do I maintain a single consistent view of a
shared state?
• How do I recover from machine failures?
• How do I allocate cluster resources?
• …..
5
Finding majority element in a single machine
List(20, 18, 20, 18, 20)
Think distributed
6
Finding majority element in a distributed dataset
Think distributed
List(1, 18, 1, 18, 1)
List(2, 18, 2, 18, 2)
List(3, 18, 3, 18, 3)
List(4, 18, 4, 18, 4)
List(5, 18, 5, 18, 5)
7
Finding majority element in a distributed dataset
Think distributed
List(1, 18, 1, 18, 1)
List(2, 18, 2, 18, 2)
List(3, 18, 3, 18, 3)
List(4, 18, 4, 18, 4)
List(5, 18, 5, 18, 5)
18
8
Disk Based vs Memory Based Frameworks
• Disk Based Frameworks
‒Persists intermediate results to disk
‒Data is reloaded from disk with every query
‒Easy failure recovery
‒Best for ETL like work-loads
‒Examples: Hadoop, Dryad
Acyclic data flow
Map
Map
Map
Reduce
Reduce
Input Output
Image courtesy of Matei Zaharia, Introduction to Spark
9
Disk Based vs Memory Based Frameworks
• Memory Based Frameworks
‒Circumvents heavy cost of I/O by keeping intermediate
results in memory
‒Sensitive to availability of memory
‒Remembers operationsapplied to dataset
‒Best for iterative workloads
‒Examples: Spark, Flink
Reuse working data set in memory
iter. 1 iter. 2 . . .
Input
Image courtesy of Matei Zaharia, Introduction to Spark
10
The rest of the talk
I. Spark Core
I. Brief background
II. Benchmarks and Comparisons
III. What is an RDD
IV. RDD Actions and Transformations
V. Spark Cluster
VI. Anatomy of a Program
VII. The Spark Family
11
Spark Background
• Amplab UC Berkley
• Project Lead: Dr. Matei Zaharia
• First paper published on RDD’s was in 2012
• Open sourced from day one, growing number of
contributors
• Released its 1.0 version May 2014. Currently in 1.2.1
• Databricks company established to support Spark and all
its related technologies. Matei currently sits as its CTO
• Amazon, Alibaba, Baidu, eBay, Groupon, Ooyala,
OpenTable, Box, Shopify, TechBase, Yahoo!
Arose from an academic setting
12
Spark versus Scalding (Hadoop)
Clear win for iterative applications
13
Ad-hoc batch queries
14
Resilient Distributed Datasets (RDDs)
• Main object in Spark’s universe
• Think of it as representing the data at that stage in
the operation
• Allows for coarse-grained transformations (e.g. map,
group-by, join)
• Allows for efficient fault recovery using lineage
‒Log one operation to apply to many elements
‒Recompute lost partitions of dataset on failure
‒No cost if nothing fails
15
RDD Actions and Transformations
• Transformations
‒ Lazy operations applied on an RDD
‒ Creates a new RDD from an existing RDD
‒ Allows Spark to perform optimizations
‒ e.g. map, filter, flatMap, union, intersection,
distinct, reduceByKey, groupByKey
• Actions
‒ Returns a value to the driver program after
computation
‒ e.g. reduce, collect, count, first, take, saveAsFile
Transformations are realized when an action is called
16
RDD Representation
• Simple common interface:
–Set of partitions
–Preferred locations for each partition
–List of parent RDDs
–Function to compute a partition given parents
–Optional partitioning info
• Allows capturing wide range of
transformations
Slide courtesy of Matei Zaharia, Introduction to Spark
17
Spark Cluster
Driver
• Entry point of Spark application
• Main Spark application is ran here
• Results of “reduce” operations are aggregated here
Driver
18
Spark Cluster
Driver
Master
Master
• Distributed coordination of Spark workers including:
 Health checking workers
 Reassignment of failed tasks
 Entry point for job and cluster metrics
19
Spark Cluster
Driver
Master
Worker
Worker
Worker
Worker
Worker
• Spawns executors to perform tasks on partitions of data
20
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘t’)(2))
messages.persist()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(_.contains(“foo”)).count
messages.filter(_.contains(“bar”)).count
. . .
tasks
results
Msgs. 1
Msgs. 2
Msgs. 3
Base RDDTransformed RDD
Action
Result: full-text search of Wikipedia in <1 sec (vs
20 sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec
(vs 170 sec for on-disk data)
Slide courtesy of Matei Zaharia, Introduction to Spark
21
The Spark Family
Cheaper by the dozen
• Aside from its performance and API, the
diverse tool set available in Spark is the
reason for its wide adoption
1. Spark Streaming
2. Spark SQL
3. MLlib
4. GraphX
22
Lambda Architecture
23
Lambda Architecture
Unified Framework
Spark Core
Spark Streaming
Spark SQL

More Related Content

What's hot

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Vincent Poncet
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
Edureka!
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
Slim Baltagi
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
Shubham Parmar
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
MaharajothiP
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
BTI360
 
Hadoop
HadoopHadoop
Spark Will Replace Hadoop ! Know Why
Spark Will Replace Hadoop ! Know Why Spark Will Replace Hadoop ! Know Why
Spark Will Replace Hadoop ! Know Why
Edureka!
 
Apache Spark Briefing
Apache Spark BriefingApache Spark Briefing
Apache Spark Briefing
Thomas W. Dinsmore
 
Introduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlibIntroduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlib
pumaranikar
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
cdmaxime
 
HADOOP
HADOOPHADOOP
Big data processing with apache spark
Big data processing with apache sparkBig data processing with apache spark
Big data processing with apache spark
sarith divakar
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
Atul Kushwaha
 
Cloud Computing: Hadoop
Cloud Computing: HadoopCloud Computing: Hadoop
Cloud Computing: Hadoop
darugar
 
Apache spark
Apache sparkApache spark
Apache spark
Prashant Pranay
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
BADR
 
Spark, Python and Parquet
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet
odsc
 
Hadoop Everywhere
Hadoop EverywhereHadoop Everywhere

What's hot (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Hadoop
HadoopHadoop
Hadoop
 
Spark Will Replace Hadoop ! Know Why
Spark Will Replace Hadoop ! Know Why Spark Will Replace Hadoop ! Know Why
Spark Will Replace Hadoop ! Know Why
 
Apache Spark Briefing
Apache Spark BriefingApache Spark Briefing
Apache Spark Briefing
 
Introduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlibIntroduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlib
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
 
HADOOP
HADOOPHADOOP
HADOOP
 
Big data processing with apache spark
Big data processing with apache sparkBig data processing with apache spark
Big data processing with apache spark
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Cloud Computing: Hadoop
Cloud Computing: HadoopCloud Computing: Hadoop
Cloud Computing: Hadoop
 
Apache spark
Apache sparkApache spark
Apache spark
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
Spark, Python and Parquet
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet
 
Hadoop Everywhere
Hadoop EverywhereHadoop Everywhere
Hadoop Everywhere
 

Similar to LanceIntroSpark_box

Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
Anirudh
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Hyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentation
Ramesh Mudunuri
 
Apache Spark
Apache SparkApache Spark
Apache Spark
masifqadri
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark Training
Spark Summit
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development
Spark Summit
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
Edureka!
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
MaheshPandit16
 
Spark tutorial
Spark tutorialSpark tutorial
Spark tutorial
Sahan Bulathwela
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
trihug
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
Juan Pedro Moreno
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Spark Worshop
Spark WorshopSpark Worshop
Spark Worshop
Juan Pedro Moreno
 
Infrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache SparkInfrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache Spark
Databricks
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
IT Event
 
End-to-end working of Apache Spark
End-to-end working of Apache SparkEnd-to-end working of Apache Spark
End-to-end working of Apache Spark
Knoldus Inc.
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 

Similar to LanceIntroSpark_box (20)

Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentation
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark Training
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Spark tutorial
Spark tutorialSpark tutorial
Spark tutorial
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Spark Worshop
Spark WorshopSpark Worshop
Spark Worshop
 
Infrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache SparkInfrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache Spark
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 
End-to-end working of Apache Spark
End-to-end working of Apache SparkEnd-to-end working of Apache Spark
End-to-end working of Apache Spark
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 

LanceIntroSpark_box

  • 1. 1 Apache Spark 101 Lance Co Ting Keh Machine Learning @ Box
  • 2. 2 Outline I. Distributed Computing at a High Level II. Disk versus Memory based Systems III. Spark Core I. Brief background II. Benchmarks and Comparisons III. What is an RDD IV. RDD Actions and Transformations V. Caching and Serialization VI. Anatomy of a Program VII. The Spark Family
  • 3. 3 Why Distributed Computing? Divide and Conquer Problem Single machine cannot complete the computation at hand Solution Parallelize the job and distribute work among a network of machines
  • 4. 4 Issues Arise in Distributed Computing View the world from the eyes of a single worker • How do I distribute an algorithm? • How do I partition my dataset? • How do I maintain a single consistent view of a shared state? • How do I recover from machine failures? • How do I allocate cluster resources? • …..
  • 5. 5 Finding majority element in a single machine List(20, 18, 20, 18, 20) Think distributed
  • 6. 6 Finding majority element in a distributed dataset Think distributed List(1, 18, 1, 18, 1) List(2, 18, 2, 18, 2) List(3, 18, 3, 18, 3) List(4, 18, 4, 18, 4) List(5, 18, 5, 18, 5)
  • 7. 7 Finding majority element in a distributed dataset Think distributed List(1, 18, 1, 18, 1) List(2, 18, 2, 18, 2) List(3, 18, 3, 18, 3) List(4, 18, 4, 18, 4) List(5, 18, 5, 18, 5) 18
  • 8. 8 Disk Based vs Memory Based Frameworks • Disk Based Frameworks ‒Persists intermediate results to disk ‒Data is reloaded from disk with every query ‒Easy failure recovery ‒Best for ETL like work-loads ‒Examples: Hadoop, Dryad Acyclic data flow Map Map Map Reduce Reduce Input Output Image courtesy of Matei Zaharia, Introduction to Spark
  • 9. 9 Disk Based vs Memory Based Frameworks • Memory Based Frameworks ‒Circumvents heavy cost of I/O by keeping intermediate results in memory ‒Sensitive to availability of memory ‒Remembers operationsapplied to dataset ‒Best for iterative workloads ‒Examples: Spark, Flink Reuse working data set in memory iter. 1 iter. 2 . . . Input Image courtesy of Matei Zaharia, Introduction to Spark
  • 10. 10 The rest of the talk I. Spark Core I. Brief background II. Benchmarks and Comparisons III. What is an RDD IV. RDD Actions and Transformations V. Spark Cluster VI. Anatomy of a Program VII. The Spark Family
  • 11. 11 Spark Background • Amplab UC Berkley • Project Lead: Dr. Matei Zaharia • First paper published on RDD’s was in 2012 • Open sourced from day one, growing number of contributors • Released its 1.0 version May 2014. Currently in 1.2.1 • Databricks company established to support Spark and all its related technologies. Matei currently sits as its CTO • Amazon, Alibaba, Baidu, eBay, Groupon, Ooyala, OpenTable, Box, Shopify, TechBase, Yahoo! Arose from an academic setting
  • 12. 12 Spark versus Scalding (Hadoop) Clear win for iterative applications
  • 14. 14 Resilient Distributed Datasets (RDDs) • Main object in Spark’s universe • Think of it as representing the data at that stage in the operation • Allows for coarse-grained transformations (e.g. map, group-by, join) • Allows for efficient fault recovery using lineage ‒Log one operation to apply to many elements ‒Recompute lost partitions of dataset on failure ‒No cost if nothing fails
  • 15. 15 RDD Actions and Transformations • Transformations ‒ Lazy operations applied on an RDD ‒ Creates a new RDD from an existing RDD ‒ Allows Spark to perform optimizations ‒ e.g. map, filter, flatMap, union, intersection, distinct, reduceByKey, groupByKey • Actions ‒ Returns a value to the driver program after computation ‒ e.g. reduce, collect, count, first, take, saveAsFile Transformations are realized when an action is called
  • 16. 16 RDD Representation • Simple common interface: –Set of partitions –Preferred locations for each partition –List of parent RDDs –Function to compute a partition given parents –Optional partitioning info • Allows capturing wide range of transformations Slide courtesy of Matei Zaharia, Introduction to Spark
  • 17. 17 Spark Cluster Driver • Entry point of Spark application • Main Spark application is ran here • Results of “reduce” operations are aggregated here Driver
  • 18. 18 Spark Cluster Driver Master Master • Distributed coordination of Spark workers including:  Health checking workers  Reassignment of failed tasks  Entry point for job and cluster metrics
  • 19. 19 Spark Cluster Driver Master Worker Worker Worker Worker Worker • Spawns executors to perform tasks on partitions of data
  • 20. 20 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘t’)(2)) messages.persist() Block 1 Block 2 Block 3 Worker Worker Worker Driver messages.filter(_.contains(“foo”)).count messages.filter(_.contains(“bar”)).count . . . tasks results Msgs. 1 Msgs. 2 Msgs. 3 Base RDDTransformed RDD Action Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data) Slide courtesy of Matei Zaharia, Introduction to Spark
  • 21. 21 The Spark Family Cheaper by the dozen • Aside from its performance and API, the diverse tool set available in Spark is the reason for its wide adoption 1. Spark Streaming 2. Spark SQL 3. MLlib 4. GraphX
  • 23. 23 Lambda Architecture Unified Framework Spark Core Spark Streaming Spark SQL

Editor's Notes

  1. How do I distribute an algorithm – example finding max
  2. <slide> Let’s say your dataset is huge and can’t fit in one machine
  3. Suggestions how to find max? <slide> Enter map reduce paradigm
  4. Suggestions how to find max? <slide> Enter map reduce paradigm
  5. Perspective of where Spark lies in the ecosystem of compute frameworks Distributed systems have many tradeoffs: Consistency versus Latency – “eventual consistency” Exactly once processing versus Latency <slide>
  6. Perspective of where Spark lies in the ecosystem of compute frameworks Distributed systems have many tradeoffs: Consistency versus Latency – “eventual consistency” Exactly once processing versus Latency <slide>
  7. DEMO
  8. Optimization: realize that a map is followed by a reduce, hence won’t store the intermediate value in memory instead directly perform the reduction ** depending on time pull up the actions, transformations page DEMO
  9. Key idea: add “variables” to the “functions” in functional programming DEMO