SlideShare a Scribd company logo
1 of 66
Download to read offline
Spark and Resilient Distributed Datasets
Amir H. Payberah
amir@sics.se
Amirkabir University of Technology
(Tehran Polytechnic)
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 1 / 49
Motivation
MapReduce greatly simplified big data analysis on large, unreliable
clusters.
But as soon as it got popular, users wanted more:
• Iterative jobs, e.g., machine learning algorithms
• Interactive analytics
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 2 / 49
Motivation
Both iterative and interactive queries need one thing that MapRe-
duce lacks:
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 3 / 49
Motivation
Both iterative and interactive queries need one thing that MapRe-
duce lacks:
Efficient primitives for data sharing.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 3 / 49
Motivation
Both iterative and interactive queries need one thing that MapRe-
duce lacks:
Efficient primitives for data sharing.
In MapReduce, the only way to share data across jobs is stable
storage, which is slow.
Replication also makes the system slow, but it is necessary for fault
tolerance.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 3 / 49
Proposed Solution
In-Memory Data Processing and Sharing.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 4 / 49
Example
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 5 / 49
Example
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 5 / 49
Example
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 6 / 49
Example
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 6 / 49
Challenge
How to design a distributed memory abstraction
that is both fault tolerant and efficient?
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 7 / 49
Challenge
How to design a distributed memory abstraction
that is both fault tolerant and efficient?
Solution
Resilient Distributed Datasets (RDD)
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 7 / 49
Resilient Distributed Datasets (RDD) (1/2)
A distributed memory abstraction.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 8 / 49
Resilient Distributed Datasets (RDD) (1/2)
A distributed memory abstraction.
Immutable collections of objects spread across a cluster.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 8 / 49
Resilient Distributed Datasets (RDD) (2/2)
An RDD is divided into a number of partitions, which are atomic
pieces of information.
Partitions of an RDD can be stored on different nodes of a cluster.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 9 / 49
Programming Model
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 10 / 49
Spark Programming Model (1/2)
Spark programming model is based on parallelizable operators.
Parallelizable operators are higher-order functions that execute user-
defined functions in parallel.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 11 / 49
Spark Programming Model (2/2)
A data flow is composed of any number of data sources, operators,
and data sinks by connecting their inputs and outputs.
Job description based on directed acyclic graphs (DAG).
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 12 / 49
Higher-Order Functions (1/3)
Higher-order functions: RDDs operators.
There are two types of RDD operators: transformations and actions.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 13 / 49
Higher-Order Functions (2/3)
Transformations: lazy operators that create new RDDs.
Actions: lunch a computation and return a value to the program or
write data to the external storage.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 14 / 49
Higher-Order Functions (3/3)
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 15 / 49
RDD Transformations - Map
All pairs are independently processed.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 16 / 49
RDD Transformations - Map
All pairs are independently processed.
// passing each element through a function.
val nums = sc.parallelize(Array(1, 2, 3))
val squares = nums.map(x => x * x) // {1, 4, 9}
// selecting those elements that func returns true.
val even = squares.filter(x => x % 2 == 0) // {4}
// mapping each element to zero or more others.
nums.flatMap(x => Range(0, x, 1)) // {0, 0, 1, 0, 1, 2}
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 16 / 49
RDD Transformations - Reduce
Pairs with identical key are grouped.
Groups are independently processed.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 17 / 49
RDD Transformations - Reduce
Pairs with identical key are grouped.
Groups are independently processed.
val pets = sc.parallelize(Seq(("cat", 1), ("dog", 1), ("cat", 2)))
pets.reduceByKey((x, y) => x + y)
// {(cat, 3), (dog, 1)}
pets.groupByKey()
// {(cat, (1, 2)), (dog, (1))}
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 17 / 49
RDD Transformations - Join
Performs an equi-join on the key.
Join candidates are independently pro-
cessed.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 18 / 49
RDD Transformations - Join
Performs an equi-join on the key.
Join candidates are independently pro-
cessed.
val visits = sc.parallelize(Seq(("index.html", "1.2.3.4"),
("about.html", "3.4.5.6"),
("index.html", "1.3.3.1")))
val pageNames = sc.parallelize(Seq(("index.html", "Home"),
("about.html", "About")))
visits.join(pageNames)
// ("index.html", ("1.2.3.4", "Home"))
// ("index.html", ("1.3.3.1", "Home"))
// ("about.html", ("3.4.5.6", "About"))
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 18 / 49
RDD Transformations - CoGroup
Groups each input on key.
Groups with identical keys are processed
together.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 19 / 49
RDD Transformations - CoGroup
Groups each input on key.
Groups with identical keys are processed
together.
val visits = sc.parallelize(Seq(("index.html", "1.2.3.4"),
("about.html", "3.4.5.6"),
("index.html", "1.3.3.1")))
val pageNames = sc.parallelize(Seq(("index.html", "Home"),
("about.html", "About")))
visits.cogroup(pageNames)
// ("index.html", (("1.2.3.4", "1.3.3.1"), ("Home")))
// ("about.html", (("3.4.5.6"), ("About")))
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 19 / 49
RDD Transformations - Union and Sample
Union: merges two RDDs and returns a single RDD using bag se-
mantics, i.e., duplicates are not removed.
Sample: similar to mapping, except that the RDD stores a random
number generator seed for each partition to deterministically sample
parent records.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 20 / 49
Basic RDD Actions (1/2)
Return all the elements of the RDD as an array.
val nums = sc.parallelize(Array(1, 2, 3))
nums.collect() // Array(1, 2, 3)
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 21 / 49
Basic RDD Actions (1/2)
Return all the elements of the RDD as an array.
val nums = sc.parallelize(Array(1, 2, 3))
nums.collect() // Array(1, 2, 3)
Return an array with the first n elements of the RDD.
nums.take(2) // Array(1, 2)
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 21 / 49
Basic RDD Actions (1/2)
Return all the elements of the RDD as an array.
val nums = sc.parallelize(Array(1, 2, 3))
nums.collect() // Array(1, 2, 3)
Return an array with the first n elements of the RDD.
nums.take(2) // Array(1, 2)
Return the number of elements in the RDD.
nums.count() // 3
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 21 / 49
Basic RDD Actions (2/2)
Aggregate the elements of the RDD using the given function.
nums.reduce((x, y) => x + y)
or
nums.reduce(_ + _) // 6
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 22 / 49
Basic RDD Actions (2/2)
Aggregate the elements of the RDD using the given function.
nums.reduce((x, y) => x + y)
or
nums.reduce(_ + _) // 6
Write the elements of the RDD as a text file.
nums.saveAsTextFile("hdfs://file.txt")
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 22 / 49
SparkContext
Main entry point to Spark functionality.
Available in shell as variable sc.
In standalone programs, you should make your own.
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
val sc = new SparkContext(master, appName, [sparkHome], [jars])
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 23 / 49
SparkContext
Main entry point to Spark functionality.
Available in shell as variable sc.
In standalone programs, you should make your own.
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
val sc = new SparkContext(master, appName, [sparkHome], [jars])
local
local[k]
spark://host:port
mesos://host:port
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 23 / 49
Creating RDDs
Turn a collection into an RDD.
val a = sc.parallelize(Array(1, 2, 3))
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 24 / 49
Creating RDDs
Turn a collection into an RDD.
val a = sc.parallelize(Array(1, 2, 3))
Load text file from local FS, HDFS, or S3.
val a = sc.textFile("file.txt")
val b = sc.textFile("directory/*.txt")
val c = sc.textFile("hdfs://namenode:9000/path/file")
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 24 / 49
Example (1/2)
Count the lines containing SICS.
val file = sc.textFile("hdfs://...")
val sics = file.filter(_.contains("SICS"))
val cachedSics = sics.cache()
val ones = cachedSics.map(_ => 1)
val count = ones.reduce(_+_)
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 25 / 49
Example (1/2)
Count the lines containing SICS.
val file = sc.textFile("hdfs://...")
val sics = file.filter(_.contains("SICS"))
val cachedSics = sics.cache()
val ones = cachedSics.map(_ => 1)
val count = ones.reduce(_+_)
Transformation
Transformation
Action
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 25 / 49
Example (2/2)
Count the lines containing SICS.
val file = sc.textFile("hdfs://...")
val count = file.filter(_.contains("SICS")).count()
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 26 / 49
Example (2/2)
Count the lines containing SICS.
val file = sc.textFile("hdfs://...")
val count = file.filter(_.contains("SICS")).count()
Transformation
Action
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 26 / 49
Example - Standalone Application (1/2)
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object WordCount {
def main(args: Array[String]) {
val sc = new SparkContext("local", "SICS", "127.0.0.1",
List("target/scala-2.10/sics-count_2.10-1.0.jar"))
val file = sc.textFile("...").cache()
val count = file.filter(_.contains("SICS")).count()
}
}
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 27 / 49
Example - Standalone Application (2/2)
sics.sbt:
name := "SICS Count"
version := "1.0"
scalaVersion := "2.10.3"
libraryDependencies += "org.apache.spark" %% "spark-core" % "0.9.0-incubating"
resolvers += "Akka Repository" at "http://repo.akka.io/releases/"
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 28 / 49
Shared Variables (1/2)
When Spark runs a function in parallel as a set of tasks on different
nodes, it ships a copy of each variable used in the function to each
task.
Sometimes, a variable needs to be shared across tasks, or between
tasks and the driver program.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 29 / 49
Shared Variables (2/2)
No updates to the variables are propagated back to the driver pro-
gram.
General read-write shared variables across tasks is inefficient.
• For example, to give every node a copy of a large input dataset.
Two types of shared variables: broadcast variables and accumula-
tors.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 30 / 49
Shared Variables: Broadcast Variables
A read-only variable cached on each machine rather than shipping
a copy of it with tasks.
The broadcast values are not shipped to the nodes more than once.
scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: spark.Broadcast[Array[Int]] = spark.Broadcast(b5c40191-...)
scala> broadcastVar.value
res0: Array[Int] = Array(1, 2, 3)
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 31 / 49
Shared Variables: Accumulators
They are only added.
They can be used to implement counters or sums.
Tasks running on the cluster can then add to it using the += oper-
ator.
scala> val accum = sc.accumulator(0)
accum: spark.Accumulator[Int] = 0
scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
...
scala> accum.value
res2: Int = 10
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 32 / 49
Execution Engine
(SPARK)
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 33 / 49
Spark
Spark provides a programming interface in Scala.
Each RDD is represented as an object in Spark.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 34 / 49
Spark Programming Interface
A Spark application consists of a driver program that runs the user’s
main function and executes various parallel operations on a cluster.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 35 / 49
Lineage
Lineage: transformations used to build
an RDD.
RDDs are stored as a chain of objects
capturing the lineage of each RDD.
val file = sc.textFile("hdfs://...")
val sics = file.filter(_.contains("SICS"))
val cachedSics = sics.cache()
val ones = cachedSics.map(_ => 1)
val count = ones.reduce(_+_)
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 36 / 49
RDD Dependencies (1/3)
Two types of dependencies between RDDs: Narrow and Wide.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 37 / 49
RDD Dependencies: Narrow (2/3)
Narrow: each partition of a parent RDD is used by at most one
partition of the child RDD.
Narrow dependencies allow pipelined execution on one cluster node:
a map followed by a filter.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 38 / 49
RDD Dependencies: Wide (3/3)
Wide: each partition of a parent RDD is used by multiple partitions
of the child RDDs.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 39 / 49
Job Scheduling (1/2)
When a user runs an action on an RDD:
the scheduler builds a DAG of stages
from the RDD lineage graph.
A stage contains as many pipelined
transformations with narrow dependen-
cies.
The boundary of a stage:
• Shuffles for wide dependencies.
• Already computed partitions.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 40 / 49
Job Scheduling (2/2)
The scheduler launches tasks to compute
missing partitions from each stage until
it computes the target RDD.
Tasks are assigned to machines based on
data locality.
• If a task needs a partition, which is
available in the memory of a node, the
task is sent to that node.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 41 / 49
RDD Fault Tolerance (1/3)
RDDs maintain lineage information that can be used to reconstruct
lost partitions.
Logging lineage rather than the actual data.
No replication.
Recompute only the lost partitions of an RDD.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 42 / 49
RDD Fault Tolerance (2/3)
The intermediate records of wide dependencies are materialized on
the nodes holding the parent partitions: to simplify fault recovery.
If a task fails, it will be re-ran on another node, as long as its stages
parents are available.
If some stages become unavailable, the tasks are submitted to com-
pute the missing partitions in parallel.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 43 / 49
RDD Fault Tolerance (3/3)
Recovery may be time-consuming for RDDs with long lineage chains
and wide dependencies.
It can be helpful to checkpoint some RDDs to stable storage.
Decision about which data to checkpoint is left to users.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 44 / 49
Memory Management (1/2)
If there is not enough space in memory for a new computed RDD
partition: a partition from the least recently used RDD is evicted.
Spark provides three options for storage of persistent RDDs:
1 In memory storage as deserialized Java objects.
2 In memory storage as serialized Java objects.
3 On disk storage.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 45 / 49
Memory Management (2/2)
When an RDD is persisted, each node stores any partitions of the
RDD that it computes in memory.
This allows future actions to be much faster.
Persisting an RDD using persist() or cache() methods.
Different storage levels:
MEMORY ONLY
MEMORY AND DISK
MEMORY ONLY SER
MEMORY AND DISK SER
MEMORY ONLY 2, MEMORY AND DISK 2, etc.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 46 / 49
RDD Applications
Applications suitable for RDDs
• Batch applications that apply the same operation to all elements of
a dataset.
Applications not suitable for RDDs
• Applications that make asynchronous fine-grained updates to shared
state, e.g., storage system for a web application.
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 47 / 49
Summary
RDD: a distributed memory abstraction that is both fault tolerant
and efficient
Two types of operations: Transformations and Actions.
RDD fault tolerance: Lineage
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 48 / 49
Questions?
Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 49 / 49

More Related Content

What's hot

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Databricks
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Edureka!
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failingSandy Ryza
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...Databricks
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkIlya Ganelin
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorDatabricks
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsDatabricks
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to SparkLi Ming Tsai
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLScyllaDB
 
Simplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks DeltaSimplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks DeltaDatabricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 

What's hot (20)

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They Work
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized Optimizations
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQL
 
Simplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks DeltaSimplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks Delta
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 

Viewers also liked

P2P Content Distribution Network
P2P Content Distribution NetworkP2P Content Distribution Network
P2P Content Distribution NetworkAmir Payberah
 
Spark Stream and SEEP
Spark Stream and SEEPSpark Stream and SEEP
Spark Stream and SEEPAmir Payberah
 
Linux Module Programming
Linux Module ProgrammingLinux Module Programming
Linux Module ProgrammingAmir Payberah
 
MegaStore and Spanner
MegaStore and SpannerMegaStore and Spanner
MegaStore and SpannerAmir Payberah
 
Process Management - Part2
Process Management - Part2Process Management - Part2
Process Management - Part2Amir Payberah
 
Introduction to Operating Systems - Part2
Introduction to Operating Systems - Part2Introduction to Operating Systems - Part2
Introduction to Operating Systems - Part2Amir Payberah
 
CPU Scheduling - Part2
CPU Scheduling - Part2CPU Scheduling - Part2
CPU Scheduling - Part2Amir Payberah
 
The Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics PlatformThe Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics PlatformAmir Payberah
 
Data Intensive Computing Frameworks
Data Intensive Computing FrameworksData Intensive Computing Frameworks
Data Intensive Computing FrameworksAmir Payberah
 
The Spark Big Data Analytics Platform
The Spark Big Data Analytics PlatformThe Spark Big Data Analytics Platform
The Spark Big Data Analytics PlatformAmir Payberah
 

Viewers also liked (20)

Scala
ScalaScala
Scala
 
P2P Content Distribution Network
P2P Content Distribution NetworkP2P Content Distribution Network
P2P Content Distribution Network
 
Hive and Shark
Hive and SharkHive and Shark
Hive and Shark
 
Spark Stream and SEEP
Spark Stream and SEEPSpark Stream and SEEP
Spark Stream and SEEP
 
MapReduce
MapReduceMapReduce
MapReduce
 
Linux Module Programming
Linux Module ProgrammingLinux Module Programming
Linux Module Programming
 
MegaStore and Spanner
MegaStore and SpannerMegaStore and Spanner
MegaStore and Spanner
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
Main Memory - Part2
Main Memory - Part2Main Memory - Part2
Main Memory - Part2
 
Security
SecuritySecurity
Security
 
Process Management - Part2
Process Management - Part2Process Management - Part2
Process Management - Part2
 
Introduction to Operating Systems - Part2
Introduction to Operating Systems - Part2Introduction to Operating Systems - Part2
Introduction to Operating Systems - Part2
 
Protection
ProtectionProtection
Protection
 
IO Systems
IO SystemsIO Systems
IO Systems
 
CPU Scheduling - Part2
CPU Scheduling - Part2CPU Scheduling - Part2
CPU Scheduling - Part2
 
Storage
StorageStorage
Storage
 
The Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics PlatformThe Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics Platform
 
Data Intensive Computing Frameworks
Data Intensive Computing FrameworksData Intensive Computing Frameworks
Data Intensive Computing Frameworks
 
The Spark Big Data Analytics Platform
The Spark Big Data Analytics PlatformThe Spark Big Data Analytics Platform
The Spark Big Data Analytics Platform
 
Deadlocks
DeadlocksDeadlocks
Deadlocks
 

Similar to Spark

Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RYanchang Zhao
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An OverviewMohit Jain
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packagesAjay Ohri
 
Apache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup TalkApache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup TalkEren Avşaroğulları
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxAishg4
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark Hubert Fan Chiang
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingAnimesh Chaturvedi
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraRussell Spitzer
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkVincent Poncet
 

Similar to Spark (20)

Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in R
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An Overview
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packages
 
Apache Spark Workshop
Apache Spark WorkshopApache Spark Workshop
Apache Spark Workshop
 
Apache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup TalkApache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup Talk
 
Spark core
Spark coreSpark core
Spark core
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
 
Apache spark basics
Apache spark basicsApache spark basics
Apache spark basics
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 

More from Amir Payberah

File System Implementation - Part2
File System Implementation - Part2File System Implementation - Part2
File System Implementation - Part2Amir Payberah
 
File System Implementation - Part1
File System Implementation - Part1File System Implementation - Part1
File System Implementation - Part1Amir Payberah
 
File System Interface
File System InterfaceFile System Interface
File System InterfaceAmir Payberah
 
Virtual Memory - Part2
Virtual Memory - Part2Virtual Memory - Part2
Virtual Memory - Part2Amir Payberah
 
Virtual Memory - Part1
Virtual Memory - Part1Virtual Memory - Part1
Virtual Memory - Part1Amir Payberah
 
CPU Scheduling - Part1
CPU Scheduling - Part1CPU Scheduling - Part1
CPU Scheduling - Part1Amir Payberah
 
Process Synchronization - Part2
Process Synchronization -  Part2Process Synchronization -  Part2
Process Synchronization - Part2Amir Payberah
 
Process Synchronization - Part1
Process Synchronization -  Part1Process Synchronization -  Part1
Process Synchronization - Part1Amir Payberah
 
Process Management - Part3
Process Management - Part3Process Management - Part3
Process Management - Part3Amir Payberah
 
Process Management - Part1
Process Management - Part1Process Management - Part1
Process Management - Part1Amir Payberah
 
Introduction to Operating Systems - Part3
Introduction to Operating Systems - Part3Introduction to Operating Systems - Part3
Introduction to Operating Systems - Part3Amir Payberah
 
Introduction to Operating Systems - Part1
Introduction to Operating Systems - Part1Introduction to Operating Systems - Part1
Introduction to Operating Systems - Part1Amir Payberah
 

More from Amir Payberah (15)

File System Implementation - Part2
File System Implementation - Part2File System Implementation - Part2
File System Implementation - Part2
 
File System Implementation - Part1
File System Implementation - Part1File System Implementation - Part1
File System Implementation - Part1
 
File System Interface
File System InterfaceFile System Interface
File System Interface
 
Virtual Memory - Part2
Virtual Memory - Part2Virtual Memory - Part2
Virtual Memory - Part2
 
Virtual Memory - Part1
Virtual Memory - Part1Virtual Memory - Part1
Virtual Memory - Part1
 
Main Memory - Part1
Main Memory - Part1Main Memory - Part1
Main Memory - Part1
 
CPU Scheduling - Part1
CPU Scheduling - Part1CPU Scheduling - Part1
CPU Scheduling - Part1
 
Process Synchronization - Part2
Process Synchronization -  Part2Process Synchronization -  Part2
Process Synchronization - Part2
 
Process Synchronization - Part1
Process Synchronization -  Part1Process Synchronization -  Part1
Process Synchronization - Part1
 
Threads
ThreadsThreads
Threads
 
Process Management - Part3
Process Management - Part3Process Management - Part3
Process Management - Part3
 
Process Management - Part1
Process Management - Part1Process Management - Part1
Process Management - Part1
 
Introduction to Operating Systems - Part3
Introduction to Operating Systems - Part3Introduction to Operating Systems - Part3
Introduction to Operating Systems - Part3
 
Introduction to Operating Systems - Part1
Introduction to Operating Systems - Part1Introduction to Operating Systems - Part1
Introduction to Operating Systems - Part1
 
Mesos and YARN
Mesos and YARNMesos and YARN
Mesos and YARN
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 

Spark

  • 1. Spark and Resilient Distributed Datasets Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 1 / 49
  • 2. Motivation MapReduce greatly simplified big data analysis on large, unreliable clusters. But as soon as it got popular, users wanted more: • Iterative jobs, e.g., machine learning algorithms • Interactive analytics Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 2 / 49
  • 3. Motivation Both iterative and interactive queries need one thing that MapRe- duce lacks: Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 3 / 49
  • 4. Motivation Both iterative and interactive queries need one thing that MapRe- duce lacks: Efficient primitives for data sharing. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 3 / 49
  • 5. Motivation Both iterative and interactive queries need one thing that MapRe- duce lacks: Efficient primitives for data sharing. In MapReduce, the only way to share data across jobs is stable storage, which is slow. Replication also makes the system slow, but it is necessary for fault tolerance. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 3 / 49
  • 6. Proposed Solution In-Memory Data Processing and Sharing. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 4 / 49
  • 7. Example Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 5 / 49
  • 8. Example Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 5 / 49
  • 9. Example Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 6 / 49
  • 10. Example Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 6 / 49
  • 11. Challenge How to design a distributed memory abstraction that is both fault tolerant and efficient? Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 7 / 49
  • 12. Challenge How to design a distributed memory abstraction that is both fault tolerant and efficient? Solution Resilient Distributed Datasets (RDD) Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 7 / 49
  • 13. Resilient Distributed Datasets (RDD) (1/2) A distributed memory abstraction. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 8 / 49
  • 14. Resilient Distributed Datasets (RDD) (1/2) A distributed memory abstraction. Immutable collections of objects spread across a cluster. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 8 / 49
  • 15. Resilient Distributed Datasets (RDD) (2/2) An RDD is divided into a number of partitions, which are atomic pieces of information. Partitions of an RDD can be stored on different nodes of a cluster. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 9 / 49
  • 16. Programming Model Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 10 / 49
  • 17. Spark Programming Model (1/2) Spark programming model is based on parallelizable operators. Parallelizable operators are higher-order functions that execute user- defined functions in parallel. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 11 / 49
  • 18. Spark Programming Model (2/2) A data flow is composed of any number of data sources, operators, and data sinks by connecting their inputs and outputs. Job description based on directed acyclic graphs (DAG). Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 12 / 49
  • 19. Higher-Order Functions (1/3) Higher-order functions: RDDs operators. There are two types of RDD operators: transformations and actions. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 13 / 49
  • 20. Higher-Order Functions (2/3) Transformations: lazy operators that create new RDDs. Actions: lunch a computation and return a value to the program or write data to the external storage. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 14 / 49
  • 21. Higher-Order Functions (3/3) Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 15 / 49
  • 22. RDD Transformations - Map All pairs are independently processed. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 16 / 49
  • 23. RDD Transformations - Map All pairs are independently processed. // passing each element through a function. val nums = sc.parallelize(Array(1, 2, 3)) val squares = nums.map(x => x * x) // {1, 4, 9} // selecting those elements that func returns true. val even = squares.filter(x => x % 2 == 0) // {4} // mapping each element to zero or more others. nums.flatMap(x => Range(0, x, 1)) // {0, 0, 1, 0, 1, 2} Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 16 / 49
  • 24. RDD Transformations - Reduce Pairs with identical key are grouped. Groups are independently processed. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 17 / 49
  • 25. RDD Transformations - Reduce Pairs with identical key are grouped. Groups are independently processed. val pets = sc.parallelize(Seq(("cat", 1), ("dog", 1), ("cat", 2))) pets.reduceByKey((x, y) => x + y) // {(cat, 3), (dog, 1)} pets.groupByKey() // {(cat, (1, 2)), (dog, (1))} Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 17 / 49
  • 26. RDD Transformations - Join Performs an equi-join on the key. Join candidates are independently pro- cessed. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 18 / 49
  • 27. RDD Transformations - Join Performs an equi-join on the key. Join candidates are independently pro- cessed. val visits = sc.parallelize(Seq(("index.html", "1.2.3.4"), ("about.html", "3.4.5.6"), ("index.html", "1.3.3.1"))) val pageNames = sc.parallelize(Seq(("index.html", "Home"), ("about.html", "About"))) visits.join(pageNames) // ("index.html", ("1.2.3.4", "Home")) // ("index.html", ("1.3.3.1", "Home")) // ("about.html", ("3.4.5.6", "About")) Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 18 / 49
  • 28. RDD Transformations - CoGroup Groups each input on key. Groups with identical keys are processed together. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 19 / 49
  • 29. RDD Transformations - CoGroup Groups each input on key. Groups with identical keys are processed together. val visits = sc.parallelize(Seq(("index.html", "1.2.3.4"), ("about.html", "3.4.5.6"), ("index.html", "1.3.3.1"))) val pageNames = sc.parallelize(Seq(("index.html", "Home"), ("about.html", "About"))) visits.cogroup(pageNames) // ("index.html", (("1.2.3.4", "1.3.3.1"), ("Home"))) // ("about.html", (("3.4.5.6"), ("About"))) Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 19 / 49
  • 30. RDD Transformations - Union and Sample Union: merges two RDDs and returns a single RDD using bag se- mantics, i.e., duplicates are not removed. Sample: similar to mapping, except that the RDD stores a random number generator seed for each partition to deterministically sample parent records. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 20 / 49
  • 31. Basic RDD Actions (1/2) Return all the elements of the RDD as an array. val nums = sc.parallelize(Array(1, 2, 3)) nums.collect() // Array(1, 2, 3) Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 21 / 49
  • 32. Basic RDD Actions (1/2) Return all the elements of the RDD as an array. val nums = sc.parallelize(Array(1, 2, 3)) nums.collect() // Array(1, 2, 3) Return an array with the first n elements of the RDD. nums.take(2) // Array(1, 2) Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 21 / 49
  • 33. Basic RDD Actions (1/2) Return all the elements of the RDD as an array. val nums = sc.parallelize(Array(1, 2, 3)) nums.collect() // Array(1, 2, 3) Return an array with the first n elements of the RDD. nums.take(2) // Array(1, 2) Return the number of elements in the RDD. nums.count() // 3 Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 21 / 49
  • 34. Basic RDD Actions (2/2) Aggregate the elements of the RDD using the given function. nums.reduce((x, y) => x + y) or nums.reduce(_ + _) // 6 Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 22 / 49
  • 35. Basic RDD Actions (2/2) Aggregate the elements of the RDD using the given function. nums.reduce((x, y) => x + y) or nums.reduce(_ + _) // 6 Write the elements of the RDD as a text file. nums.saveAsTextFile("hdfs://file.txt") Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 22 / 49
  • 36. SparkContext Main entry point to Spark functionality. Available in shell as variable sc. In standalone programs, you should make your own. import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ val sc = new SparkContext(master, appName, [sparkHome], [jars]) Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 23 / 49
  • 37. SparkContext Main entry point to Spark functionality. Available in shell as variable sc. In standalone programs, you should make your own. import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ val sc = new SparkContext(master, appName, [sparkHome], [jars]) local local[k] spark://host:port mesos://host:port Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 23 / 49
  • 38. Creating RDDs Turn a collection into an RDD. val a = sc.parallelize(Array(1, 2, 3)) Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 24 / 49
  • 39. Creating RDDs Turn a collection into an RDD. val a = sc.parallelize(Array(1, 2, 3)) Load text file from local FS, HDFS, or S3. val a = sc.textFile("file.txt") val b = sc.textFile("directory/*.txt") val c = sc.textFile("hdfs://namenode:9000/path/file") Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 24 / 49
  • 40. Example (1/2) Count the lines containing SICS. val file = sc.textFile("hdfs://...") val sics = file.filter(_.contains("SICS")) val cachedSics = sics.cache() val ones = cachedSics.map(_ => 1) val count = ones.reduce(_+_) Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 25 / 49
  • 41. Example (1/2) Count the lines containing SICS. val file = sc.textFile("hdfs://...") val sics = file.filter(_.contains("SICS")) val cachedSics = sics.cache() val ones = cachedSics.map(_ => 1) val count = ones.reduce(_+_) Transformation Transformation Action Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 25 / 49
  • 42. Example (2/2) Count the lines containing SICS. val file = sc.textFile("hdfs://...") val count = file.filter(_.contains("SICS")).count() Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 26 / 49
  • 43. Example (2/2) Count the lines containing SICS. val file = sc.textFile("hdfs://...") val count = file.filter(_.contains("SICS")).count() Transformation Action Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 26 / 49
  • 44. Example - Standalone Application (1/2) import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object WordCount { def main(args: Array[String]) { val sc = new SparkContext("local", "SICS", "127.0.0.1", List("target/scala-2.10/sics-count_2.10-1.0.jar")) val file = sc.textFile("...").cache() val count = file.filter(_.contains("SICS")).count() } } Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 27 / 49
  • 45. Example - Standalone Application (2/2) sics.sbt: name := "SICS Count" version := "1.0" scalaVersion := "2.10.3" libraryDependencies += "org.apache.spark" %% "spark-core" % "0.9.0-incubating" resolvers += "Akka Repository" at "http://repo.akka.io/releases/" Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 28 / 49
  • 46. Shared Variables (1/2) When Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task. Sometimes, a variable needs to be shared across tasks, or between tasks and the driver program. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 29 / 49
  • 47. Shared Variables (2/2) No updates to the variables are propagated back to the driver pro- gram. General read-write shared variables across tasks is inefficient. • For example, to give every node a copy of a large input dataset. Two types of shared variables: broadcast variables and accumula- tors. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 30 / 49
  • 48. Shared Variables: Broadcast Variables A read-only variable cached on each machine rather than shipping a copy of it with tasks. The broadcast values are not shipped to the nodes more than once. scala> val broadcastVar = sc.broadcast(Array(1, 2, 3)) broadcastVar: spark.Broadcast[Array[Int]] = spark.Broadcast(b5c40191-...) scala> broadcastVar.value res0: Array[Int] = Array(1, 2, 3) Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 31 / 49
  • 49. Shared Variables: Accumulators They are only added. They can be used to implement counters or sums. Tasks running on the cluster can then add to it using the += oper- ator. scala> val accum = sc.accumulator(0) accum: spark.Accumulator[Int] = 0 scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x) ... scala> accum.value res2: Int = 10 Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 32 / 49
  • 50. Execution Engine (SPARK) Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 33 / 49
  • 51. Spark Spark provides a programming interface in Scala. Each RDD is represented as an object in Spark. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 34 / 49
  • 52. Spark Programming Interface A Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 35 / 49
  • 53. Lineage Lineage: transformations used to build an RDD. RDDs are stored as a chain of objects capturing the lineage of each RDD. val file = sc.textFile("hdfs://...") val sics = file.filter(_.contains("SICS")) val cachedSics = sics.cache() val ones = cachedSics.map(_ => 1) val count = ones.reduce(_+_) Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 36 / 49
  • 54. RDD Dependencies (1/3) Two types of dependencies between RDDs: Narrow and Wide. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 37 / 49
  • 55. RDD Dependencies: Narrow (2/3) Narrow: each partition of a parent RDD is used by at most one partition of the child RDD. Narrow dependencies allow pipelined execution on one cluster node: a map followed by a filter. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 38 / 49
  • 56. RDD Dependencies: Wide (3/3) Wide: each partition of a parent RDD is used by multiple partitions of the child RDDs. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 39 / 49
  • 57. Job Scheduling (1/2) When a user runs an action on an RDD: the scheduler builds a DAG of stages from the RDD lineage graph. A stage contains as many pipelined transformations with narrow dependen- cies. The boundary of a stage: • Shuffles for wide dependencies. • Already computed partitions. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 40 / 49
  • 58. Job Scheduling (2/2) The scheduler launches tasks to compute missing partitions from each stage until it computes the target RDD. Tasks are assigned to machines based on data locality. • If a task needs a partition, which is available in the memory of a node, the task is sent to that node. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 41 / 49
  • 59. RDD Fault Tolerance (1/3) RDDs maintain lineage information that can be used to reconstruct lost partitions. Logging lineage rather than the actual data. No replication. Recompute only the lost partitions of an RDD. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 42 / 49
  • 60. RDD Fault Tolerance (2/3) The intermediate records of wide dependencies are materialized on the nodes holding the parent partitions: to simplify fault recovery. If a task fails, it will be re-ran on another node, as long as its stages parents are available. If some stages become unavailable, the tasks are submitted to com- pute the missing partitions in parallel. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 43 / 49
  • 61. RDD Fault Tolerance (3/3) Recovery may be time-consuming for RDDs with long lineage chains and wide dependencies. It can be helpful to checkpoint some RDDs to stable storage. Decision about which data to checkpoint is left to users. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 44 / 49
  • 62. Memory Management (1/2) If there is not enough space in memory for a new computed RDD partition: a partition from the least recently used RDD is evicted. Spark provides three options for storage of persistent RDDs: 1 In memory storage as deserialized Java objects. 2 In memory storage as serialized Java objects. 3 On disk storage. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 45 / 49
  • 63. Memory Management (2/2) When an RDD is persisted, each node stores any partitions of the RDD that it computes in memory. This allows future actions to be much faster. Persisting an RDD using persist() or cache() methods. Different storage levels: MEMORY ONLY MEMORY AND DISK MEMORY ONLY SER MEMORY AND DISK SER MEMORY ONLY 2, MEMORY AND DISK 2, etc. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 46 / 49
  • 64. RDD Applications Applications suitable for RDDs • Batch applications that apply the same operation to all elements of a dataset. Applications not suitable for RDDs • Applications that make asynchronous fine-grained updates to shared state, e.g., storage system for a web application. Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 47 / 49
  • 65. Summary RDD: a distributed memory abstraction that is both fault tolerant and efficient Two types of operations: Transformations and Actions. RDD fault tolerance: Lineage Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 48 / 49
  • 66. Questions? Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 49 / 49