SlideShare a Scribd company logo
1 of 57
Download to read offline
Distributed computing with
Javier Santos
April,13 - 2015
About me
Javier Santos @jpaniego
«Hay dos formas de programar sin
errores; solo la tercera funciona»
Alan J Perlis
What’s Spark?
val mySeq: Seq[Int] = 1 to 5
val action: Seq[Int] => Int =
(seq: Seq[Int]) => seq.sum
val result: Int = 15
What’s Spark?
val mySeq: Seq[Int] = 1 to Int.MaxValue
val action: Seq[Int] => Int =
(seq: Seq[Int]) => seq.sum
val result: Int = 1073741824
What’s Spark?
Computing problems:
● Time: Waiting for action results on tons of records
● Space: How do we allocate TB of data?
downloadmoreram.com
What’s Spark?
val mySeq: Seq[Int] = 1 to 99
val mySeq_1 =
1 to 33
val mySeq_3 =
67 to 99
val mySeq_2 =
34 to 66
What’s Spark?
val action: Seq[Int] => Int = (seq: Seq[Int]) => seq.sum
val mySeq_1 =
1 to 33
val mySeq_3 =
67 to 99
val mySeq_2 =
34 to 66
action action
action
What’s Spark?
val result: Int = 4950
val mySeq_1 =
1 to 33
val mySeq_3 =
67 to 99
val mySeq_2 =
34 to 66
561
1650
2739
What’s Spark?
● General engine for large-scale data processing.
● “In-memory” data allocation.
● Target: data and computing parallelism.
● Mostly developed in Scala.
It also provides support for Python and Java
● First developed by AMPLab in UC Berkeley
● Support by Databricks
● Latest stable version 1.3.0
Modules
Modules
MapReduce model
● Programming model used for supporting parallel-
computing and big data sets distribution among
computing groups
● Based on two main functional programming
methods: Map and reduce.
● Hadoop
○ One of the first open source implementation
○ First implementation lead by Yahoo
MapReduce model
● Phases
Map Combiner Partitioner Reduce
MapReduce model
● Example (Map)
“123;SamsungGalaxyIII;14:05;300;34612345678”
“124;LGNexus4;16:05;121;+34613455678”
“126;LGNexus4;12:05;23;+3463624678”
“131;NokiaLumia;14:05;300;+34613246778”
“125;MotorolaG2;14:05;300;+34612345678”
“127;NokiaLumia;14:05;300;+34612345678”
“130;NokiaLumia;14:05;300;+34612345678”
(SamsungGalaxyIII, 1)
(LGNexus4, 1)
(LGNexus4, 1)
(NokiaLumia, 1)
(MotorolaG2, 1)
(NokiaLumia, 1)
(NokiaLumia, 1)
Partitions
MapReduce model
● Example (Combiner)
(SamsungGalaxyIII, 1)
(LGNexus4, 1)
(LGNexus4, 1)
(NokiaLumia, 1)
(MotorolaG2, 1)
(NokiaLumia, 1)
(NokiaLumia, 1)
(SamsungGalaxyIII, 1)
(LGNexus4, 2)
(NokiaLumia, 1)
(MotorolaG2, 1)
(NokiaLumia, 2)
MapReduce model
● Example (Partitioner)
(SamsungGalaxyIII, 1)
(LGNexus4, 2)
(NokiaLumia, 1)
(MotorolaG2, 1)
(NokiaLumia, 2)
(SamsungGalaxyIII, 1)
(LGNexus4, 2)
(NokiaLumia, 3)
(MotorolaG2, 1)
MapReduce model
● Example (Reduce)
(SamsungGalaxyIII, 1)
(LGNexus4, 2)
(NokiaLumia, 3)
(MotorolaG2, 1) (NokiaLumia, 3)
ReduceFunction
f(t1:(String,Int),t2:(String,Int)) =
if (t1._2 > t2._2) t1 else t2
So...what about Spark?
Deployment types
● Local
○ master=”local[N]” (N=Amount of cores)
○ Spark master is launched in the same process.
It’s not accessible from web.
○ Workers are launched in the same process.
● Standalone
○ master=”spark://master-url”
○ Spark master is launched in a cluster machine.
○ For deploying workers, it’s used ‘start-all.sh’ script.
○ Submit JAR to worker nodes.
Deployment terms
Application Master
Worker
Worker
Worker
Spark
Cluster
Deployment: Cluster managers
● YARN
○ It uses Hadoop Yarn Resource Manager
○ Spark app is launched inside a Yarn container
○ Several isolated Spark Masters running in cluster possibility.
● Mesos
○ Another cluster manager
○ Not so much used like Yarn with Spark.
○ Several isolated Spark Masters running in cluster possibility.
Execution terms
Driver
Executor
Executor
Executor
Executor
Executor
Spark
Context
Execution terms :
SparkContext & Executor
● SparkContext
○ Spark cluster connection
○ Necessary for SQLContext, StreamingContext, …
○ Isolated: Two Spark Contexts cannot exchange data (without
external help).
● Executor
○ Individual execution unit
○ 1 core ~ 2 executor
○ Each application has its owns.
Job terms
Job
Stage Stage Stage
Task Task Task Task Task
Job terms
● Job
○ Individual executed action
○ It’s composed by a set of tasks arranged in stages
● Stage
○ Job split based on last cached action or shuffle event.
● Task
○ Minimum execution unit sent to executors.
○ Task ~ Partition correlation
Modules
● Resilient Distributed Dataset. Basic abstraction in
Spark
● Think of it like a huge collection partitioned and
distributed in different machines.
● Lazy evaluated
● Basically composed of
○ A list of partitions
○ A function for computing each split
○ A list of dependencies on other RDDs
RDD
RDD
● Basic example:
val myRdd: RDD[String] = sc.parallelize(
“this is a sample string”.split(“ “).toList)
● For creating a new RDD is necessary to implement:
○ compute (How to get RDD population)
○ getDependencies (RDD lineage)
○ getPartitions (How to split the data)
RDD
● An RDD may be created from
○ A file / set of files:
sc.textFile(“myFile”)
sc.textFile(“file1,file2”)
○ A bunch of memory-storaged data:
sc.parallelize(List(1,2,3))
Another RDD:
myRdd.map(_.toString)
RDD - Partitions
● Partition : RDD chunk
● A worker node may contain 1 or more partitions of
some RDD.
● It’s important to choose a best-performance
partitioning.
● Repartitioning
○ It implies shuffling all RDD data among worker nodes
○ There will surely be tons of network traffic among all worker
nodes!
RDD - Lineage
● An RDD may be built from another one.
● Base RDD : Has no parent
● Ex:
val base: RDD[Int] = sc.parallelize(List(1,2,3,4))
val even: RDD[Int] = myRDD.filter(_%2==0)
● DAG (Directed Acyclic Graph) : Represents the RDD
lineage (inheritance, mutations, …).
● DAGScheduler: High level layer used that schedules
stages among all related RDDs
RDD - Lineage
RDD - Transformations
● It applies a change to some RDD, returning a new one.
● It’s not immediately evaluated.
● Think of it like a “Call-by-name” function.
● Adds a new node in Lineage DAG.
RDD - Transformations
● Most frequently used:
○ map
val myRDD: RDD[Int] = sc.parallelize(List(1,2,3,4))
val myStrinRDD: RDD[String] = myRDD.map(_.toString)
○ flatMap
val myRDD: RDD[Int] =
sc.parallelize(List(1,null,3,null))
val myNotNullInts: RDD[Int] =
myRdd.flatMap(n => Option(n))
RDD - Transformations
● Most frequently used:
○ filter
val myRDD: RDD[Int] = sc.parallelize(List(1,2,3,4))
val myStrinRDD: RDD[Int] = myRDD.filter(_%2==0)
○ union
val odd: RDD[Int] = sc.parallelize(List(1,3,5))
val even: RDD[Int] = sc.parallelize(List(2,4,6))
val all: RDD[Int] = odd.union(even)
RDD - Actions
● It launches the RDD evaluation,
○ returning some data to the driver
○ or persisting to some external storage system
● It’s evaluated partially on each worker node, and
results are merged in the Driver
● There are mechanisms to make cheaper computing
several times the same RDD (like persist() or
cache() )
RDD - Actions
● Examples:
○ count
val myRdd: Rdd[Int] = sc.parallelize(List(1,2,3))
val size: Int = myRdd.count
Counting implies processing whole RDD
○ take
val myRdd: Rdd[Int] = sc.parallelize(List(1,2,3))
val List(1,2) = myRdd.take(2).toList
○ collect
val myRdd: Rdd[Int] = sc.parallelize(List(1,2,3))
val data: Array[Int] = myRdd.collect
Beware! Executing ‘collect’ on big collections might end into a memory leak
Demo1 : Most retweeted
● Most Retweeted example
● Bunch of tweets (1000 json records)
● Find out which is the most retweeted tweet.
Key-Value RDDs
● Particular case of RDD[T] where T = (U,V)
● It allows grouping,combining,aggregating values by
some key.
● In Scala it’s only needed to
import org.apache.spark.SparkContext._
● In Java, it’s mandatory to use PairRDD class.
Key-Value RDDs
● keyBy: Generates a PairRDD
val myRDD: RDD[Int] =
sc.parallelize(List(1,2,3,4,5))
val kvRDD: RDD[(String,Int)] =
myRDD.keyBy(
n => if (n%2==0) “even” else “odd”)
“odd” -> 1, “even” -> 2, “odd” -> 3, “even” -> 4, “odd” -> 5
Key-Value RDDs
● keys: Gets keys from PairRDD
val myRDD: RDD[(String,Int)] =
sc.parallelize(“odd” -> 1,”even” -> 2,”odd” -> 3)
val keysRDD: RDD[String] = myRDD.keys
“odd”,”even”,”odd”
● values: Gets values from PairRDD
Key-Value RDDs
● mapValues: map PairRDD values, omitting keys.
val myRDD: RDD[(String,Int)] =
sc.parallelize(“odd” -> 1,”even” -> 2,”odd” -> 3)
val mapRDD: RDD[(String,String)] =
myRDD.mapValues(_.toString)
“odd” -> “1”, “even” -> “2”, “odd” -> “3”
● flatMapValues: flatMap PairRDD values
Key-Value RDDs
● join: Return a new RDD with both RDD joined by key.
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","
bear","bee"), 3)
val d = c.keyBy(_.length)
b.join(d).collect
res0: Array[(Int, (String, String))] = Array((6,(salmon,salmon)), (6,(salmon,rabbit)),
(6,(salmon,turkey)), (6,(salmon,salmon)), (6,(salmon,rabbit)), (6,(salmon,turkey)), (3,
(dog,dog)), (3,(dog,cat)), (3,(dog,gnu)), (3,(dog,bee)), (3,(rat,dog)), (3,(rat,cat)),
(3,(rat,gnu)), (3,(rat,bee)))
Key-Value RDDs - combineByKey
● combineByKey:
def combineByKey[C](
createCombiner: V => C,
mergeValue: (C,V) => C,
mergeCombiners: (C,C) => C): RDD[(K,C)]
● Think of it as something-like-but-not a foldLeft over
each partition
Key-Value RDDs - combineByKey
● It’s composed by:
○ createCombiner( V => C): Sets the way to mutate initial RDD
[V] data into new data type used for aggregating values (C). This
will be called Combinator.
○ mergeValue((C,V) => C): Defines how to aggregate initial V
values to our Combiner type C, returning a new combiner type C.
○ mergeCombiners((C,C) => C): Defines how to merge two
combiners into a new one.
Key-Value RDDs - combineByKey
● Example:
val a = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","
turkey","wolf","bear","bee"), 3)
val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)
val c = b.zip(a)
val d = c.combineByKey(List(_), (x:List[String], y:String) => y :: x,
(x:List[String], y:List[String]) => x ::: y)
d.collect
res16: Array[(Int, List[String])] = Array((1,List(cat, dog, turkey)), (2,List(gnu,
rabbit, salmon, bee, bear, wolf)))
Key-Value RDDs - aggregateByKey
● aggregateByKey : Aggregate the values of each key,
using given combine functions and a neutral “zero
value”.
● Beware! Zero value is evaluated in each partition.
def aggregateByKey[U:ClassTag](zeroValue: U)(
seqOp: (U,V) => U, combOp: (U,U) => U): RDD[(K,U)] =
combineByKey(
(v: V) => seqOp(zeroValue,v),
seqOp,
comboOp)
Key-Value RDDs - groupByKey
● groupByKey: Group all values with same key into a
Iterable of values
def groupByKey(): RDD[(Key,Iterable[Value])] =
combineByKey[List[Value]](
(v: V) => List(v),
(list: List[V],v: V) => list+= v,
(l1: List[V],l2: List[V]) => l1 ++ l2)
Note: GroupByKey actually uses CompactBuffer instead of list.
Modules
SparkSQL
● Shark successor (Berkeley)
○ Alternative to Hive + Hadoop
● Query language
○ HiveQL
○ Future: Support full SQL92
● SQLContext
val sqlContext = new SQLContext(sparkContext)
val hiveContext = new HiveContext(sparkContext)
SparkSQL
● Different datasources
○ JSON
○ Parquet
○ CSV
● New implementations
○ Cassandra
○ ElasticSearch
○ MongoDB
● Unified API in Spark 1.3.0
val students: DataFrame = sqlContext.load(
“students”,“org.apache.sql.parquet”,Map(...))
SparkSQL - DataFrame
● DataFrame = RDD[org.apache.spark.sql.Row] +
Schema
● A Row holds both column values and their types.
● This allows unifying multiple datasources with the
same API.
i.e, we could join two tables, one declared on Mongo and
another on ElasticSearch.
Demo2 : Most retweeted(SparkSQL)
● Most Retweeted Example (SparkSQL)
● Same bunch of tweets (1000 json records)
● Find out which is the most retweeted tweet using
SparkSQL
Who uses it?
Distributed computing with
Javier Santos
April 2015

More Related Content

What's hot

RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compactionMIJIN AN
 
Multi Master PostgreSQL Cluster on Kubernetes
Multi Master PostgreSQL Cluster on KubernetesMulti Master PostgreSQL Cluster on Kubernetes
Multi Master PostgreSQL Cluster on KubernetesOhyama Masanori
 
Understanding PostgreSQL LW Locks
Understanding PostgreSQL LW LocksUnderstanding PostgreSQL LW Locks
Understanding PostgreSQL LW LocksJignesh Shah
 
Migrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMigrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMariaDB plc
 
Logs/Metrics Gathering With OpenShift EFK Stack
Logs/Metrics Gathering With OpenShift EFK StackLogs/Metrics Gathering With OpenShift EFK Stack
Logs/Metrics Gathering With OpenShift EFK StackJosef Karásek
 
Operating PostgreSQL at Scale with Kubernetes
Operating PostgreSQL at Scale with KubernetesOperating PostgreSQL at Scale with Kubernetes
Operating PostgreSQL at Scale with KubernetesJonathan Katz
 
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detailMIJIN AN
 
Ceph Day Beijing - SPDK for Ceph
Ceph Day Beijing - SPDK for CephCeph Day Beijing - SPDK for Ceph
Ceph Day Beijing - SPDK for CephDanielle Womboldt
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used forAljoscha Krettek
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisArnab Mitra
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
Simple and Scalable Microservices: Using NATS with Docker Compose and Swarm
Simple and Scalable Microservices: Using NATS with Docker Compose and Swarm Simple and Scalable Microservices: Using NATS with Docker Compose and Swarm
Simple and Scalable Microservices: Using NATS with Docker Compose and Swarm NATS
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Sadayuki Furuhashi
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 

What's hot (20)

RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 
Multi Master PostgreSQL Cluster on Kubernetes
Multi Master PostgreSQL Cluster on KubernetesMulti Master PostgreSQL Cluster on Kubernetes
Multi Master PostgreSQL Cluster on Kubernetes
 
Understanding PostgreSQL LW Locks
Understanding PostgreSQL LW LocksUnderstanding PostgreSQL LW Locks
Understanding PostgreSQL LW Locks
 
Migrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMigrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at Facebook
 
Logs/Metrics Gathering With OpenShift EFK Stack
Logs/Metrics Gathering With OpenShift EFK StackLogs/Metrics Gathering With OpenShift EFK Stack
Logs/Metrics Gathering With OpenShift EFK Stack
 
Operating PostgreSQL at Scale with Kubernetes
Operating PostgreSQL at Scale with KubernetesOperating PostgreSQL at Scale with Kubernetes
Operating PostgreSQL at Scale with Kubernetes
 
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detail
 
Ceph Day Beijing - SPDK for Ceph
Ceph Day Beijing - SPDK for CephCeph Day Beijing - SPDK for Ceph
Ceph Day Beijing - SPDK for Ceph
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Postgre sql best_practices
Postgre sql best_practicesPostgre sql best_practices
Postgre sql best_practices
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
CockroachDB
CockroachDBCockroachDB
CockroachDB
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Simple and Scalable Microservices: Using NATS with Docker Compose and Swarm
Simple and Scalable Microservices: Using NATS with Docker Compose and Swarm Simple and Scalable Microservices: Using NATS with Docker Compose and Swarm
Simple and Scalable Microservices: Using NATS with Docker Compose and Swarm
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 

Similar to Distributed computing with spark

Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Robert Metzger
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探台灣資料科學年會
 
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)Austin Benson
 
Rubyconfindia2018 - GPU accelerated libraries for Ruby
Rubyconfindia2018 - GPU accelerated libraries for RubyRubyconfindia2018 - GPU accelerated libraries for Ruby
Rubyconfindia2018 - GPU accelerated libraries for RubyPrasun Anand
 
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...CloudxLab
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to sparkJavier Arrieta
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...Spark Summit
 
R getting spatial
R getting spatialR getting spatial
R getting spatialFAO
 
Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Mark Smith
 
CS101- Introduction to Computing- Lecture 35
CS101- Introduction to Computing- Lecture 35CS101- Introduction to Computing- Lecture 35
CS101- Introduction to Computing- Lecture 35Bilal Ahmed
 
[系列活動] Data exploration with modern R
[系列活動] Data exploration with modern R[系列活動] Data exploration with modern R
[系列活動] Data exploration with modern R台灣資料科學年會
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkDatio Big Data
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide trainingSpark Summit
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 

Similar to Distributed computing with spark (20)

Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 
Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
 
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
 
Rubyconfindia2018 - GPU accelerated libraries for Ruby
Rubyconfindia2018 - GPU accelerated libraries for RubyRubyconfindia2018 - GPU accelerated libraries for Ruby
Rubyconfindia2018 - GPU accelerated libraries for Ruby
 
Neo4j: Graph-like power
Neo4j: Graph-like powerNeo4j: Graph-like power
Neo4j: Graph-like power
 
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
 
10. R getting spatial
10.  R getting spatial10.  R getting spatial
10. R getting spatial
 
R getting spatial
R getting spatialR getting spatial
R getting spatial
 
Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016
 
CS101- Introduction to Computing- Lecture 35
CS101- Introduction to Computing- Lecture 35CS101- Introduction to Computing- Lecture 35
CS101- Introduction to Computing- Lecture 35
 
[系列活動] Data exploration with modern R
[系列活動] Data exploration with modern R[系列活動] Data exploration with modern R
[系列活動] Data exploration with modern R
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 

More from Javier Santos Paniego

More from Javier Santos Paniego (6)

Codemotion akka persistence, cqrs%2 fes y otras siglas del montón
Codemotion   akka persistence, cqrs%2 fes y otras siglas del montónCodemotion   akka persistence, cqrs%2 fes y otras siglas del montón
Codemotion akka persistence, cqrs%2 fes y otras siglas del montón
 
Codemotion akka voló sobre el nido del future
Codemotion   akka voló sobre el nido del futureCodemotion   akka voló sobre el nido del future
Codemotion akka voló sobre el nido del future
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Scala for dummies
Scala for dummiesScala for dummies
Scala for dummies
 
Codemotion 2014 Scala @real life
Codemotion 2014 Scala @real lifeCodemotion 2014 Scala @real life
Codemotion 2014 Scala @real life
 
Scala @ Real life
Scala @ Real lifeScala @ Real life
Scala @ Real life
 

Recently uploaded

4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxChelloAnnAsuncion2
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxnelietumpap1
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 

Recently uploaded (20)

Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptx
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 

Distributed computing with spark

  • 1. Distributed computing with Javier Santos April,13 - 2015
  • 2. About me Javier Santos @jpaniego «Hay dos formas de programar sin errores; solo la tercera funciona» Alan J Perlis
  • 3. What’s Spark? val mySeq: Seq[Int] = 1 to 5 val action: Seq[Int] => Int = (seq: Seq[Int]) => seq.sum val result: Int = 15
  • 4. What’s Spark? val mySeq: Seq[Int] = 1 to Int.MaxValue val action: Seq[Int] => Int = (seq: Seq[Int]) => seq.sum val result: Int = 1073741824
  • 5. What’s Spark? Computing problems: ● Time: Waiting for action results on tons of records ● Space: How do we allocate TB of data?
  • 7. What’s Spark? val mySeq: Seq[Int] = 1 to 99 val mySeq_1 = 1 to 33 val mySeq_3 = 67 to 99 val mySeq_2 = 34 to 66
  • 8. What’s Spark? val action: Seq[Int] => Int = (seq: Seq[Int]) => seq.sum val mySeq_1 = 1 to 33 val mySeq_3 = 67 to 99 val mySeq_2 = 34 to 66 action action action
  • 9. What’s Spark? val result: Int = 4950 val mySeq_1 = 1 to 33 val mySeq_3 = 67 to 99 val mySeq_2 = 34 to 66 561 1650 2739
  • 10. What’s Spark? ● General engine for large-scale data processing. ● “In-memory” data allocation. ● Target: data and computing parallelism. ● Mostly developed in Scala. It also provides support for Python and Java ● First developed by AMPLab in UC Berkeley ● Support by Databricks ● Latest stable version 1.3.0
  • 13. MapReduce model ● Programming model used for supporting parallel- computing and big data sets distribution among computing groups ● Based on two main functional programming methods: Map and reduce. ● Hadoop ○ One of the first open source implementation ○ First implementation lead by Yahoo
  • 14. MapReduce model ● Phases Map Combiner Partitioner Reduce
  • 15. MapReduce model ● Example (Map) “123;SamsungGalaxyIII;14:05;300;34612345678” “124;LGNexus4;16:05;121;+34613455678” “126;LGNexus4;12:05;23;+3463624678” “131;NokiaLumia;14:05;300;+34613246778” “125;MotorolaG2;14:05;300;+34612345678” “127;NokiaLumia;14:05;300;+34612345678” “130;NokiaLumia;14:05;300;+34612345678” (SamsungGalaxyIII, 1) (LGNexus4, 1) (LGNexus4, 1) (NokiaLumia, 1) (MotorolaG2, 1) (NokiaLumia, 1) (NokiaLumia, 1) Partitions
  • 16. MapReduce model ● Example (Combiner) (SamsungGalaxyIII, 1) (LGNexus4, 1) (LGNexus4, 1) (NokiaLumia, 1) (MotorolaG2, 1) (NokiaLumia, 1) (NokiaLumia, 1) (SamsungGalaxyIII, 1) (LGNexus4, 2) (NokiaLumia, 1) (MotorolaG2, 1) (NokiaLumia, 2)
  • 17. MapReduce model ● Example (Partitioner) (SamsungGalaxyIII, 1) (LGNexus4, 2) (NokiaLumia, 1) (MotorolaG2, 1) (NokiaLumia, 2) (SamsungGalaxyIII, 1) (LGNexus4, 2) (NokiaLumia, 3) (MotorolaG2, 1)
  • 18. MapReduce model ● Example (Reduce) (SamsungGalaxyIII, 1) (LGNexus4, 2) (NokiaLumia, 3) (MotorolaG2, 1) (NokiaLumia, 3) ReduceFunction f(t1:(String,Int),t2:(String,Int)) = if (t1._2 > t2._2) t1 else t2
  • 20. Deployment types ● Local ○ master=”local[N]” (N=Amount of cores) ○ Spark master is launched in the same process. It’s not accessible from web. ○ Workers are launched in the same process. ● Standalone ○ master=”spark://master-url” ○ Spark master is launched in a cluster machine. ○ For deploying workers, it’s used ‘start-all.sh’ script. ○ Submit JAR to worker nodes.
  • 22. Deployment: Cluster managers ● YARN ○ It uses Hadoop Yarn Resource Manager ○ Spark app is launched inside a Yarn container ○ Several isolated Spark Masters running in cluster possibility. ● Mesos ○ Another cluster manager ○ Not so much used like Yarn with Spark. ○ Several isolated Spark Masters running in cluster possibility.
  • 24. Execution terms : SparkContext & Executor ● SparkContext ○ Spark cluster connection ○ Necessary for SQLContext, StreamingContext, … ○ Isolated: Two Spark Contexts cannot exchange data (without external help). ● Executor ○ Individual execution unit ○ 1 core ~ 2 executor ○ Each application has its owns.
  • 25. Job terms Job Stage Stage Stage Task Task Task Task Task
  • 26. Job terms ● Job ○ Individual executed action ○ It’s composed by a set of tasks arranged in stages ● Stage ○ Job split based on last cached action or shuffle event. ● Task ○ Minimum execution unit sent to executors. ○ Task ~ Partition correlation
  • 28. ● Resilient Distributed Dataset. Basic abstraction in Spark ● Think of it like a huge collection partitioned and distributed in different machines. ● Lazy evaluated ● Basically composed of ○ A list of partitions ○ A function for computing each split ○ A list of dependencies on other RDDs RDD
  • 29. RDD ● Basic example: val myRdd: RDD[String] = sc.parallelize( “this is a sample string”.split(“ “).toList) ● For creating a new RDD is necessary to implement: ○ compute (How to get RDD population) ○ getDependencies (RDD lineage) ○ getPartitions (How to split the data)
  • 30. RDD ● An RDD may be created from ○ A file / set of files: sc.textFile(“myFile”) sc.textFile(“file1,file2”) ○ A bunch of memory-storaged data: sc.parallelize(List(1,2,3)) Another RDD: myRdd.map(_.toString)
  • 31. RDD - Partitions ● Partition : RDD chunk ● A worker node may contain 1 or more partitions of some RDD. ● It’s important to choose a best-performance partitioning. ● Repartitioning ○ It implies shuffling all RDD data among worker nodes ○ There will surely be tons of network traffic among all worker nodes!
  • 32. RDD - Lineage ● An RDD may be built from another one. ● Base RDD : Has no parent ● Ex: val base: RDD[Int] = sc.parallelize(List(1,2,3,4)) val even: RDD[Int] = myRDD.filter(_%2==0) ● DAG (Directed Acyclic Graph) : Represents the RDD lineage (inheritance, mutations, …). ● DAGScheduler: High level layer used that schedules stages among all related RDDs
  • 34. RDD - Transformations ● It applies a change to some RDD, returning a new one. ● It’s not immediately evaluated. ● Think of it like a “Call-by-name” function. ● Adds a new node in Lineage DAG.
  • 35. RDD - Transformations ● Most frequently used: ○ map val myRDD: RDD[Int] = sc.parallelize(List(1,2,3,4)) val myStrinRDD: RDD[String] = myRDD.map(_.toString) ○ flatMap val myRDD: RDD[Int] = sc.parallelize(List(1,null,3,null)) val myNotNullInts: RDD[Int] = myRdd.flatMap(n => Option(n))
  • 36. RDD - Transformations ● Most frequently used: ○ filter val myRDD: RDD[Int] = sc.parallelize(List(1,2,3,4)) val myStrinRDD: RDD[Int] = myRDD.filter(_%2==0) ○ union val odd: RDD[Int] = sc.parallelize(List(1,3,5)) val even: RDD[Int] = sc.parallelize(List(2,4,6)) val all: RDD[Int] = odd.union(even)
  • 37. RDD - Actions ● It launches the RDD evaluation, ○ returning some data to the driver ○ or persisting to some external storage system ● It’s evaluated partially on each worker node, and results are merged in the Driver ● There are mechanisms to make cheaper computing several times the same RDD (like persist() or cache() )
  • 38. RDD - Actions ● Examples: ○ count val myRdd: Rdd[Int] = sc.parallelize(List(1,2,3)) val size: Int = myRdd.count Counting implies processing whole RDD ○ take val myRdd: Rdd[Int] = sc.parallelize(List(1,2,3)) val List(1,2) = myRdd.take(2).toList ○ collect val myRdd: Rdd[Int] = sc.parallelize(List(1,2,3)) val data: Array[Int] = myRdd.collect Beware! Executing ‘collect’ on big collections might end into a memory leak
  • 39. Demo1 : Most retweeted ● Most Retweeted example ● Bunch of tweets (1000 json records) ● Find out which is the most retweeted tweet.
  • 40. Key-Value RDDs ● Particular case of RDD[T] where T = (U,V) ● It allows grouping,combining,aggregating values by some key. ● In Scala it’s only needed to import org.apache.spark.SparkContext._ ● In Java, it’s mandatory to use PairRDD class.
  • 41. Key-Value RDDs ● keyBy: Generates a PairRDD val myRDD: RDD[Int] = sc.parallelize(List(1,2,3,4,5)) val kvRDD: RDD[(String,Int)] = myRDD.keyBy( n => if (n%2==0) “even” else “odd”) “odd” -> 1, “even” -> 2, “odd” -> 3, “even” -> 4, “odd” -> 5
  • 42. Key-Value RDDs ● keys: Gets keys from PairRDD val myRDD: RDD[(String,Int)] = sc.parallelize(“odd” -> 1,”even” -> 2,”odd” -> 3) val keysRDD: RDD[String] = myRDD.keys “odd”,”even”,”odd” ● values: Gets values from PairRDD
  • 43. Key-Value RDDs ● mapValues: map PairRDD values, omitting keys. val myRDD: RDD[(String,Int)] = sc.parallelize(“odd” -> 1,”even” -> 2,”odd” -> 3) val mapRDD: RDD[(String,String)] = myRDD.mapValues(_.toString) “odd” -> “1”, “even” -> “2”, “odd” -> “3” ● flatMapValues: flatMap PairRDD values
  • 44. Key-Value RDDs ● join: Return a new RDD with both RDD joined by key. val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3) val b = a.keyBy(_.length) val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf"," bear","bee"), 3) val d = c.keyBy(_.length) b.join(d).collect res0: Array[(Int, (String, String))] = Array((6,(salmon,salmon)), (6,(salmon,rabbit)), (6,(salmon,turkey)), (6,(salmon,salmon)), (6,(salmon,rabbit)), (6,(salmon,turkey)), (3, (dog,dog)), (3,(dog,cat)), (3,(dog,gnu)), (3,(dog,bee)), (3,(rat,dog)), (3,(rat,cat)), (3,(rat,gnu)), (3,(rat,bee)))
  • 45. Key-Value RDDs - combineByKey ● combineByKey: def combineByKey[C]( createCombiner: V => C, mergeValue: (C,V) => C, mergeCombiners: (C,C) => C): RDD[(K,C)] ● Think of it as something-like-but-not a foldLeft over each partition
  • 46. Key-Value RDDs - combineByKey ● It’s composed by: ○ createCombiner( V => C): Sets the way to mutate initial RDD [V] data into new data type used for aggregating values (C). This will be called Combinator. ○ mergeValue((C,V) => C): Defines how to aggregate initial V values to our Combiner type C, returning a new combiner type C. ○ mergeCombiners((C,C) => C): Defines how to merge two combiners into a new one.
  • 47. Key-Value RDDs - combineByKey ● Example: val a = sc.parallelize(List("dog","cat","gnu","salmon","rabbit"," turkey","wolf","bear","bee"), 3) val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3) val c = b.zip(a) val d = c.combineByKey(List(_), (x:List[String], y:String) => y :: x, (x:List[String], y:List[String]) => x ::: y) d.collect res16: Array[(Int, List[String])] = Array((1,List(cat, dog, turkey)), (2,List(gnu, rabbit, salmon, bee, bear, wolf)))
  • 48. Key-Value RDDs - aggregateByKey ● aggregateByKey : Aggregate the values of each key, using given combine functions and a neutral “zero value”. ● Beware! Zero value is evaluated in each partition. def aggregateByKey[U:ClassTag](zeroValue: U)( seqOp: (U,V) => U, combOp: (U,U) => U): RDD[(K,U)] = combineByKey( (v: V) => seqOp(zeroValue,v), seqOp, comboOp)
  • 49. Key-Value RDDs - groupByKey ● groupByKey: Group all values with same key into a Iterable of values def groupByKey(): RDD[(Key,Iterable[Value])] = combineByKey[List[Value]]( (v: V) => List(v), (list: List[V],v: V) => list+= v, (l1: List[V],l2: List[V]) => l1 ++ l2) Note: GroupByKey actually uses CompactBuffer instead of list.
  • 51.
  • 52. SparkSQL ● Shark successor (Berkeley) ○ Alternative to Hive + Hadoop ● Query language ○ HiveQL ○ Future: Support full SQL92 ● SQLContext val sqlContext = new SQLContext(sparkContext) val hiveContext = new HiveContext(sparkContext)
  • 53. SparkSQL ● Different datasources ○ JSON ○ Parquet ○ CSV ● New implementations ○ Cassandra ○ ElasticSearch ○ MongoDB ● Unified API in Spark 1.3.0 val students: DataFrame = sqlContext.load( “students”,“org.apache.sql.parquet”,Map(...))
  • 54. SparkSQL - DataFrame ● DataFrame = RDD[org.apache.spark.sql.Row] + Schema ● A Row holds both column values and their types. ● This allows unifying multiple datasources with the same API. i.e, we could join two tables, one declared on Mongo and another on ElasticSearch.
  • 55. Demo2 : Most retweeted(SparkSQL) ● Most Retweeted Example (SparkSQL) ● Same bunch of tweets (1000 json records) ● Find out which is the most retweeted tweet using SparkSQL