SlideShare a Scribd company logo
MADRID · MAY · 2015
Scala Programming @ Madrid
Introduction to
Javier Santos
May,11 - 2015
MADRID · MAY · 2015
Scala Programming @ Madrid
About me
Javier Santos @jpaniego
«Hay dos formas de programar
sin errores; solo la tercera
funciona»
Alan J Perlis
MADRID · MAY · 2015
Scala Programming @ Madrid
Disclaimer I
● I’m no Spark evangelizer
MADRID · MAY · 2015
Scala Programming @ Madrid
Disclaimer II
● I’m no expert
MADRID · MAY · 2015
Scala Programming @ Madrid
Disclaimer III
● I’m not paying beers
MADRID · MAY · 2015
Scala Programming @ Madrid
What’s Spark?
val mySeq: Seq[Int] = 1 to 5
val action: Seq[Int] => Int =
(seq: Seq[Int]) => seq.sum
val result: Int = 15
MADRID · MAY · 2015
Scala Programming @ Madrid
What’s Spark?
val mySeq: Seq[Int] = 1 to Int.MaxValue
val action: Seq[Int] => Int =
(seq: Seq[Int]) => seq.sum
val result: Int = 1073741824
MADRID · MAY · 2015
Scala Programming @ Madrid
What’s Spark?
Computing problems:
● Time: Waiting for action results on tons of records
● Space: How do we allocate TB of data?
MADRID · MAY · 2015
Scala Programming @ Madrid
downloadmoreram.com
MADRID · MAY · 2015
Scala Programming @ Madrid
MADRID · MAY · 2015
Scala Programming @ Madrid
What’s Spark?
val mySeq: Seq[Int] = 1 to 99
val mySeq_1 =
1 to 33
val mySeq_3 =
67 to 99
val mySeq_2 =
34 to 66
MADRID · MAY · 2015
Scala Programming @ Madrid
What’s Spark?
val action: Seq[Int] => Int = (seq: Seq[Int]) => seq.sum
val mySeq_1 =
1 to 33
val mySeq_3 =
67 to 99
val mySeq_2 =
34 to 66
action action
action
MADRID · MAY · 2015
Scala Programming @ Madrid
What’s Spark?
val result: Int = 4950
val mySeq_1 =
1 to 33
val mySeq_3 =
67 to 99
val mySeq_2 =
34 to 66
561
1650
2739
MADRID · MAY · 2015
Scala Programming @ Madrid
What’s Spark?
● General engine for large-scale data processing.
● “In-memory” data allocation.
● Target: data and computing parallelism.
● Scala, Python and Java APIs.
● First developed by AMPLab in UC Berkeley
● Support by Databricks
● Latest stable version 1.3.1
MADRID · MAY · 2015
Scala Programming @ Madrid
Modules
MADRID · MAY · 2015
Scala Programming @ Madrid
Modules
MADRID · MAY · 2015
Scala Programming @ Madrid
MapReduce model
● Linear calculus method used for supporting parallel-
computing and big data sets distribution among computing
groups
● 80s. Google. PageRank. Big arrays computation.
● Based on two main functional programming methods: Map
and reduce.
● Hadoop
○ One of the first open source implementations
○ Most of contributions made by Yahoo
MADRID · MAY · 2015
Scala Programming @ Madrid
MapReduce model
● Phases
Map Combiner Partitioner Reduce
MADRID · MAY · 2015
Scala Programming @ Madrid
MapReduce model
● Example (Map)
“123;SamsungGalaxyIII;14:05;300;34612345678”
“124;LGNexus4;16:05;121;+34613455678”
“126;LGNexus4;12:05;23;+3463624678”
“131;NokiaLumia;14:05;300;+34613246778”
“125;MotorolaG2;14:05;300;+34612345678”
“127;NokiaLumia;14:05;300;+34612345678”
“130;NokiaLumia;14:05;300;+34612345678”
(SamsungGalaxyIII, 1)
(LGNexus4, 1)
(LGNexus4, 1)
(NokiaLumia,
1)
(MotorolaG2, 1)
(NokiaLumia, 1)
(NokiaLumia, 1)
Partitions
MADRID · MAY · 2015
Scala Programming @ Madrid
MapReduce model
● Example (Combiner - local aggregator)
(SamsungGalaxyIII, 1)
(LGNexus4, 1)
(LGNexus4, 1)
(NokiaLumia,
1)
(MotorolaG2, 1)
(NokiaLumia, 1)
(NokiaLumia, 1)
(SamsungGalaxyIII, 1)
(LGNexus4, 2)
(NokiaLumia,
1)
(MotorolaG2, 1)
(NokiaLumia, 2)
MADRID · MAY · 2015
Scala Programming @ Madrid
MapReduce model
● Example (Partitioner)
(SamsungGalaxyIII, 1)
(LGNexus4, 2)
(NokiaLumia,
1)
(MotorolaG2, 1)
(NokiaLumia, 2)
(SamsungGalaxyIII, 1)
(LGNexus4, 2)
(NokiaLumia, 3)
(MotorolaG2, 1)
MADRID · MAY · 2015
Scala Programming @ Madrid
MapReduce model
● Example (Reduce)
(SamsungGalaxyIII, 1)
(LGNexus4, 2)
(NokiaLumia, 3)
(MotorolaG2, 1) (NokiaLumia, 3)
ReduceFunction
f(t1:(String,Int),t2:(String,Int))
=
if (t1._2 > t2._2) t1 else t2
MADRID · MAY · 2015
Scala Programming @ Madrid
So...what about Spark?
MADRID · MAY · 2015
Scala Programming @ Madrid
Deployment terms
Driver Master
Worker
Worker
Worker
Spark
Cluster
MADRID · MAY · 2015
Scala Programming @ Madrid
Deployment types
● Local
○ master=”local[N]” (N=Amount of cores)
○ Spark master is launched in the same process.
○ Workers are launched in the same process.
● Standalone
○ master=”spark://master-url”
○ Spark master is launched in a cluster machine.
○ Submit JAR to worker nodes.
○ Resource managers: YARN, Mesos
MADRID · MAY · 2015
Scala Programming @ Madrid
Execution terms
Driver
Executor
Executor
Executor
Executor
Executor
Spark
Context
MADRID · MAY · 2015
Scala Programming @ Madrid
Execution terms : SparkContext & Executor
● SparkContext
○ Spark cluster connection
○ Necessary for SQLContext, StreamingContext, …
○ Isolated: Two Spark Contexts cannot exchange data (without external help).
● Executor
○ Individual execution unit
○ 1 core ~ 2 executor
○ Each application has its owns.
MADRID · MAY · 2015
Scala Programming @ Madrid
Job terms
Job
Stage Stage Stage
Task Task Task Task Task
MADRID · MAY · 2015
Scala Programming @ Madrid
Modules
MADRID · MAY · 2015
Scala Programming @ Madrid
● Resilient Distributed Dataset. Basic abstraction in Spark
● Think of it like a huge collection partitioned and distributed in
different machines.
● Lazy evaluated
● Basically composed of
○ A list of partitions
○ A function for computing each split
○ A list of dependencies on other RDDs
RDD
MADRID · MAY · 2015
Scala Programming @ Madrid
RDD
● Basic example:
val myRdd: RDD[String] = sc.parallelize(
“this is a sample string”.split(“ “).toList)
val anotherRdd: RDD[(Int,String)] = sc.parallelize(
List(
0 -> “A value for key ‘0’ ”,
1 -> “Another value”))
MADRID · MAY · 2015
Scala Programming @ Madrid
RDD
● An RDD may be created from
○ A file / set of files:
sc.textFile(“myFile”)
sc.textFile(“file1,file2”)
○ A bunch of memory-storaged data:
sc.parallelize(List(1,2,3))
Another RDD:
myRdd.map(_.toString)
○ Another RDD
MADRID · MAY · 2015
Scala Programming @ Madrid
RDD - Partitions
● Partition : RDD chunk
● A worker node may contain 1 or more partitions of some RDD.
● It’s important to choose a best-performance partitioning.
● Repartitioning
○ It implies shuffling all RDD data among worker nodes
○ There will surely be tons of network traffic among all worker nodes!
MADRID · MAY · 2015
Scala Programming @ Madrid
RDD - Lineage
● An RDD may be built from another one.
● Base RDD : Has no parent
● Ex:
val base: RDD[Int] = sc.parallelize(List(1,2,3,4))
val even: RDD[Int] = base.filter(_%2==0)
● DAG (Directed Acyclic Graph) : Generalization of MapReduce
model.
Manages RDD lineages (source, mutations, …).
MADRID · MAY · 2015
Scala Programming @ Madrid
RDD - Lineage
MADRID · MAY · 2015
Scala Programming @ Madrid
RDD - Transformations
● It applies a change to some RDD, returning a new one.
● It’s not immediately evaluated.
● Think of it like a “Call-by-name” function. λ
● Adds a new node in Lineage DAG.
MADRID · MAY · 2015
Scala Programming @ Madrid
RDD - Transformations
● Most frequently used:
○ map
val myRDD: RDD[Int] = sc.parallelize(List(1,2,3,4))
val myStrinRDD: RDD[String] = myRDD.map(_.toString)
○ flatMap
val myRDD: RDD[Int] =
sc.parallelize(List(1,null,3,null))
val myNotNullInts: RDD[Int] =
myRdd.flatMap(n => Option(n))
MADRID · MAY · 2015
Scala Programming @ Madrid
RDD - Transformations
● Most frequently used:
○ filter
val myRDD: RDD[Int] = sc.parallelize(List(1,2,3,4))
val myStrinRDD: RDD[Int] = myRDD.filter(_%2==0)
○ union
val odd: RDD[Int] = sc.parallelize(List(1,3,5))
val even: RDD[Int] = sc.parallelize(List(2,4,6))
val all: RDD[Int] = odd.union(even)
MADRID · MAY · 2015
Scala Programming @ Madrid
RDD - Actions
● It launches the RDD evaluation,
○ returning some data to the driver
○ or persisting to some external storage system
● It’s evaluated partially on each worker node, and results are
merged in the Driver
● There are mechanisms to make cheaper computing several times
the same RDD (like persist() or cache() )
MADRID · MAY · 2015
Scala Programming @ Madrid
RDD - Actions
● Examples:
○ count
val myRdd: Rdd[Int] = sc.parallelize(List(1,2,3))
val size: Int = myRdd.count
Counting implies processing whole RDD
○ take
val myRdd: Rdd[Int] = sc.parallelize(List(1,2,3))
val List(first,second) = myRdd.take(2).toList
○ collect
val myRdd: Rdd[Int] = sc.parallelize(List(1,2,3))
val data: Array[Int] = myRdd.collect
Beware! Executing ‘collect’ on big collections might end into a memory leak
MADRID · MAY · 2015
Scala Programming @ Madrid
Demo1 : Most retweeted
● Most Retweeted example
● Bunch of tweets (1000 json records)
● Find out which is the most retweeted tweet.
MADRID · MAY · 2015
Scala Programming @ Madrid
Key-Value RDDs
● Particular case of RDD[T] where T = (U,V)
● It allows grouping,combining,aggregating values by some key.
● In Scala it’s only needed to
import org.apache.spark.SparkContext._
● In Java, it’s mandatory to use PairRDD class.
MADRID · MAY · 2015
Scala Programming @ Madrid
Key-Value RDDs
● keyBy: Generates a PairRDD
val myRDD: RDD[Int] =
sc.parallelize(List(1,2,3,4,5))
val kvRDD: RDD[(String,Int)] =
myRDD.keyBy(
n => if (n%2==0) “even” else “odd”)
“odd” -> 1, “even” -> 2, “odd” -> 3, “even” -> 4, “odd” -> 5
MADRID · MAY · 2015
Scala Programming @ Madrid
Key-Value RDDs
● keys: Gets keys from PairRDD
val myRDD: RDD[(String,Int)] =
sc.parallelize(“odd” -> 1,”even” -> 2,”odd” -> 3)
val keysRDD: RDD[String] = myRDD.keys
“odd”,”even”,”odd”
● values: Gets values from PairRDD
MADRID · MAY · 2015
Scala Programming @ Madrid
Key-Value RDDs
● mapValues: map PairRDD values, omitting keys.
val myRDD: RDD[(String,Int)] =
sc.parallelize(“odd” -> 1,”even” -> 2,”odd” -> 3)
val mapRDD: RDD[(String,String)] =
myRDD.mapValues(_.toString)
“odd” -> “1”, “even” -> “2”, “odd” -> “3”
● flatMapValues: flatMap PairRDD values
MADRID · MAY · 2015
Scala Programming @ Madrid
Key-Value RDDs
● join: Return a new RDD with both RDD joined by key.
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
val b = a.keyBy(_.length)
val c =
sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","be
e"), 3)
val d = c.keyBy(_.length)
b.join(d).collect
res0: Array[(Int, (String, String))] = Array((6,(salmon,salmon)),
(6,(salmon,rabbit)), (6,(salmon,turkey)), (6,(salmon,salmon)), (6,(salmon,rabbit)),
(6,(salmon,turkey)), (3,(dog,dog)), (3,(dog,cat)), (3,(dog,gnu)), (3,(dog,bee)),
(3,(rat,dog)), (3,(rat,cat)), (3,(rat,gnu)), (3,(rat,bee)))
MADRID · MAY · 2015
Scala Programming @ Madrid
Key-Value RDDs - combineByKey
● combineByKey:
def combineByKey[C](
createCombiner: V => C,
mergeValue: (C,V) => C,
mergeCombiners: (C,C) => C): RDD[(K,C)]
● Think of it as something-like-but-not a foldLeft over each
partition λ
MADRID · MAY · 2015
Scala Programming @ Madrid
Key-Value RDDs - combineByKey
● It’s composed by:
○ createCombiner( V => C): Sets the way to mutate initial RDD[V] data into new data
type used for aggregating values (C). This will be called Combinator.
○ mergeValue((C,V) => C): Defines how to aggregate initial V values to our Combiner
type C, returning a new combiner type C.
○ mergeCombiners((C,C) => C): Defines how to merge two combiners into a new one.
MADRID · MAY · 2015
Scala Programming @ Madrid
Key-Value RDDs - combineByKey
● Example:
val a =
sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf"
,"bear","bee"), 3)
val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)
val c = b.zip(a)
val d = c.combineByKey(List(_), (x:List[String], y:String) => y :: x,
(x:List[String], y:List[String]) => x ::: y)
d.collect
res16: Array[(Int, List[String])] = Array((1,List(cat, dog, turkey)), (2,List(gnu, rabbit, salmon, bee,
bear, wolf)))
MADRID · MAY · 2015
Scala Programming @ Madrid
Key-Value RDDs - aggregateByKey
● aggregateByKey : Aggregate the values of each key, using given
combine functions and a neutral “zero value”.
● Beware! Zero value is evaluated in each partition.
def aggregateByKey[U:ClassTag](zeroValue: U)(
seqOp: (U,V) => U, combOp: (U,U) => U): RDD[(K,U)] =
combineByKey(
(v: V) => seqOp(zeroValue,v),
seqOp,
comboOp)
MADRID · MAY · 2015
Scala Programming @ Madrid
Key-Value RDDs - groupByKey
● groupByKey: Group all values with same key into a Iterable of
values
def groupByKey(): RDD[(Key,Iterable[Value])] =
combineByKey[List[Value]](
(v: V) => List(v),
(list: List[V],v: V) => list+= v,
(l1: List[V],l2: List[V]) => l1 ++ l2)
Note: GroupByKey actually uses CompactBuffer instead of list.
MADRID · MAY · 2015
Scala Programming @ Madrid
Modules
MADRID · MAY · 2015
Scala Programming @ Madrid
MADRID · MAY · 2015
Scala Programming @ Madrid
SparkSQL
● Shark successor (Berkeley)
○ Alternative to Hive + Hadoop
● Query language
○ HiveQL
○ Future: Support full SQL92
● SQLContext
val sqlContext = new SQLContext(sparkContext)
val hiveContext = new HiveContext(sparkContext)
MADRID · MAY · 2015
Scala Programming @ Madrid
SparkSQL
● Different datasources
○ JSON
○ Parquet
○ CSV
● New implementations
○ Cassandra
○ ElasticSearch
○ MongoDB
● Unified API in Spark 1.3.0
val students: DataFrame = sqlContext.load(
“students”,“org.apache.sql.parquet”,Map(...))
MADRID · MAY · 2015
Scala Programming @ Madrid
SparkSQL - DataFrame
● DataFrame = RDD[org.apache.spark.sql.Row] + Schema
● A Row holds both column values and their types.
● This allows unifying multiple datasources with the same API.
i.e, we could join two tables, one declared on Mongo and another
on ElasticSearch.
MADRID · MAY · 2015
Scala Programming @ Madrid
Demo2 : Most retweeted(SparkSQL)
● Most Retweeted Example (SparkSQL)
● Same bunch of tweets (1000 json records)
● Find out which is the most retweeted tweet using SparkSQL
MADRID · MAY · 2015
Scala Programming @ Madrid
Who uses it?
MADRID · MAY · 2015
Scala Programming @ Madrid
Let’s talk about...
● Spark streaming
● Machine learning (MLlib)
● GraphX vs Neo4j
● Spark vs Hadoop
● ?
MADRID · MAY · 2015
Scala Programming @ Madrid
MADRID · MAY · 2015
Scala Programming @ Madrid
Introduction to
Javier Santos
May,11 - 2015

More Related Content

Viewers also liked

Intro to Functional Programming in Scala
Intro to Functional Programming in ScalaIntro to Functional Programming in Scala
Intro to Functional Programming in Scala
Shai Yallin
 
The Scala Programming Language
The Scala Programming LanguageThe Scala Programming Language
The Scala Programming Language
league
 
Real-Time 3D Programming in Scala
Real-Time 3D Programming in ScalaReal-Time 3D Programming in Scala
Real-Time 3D Programming in ScalaHideyuki Takeuchi
 
Scala Programming Introduction
Scala Programming IntroductionScala Programming Introduction
Scala Programming Introduction
airisData
 
Programming in scala - 1
Programming in scala - 1Programming in scala - 1
Programming in scala - 1
Mukesh Kumar
 
The Scala Programming Language
The Scala Programming LanguageThe Scala Programming Language
The Scala Programming Language
Haim Michael
 
Scala syntax analysis
Scala syntax analysisScala syntax analysis
Scala syntax analysis
Ildar Nurgaliev
 

Viewers also liked (7)

Intro to Functional Programming in Scala
Intro to Functional Programming in ScalaIntro to Functional Programming in Scala
Intro to Functional Programming in Scala
 
The Scala Programming Language
The Scala Programming LanguageThe Scala Programming Language
The Scala Programming Language
 
Real-Time 3D Programming in Scala
Real-Time 3D Programming in ScalaReal-Time 3D Programming in Scala
Real-Time 3D Programming in Scala
 
Scala Programming Introduction
Scala Programming IntroductionScala Programming Introduction
Scala Programming Introduction
 
Programming in scala - 1
Programming in scala - 1Programming in scala - 1
Programming in scala - 1
 
The Scala Programming Language
The Scala Programming LanguageThe Scala Programming Language
The Scala Programming Language
 
Scala syntax analysis
Scala syntax analysisScala syntax analysis
Scala syntax analysis
 

Similar to Introduction to spark

Scala for dummies
Scala for dummiesScala for dummies
Scala for dummies
Javier Santos Paniego
 
Spark
SparkSpark
Porting R Models into Scala Spark
Porting R Models into Scala SparkPorting R Models into Scala Spark
Porting R Models into Scala Spark
carl_pulley
 
Object oriented programming with java pdf
Object oriented programming with java pdfObject oriented programming with java pdf
Object oriented programming with java pdf
sonivipin2342
 
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development ModelLaskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
Garindra Prahandono
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Massimo Schenone
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
Majid Hajibaba
 
Introduction to Machine Learning with Spark
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Spark
datamantra
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
Andy Petrella
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Holden Karau
 
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Taro L. Saito
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
Lucian Neghina
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Putting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsPutting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech Analystics
Gareth Rogers
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
datamantra
 
Scala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusScala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadius
BoldRadius Solutions
 
Distributed computing with spark
Distributed computing with sparkDistributed computing with spark
Distributed computing with spark
Javier Santos Paniego
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
datamantra
 

Similar to Introduction to spark (20)

Scala for dummies
Scala for dummiesScala for dummies
Scala for dummies
 
Spark
SparkSpark
Spark
 
Porting R Models into Scala Spark
Porting R Models into Scala SparkPorting R Models into Scala Spark
Porting R Models into Scala Spark
 
Object oriented programming with java pdf
Object oriented programming with java pdfObject oriented programming with java pdf
Object oriented programming with java pdf
 
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development ModelLaskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
 
Introduction to Machine Learning with Spark
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Spark
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Putting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsPutting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech Analystics
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
 
Scala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusScala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadius
 
Distributed computing with spark
Distributed computing with sparkDistributed computing with spark
Distributed computing with spark
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
 

Introduction to spark

  • 1. MADRID · MAY · 2015 Scala Programming @ Madrid Introduction to Javier Santos May,11 - 2015
  • 2. MADRID · MAY · 2015 Scala Programming @ Madrid About me Javier Santos @jpaniego «Hay dos formas de programar sin errores; solo la tercera funciona» Alan J Perlis
  • 3. MADRID · MAY · 2015 Scala Programming @ Madrid Disclaimer I ● I’m no Spark evangelizer
  • 4. MADRID · MAY · 2015 Scala Programming @ Madrid Disclaimer II ● I’m no expert
  • 5. MADRID · MAY · 2015 Scala Programming @ Madrid Disclaimer III ● I’m not paying beers
  • 6. MADRID · MAY · 2015 Scala Programming @ Madrid What’s Spark? val mySeq: Seq[Int] = 1 to 5 val action: Seq[Int] => Int = (seq: Seq[Int]) => seq.sum val result: Int = 15
  • 7. MADRID · MAY · 2015 Scala Programming @ Madrid What’s Spark? val mySeq: Seq[Int] = 1 to Int.MaxValue val action: Seq[Int] => Int = (seq: Seq[Int]) => seq.sum val result: Int = 1073741824
  • 8. MADRID · MAY · 2015 Scala Programming @ Madrid What’s Spark? Computing problems: ● Time: Waiting for action results on tons of records ● Space: How do we allocate TB of data?
  • 9. MADRID · MAY · 2015 Scala Programming @ Madrid downloadmoreram.com
  • 10. MADRID · MAY · 2015 Scala Programming @ Madrid
  • 11. MADRID · MAY · 2015 Scala Programming @ Madrid What’s Spark? val mySeq: Seq[Int] = 1 to 99 val mySeq_1 = 1 to 33 val mySeq_3 = 67 to 99 val mySeq_2 = 34 to 66
  • 12. MADRID · MAY · 2015 Scala Programming @ Madrid What’s Spark? val action: Seq[Int] => Int = (seq: Seq[Int]) => seq.sum val mySeq_1 = 1 to 33 val mySeq_3 = 67 to 99 val mySeq_2 = 34 to 66 action action action
  • 13. MADRID · MAY · 2015 Scala Programming @ Madrid What’s Spark? val result: Int = 4950 val mySeq_1 = 1 to 33 val mySeq_3 = 67 to 99 val mySeq_2 = 34 to 66 561 1650 2739
  • 14. MADRID · MAY · 2015 Scala Programming @ Madrid What’s Spark? ● General engine for large-scale data processing. ● “In-memory” data allocation. ● Target: data and computing parallelism. ● Scala, Python and Java APIs. ● First developed by AMPLab in UC Berkeley ● Support by Databricks ● Latest stable version 1.3.1
  • 15. MADRID · MAY · 2015 Scala Programming @ Madrid Modules
  • 16. MADRID · MAY · 2015 Scala Programming @ Madrid Modules
  • 17. MADRID · MAY · 2015 Scala Programming @ Madrid MapReduce model ● Linear calculus method used for supporting parallel- computing and big data sets distribution among computing groups ● 80s. Google. PageRank. Big arrays computation. ● Based on two main functional programming methods: Map and reduce. ● Hadoop ○ One of the first open source implementations ○ Most of contributions made by Yahoo
  • 18. MADRID · MAY · 2015 Scala Programming @ Madrid MapReduce model ● Phases Map Combiner Partitioner Reduce
  • 19. MADRID · MAY · 2015 Scala Programming @ Madrid MapReduce model ● Example (Map) “123;SamsungGalaxyIII;14:05;300;34612345678” “124;LGNexus4;16:05;121;+34613455678” “126;LGNexus4;12:05;23;+3463624678” “131;NokiaLumia;14:05;300;+34613246778” “125;MotorolaG2;14:05;300;+34612345678” “127;NokiaLumia;14:05;300;+34612345678” “130;NokiaLumia;14:05;300;+34612345678” (SamsungGalaxyIII, 1) (LGNexus4, 1) (LGNexus4, 1) (NokiaLumia, 1) (MotorolaG2, 1) (NokiaLumia, 1) (NokiaLumia, 1) Partitions
  • 20. MADRID · MAY · 2015 Scala Programming @ Madrid MapReduce model ● Example (Combiner - local aggregator) (SamsungGalaxyIII, 1) (LGNexus4, 1) (LGNexus4, 1) (NokiaLumia, 1) (MotorolaG2, 1) (NokiaLumia, 1) (NokiaLumia, 1) (SamsungGalaxyIII, 1) (LGNexus4, 2) (NokiaLumia, 1) (MotorolaG2, 1) (NokiaLumia, 2)
  • 21. MADRID · MAY · 2015 Scala Programming @ Madrid MapReduce model ● Example (Partitioner) (SamsungGalaxyIII, 1) (LGNexus4, 2) (NokiaLumia, 1) (MotorolaG2, 1) (NokiaLumia, 2) (SamsungGalaxyIII, 1) (LGNexus4, 2) (NokiaLumia, 3) (MotorolaG2, 1)
  • 22. MADRID · MAY · 2015 Scala Programming @ Madrid MapReduce model ● Example (Reduce) (SamsungGalaxyIII, 1) (LGNexus4, 2) (NokiaLumia, 3) (MotorolaG2, 1) (NokiaLumia, 3) ReduceFunction f(t1:(String,Int),t2:(String,Int)) = if (t1._2 > t2._2) t1 else t2
  • 23. MADRID · MAY · 2015 Scala Programming @ Madrid So...what about Spark?
  • 24. MADRID · MAY · 2015 Scala Programming @ Madrid Deployment terms Driver Master Worker Worker Worker Spark Cluster
  • 25. MADRID · MAY · 2015 Scala Programming @ Madrid Deployment types ● Local ○ master=”local[N]” (N=Amount of cores) ○ Spark master is launched in the same process. ○ Workers are launched in the same process. ● Standalone ○ master=”spark://master-url” ○ Spark master is launched in a cluster machine. ○ Submit JAR to worker nodes. ○ Resource managers: YARN, Mesos
  • 26. MADRID · MAY · 2015 Scala Programming @ Madrid Execution terms Driver Executor Executor Executor Executor Executor Spark Context
  • 27. MADRID · MAY · 2015 Scala Programming @ Madrid Execution terms : SparkContext & Executor ● SparkContext ○ Spark cluster connection ○ Necessary for SQLContext, StreamingContext, … ○ Isolated: Two Spark Contexts cannot exchange data (without external help). ● Executor ○ Individual execution unit ○ 1 core ~ 2 executor ○ Each application has its owns.
  • 28. MADRID · MAY · 2015 Scala Programming @ Madrid Job terms Job Stage Stage Stage Task Task Task Task Task
  • 29. MADRID · MAY · 2015 Scala Programming @ Madrid Modules
  • 30. MADRID · MAY · 2015 Scala Programming @ Madrid ● Resilient Distributed Dataset. Basic abstraction in Spark ● Think of it like a huge collection partitioned and distributed in different machines. ● Lazy evaluated ● Basically composed of ○ A list of partitions ○ A function for computing each split ○ A list of dependencies on other RDDs RDD
  • 31. MADRID · MAY · 2015 Scala Programming @ Madrid RDD ● Basic example: val myRdd: RDD[String] = sc.parallelize( “this is a sample string”.split(“ “).toList) val anotherRdd: RDD[(Int,String)] = sc.parallelize( List( 0 -> “A value for key ‘0’ ”, 1 -> “Another value”))
  • 32. MADRID · MAY · 2015 Scala Programming @ Madrid RDD ● An RDD may be created from ○ A file / set of files: sc.textFile(“myFile”) sc.textFile(“file1,file2”) ○ A bunch of memory-storaged data: sc.parallelize(List(1,2,3)) Another RDD: myRdd.map(_.toString) ○ Another RDD
  • 33. MADRID · MAY · 2015 Scala Programming @ Madrid RDD - Partitions ● Partition : RDD chunk ● A worker node may contain 1 or more partitions of some RDD. ● It’s important to choose a best-performance partitioning. ● Repartitioning ○ It implies shuffling all RDD data among worker nodes ○ There will surely be tons of network traffic among all worker nodes!
  • 34. MADRID · MAY · 2015 Scala Programming @ Madrid RDD - Lineage ● An RDD may be built from another one. ● Base RDD : Has no parent ● Ex: val base: RDD[Int] = sc.parallelize(List(1,2,3,4)) val even: RDD[Int] = base.filter(_%2==0) ● DAG (Directed Acyclic Graph) : Generalization of MapReduce model. Manages RDD lineages (source, mutations, …).
  • 35. MADRID · MAY · 2015 Scala Programming @ Madrid RDD - Lineage
  • 36. MADRID · MAY · 2015 Scala Programming @ Madrid RDD - Transformations ● It applies a change to some RDD, returning a new one. ● It’s not immediately evaluated. ● Think of it like a “Call-by-name” function. λ ● Adds a new node in Lineage DAG.
  • 37. MADRID · MAY · 2015 Scala Programming @ Madrid RDD - Transformations ● Most frequently used: ○ map val myRDD: RDD[Int] = sc.parallelize(List(1,2,3,4)) val myStrinRDD: RDD[String] = myRDD.map(_.toString) ○ flatMap val myRDD: RDD[Int] = sc.parallelize(List(1,null,3,null)) val myNotNullInts: RDD[Int] = myRdd.flatMap(n => Option(n))
  • 38. MADRID · MAY · 2015 Scala Programming @ Madrid RDD - Transformations ● Most frequently used: ○ filter val myRDD: RDD[Int] = sc.parallelize(List(1,2,3,4)) val myStrinRDD: RDD[Int] = myRDD.filter(_%2==0) ○ union val odd: RDD[Int] = sc.parallelize(List(1,3,5)) val even: RDD[Int] = sc.parallelize(List(2,4,6)) val all: RDD[Int] = odd.union(even)
  • 39. MADRID · MAY · 2015 Scala Programming @ Madrid RDD - Actions ● It launches the RDD evaluation, ○ returning some data to the driver ○ or persisting to some external storage system ● It’s evaluated partially on each worker node, and results are merged in the Driver ● There are mechanisms to make cheaper computing several times the same RDD (like persist() or cache() )
  • 40. MADRID · MAY · 2015 Scala Programming @ Madrid RDD - Actions ● Examples: ○ count val myRdd: Rdd[Int] = sc.parallelize(List(1,2,3)) val size: Int = myRdd.count Counting implies processing whole RDD ○ take val myRdd: Rdd[Int] = sc.parallelize(List(1,2,3)) val List(first,second) = myRdd.take(2).toList ○ collect val myRdd: Rdd[Int] = sc.parallelize(List(1,2,3)) val data: Array[Int] = myRdd.collect Beware! Executing ‘collect’ on big collections might end into a memory leak
  • 41. MADRID · MAY · 2015 Scala Programming @ Madrid Demo1 : Most retweeted ● Most Retweeted example ● Bunch of tweets (1000 json records) ● Find out which is the most retweeted tweet.
  • 42. MADRID · MAY · 2015 Scala Programming @ Madrid Key-Value RDDs ● Particular case of RDD[T] where T = (U,V) ● It allows grouping,combining,aggregating values by some key. ● In Scala it’s only needed to import org.apache.spark.SparkContext._ ● In Java, it’s mandatory to use PairRDD class.
  • 43. MADRID · MAY · 2015 Scala Programming @ Madrid Key-Value RDDs ● keyBy: Generates a PairRDD val myRDD: RDD[Int] = sc.parallelize(List(1,2,3,4,5)) val kvRDD: RDD[(String,Int)] = myRDD.keyBy( n => if (n%2==0) “even” else “odd”) “odd” -> 1, “even” -> 2, “odd” -> 3, “even” -> 4, “odd” -> 5
  • 44. MADRID · MAY · 2015 Scala Programming @ Madrid Key-Value RDDs ● keys: Gets keys from PairRDD val myRDD: RDD[(String,Int)] = sc.parallelize(“odd” -> 1,”even” -> 2,”odd” -> 3) val keysRDD: RDD[String] = myRDD.keys “odd”,”even”,”odd” ● values: Gets values from PairRDD
  • 45. MADRID · MAY · 2015 Scala Programming @ Madrid Key-Value RDDs ● mapValues: map PairRDD values, omitting keys. val myRDD: RDD[(String,Int)] = sc.parallelize(“odd” -> 1,”even” -> 2,”odd” -> 3) val mapRDD: RDD[(String,String)] = myRDD.mapValues(_.toString) “odd” -> “1”, “even” -> “2”, “odd” -> “3” ● flatMapValues: flatMap PairRDD values
  • 46. MADRID · MAY · 2015 Scala Programming @ Madrid Key-Value RDDs ● join: Return a new RDD with both RDD joined by key. val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3) val b = a.keyBy(_.length) val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","be e"), 3) val d = c.keyBy(_.length) b.join(d).collect res0: Array[(Int, (String, String))] = Array((6,(salmon,salmon)), (6,(salmon,rabbit)), (6,(salmon,turkey)), (6,(salmon,salmon)), (6,(salmon,rabbit)), (6,(salmon,turkey)), (3,(dog,dog)), (3,(dog,cat)), (3,(dog,gnu)), (3,(dog,bee)), (3,(rat,dog)), (3,(rat,cat)), (3,(rat,gnu)), (3,(rat,bee)))
  • 47. MADRID · MAY · 2015 Scala Programming @ Madrid Key-Value RDDs - combineByKey ● combineByKey: def combineByKey[C]( createCombiner: V => C, mergeValue: (C,V) => C, mergeCombiners: (C,C) => C): RDD[(K,C)] ● Think of it as something-like-but-not a foldLeft over each partition λ
  • 48. MADRID · MAY · 2015 Scala Programming @ Madrid Key-Value RDDs - combineByKey ● It’s composed by: ○ createCombiner( V => C): Sets the way to mutate initial RDD[V] data into new data type used for aggregating values (C). This will be called Combinator. ○ mergeValue((C,V) => C): Defines how to aggregate initial V values to our Combiner type C, returning a new combiner type C. ○ mergeCombiners((C,C) => C): Defines how to merge two combiners into a new one.
  • 49. MADRID · MAY · 2015 Scala Programming @ Madrid Key-Value RDDs - combineByKey ● Example: val a = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf" ,"bear","bee"), 3) val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3) val c = b.zip(a) val d = c.combineByKey(List(_), (x:List[String], y:String) => y :: x, (x:List[String], y:List[String]) => x ::: y) d.collect res16: Array[(Int, List[String])] = Array((1,List(cat, dog, turkey)), (2,List(gnu, rabbit, salmon, bee, bear, wolf)))
  • 50. MADRID · MAY · 2015 Scala Programming @ Madrid Key-Value RDDs - aggregateByKey ● aggregateByKey : Aggregate the values of each key, using given combine functions and a neutral “zero value”. ● Beware! Zero value is evaluated in each partition. def aggregateByKey[U:ClassTag](zeroValue: U)( seqOp: (U,V) => U, combOp: (U,U) => U): RDD[(K,U)] = combineByKey( (v: V) => seqOp(zeroValue,v), seqOp, comboOp)
  • 51. MADRID · MAY · 2015 Scala Programming @ Madrid Key-Value RDDs - groupByKey ● groupByKey: Group all values with same key into a Iterable of values def groupByKey(): RDD[(Key,Iterable[Value])] = combineByKey[List[Value]]( (v: V) => List(v), (list: List[V],v: V) => list+= v, (l1: List[V],l2: List[V]) => l1 ++ l2) Note: GroupByKey actually uses CompactBuffer instead of list.
  • 52. MADRID · MAY · 2015 Scala Programming @ Madrid Modules
  • 53. MADRID · MAY · 2015 Scala Programming @ Madrid
  • 54. MADRID · MAY · 2015 Scala Programming @ Madrid SparkSQL ● Shark successor (Berkeley) ○ Alternative to Hive + Hadoop ● Query language ○ HiveQL ○ Future: Support full SQL92 ● SQLContext val sqlContext = new SQLContext(sparkContext) val hiveContext = new HiveContext(sparkContext)
  • 55. MADRID · MAY · 2015 Scala Programming @ Madrid SparkSQL ● Different datasources ○ JSON ○ Parquet ○ CSV ● New implementations ○ Cassandra ○ ElasticSearch ○ MongoDB ● Unified API in Spark 1.3.0 val students: DataFrame = sqlContext.load( “students”,“org.apache.sql.parquet”,Map(...))
  • 56. MADRID · MAY · 2015 Scala Programming @ Madrid SparkSQL - DataFrame ● DataFrame = RDD[org.apache.spark.sql.Row] + Schema ● A Row holds both column values and their types. ● This allows unifying multiple datasources with the same API. i.e, we could join two tables, one declared on Mongo and another on ElasticSearch.
  • 57. MADRID · MAY · 2015 Scala Programming @ Madrid Demo2 : Most retweeted(SparkSQL) ● Most Retweeted Example (SparkSQL) ● Same bunch of tweets (1000 json records) ● Find out which is the most retweeted tweet using SparkSQL
  • 58. MADRID · MAY · 2015 Scala Programming @ Madrid Who uses it?
  • 59. MADRID · MAY · 2015 Scala Programming @ Madrid Let’s talk about... ● Spark streaming ● Machine learning (MLlib) ● GraphX vs Neo4j ● Spark vs Hadoop ● ?
  • 60. MADRID · MAY · 2015 Scala Programming @ Madrid
  • 61. MADRID · MAY · 2015 Scala Programming @ Madrid Introduction to Javier Santos May,11 - 2015

Editor's Notes

  1. Recuadros = particiones
  2. Blue: RDD Gray: Cached partiton