Apache Spark
Architecture
● Madhukara Phatak
● Big data consultant and
trainer at datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com
RDD - Resilient Distributed Dataset
● An big collection of data having following properties
○ Immutable
○ Lazy evaluated
○ Cacheable
○ Type inferred
Immutability
● Immutability means once created it never changes
● Big data by default immutable in nature
● Immutability helps to
○ Parallelize
○ Caching
Immutability in action
● const int a = 0 //immutable
int b = 0; // mutable
● Updation
○ b ++ // in place
○ c = a + 1
● Immutability is about value not about reference
Immutability in collections
Mutable Immutable
var collection = [1,2,4,5]
for ( i = 0; i<collection.length; i++) {
collection[i]+=1;
}
Uses loop for updating
collection is updated in place
val collection = [1,2,4,5]
val newCollection = collection.map(value
=> value +1)
Uses transformation for change
Creates a new copy of collection. Leaves
collection intact
Multiple transformations
Mutable Immutable
var collection = [1,2,4,5]
for ( i = 0; i<collection.length; i++) {
collection[i]+=1;
}
for ( i = 0; i<collection.length; i++) {
collection[i]+=2;
}
val collection = [1,2,4,5]
val newCollection = collection.map(value
=> value +1)
val nextNewCollection = newCollection.
map(value => value +2)
Creates 3 copies of same collection
Challenges of immutability
● Immutability is great for parallelism but not good for
space
● Doing multiple transformations result in
○ Multiple copies of data
○ Multiple passes over data
● In big data, multiple copies and multiple passes will
have poor performance characteristics.
Let’s get lazy
● Laziness means not computing transformation till it’s
need
● Laziness defers evaluation
● Laziness allows separating execution from evaluation
Laziness in action
● val c1 = collection.map(value => value +1) //
do not compute anything
● val c2 = c1.map(value => value +2) // don’t compute
● print c2
Now transform into
Transformation
● val c2 = collection.map ( value => {
var result = value +1
result = result + 2 } )
● Multiple transformations are combined to one
Laziness and Immutability
● You can be lazy only if the underneath data is
immutable
● You cannot combine transformation if transformation
has side effect
● So combining laziness and immutability gives better
performance and distributed processing.
Challenges of Laziness
● Laziness poses challenges in terms of data type
● If laziness defers execution, determining the type of the
variable becomes challenging
● If we can’t determine the right type, it allows to have
semantic issues
● Running big data programs and getting semantics
errors are not fun.
Type inference
● Type inference is part of compiler to determining the
type by value
● As all the transformation are side effect free, we can
determine the type by operation
● Every transformation has specific return type
● Having type inference relieves you think about
representation for many transforms.
Type inference in action
● val collection = [1,2,4,5] // explicit type Array
● val c1 = collection.map( value => value + 1) // Array
only
● val c2 = c1.map( value => value +2 ) // Array only
● val c3 = c2.count( ) // inferred as Int
● val c4 = c3.map(value => value+3) //gets error. As you
cannot map over an integers
Caching
● Immutable data allows you to cache data for long time
● Lazy transformation allows to recreate data on failure
● Transformations can be saved also
● Caching data improves execution engine performance
Spark running architecture
HDFS
NoSQL
Spark Driver program
Worker Node running
transformations
Worker Node running
transformations
Spark Scheduler
Mesos / YARN
Components
● Spark driver (context)
● Spark DAG scheduler
● Cluster management systems
○ YARN
○ Apache Mesos
● Data sources
○ In memory
○ HDFS
○ No SQL

Spark architecture

  • 1.
  • 2.
    ● Madhukara Phatak ●Big data consultant and trainer at datamantra.io ● Consult in Hadoop, Spark and Scala ● www.madhukaraphatak.com
  • 3.
    RDD - ResilientDistributed Dataset ● An big collection of data having following properties ○ Immutable ○ Lazy evaluated ○ Cacheable ○ Type inferred
  • 4.
    Immutability ● Immutability meansonce created it never changes ● Big data by default immutable in nature ● Immutability helps to ○ Parallelize ○ Caching
  • 5.
    Immutability in action ●const int a = 0 //immutable int b = 0; // mutable ● Updation ○ b ++ // in place ○ c = a + 1 ● Immutability is about value not about reference
  • 6.
    Immutability in collections MutableImmutable var collection = [1,2,4,5] for ( i = 0; i<collection.length; i++) { collection[i]+=1; } Uses loop for updating collection is updated in place val collection = [1,2,4,5] val newCollection = collection.map(value => value +1) Uses transformation for change Creates a new copy of collection. Leaves collection intact
  • 7.
    Multiple transformations Mutable Immutable varcollection = [1,2,4,5] for ( i = 0; i<collection.length; i++) { collection[i]+=1; } for ( i = 0; i<collection.length; i++) { collection[i]+=2; } val collection = [1,2,4,5] val newCollection = collection.map(value => value +1) val nextNewCollection = newCollection. map(value => value +2) Creates 3 copies of same collection
  • 8.
    Challenges of immutability ●Immutability is great for parallelism but not good for space ● Doing multiple transformations result in ○ Multiple copies of data ○ Multiple passes over data ● In big data, multiple copies and multiple passes will have poor performance characteristics.
  • 9.
    Let’s get lazy ●Laziness means not computing transformation till it’s need ● Laziness defers evaluation ● Laziness allows separating execution from evaluation
  • 10.
    Laziness in action ●val c1 = collection.map(value => value +1) // do not compute anything ● val c2 = c1.map(value => value +2) // don’t compute ● print c2 Now transform into
  • 11.
    Transformation ● val c2= collection.map ( value => { var result = value +1 result = result + 2 } ) ● Multiple transformations are combined to one
  • 12.
    Laziness and Immutability ●You can be lazy only if the underneath data is immutable ● You cannot combine transformation if transformation has side effect ● So combining laziness and immutability gives better performance and distributed processing.
  • 13.
    Challenges of Laziness ●Laziness poses challenges in terms of data type ● If laziness defers execution, determining the type of the variable becomes challenging ● If we can’t determine the right type, it allows to have semantic issues ● Running big data programs and getting semantics errors are not fun.
  • 14.
    Type inference ● Typeinference is part of compiler to determining the type by value ● As all the transformation are side effect free, we can determine the type by operation ● Every transformation has specific return type ● Having type inference relieves you think about representation for many transforms.
  • 15.
    Type inference inaction ● val collection = [1,2,4,5] // explicit type Array ● val c1 = collection.map( value => value + 1) // Array only ● val c2 = c1.map( value => value +2 ) // Array only ● val c3 = c2.count( ) // inferred as Int ● val c4 = c3.map(value => value+3) //gets error. As you cannot map over an integers
  • 16.
    Caching ● Immutable dataallows you to cache data for long time ● Lazy transformation allows to recreate data on failure ● Transformations can be saved also ● Caching data improves execution engine performance
  • 17.
    Spark running architecture HDFS NoSQL SparkDriver program Worker Node running transformations Worker Node running transformations Spark Scheduler Mesos / YARN
  • 18.
    Components ● Spark driver(context) ● Spark DAG scheduler ● Cluster management systems ○ YARN ○ Apache Mesos ● Data sources ○ In memory ○ HDFS ○ No SQL