Intro to Apache Spark and Scala for Big Data

Intro to Apache Spark:
Fast cluster computing engine for
Hadoop
Intro to Scala:
Object-oriented and functional
language for the Java Virtual
Machine
ACM SIGKDD, 7/9/2014
Roger Huang
Lead System Architect
rohuang@visa.com
rog4096@yahoo.com
@BigDataWrangler

2Intro to Spark: Intro to Scala | 7/9/2014
About me: Roger Huang
• Visa
– Digital & Mobile Products Architecture, Strategic Projects &
infrastructure
– Search infrastructure
– Customer segmentation
– Logging Framework
– Splunk on Hadoop (Hunk)
– Real-time monitoring
– Data
• PayPal
– Java Infrastructure

Different perspectives on an elephant Scala

Outline
• Spark
– Hadoop eco system
• Scala
– Background
• Why Scala?
– For the computer scientist
– For the Java / OO programmer
– For the Spark developer
– For the Big Data developer
– For the Big Data scientist / mathematician
– For the system architect

Spark in the Hadoop ecosystem

Spark Ecosystem of Software Projects
• Spark [Ognen]
– APIs: Scala, Python [Robert], Java
• “SQL”
– Shark (Hive + Spark) [Roger]
– SparkSQL (alpha)
• Machine Learning Library (MLlib) [Omar]
– Clustering
– Classification
• binary classification
• Linear regression
– recommendations
• Spark Streaming [Chance]
• GraphX [Srini]
• …

Resilient Distributed Dataset
• Fault tolerant collection of elements partitioned across the
nodes of the cluster that can be operated on in parallel
• Data sources for RDDs
– Parallelized collections
• From Scala collections
– Hadoop datasets
• From HDFS, any Hadoop supported storage system (Hbase, Amazon
S3, …)
• Text files, SequenceFile, any Hadoop InputFormat
• Two types of operations
– Transformation
• takes an existing dataset and creates a new one
– Action
• takes a dataset, run a computation, and return value to driver program

(Some) RDD Operations
• Transformations
– map(func)
– filter(func)
– flatMap(func)
– mapPartitions(func)
– mapPartitionsWithIndex(func)
– sample(withReplacement,
fraction, seed)
– union(otherDataset)
– distinct()
– groupByKey()
– reduceByKey(func)
– sortByKey()
– Join(otherDataset)
– cogroup(otherDataset)
– cartesian(otherDataset)
• Actions
– reduce(func)
– collect()
– count()
– first()
– take(n)
– takeSample(withReplacement,
num, seed)
– saveAsTextFile(path)
– saveAsSequenceFile(path)
– countByKey()
– foreach(func)
– …

Scala background
• Scalable, Object oriented, functional language
– Version 2.11 (4/2014)
• Runs on the Java Virtual Machine
• Martin Odersky
– javac
– Java generics
• http://scala-lang.org/, REPL
• http://www.scala-lang.org/api/current
• http://scala-ide.org/
• http://www.scala-sbt.org/, Simple build tool
• Who’s using Scala?
– Twitter, LinkedIn, …
• Powered by Scala
– Apache Spark, Apache Kafka, Akka,…

Outline
• Spark
• Scala
– Background
• Why Scala?
– For the Hadoop/Spark developer

Scala for the computer scientist:
functional programming (FP)

functional programming (FP)
• Math functions, e.g., f(x) = y
– A function has a single responsibility
– A function has no side effects
– A function is referentially transparent
• A function outputs the same value for the same inputs.
• Functional programming
– expresses computation as the evaluation and composition of
mathematical functions
– Avoid side effects and mutating state data

Why functional programming?
• Multi core processors
• Concurrency
– Computation as a series of independent data transformations
– Parallel data transformations without side effects
• Referential transparency

functional programming
• Functions
– Lambda, closure
• For-comprehensions
• Type inference
• Pattern matching
• Higher order functions
– map, flatMap, foldLeft
• And more …

FP: functions
• Anonymous function
– Function without a name
– lambda function
• Example
– scala> List(100, 200, 300) map { _ * 10/100}
– res0: List[Int] = List(10, 20, 30)
• Closure (Wikipedia)
– Closure = A function, together with a referencing environment – a
table storing a reference to each of the non-local variables of that
function.
– A closure allows a function to access those non-local variables
even when invoked outside its immediate lexical scope.

FP: functions
• applyPercentage is an example of a closure
– scala> var percentage = 10
– percentage: Int = 10
– scala> val applyPercentage = (amount: Int) => amount *
percentage / 100
– applyPercentage: Int => Int = <function1>
– scala> percentage = 20
– percentage: Int = 20
– scala> List (100, 200, 300) map applyPercentage
– res1: List[Int] = List(20, 40, 60)
– scala>

FP: functions
• Closure

FP: Higher order functions
scala> :load Person.scala
Loading Person.scala...
defined class Person
scala> val jd = new Person("John", "Doe", 17)
jd: Person = Person@372a6e85
scala> val rh = new Person("Roger", "Huang", 34)
rh: Person = Person@611c4041
scala> val people = Array(jd, rh)
people: Array[Person] = Array(Person@372a6e85, Person@611c4041)
scala> val (minors, adults) = people partition (_.age < 18)
minors: Array[Person] = Array(Person@372a6e85)
adults: Array[Person] = Array(Person@611c4041)
scala>

FP: Higher order functions
• HOF
– takes a function as an argument
– Returns a function

FP: Higher order functions: map
• Creates a new collection from an existing collection by applying
a function
scala> List(1, 2, 3 ) map { (x: Int) => x + 1 }
res0: List[Int] = List(2, 3, 4)
• Function literal
scala> List(1, 2, 3) map { _ + 1 }
• Passing an existing function
scala> def addOne(num: Int) = num + 1
addOne: (num: Int)Int
scala> List(1, 2, 3) map addOne

FP: Higher order functions: map

FP: Higher order functions: flatmap

FP: for-comprehension
• Syntax
– for ( <generator> | <guard> ) <expression> [yield] <expression>
• Types
– Imperative form. Does not return a value.
scala> val aList = List(1, 2, 3)
aList: List[Int] = List(1, 2, 3)
scala> val bList = List(4, 5, 6)
bList: List[Int] = List(4, 5, 6)
scala> for { a <- aList; if (a < 2); b <- bList; if (b < 7) } println( a + b )
5
6
7

• Syntax
– for ( <generator> | <guard> ) <expression> [yield] <expression>
• Types
– Functional form (a.k.a., sequence comprehension) . Returns/yields
a value
scala> for { a <- aList; b <- bList} yield a + b
res0: List[Int] = List(5, 6, 7, 6, 7, 8, 7, 8, 9)
scala> res0.take(1)
res1: List[Int] = List(5)
scala> for { a <- aList; if (a < 2); b <- bList } yield a + b
scala>

FP: foldLeft
• scala> val numbers = 1.to(10)
• numbers: scala.collection.immutable.Range.Inclusive =
Range(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
• scala> def add( a:Int, b:Int ): Int = { a + b }
• add: (a: Int, b: Int)Int
• scala> numbers.foldLeft(0){ add }
• res0: Int = 55
• scala> numbers.foldLeft(0){ (acc, b) => acc + b }
• res1: Int = 55
• scala>

FP: foldLeft

FP: find the last item in an array
• scala> val ns = Array(20, 40, 60)
• ns: Array[Int] = Array(20, 40, 60)
• scala> ns.foldLeft(ns.head) {(acc, b) => b}
• res0: Int = 60
• scala>

FP: reverse an array w/ foldLeft
• scala> val ns = Array(20, 40, 60)
• ns: Array[Int] = Array(20, 40, 60)
• scala> ns.foldLeft( Array[Int]() ) { (acc, b) => b +: acc}
• res1: Array[Int] = Array(60, 40, 20)
• scala>

FP: reverse an array w/ foldLeft