Scala and Spark are
Ideal for Big Data
John Nestor
47 Degrees
Seattle Unstructured Data Science Pop-Up
October 7, 2015
www.47deg.com
147deg.com
47deg.com
Why Scala?
• Strong typing
• Concise elegant syntax
• Runs on JVM (Java Virtual Machine)
• Supports both object-oriented and functional
• Small simple programs through large parallel distributed
systems
• Easy to cleanly extend with new libraries and DSL’s
• Ideal for parallel and distributed systems
2
47deg.com
Scala: Strong Typing and Concise Syntax
• Strong typing like Java.
• Compile time checks
• Better modularity via strongly typed interfaces
• Easier maintenance: types make code easier to
understand
• Concise syntax like Python.
• Type inference. Compiler infers most types that had to be
explicit in Java.
• Powerful syntax that avoid much of the boilerplate of Java
code (see next slide).
• Best of both worlds: safety of strong typing with conciseness
(like Python).
3
47deg.com
Scala Case Class
• Java version
class User {
private String name;
private Int age;
public User(String name, Int age) {
this.name = name; this.age = age;
}
public getAge() { return age; }
public setAge(Int age) { this.age = age;}
}
User joe = new User(“Joe”, 30);
• Scala version
case class User(name:String, var age:Int)
val joe = User(“Joe”, 30)
4
47deg.com
Functional Scala
• Anonymous functions.
(a:Int,b:Int) => a+b
• Functions that take and return other functions.
• Rarely need variables or loops
• Immutable collections: Seq[T], Map[K,V], …
• Works well with concurrent or distributed systems
• Natural for functional programming
• Functional collection operations (a small sample)
• map, flatMap, reduce, …
• filter, groupBy, sortBy, take, drop, …
5
47deg.com
Scala Availability and Support
• Open Source
• Typesafe provides support. Founded my Martin
Odersky who designed Scala.
• IDEs: Intellij IDEA and Eclipse
• Libraries: lots now and more every day
• ScalaNLP - Epic (natural language processing)
• Major Scala users: LinkedIn, Twitter, Goldman Sachs,
Coursera, Angies List, Whitepages
• Major systems written in Scala: Spark, Kafka
6
47deg.com
Typesafe Scala Components
• Scala Compiler (includes REPL)
• Scala Standard Libraries
• SBT - Scala Build Tool
• Play - scaleable web applications
• Scala JS - compiles Scala to JavaScript
• Akka - for parallel and distributed computation
• Spray - high performance asynchronous TCP/ HTTP library
• Spark - Typesafe also supports Spark
• Slick - for SQL database access
• ConductR - Scala deployment/devops tool
• Reactive Monitoring (Beta)
7
47deg.com
Why Spark?
• Support for not only batch but also (near) real-time
• Fast - keeps data in memory as much as possible
• Often 10X to 100X Hadoop speed
• A clean easy-to-use API
• A richer set of functional operations than just map and
reduce
• A foundation for a wide set of integrated data
applications
• Can recover from failures - recompute or (optional)
replication
• Scalable for very large data sets and reduced time
8
47deg.com
Spark RDDs
• RDD[T] - resilient distributed data set
• typed (must be serializable)
• immutable
• ordered
• can be processed in parallel
• lazy evaluation - permits more global optimizations
• Rich set of functional operations ( a small sample)
• map, flatMap, reduce, …
• filter, groupBy, sortBy, take, drop, …
9
47deg.com
Spark Components
• Spark Core
• Scalable multi-node cluster
• Failure detection and recovery
• RDDs and functional operations
• MLLib - for machine learning
• linear regression, SVMs, clustering, collaborative
filtering, dimension reduction
• more on the way!
• GraphX - for graph computation
• Streaming - for near real-time
• Dataframes - for SQL and Json
10
47deg.com
Spark Availability and Support
• Open Source - top level Apache project
• Over 750 contributors from over 200 organizations
• Can process multiple petabytes on clusters of over
8000 nodes
• Databricks. Matei Zaharia who wrote the original Spark
is a founder and CTO
• Packages (more every day)
• Zeppelin - Scala notebooks
• Cassandra, Kafka connectors
11
47deg.com
Clusters and Scalability
• Scala Akka clusters (process distribution, micro services)
• message passing
• remote Actors
• Spark clusters (data distribution)
• local
• Stand alone (optionally with ZooKeeper)
• Apache Mesos
• Hadoop Yarn
• can run above on Amazon and Google clouds
12
47deg.com
Why Scala for Spark?
• Why not Python, R, or Java for Spark?
• Spark is written in Scala
• Scala source code is important Spark documentation
• Spark is best extended in Scala
• The primary API for Spark is Scala
• The functional features of Scala and Spark are a
natural fit and easiest to use in Scala
• If you want to build scalable high performance
production code based on Spark, R by itself is too
specialized, Python is too slow and Java is tedious to
write and maintain
13
47deg.com
Demo
14
47deg.com
Seattle Resources
• Seattle Meetups
• Scala at the Sea Meetup
http://www.meetup.com/Seattle-Scala-User-Group/
• Seattle Spark Meetup
http://www.meetup.com/Seattle-Spark-Meetup/
• Seattle Training: Spark and Typesafe Scala Classes
http://www.47deg.com/events#training
• UW Scala Professional Certificate Program
http://www.pce.uw.edu/certificates/scala-functional-reactive-programming.html
15

Scala and Spark are Ideal for Big Data

  • 1.
    Scala and Sparkare Ideal for Big Data John Nestor 47 Degrees Seattle Unstructured Data Science Pop-Up October 7, 2015 www.47deg.com 147deg.com
  • 2.
    47deg.com Why Scala? • Strongtyping • Concise elegant syntax • Runs on JVM (Java Virtual Machine) • Supports both object-oriented and functional • Small simple programs through large parallel distributed systems • Easy to cleanly extend with new libraries and DSL’s • Ideal for parallel and distributed systems 2
  • 3.
    47deg.com Scala: Strong Typingand Concise Syntax • Strong typing like Java. • Compile time checks • Better modularity via strongly typed interfaces • Easier maintenance: types make code easier to understand • Concise syntax like Python. • Type inference. Compiler infers most types that had to be explicit in Java. • Powerful syntax that avoid much of the boilerplate of Java code (see next slide). • Best of both worlds: safety of strong typing with conciseness (like Python). 3
  • 4.
    47deg.com Scala Case Class •Java version class User { private String name; private Int age; public User(String name, Int age) { this.name = name; this.age = age; } public getAge() { return age; } public setAge(Int age) { this.age = age;} } User joe = new User(“Joe”, 30); • Scala version case class User(name:String, var age:Int) val joe = User(“Joe”, 30) 4
  • 5.
    47deg.com Functional Scala • Anonymousfunctions. (a:Int,b:Int) => a+b • Functions that take and return other functions. • Rarely need variables or loops • Immutable collections: Seq[T], Map[K,V], … • Works well with concurrent or distributed systems • Natural for functional programming • Functional collection operations (a small sample) • map, flatMap, reduce, … • filter, groupBy, sortBy, take, drop, … 5
  • 6.
    47deg.com Scala Availability andSupport • Open Source • Typesafe provides support. Founded my Martin Odersky who designed Scala. • IDEs: Intellij IDEA and Eclipse • Libraries: lots now and more every day • ScalaNLP - Epic (natural language processing) • Major Scala users: LinkedIn, Twitter, Goldman Sachs, Coursera, Angies List, Whitepages • Major systems written in Scala: Spark, Kafka 6
  • 7.
    47deg.com Typesafe Scala Components •Scala Compiler (includes REPL) • Scala Standard Libraries • SBT - Scala Build Tool • Play - scaleable web applications • Scala JS - compiles Scala to JavaScript • Akka - for parallel and distributed computation • Spray - high performance asynchronous TCP/ HTTP library • Spark - Typesafe also supports Spark • Slick - for SQL database access • ConductR - Scala deployment/devops tool • Reactive Monitoring (Beta) 7
  • 8.
    47deg.com Why Spark? • Supportfor not only batch but also (near) real-time • Fast - keeps data in memory as much as possible • Often 10X to 100X Hadoop speed • A clean easy-to-use API • A richer set of functional operations than just map and reduce • A foundation for a wide set of integrated data applications • Can recover from failures - recompute or (optional) replication • Scalable for very large data sets and reduced time 8
  • 9.
    47deg.com Spark RDDs • RDD[T]- resilient distributed data set • typed (must be serializable) • immutable • ordered • can be processed in parallel • lazy evaluation - permits more global optimizations • Rich set of functional operations ( a small sample) • map, flatMap, reduce, … • filter, groupBy, sortBy, take, drop, … 9
  • 10.
    47deg.com Spark Components • SparkCore • Scalable multi-node cluster • Failure detection and recovery • RDDs and functional operations • MLLib - for machine learning • linear regression, SVMs, clustering, collaborative filtering, dimension reduction • more on the way! • GraphX - for graph computation • Streaming - for near real-time • Dataframes - for SQL and Json 10
  • 11.
    47deg.com Spark Availability andSupport • Open Source - top level Apache project • Over 750 contributors from over 200 organizations • Can process multiple petabytes on clusters of over 8000 nodes • Databricks. Matei Zaharia who wrote the original Spark is a founder and CTO • Packages (more every day) • Zeppelin - Scala notebooks • Cassandra, Kafka connectors 11
  • 12.
    47deg.com Clusters and Scalability •Scala Akka clusters (process distribution, micro services) • message passing • remote Actors • Spark clusters (data distribution) • local • Stand alone (optionally with ZooKeeper) • Apache Mesos • Hadoop Yarn • can run above on Amazon and Google clouds 12
  • 13.
    47deg.com Why Scala forSpark? • Why not Python, R, or Java for Spark? • Spark is written in Scala • Scala source code is important Spark documentation • Spark is best extended in Scala • The primary API for Spark is Scala • The functional features of Scala and Spark are a natural fit and easiest to use in Scala • If you want to build scalable high performance production code based on Spark, R by itself is too specialized, Python is too slow and Java is tedious to write and maintain 13
  • 14.
  • 15.
    47deg.com Seattle Resources • SeattleMeetups • Scala at the Sea Meetup http://www.meetup.com/Seattle-Scala-User-Group/ • Seattle Spark Meetup http://www.meetup.com/Seattle-Spark-Meetup/ • Seattle Training: Spark and Typesafe Scala Classes http://www.47deg.com/events#training • UW Scala Professional Certificate Program http://www.pce.uw.edu/certificates/scala-functional-reactive-programming.html 15