Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Scala and Spark are Ideal for Big Data
1. Scala and Spark are
Ideal for Big Data
John Nestor
47 Degrees
Seattle Unstructured Data Science Pop-Up
October 7, 2015
www.47deg.com
147deg.com
2. 47deg.com
Why Scala?
• Strong typing
• Concise elegant syntax
• Runs on JVM (Java Virtual Machine)
• Supports both object-oriented and functional
• Small simple programs through large parallel distributed
systems
• Easy to cleanly extend with new libraries and DSL’s
• Ideal for parallel and distributed systems
2
3. 47deg.com
Scala: Strong Typing and Concise Syntax
• Strong typing like Java.
• Compile time checks
• Better modularity via strongly typed interfaces
• Easier maintenance: types make code easier to
understand
• Concise syntax like Python.
• Type inference. Compiler infers most types that had to be
explicit in Java.
• Powerful syntax that avoid much of the boilerplate of Java
code (see next slide).
• Best of both worlds: safety of strong typing with conciseness
(like Python).
3
4. 47deg.com
Scala Case Class
• Java version
class User {
private String name;
private Int age;
public User(String name, Int age) {
this.name = name; this.age = age;
}
public getAge() { return age; }
public setAge(Int age) { this.age = age;}
}
User joe = new User(“Joe”, 30);
• Scala version
case class User(name:String, var age:Int)
val joe = User(“Joe”, 30)
4
5. 47deg.com
Functional Scala
• Anonymous functions.
(a:Int,b:Int) => a+b
• Functions that take and return other functions.
• Rarely need variables or loops
• Immutable collections: Seq[T], Map[K,V], …
• Works well with concurrent or distributed systems
• Natural for functional programming
• Functional collection operations (a small sample)
• map, flatMap, reduce, …
• filter, groupBy, sortBy, take, drop, …
5
6. 47deg.com
Scala Availability and Support
• Open Source
• Typesafe provides support. Founded my Martin
Odersky who designed Scala.
• IDEs: Intellij IDEA and Eclipse
• Libraries: lots now and more every day
• ScalaNLP - Epic (natural language processing)
• Major Scala users: LinkedIn, Twitter, Goldman Sachs,
Coursera, Angies List, Whitepages
• Major systems written in Scala: Spark, Kafka
6
7. 47deg.com
Typesafe Scala Components
• Scala Compiler (includes REPL)
• Scala Standard Libraries
• SBT - Scala Build Tool
• Play - scaleable web applications
• Scala JS - compiles Scala to JavaScript
• Akka - for parallel and distributed computation
• Spray - high performance asynchronous TCP/ HTTP library
• Spark - Typesafe also supports Spark
• Slick - for SQL database access
• ConductR - Scala deployment/devops tool
• Reactive Monitoring (Beta)
7
8. 47deg.com
Why Spark?
• Support for not only batch but also (near) real-time
• Fast - keeps data in memory as much as possible
• Often 10X to 100X Hadoop speed
• A clean easy-to-use API
• A richer set of functional operations than just map and
reduce
• A foundation for a wide set of integrated data
applications
• Can recover from failures - recompute or (optional)
replication
• Scalable for very large data sets and reduced time
8
9. 47deg.com
Spark RDDs
• RDD[T] - resilient distributed data set
• typed (must be serializable)
• immutable
• ordered
• can be processed in parallel
• lazy evaluation - permits more global optimizations
• Rich set of functional operations ( a small sample)
• map, flatMap, reduce, …
• filter, groupBy, sortBy, take, drop, …
9
10. 47deg.com
Spark Components
• Spark Core
• Scalable multi-node cluster
• Failure detection and recovery
• RDDs and functional operations
• MLLib - for machine learning
• linear regression, SVMs, clustering, collaborative
filtering, dimension reduction
• more on the way!
• GraphX - for graph computation
• Streaming - for near real-time
• Dataframes - for SQL and Json
10
11. 47deg.com
Spark Availability and Support
• Open Source - top level Apache project
• Over 750 contributors from over 200 organizations
• Can process multiple petabytes on clusters of over
8000 nodes
• Databricks. Matei Zaharia who wrote the original Spark
is a founder and CTO
• Packages (more every day)
• Zeppelin - Scala notebooks
• Cassandra, Kafka connectors
11
12. 47deg.com
Clusters and Scalability
• Scala Akka clusters (process distribution, micro services)
• message passing
• remote Actors
• Spark clusters (data distribution)
• local
• Stand alone (optionally with ZooKeeper)
• Apache Mesos
• Hadoop Yarn
• can run above on Amazon and Google clouds
12
13. 47deg.com
Why Scala for Spark?
• Why not Python, R, or Java for Spark?
• Spark is written in Scala
• Scala source code is important Spark documentation
• Spark is best extended in Scala
• The primary API for Spark is Scala
• The functional features of Scala and Spark are a
natural fit and easiest to use in Scala
• If you want to build scalable high performance
production code based on Spark, R by itself is too
specialized, Python is too slow and Java is tedious to
write and maintain
13