9. WHAT IS SPARK?
• In-memory data analytics cluster computing
framework (up to 100x faster than Hadoop)
• Comes with over 80 distributed operations for
grouping, filtering etc.
• Runs standalone or on Hadoop, Mesos and
TaskCluster in the future (right Jonas?)
10. WHY DO WE CARE?
• In memory caching
• Interactive command line interface for EDA (think R command line)
• Comes with higher level libraries for machine learning and graph
• Works beautifully on a single machine without tedious setup;
doesn’t depend on Hadoop/HDFS
• Scala, Python, Clojure and R APIs are available
11. WHY DO WE REALLY CARE?
The easier we make it to get answers,
the more questions we will ask
13. HOW DOES IT WORK?
• User creates Resilient Distributed Datasets (RDDs),
transforms and executes them
• RDD operations are compiled to a DAG of
• DAG is compiled into stages
• A stage is executed in parallel as a series of tasks
A parallel dataset with partitions
Var A Var B Var C
Set of tasks that can run in parallel
• Tasks are the fundamental unit of work
• Tasks are serialised and shipped to workers
• Task execution
1. Fetch input
3. Output result
1. Visit telemetry-dash.mozilla.org and sign in using Persona.
2. Click “Launch an ad-hoc analysis worker”.
3. Upload your SSH public key (this allows you to log in to the
server once it’s started up).
4. Click “Submit”
5. A Ubuntu machine will be started up on Amazon’s EC2
• Connect to the machine through ssh
• Clone the starter template:
1. git clone https://github.com/vitillo/mozilla-telemetry-spark.git
2. cd mozilla-telemetry-spark && source aws/setup.sh
3. sbt console
• Open http://bit.ly/1wBHHDH