3. TELEMETRY PINGS
• If Telemetry is enabled, a ping is generated for
each session
• Pings are sent to our backend infrastructure as
json blobs
• Backend validates and stores pings on S3
5. TELEMETRY MAP-REDUCE
import json
def map(k, d, v, cx):
j = json.loads(v)
os = j['info']['OS']
cx.write(os, 1)
def reduce(k, v, cx):
cx.write(k, sum(v))
• Processes pings from S3 using a map reduce
framework written in Python
• https://github.com/mozilla/telemetry-server
6. SHORTCOMINGS
• Not distributed, limited to a single machine
• Doesn’t support chains of map/reduce ops
• Doesn’t support SQL-like queries
• Batch oriented
9. WHAT IS SPARK?
• In-memory data analytics cluster computing
framework (up to 100x faster than Hadoop)
• Comes with over 80 distributed operations for
grouping, filtering etc.
• Runs standalone or on Hadoop, Mesos and
TaskCluster in the future (right Jonas?)
10. WHY DO WE CARE?
• In memory caching
• Interactive command line interface for EDA (think R command line)
• Comes with higher level libraries for machine learning and graph
processing
• Works beautifully on a single machine without tedious setup;
doesn’t depend on Hadoop/HDFS
• Scala, Python, Clojure and R APIs are available
11. WHY DO WE REALLY CARE?
The easier we make it to get answers,
the more questions we will ask
13. HOW DOES IT WORK?
• User creates Resilient Distributed Datasets (RDDs),
transforms and executes them
• RDD operations are compiled to a DAG of
operators
• DAG is compiled into stages
• A stage is executed in parallel as a series of tasks
14. RDD
A parallel dataset with partitions
Var A Var B Var C
observation
observation
observation
observation
Partition
Partition
17. Stage 1
map map
STAGE
shuffle
RDD[String] RDD[Array[String]] RDD[(String, Int)]
read input output
read
map
map
shuffle
P1
P2
P3
P4
T1
T2
T3
T4
Set of tasks that can run in parallel
Stage 1
18. STAGE
Set of tasks that can run in parallel
Stage 1 Stage 2
19. STAGE
Set of tasks that can run in parallel
• Tasks are the fundamental unit of work
• Tasks are serialised and shipped to workers
• Task execution
1. Fetch input
2. Execute
3. Output result
task 1
task 2
task 3
task 4
21. HANDS-ON
1. Visit telemetry-dash.mozilla.org and sign in using Persona.
2. Click “Launch an ad-hoc analysis worker”.
3. Upload your SSH public key (this allows you to log in to the
server once it’s started up).
4. Click “Submit”
5. A Ubuntu machine will be started up on Amazon’s EC2
infrastructure.
22. HANDS-ON
• Connect to the machine through ssh
• Clone the starter template:
1. git clone https://github.com/vitillo/mozilla-telemetry-spark.git
2. cd mozilla-telemetry-spark && source aws/setup.sh
3. sbt console
• Open http://bit.ly/1wBHHDH