Spark meets Telemetry

SPARK MEETS TELEMETRY
Mozlandia 2014
Roberto Agostino Vitillo

TELEMETRY PINGS
• If Telemetry is enabled, a ping is generated for
each session
• Pings are sent to our backend infrastructure as
json blobs
• Backend validates and stores pings on S3

TELEMETRY MAP-REDUCE
import json
def map(k, d, v, cx):
j = json.loads(v)
os = j['info']['OS']
cx.write(os, 1)
def reduce(k, v, cx):
cx.write(k, sum(v))
• Processes pings from S3 using a map reduce
framework written in Python
• https://github.com/mozilla/telemetry-server

SHORTCOMINGS
• Not distributed, limited to a single machine
• Doesn’t support chains of map/reduce ops
• Doesn’t support SQL-like queries
• Batch oriented

source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html

WHAT IS SPARK?
• In-memory data analytics cluster computing
framework (up to 100x faster than Hadoop)
• Comes with over 80 distributed operations for
grouping, filtering etc.
• Runs standalone or on Hadoop, Mesos and
TaskCluster in the future (right Jonas?)

WHY DO WE CARE?
• In memory caching
• Interactive command line interface for EDA (think R command line)
• Comes with higher level libraries for machine learning and graph
processing
• Works beautifully on a single machine without tedious setup;
doesn’t depend on Hadoop/HDFS
• Scala, Python, Clojure and R APIs are available

WHY DO WE REALLY CARE?
The easier we make it to get answers,
the more questions we will ask

HOW DOES IT WORK?
• User creates Resilient Distributed Datasets (RDDs),
transforms and executes them
• RDD operations are compiled to a DAG of
operators
• DAG is compiled into stages
• A stage is executed in parallel as a series of tasks

RDD
A parallel dataset with partitions
Var A Var B Var C
observation
observation
observation
observation
Partition
Partition

DAG
Logical graph of RDD operations
sc.textFile("input")
.map(line => line.split(","))
.map(line => (line(0), line(1).toInt))
.reduceByKey(_ + _, 3)
RDD[String] RDD[Array[String]] RDD[(String, Int)]
RDD[(String, Int)]
map map reduceByKey
read
P1
P2
P3
P4

RDD[(String, Int)]
map map reduceByKey
read
STAGE
Stage 1 Stage 2
P1
P2
P3
P4

Stage 1
map map
STAGE
shuffle
read input output
read
map
map
shuffle
P1
P2
P3
P4
T1
T2
T3
T4
Set of tasks that can run in parallel
Stage 1

STAGE
Stage 1 Stage 2

STAGE
• Tasks are the fundamental unit of work
• Tasks are serialised and shipped to workers
• Task execution
1. Fetch input
2. Execute
3. Output result
task 1
task 2
task 3
task 4

HANDS-ON
1. Visit telemetry-dash.mozilla.org and sign in using Persona.
2. Click “Launch an ad-hoc analysis worker”.
3. Upload your SSH public key (this allows you to log in to the
server once it’s started up).
4. Click “Submit”
5. A Ubuntu machine will be started up on Amazon’s EC2
infrastructure.

HANDS-ON
• Connect to the machine through ssh
• Clone the starter template:
1. git clone https://github.com/vitillo/mozilla-telemetry-spark.git
2. cd mozilla-telemetry-spark && source aws/setup.sh
3. sbt console
• Open http://bit.ly/1wBHHDH

Spark meets Telemetry

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark meets Telemetry

Similar to Spark meets Telemetry (20)

More from Roberto Agostino Vitillo

More from Roberto Agostino Vitillo (14)

Recently uploaded

Recently uploaded (20)

Spark meets Telemetry