SPARK MEETS TELEMETRY 
Mozlandia 2014 
Roberto Agostino Vitillo
TELEMETRY PINGS
TELEMETRY PINGS 
• If Telemetry is enabled, a ping is generated for 
each session 
• Pings are sent to our backend infrast...
TELEMETRY PINGS
TELEMETRY MAP-REDUCE 
import json 
def map(k, d, v, cx): 
j = json.loads(v) 
os = j['info']['OS'] 
cx.write(os, 1) 
def re...
SHORTCOMINGS 
• Not distributed, limited to a single machine 
• Doesn’t support chains of map/reduce ops 
• Doesn’t suppor...
source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
WHAT IS SPARK? 
• In-memory data analytics cluster computing 
framework (up to 100x faster than Hadoop) 
• Comes with over...
WHY DO WE CARE? 
• In memory caching 
• Interactive command line interface for EDA (think R command line) 
• Comes with hi...
WHY DO WE REALLY CARE? 
The easier we make it to get answers, 
the more questions we will ask
MASHUP DEMO
HOW DOES IT WORK? 
• User creates Resilient Distributed Datasets (RDDs), 
transforms and executes them 
• RDD operations a...
RDD 
A parallel dataset with partitions 
Var A Var B Var C 
observation 
observation 
observation 
observation 
Partition ...
DAG 
Logical graph of RDD operations 
sc.textFile("input") 
.map(line => line.split(",")) 
.map(line => (line(0), line(1)....
RDD[String] RDD[Array[String]] RDD[(String, Int)] 
RDD[(String, Int)] 
map map reduceByKey 
read 
STAGE 
Stage 1 Stage 2 
...
Stage 1 
map map 
STAGE 
shuffle 
RDD[String] RDD[Array[String]] RDD[(String, Int)] 
read input output 
read 
map 
map 
sh...
STAGE 
Set of tasks that can run in parallel 
Stage 1 Stage 2
STAGE 
Set of tasks that can run in parallel 
• Tasks are the fundamental unit of work 
• Tasks are serialised and shipped...
HANDS-ON
HANDS-ON 
1. Visit telemetry-dash.mozilla.org and sign in using Persona. 
2. Click “Launch an ad-hoc analysis worker”. 
3....
HANDS-ON 
• Connect to the machine through ssh 
• Clone the starter template: 
1. git clone https://github.com/vitillo/moz...
TUTORIAL
Spark meets Telemetry
Spark meets Telemetry
Upcoming SlideShare
Loading in …5
×

Spark meets Telemetry

1,211 views

Published on

A talk about Spark and Mozilla Telemetry.

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,211
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
6
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Spark meets Telemetry

  1. 1. SPARK MEETS TELEMETRY Mozlandia 2014 Roberto Agostino Vitillo
  2. 2. TELEMETRY PINGS
  3. 3. TELEMETRY PINGS • If Telemetry is enabled, a ping is generated for each session • Pings are sent to our backend infrastructure as json blobs • Backend validates and stores pings on S3
  4. 4. TELEMETRY PINGS
  5. 5. TELEMETRY MAP-REDUCE import json def map(k, d, v, cx): j = json.loads(v) os = j['info']['OS'] cx.write(os, 1) def reduce(k, v, cx): cx.write(k, sum(v)) • Processes pings from S3 using a map reduce framework written in Python • https://github.com/mozilla/telemetry-server
  6. 6. SHORTCOMINGS • Not distributed, limited to a single machine • Doesn’t support chains of map/reduce ops • Doesn’t support SQL-like queries • Batch oriented
  7. 7. source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
  8. 8. WHAT IS SPARK? • In-memory data analytics cluster computing framework (up to 100x faster than Hadoop) • Comes with over 80 distributed operations for grouping, filtering etc. • Runs standalone or on Hadoop, Mesos and TaskCluster in the future (right Jonas?)
  9. 9. WHY DO WE CARE? • In memory caching • Interactive command line interface for EDA (think R command line) • Comes with higher level libraries for machine learning and graph processing • Works beautifully on a single machine without tedious setup; doesn’t depend on Hadoop/HDFS • Scala, Python, Clojure and R APIs are available
  10. 10. WHY DO WE REALLY CARE? The easier we make it to get answers, the more questions we will ask
  11. 11. MASHUP DEMO
  12. 12. HOW DOES IT WORK? • User creates Resilient Distributed Datasets (RDDs), transforms and executes them • RDD operations are compiled to a DAG of operators • DAG is compiled into stages • A stage is executed in parallel as a series of tasks
  13. 13. RDD A parallel dataset with partitions Var A Var B Var C observation observation observation observation Partition Partition
  14. 14. DAG Logical graph of RDD operations sc.textFile("input") .map(line => line.split(",")) .map(line => (line(0), line(1).toInt)) .reduceByKey(_ + _, 3) RDD[String] RDD[Array[String]] RDD[(String, Int)] RDD[(String, Int)] map map reduceByKey read P1 P2 P3 P4
  15. 15. RDD[String] RDD[Array[String]] RDD[(String, Int)] RDD[(String, Int)] map map reduceByKey read STAGE Stage 1 Stage 2 P1 P2 P3 P4
  16. 16. Stage 1 map map STAGE shuffle RDD[String] RDD[Array[String]] RDD[(String, Int)] read input output read map map shuffle P1 P2 P3 P4 T1 T2 T3 T4 Set of tasks that can run in parallel Stage 1
  17. 17. STAGE Set of tasks that can run in parallel Stage 1 Stage 2
  18. 18. STAGE Set of tasks that can run in parallel • Tasks are the fundamental unit of work • Tasks are serialised and shipped to workers • Task execution 1. Fetch input 2. Execute 3. Output result task 1 task 2 task 3 task 4
  19. 19. HANDS-ON
  20. 20. HANDS-ON 1. Visit telemetry-dash.mozilla.org and sign in using Persona. 2. Click “Launch an ad-hoc analysis worker”. 3. Upload your SSH public key (this allows you to log in to the server once it’s started up). 4. Click “Submit” 5. A Ubuntu machine will be started up on Amazon’s EC2 infrastructure.
  21. 21. HANDS-ON • Connect to the machine through ssh • Clone the starter template: 1. git clone https://github.com/vitillo/mozilla-telemetry-spark.git 2. cd mozilla-telemetry-spark && source aws/setup.sh 3. sbt console • Open http://bit.ly/1wBHHDH
  22. 22. TUTORIAL

×