Spark meets Telemetry

Roberto Agostino Vitillo
Roberto Agostino VitilloStaff Data Engineer at Mozilla
SPARK MEETS TELEMETRY 
Mozlandia 2014 
Roberto Agostino Vitillo
TELEMETRY PINGS
TELEMETRY PINGS 
• If Telemetry is enabled, a ping is generated for 
each session 
• Pings are sent to our backend infrastructure as 
json blobs 
• Backend validates and stores pings on S3
TELEMETRY PINGS
TELEMETRY MAP-REDUCE 
import json 
def map(k, d, v, cx): 
j = json.loads(v) 
os = j['info']['OS'] 
cx.write(os, 1) 
def reduce(k, v, cx): 
cx.write(k, sum(v)) 
• Processes pings from S3 using a map reduce 
framework written in Python 
• https://github.com/mozilla/telemetry-server
SHORTCOMINGS 
• Not distributed, limited to a single machine 
• Doesn’t support chains of map/reduce ops 
• Doesn’t support SQL-like queries 
• Batch oriented
Spark meets Telemetry
source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
WHAT IS SPARK? 
• In-memory data analytics cluster computing 
framework (up to 100x faster than Hadoop) 
• Comes with over 80 distributed operations for 
grouping, filtering etc. 
• Runs standalone or on Hadoop, Mesos and 
TaskCluster in the future (right Jonas?)
WHY DO WE CARE? 
• In memory caching 
• Interactive command line interface for EDA (think R command line) 
• Comes with higher level libraries for machine learning and graph 
processing 
• Works beautifully on a single machine without tedious setup; 
doesn’t depend on Hadoop/HDFS 
• Scala, Python, Clojure and R APIs are available
WHY DO WE REALLY CARE? 
The easier we make it to get answers, 
the more questions we will ask
MASHUP DEMO
HOW DOES IT WORK? 
• User creates Resilient Distributed Datasets (RDDs), 
transforms and executes them 
• RDD operations are compiled to a DAG of 
operators 
• DAG is compiled into stages 
• A stage is executed in parallel as a series of tasks
RDD 
A parallel dataset with partitions 
Var A Var B Var C 
observation 
observation 
observation 
observation 
Partition 
Partition
DAG 
Logical graph of RDD operations 
sc.textFile("input") 
.map(line => line.split(",")) 
.map(line => (line(0), line(1).toInt)) 
.reduceByKey(_ + _, 3) 
RDD[String] RDD[Array[String]] RDD[(String, Int)] 
RDD[(String, Int)] 
map map reduceByKey 
read 
P1 
P2 
P3 
P4
RDD[String] RDD[Array[String]] RDD[(String, Int)] 
RDD[(String, Int)] 
map map reduceByKey 
read 
STAGE 
Stage 1 Stage 2 
P1 
P2 
P3 
P4
Stage 1 
map map 
STAGE 
shuffle 
RDD[String] RDD[Array[String]] RDD[(String, Int)] 
read input output 
read 
map 
map 
shuffle 
P1 
P2 
P3 
P4 
T1 
T2 
T3 
T4 
Set of tasks that can run in parallel 
Stage 1
STAGE 
Set of tasks that can run in parallel 
Stage 1 Stage 2
STAGE 
Set of tasks that can run in parallel 
• Tasks are the fundamental unit of work 
• Tasks are serialised and shipped to workers 
• Task execution 
1. Fetch input 
2. Execute 
3. Output result 
task 1 
task 2 
task 3 
task 4
HANDS-ON
HANDS-ON 
1. Visit telemetry-dash.mozilla.org and sign in using Persona. 
2. Click “Launch an ad-hoc analysis worker”. 
3. Upload your SSH public key (this allows you to log in to the 
server once it’s started up). 
4. Click “Submit” 
5. A Ubuntu machine will be started up on Amazon’s EC2 
infrastructure.
HANDS-ON 
• Connect to the machine through ssh 
• Clone the starter template: 
1. git clone https://github.com/vitillo/mozilla-telemetry-spark.git 
2. cd mozilla-telemetry-spark && source aws/setup.sh 
3. sbt console 
• Open http://bit.ly/1wBHHDH
TUTORIAL
Spark meets Telemetry
1 of 24

More Related Content

What's hot(20)

S2S2
S2
Daniel Marcous350 views
CArcMOOC 03.04 - Gate-level designCArcMOOC 03.04 - Gate-level design
CArcMOOC 03.04 - Gate-level design
Alessandro Bogliolo764 views
Pgrouting_foss4guk_ross_mcdonaldPgrouting_foss4guk_ross_mcdonald
Pgrouting_foss4guk_ross_mcdonald
Ross McDonald463 views
MapReduce with HadoopMapReduce with Hadoop
MapReduce with Hadoop
Vitalie Scurtu1.1K views
Doom in SpaceXDoom in SpaceX
Doom in SpaceX
Martin Dvorak15.9K views
Flink meetupFlink meetup
Flink meetup
Frank McSherry634 views
scikit-cudascikit-cuda
scikit-cuda
Lev Givon2K views
Google Cluster InnardsGoogle Cluster Innards
Google Cluster Innards
Martin Dvorak17.8K views
Riding the Elephant - Hadoop 2.0Riding the Elephant - Hadoop 2.0
Riding the Elephant - Hadoop 2.0
Simon Elliston Ball753 views
Three Functional Programming Technologies for Big DataThree Functional Programming Technologies for Big Data
Three Functional Programming Technologies for Big Data
Dynamical Software, Inc.777 views
Aws Quick Dirty Hadoop Mapreduce Ec2 S3Aws Quick Dirty Hadoop Mapreduce Ec2 S3
Aws Quick Dirty Hadoop Mapreduce Ec2 S3
Skills Matter1.6K views
Graph databaseGraph database
Graph database
Achintya Kumar1.3K views
State of OSRM - SOTM 2016State of OSRM - SOTM 2016
State of OSRM - SOTM 2016
Johan 657 views
Nips2016 mlgkernelNips2016 mlgkernel
Nips2016 mlgkernel
Daigo HIROOKA113 views

Similar to Spark meets Telemetry(20)

More from Roberto Agostino Vitillo(14)

Telemetry OnboardingTelemetry Onboarding
Telemetry Onboarding
Roberto Agostino Vitillo289 views
Growing a Data Pipeline for AnalyticsGrowing a Data Pipeline for Analytics
Growing a Data Pipeline for Analytics
Roberto Agostino Vitillo317 views
Telemetry DatasetsTelemetry Datasets
Telemetry Datasets
Roberto Agostino Vitillo214 views
Growing a SQL QueryGrowing a SQL Query
Growing a SQL Query
Roberto Agostino Vitillo182 views
Telemetry OnboardingTelemetry Onboarding
Telemetry Onboarding
Roberto Agostino Vitillo1.8K views
All you need to know about StatisticsAll you need to know about Statistics
All you need to know about Statistics
Roberto Agostino Vitillo1.3K views
Vectorization on x86: all you need to knowVectorization on x86: all you need to know
Vectorization on x86: all you need to know
Roberto Agostino Vitillo1.3K views
Sharing C++ objects in LinuxSharing C++ objects in Linux
Sharing C++ objects in Linux
Roberto Agostino Vitillo3.3K views
Performance tools developmentsPerformance tools developments
Performance tools developments
Roberto Agostino Vitillo582 views
Exploiting vectorization with ISPCExploiting vectorization with ISPC
Exploiting vectorization with ISPC
Roberto Agostino Vitillo1.1K views
GOoDA tutorialGOoDA tutorial
GOoDA tutorial
Roberto Agostino Vitillo1K views
Callgraph analysisCallgraph analysis
Callgraph analysis
Roberto Agostino Vitillo1.2K views
Vectorization in ATLASVectorization in ATLAS
Vectorization in ATLAS
Roberto Agostino Vitillo1.2K views
Inter-process communication on steroidsInter-process communication on steroids
Inter-process communication on steroids
Roberto Agostino Vitillo1.5K views

Recently uploaded(20)

The Research Portal of Catalonia: Growing more (information) & more (services)The Research Portal of Catalonia: Growing more (information) & more (services)
The Research Portal of Catalonia: Growing more (information) & more (services)
CSUC - Consorci de Serveis Universitaris de Catalunya51 views
Liqid: Composable CXL PreviewLiqid: Composable CXL Preview
Liqid: Composable CXL Preview
CXL Forum118 views
[2023] Putting the R! in R&D.pdf[2023] Putting the R! in R&D.pdf
[2023] Putting the R! in R&D.pdf
Eleanor McHugh34 views
METHOD AND SYSTEM FOR PREDICTING OPTIMAL LOAD FOR WHICH THE YIELD IS MAXIMUM ...METHOD AND SYSTEM FOR PREDICTING OPTIMAL LOAD FOR WHICH THE YIELD IS MAXIMUM ...
METHOD AND SYSTEM FOR PREDICTING OPTIMAL LOAD FOR WHICH THE YIELD IS MAXIMUM ...
Prity Khastgir IPR Strategic India Patent Attorney Amplify Innovation23 views

Spark meets Telemetry

  • 1. SPARK MEETS TELEMETRY Mozlandia 2014 Roberto Agostino Vitillo
  • 3. TELEMETRY PINGS • If Telemetry is enabled, a ping is generated for each session • Pings are sent to our backend infrastructure as json blobs • Backend validates and stores pings on S3
  • 5. TELEMETRY MAP-REDUCE import json def map(k, d, v, cx): j = json.loads(v) os = j['info']['OS'] cx.write(os, 1) def reduce(k, v, cx): cx.write(k, sum(v)) • Processes pings from S3 using a map reduce framework written in Python • https://github.com/mozilla/telemetry-server
  • 6. SHORTCOMINGS • Not distributed, limited to a single machine • Doesn’t support chains of map/reduce ops • Doesn’t support SQL-like queries • Batch oriented
  • 9. WHAT IS SPARK? • In-memory data analytics cluster computing framework (up to 100x faster than Hadoop) • Comes with over 80 distributed operations for grouping, filtering etc. • Runs standalone or on Hadoop, Mesos and TaskCluster in the future (right Jonas?)
  • 10. WHY DO WE CARE? • In memory caching • Interactive command line interface for EDA (think R command line) • Comes with higher level libraries for machine learning and graph processing • Works beautifully on a single machine without tedious setup; doesn’t depend on Hadoop/HDFS • Scala, Python, Clojure and R APIs are available
  • 11. WHY DO WE REALLY CARE? The easier we make it to get answers, the more questions we will ask
  • 13. HOW DOES IT WORK? • User creates Resilient Distributed Datasets (RDDs), transforms and executes them • RDD operations are compiled to a DAG of operators • DAG is compiled into stages • A stage is executed in parallel as a series of tasks
  • 14. RDD A parallel dataset with partitions Var A Var B Var C observation observation observation observation Partition Partition
  • 15. DAG Logical graph of RDD operations sc.textFile("input") .map(line => line.split(",")) .map(line => (line(0), line(1).toInt)) .reduceByKey(_ + _, 3) RDD[String] RDD[Array[String]] RDD[(String, Int)] RDD[(String, Int)] map map reduceByKey read P1 P2 P3 P4
  • 16. RDD[String] RDD[Array[String]] RDD[(String, Int)] RDD[(String, Int)] map map reduceByKey read STAGE Stage 1 Stage 2 P1 P2 P3 P4
  • 17. Stage 1 map map STAGE shuffle RDD[String] RDD[Array[String]] RDD[(String, Int)] read input output read map map shuffle P1 P2 P3 P4 T1 T2 T3 T4 Set of tasks that can run in parallel Stage 1
  • 18. STAGE Set of tasks that can run in parallel Stage 1 Stage 2
  • 19. STAGE Set of tasks that can run in parallel • Tasks are the fundamental unit of work • Tasks are serialised and shipped to workers • Task execution 1. Fetch input 2. Execute 3. Output result task 1 task 2 task 3 task 4
  • 21. HANDS-ON 1. Visit telemetry-dash.mozilla.org and sign in using Persona. 2. Click “Launch an ad-hoc analysis worker”. 3. Upload your SSH public key (this allows you to log in to the server once it’s started up). 4. Click “Submit” 5. A Ubuntu machine will be started up on Amazon’s EC2 infrastructure.
  • 22. HANDS-ON • Connect to the machine through ssh • Clone the starter template: 1. git clone https://github.com/vitillo/mozilla-telemetry-spark.git 2. cd mozilla-telemetry-spark && source aws/setup.sh 3. sbt console • Open http://bit.ly/1wBHHDH