SlideShare a Scribd company logo
SPARK MEETS TELEMETRY 
Mozlandia 2014 
Roberto Agostino Vitillo
TELEMETRY PINGS
TELEMETRY PINGS 
• If Telemetry is enabled, a ping is generated for 
each session 
• Pings are sent to our backend infrastructure as 
json blobs 
• Backend validates and stores pings on S3
TELEMETRY PINGS
TELEMETRY MAP-REDUCE 
import json 
def map(k, d, v, cx): 
j = json.loads(v) 
os = j['info']['OS'] 
cx.write(os, 1) 
def reduce(k, v, cx): 
cx.write(k, sum(v)) 
• Processes pings from S3 using a map reduce 
framework written in Python 
• https://github.com/mozilla/telemetry-server
SHORTCOMINGS 
• Not distributed, limited to a single machine 
• Doesn’t support chains of map/reduce ops 
• Doesn’t support SQL-like queries 
• Batch oriented
source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
WHAT IS SPARK? 
• In-memory data analytics cluster computing 
framework (up to 100x faster than Hadoop) 
• Comes with over 80 distributed operations for 
grouping, filtering etc. 
• Runs standalone or on Hadoop, Mesos and 
TaskCluster in the future (right Jonas?)
WHY DO WE CARE? 
• In memory caching 
• Interactive command line interface for EDA (think R command line) 
• Comes with higher level libraries for machine learning and graph 
processing 
• Works beautifully on a single machine without tedious setup; 
doesn’t depend on Hadoop/HDFS 
• Scala, Python, Clojure and R APIs are available
WHY DO WE REALLY CARE? 
The easier we make it to get answers, 
the more questions we will ask
MASHUP DEMO
HOW DOES IT WORK? 
• User creates Resilient Distributed Datasets (RDDs), 
transforms and executes them 
• RDD operations are compiled to a DAG of 
operators 
• DAG is compiled into stages 
• A stage is executed in parallel as a series of tasks
RDD 
A parallel dataset with partitions 
Var A Var B Var C 
observation 
observation 
observation 
observation 
Partition 
Partition
DAG 
Logical graph of RDD operations 
sc.textFile("input") 
.map(line => line.split(",")) 
.map(line => (line(0), line(1).toInt)) 
.reduceByKey(_ + _, 3) 
RDD[String] RDD[Array[String]] RDD[(String, Int)] 
RDD[(String, Int)] 
map map reduceByKey 
read 
P1 
P2 
P3 
P4
RDD[String] RDD[Array[String]] RDD[(String, Int)] 
RDD[(String, Int)] 
map map reduceByKey 
read 
STAGE 
Stage 1 Stage 2 
P1 
P2 
P3 
P4
Stage 1 
map map 
STAGE 
shuffle 
RDD[String] RDD[Array[String]] RDD[(String, Int)] 
read input output 
read 
map 
map 
shuffle 
P1 
P2 
P3 
P4 
T1 
T2 
T3 
T4 
Set of tasks that can run in parallel 
Stage 1
STAGE 
Set of tasks that can run in parallel 
Stage 1 Stage 2
STAGE 
Set of tasks that can run in parallel 
• Tasks are the fundamental unit of work 
• Tasks are serialised and shipped to workers 
• Task execution 
1. Fetch input 
2. Execute 
3. Output result 
task 1 
task 2 
task 3 
task 4
HANDS-ON
HANDS-ON 
1. Visit telemetry-dash.mozilla.org and sign in using Persona. 
2. Click “Launch an ad-hoc analysis worker”. 
3. Upload your SSH public key (this allows you to log in to the 
server once it’s started up). 
4. Click “Submit” 
5. A Ubuntu machine will be started up on Amazon’s EC2 
infrastructure.
HANDS-ON 
• Connect to the machine through ssh 
• Clone the starter template: 
1. git clone https://github.com/vitillo/mozilla-telemetry-spark.git 
2. cd mozilla-telemetry-spark && source aws/setup.sh 
3. sbt console 
• Open http://bit.ly/1wBHHDH
TUTORIAL
Spark meets Telemetry

More Related Content

What's hot

Where in the world is Franz Kafka? | Will LaForest, Confluent
Where in the world is Franz Kafka? | Will LaForest, ConfluentWhere in the world is Franz Kafka? | Will LaForest, Confluent
Where in the world is Franz Kafka? | Will LaForest, Confluent
HostedbyConfluent
 
S2
S2S2
CArcMOOC 03.04 - Gate-level design
CArcMOOC 03.04 - Gate-level designCArcMOOC 03.04 - Gate-level design
CArcMOOC 03.04 - Gate-level design
Alessandro Bogliolo
 
Pgrouting_foss4guk_ross_mcdonald
Pgrouting_foss4guk_ross_mcdonaldPgrouting_foss4guk_ross_mcdonald
Pgrouting_foss4guk_ross_mcdonald
Ross McDonald
 
MapReduce with Hadoop
MapReduce with HadoopMapReduce with Hadoop
MapReduce with Hadoop
Vitalie Scurtu
 
A parallel gpu version of the traveling salesman problem slides
A parallel gpu version of the traveling salesman problem slidesA parallel gpu version of the traveling salesman problem slides
A parallel gpu version of the traveling salesman problem slides
Vimukthi Wickramasinghe
 
Doom in SpaceX
Doom in SpaceXDoom in SpaceX
Doom in SpaceX
Martin Dvorak
 
Flink meetup
Flink meetupFlink meetup
Flink meetup
Frank McSherry
 
scikit-cuda
scikit-cudascikit-cuda
scikit-cuda
Lev Givon
 
3.5 equivalence of pushdown automata and cfl
3.5 equivalence of pushdown automata and cfl3.5 equivalence of pushdown automata and cfl
3.5 equivalence of pushdown automata and cfl
Sampath Kumar S
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster Innards
Martin Dvorak
 
Riding the Elephant - Hadoop 2.0
Riding the Elephant - Hadoop 2.0Riding the Elephant - Hadoop 2.0
Riding the Elephant - Hadoop 2.0
Simon Elliston Ball
 
Three Functional Programming Technologies for Big Data
Three Functional Programming Technologies for Big DataThree Functional Programming Technologies for Big Data
Three Functional Programming Technologies for Big Data
Dynamical Software, Inc.
 
Quill - 一個 Scala 的資料庫存取利器
Quill - 一個 Scala 的資料庫存取利器Quill - 一個 Scala 的資料庫存取利器
Quill - 一個 Scala 的資料庫存取利器
vito jeng
 
Aws Quick Dirty Hadoop Mapreduce Ec2 S3
Aws Quick Dirty Hadoop Mapreduce Ec2 S3Aws Quick Dirty Hadoop Mapreduce Ec2 S3
Aws Quick Dirty Hadoop Mapreduce Ec2 S3
Skills Matter
 
Open Source Routing Machine - FOSS4G 2016 Bonn
Open Source Routing Machine - FOSS4G 2016 BonnOpen Source Routing Machine - FOSS4G 2016 Bonn
Open Source Routing Machine - FOSS4G 2016 Bonn
Johan
 
Graph database
Graph databaseGraph database
Graph database
Achintya Kumar
 
State of OSRM - SOTM 2016
State of OSRM - SOTM 2016State of OSRM - SOTM 2016
State of OSRM - SOTM 2016
Johan
 
Nips2016 mlgkernel
Nips2016 mlgkernelNips2016 mlgkernel
Nips2016 mlgkernel
Daigo HIROOKA
 
Automating AWS Infrastructure Provisioning Using Concourse and Terraform
Automating AWS Infrastructure Provisioning Using Concourse and TerraformAutomating AWS Infrastructure Provisioning Using Concourse and Terraform
Automating AWS Infrastructure Provisioning Using Concourse and Terraform
Cesar Rodriguez
 

What's hot (20)

Where in the world is Franz Kafka? | Will LaForest, Confluent
Where in the world is Franz Kafka? | Will LaForest, ConfluentWhere in the world is Franz Kafka? | Will LaForest, Confluent
Where in the world is Franz Kafka? | Will LaForest, Confluent
 
S2
S2S2
S2
 
CArcMOOC 03.04 - Gate-level design
CArcMOOC 03.04 - Gate-level designCArcMOOC 03.04 - Gate-level design
CArcMOOC 03.04 - Gate-level design
 
Pgrouting_foss4guk_ross_mcdonald
Pgrouting_foss4guk_ross_mcdonaldPgrouting_foss4guk_ross_mcdonald
Pgrouting_foss4guk_ross_mcdonald
 
MapReduce with Hadoop
MapReduce with HadoopMapReduce with Hadoop
MapReduce with Hadoop
 
A parallel gpu version of the traveling salesman problem slides
A parallel gpu version of the traveling salesman problem slidesA parallel gpu version of the traveling salesman problem slides
A parallel gpu version of the traveling salesman problem slides
 
Doom in SpaceX
Doom in SpaceXDoom in SpaceX
Doom in SpaceX
 
Flink meetup
Flink meetupFlink meetup
Flink meetup
 
scikit-cuda
scikit-cudascikit-cuda
scikit-cuda
 
3.5 equivalence of pushdown automata and cfl
3.5 equivalence of pushdown automata and cfl3.5 equivalence of pushdown automata and cfl
3.5 equivalence of pushdown automata and cfl
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster Innards
 
Riding the Elephant - Hadoop 2.0
Riding the Elephant - Hadoop 2.0Riding the Elephant - Hadoop 2.0
Riding the Elephant - Hadoop 2.0
 
Three Functional Programming Technologies for Big Data
Three Functional Programming Technologies for Big DataThree Functional Programming Technologies for Big Data
Three Functional Programming Technologies for Big Data
 
Quill - 一個 Scala 的資料庫存取利器
Quill - 一個 Scala 的資料庫存取利器Quill - 一個 Scala 的資料庫存取利器
Quill - 一個 Scala 的資料庫存取利器
 
Aws Quick Dirty Hadoop Mapreduce Ec2 S3
Aws Quick Dirty Hadoop Mapreduce Ec2 S3Aws Quick Dirty Hadoop Mapreduce Ec2 S3
Aws Quick Dirty Hadoop Mapreduce Ec2 S3
 
Open Source Routing Machine - FOSS4G 2016 Bonn
Open Source Routing Machine - FOSS4G 2016 BonnOpen Source Routing Machine - FOSS4G 2016 Bonn
Open Source Routing Machine - FOSS4G 2016 Bonn
 
Graph database
Graph databaseGraph database
Graph database
 
State of OSRM - SOTM 2016
State of OSRM - SOTM 2016State of OSRM - SOTM 2016
State of OSRM - SOTM 2016
 
Nips2016 mlgkernel
Nips2016 mlgkernelNips2016 mlgkernel
Nips2016 mlgkernel
 
Automating AWS Infrastructure Provisioning Using Concourse and Terraform
Automating AWS Infrastructure Provisioning Using Concourse and TerraformAutomating AWS Infrastructure Provisioning Using Concourse and Terraform
Automating AWS Infrastructure Provisioning Using Concourse and Terraform
 

Similar to Spark meets Telemetry

Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Patrick Wendell
 
RBootcam Day 2
RBootcam Day 2RBootcam Day 2
RBootcam Day 2
Olga Scrivner
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Matthew Lease
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
Deathstar
DeathstarDeathstar
Deathstar
armstrtw
 
Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai
Vibrant Technologies & Computers
 
Barcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop PresentationBarcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop Presentation
Norberto Leite
 
Shrug2017 arcpy data_and_you
Shrug2017 arcpy data_and_youShrug2017 arcpy data_and_you
Shrug2017 arcpy data_and_you
SHRUG GIS
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
Javier Arrieta
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
Prakash Chockalingam
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer
tirlukachaitanya
 
State of the Art Web Mapping with Open Source
State of the Art Web Mapping with Open SourceState of the Art Web Mapping with Open Source
State of the Art Web Mapping with Open Source
OSCON Byrum
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
Harisankar H
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
NVIDIA Japan
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Geoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel ProcessingGeoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel Processing
Geoff Rothman
 
A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)
PingCAP
 

Similar to Spark meets Telemetry (20)

Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
RBootcam Day 2
RBootcam Day 2RBootcam Day 2
RBootcam Day 2
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Deathstar
DeathstarDeathstar
Deathstar
 
Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai
 
Barcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop PresentationBarcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop Presentation
 
Shrug2017 arcpy data_and_you
Shrug2017 arcpy data_and_youShrug2017 arcpy data_and_you
Shrug2017 arcpy data_and_you
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer
 
State of the Art Web Mapping with Open Source
State of the Art Web Mapping with Open SourceState of the Art Web Mapping with Open Source
State of the Art Web Mapping with Open Source
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Geoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel ProcessingGeoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel Processing
 
A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)
 

More from Roberto Agostino Vitillo

Telemetry Onboarding
Telemetry OnboardingTelemetry Onboarding
Telemetry Onboarding
Roberto Agostino Vitillo
 
Growing a Data Pipeline for Analytics
Growing a Data Pipeline for AnalyticsGrowing a Data Pipeline for Analytics
Growing a Data Pipeline for Analytics
Roberto Agostino Vitillo
 
Telemetry Datasets
Telemetry DatasetsTelemetry Datasets
Telemetry Datasets
Roberto Agostino Vitillo
 
Growing a SQL Query
Growing a SQL QueryGrowing a SQL Query
Growing a SQL Query
Roberto Agostino Vitillo
 
Telemetry Onboarding
Telemetry OnboardingTelemetry Onboarding
Telemetry Onboarding
Roberto Agostino Vitillo
 
All you need to know about Statistics
All you need to know about StatisticsAll you need to know about Statistics
All you need to know about Statistics
Roberto Agostino Vitillo
 
Vectorization on x86: all you need to know
Vectorization on x86: all you need to knowVectorization on x86: all you need to know
Vectorization on x86: all you need to know
Roberto Agostino Vitillo
 
Sharing C++ objects in Linux
Sharing C++ objects in LinuxSharing C++ objects in Linux
Sharing C++ objects in Linux
Roberto Agostino Vitillo
 
Performance tools developments
Performance tools developmentsPerformance tools developments
Performance tools developments
Roberto Agostino Vitillo
 
Exploiting vectorization with ISPC
Exploiting vectorization with ISPCExploiting vectorization with ISPC
Exploiting vectorization with ISPC
Roberto Agostino Vitillo
 
GOoDA tutorial
GOoDA tutorialGOoDA tutorial
GOoDA tutorial
Roberto Agostino Vitillo
 
Callgraph analysis
Callgraph analysisCallgraph analysis
Callgraph analysis
Roberto Agostino Vitillo
 
Vectorization in ATLAS
Vectorization in ATLASVectorization in ATLAS
Vectorization in ATLAS
Roberto Agostino Vitillo
 
Inter-process communication on steroids
Inter-process communication on steroidsInter-process communication on steroids
Inter-process communication on steroids
Roberto Agostino Vitillo
 

More from Roberto Agostino Vitillo (14)

Telemetry Onboarding
Telemetry OnboardingTelemetry Onboarding
Telemetry Onboarding
 
Growing a Data Pipeline for Analytics
Growing a Data Pipeline for AnalyticsGrowing a Data Pipeline for Analytics
Growing a Data Pipeline for Analytics
 
Telemetry Datasets
Telemetry DatasetsTelemetry Datasets
Telemetry Datasets
 
Growing a SQL Query
Growing a SQL QueryGrowing a SQL Query
Growing a SQL Query
 
Telemetry Onboarding
Telemetry OnboardingTelemetry Onboarding
Telemetry Onboarding
 
All you need to know about Statistics
All you need to know about StatisticsAll you need to know about Statistics
All you need to know about Statistics
 
Vectorization on x86: all you need to know
Vectorization on x86: all you need to knowVectorization on x86: all you need to know
Vectorization on x86: all you need to know
 
Sharing C++ objects in Linux
Sharing C++ objects in LinuxSharing C++ objects in Linux
Sharing C++ objects in Linux
 
Performance tools developments
Performance tools developmentsPerformance tools developments
Performance tools developments
 
Exploiting vectorization with ISPC
Exploiting vectorization with ISPCExploiting vectorization with ISPC
Exploiting vectorization with ISPC
 
GOoDA tutorial
GOoDA tutorialGOoDA tutorial
GOoDA tutorial
 
Callgraph analysis
Callgraph analysisCallgraph analysis
Callgraph analysis
 
Vectorization in ATLAS
Vectorization in ATLASVectorization in ATLAS
Vectorization in ATLAS
 
Inter-process communication on steroids
Inter-process communication on steroidsInter-process communication on steroids
Inter-process communication on steroids
 

Recently uploaded

Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
Safe Software
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
Fwdays
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
Fwdays
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
Enterprise Knowledge
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
LizaNolte
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
Sease
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
zjhamm304
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Pitangent Analytics & Technology Solutions Pvt. Ltd
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
Mydbops
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024
Vadym Kazulkin
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 

Recently uploaded (20)

Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 

Spark meets Telemetry

  • 1. SPARK MEETS TELEMETRY Mozlandia 2014 Roberto Agostino Vitillo
  • 3. TELEMETRY PINGS • If Telemetry is enabled, a ping is generated for each session • Pings are sent to our backend infrastructure as json blobs • Backend validates and stores pings on S3
  • 5. TELEMETRY MAP-REDUCE import json def map(k, d, v, cx): j = json.loads(v) os = j['info']['OS'] cx.write(os, 1) def reduce(k, v, cx): cx.write(k, sum(v)) • Processes pings from S3 using a map reduce framework written in Python • https://github.com/mozilla/telemetry-server
  • 6. SHORTCOMINGS • Not distributed, limited to a single machine • Doesn’t support chains of map/reduce ops • Doesn’t support SQL-like queries • Batch oriented
  • 7.
  • 9. WHAT IS SPARK? • In-memory data analytics cluster computing framework (up to 100x faster than Hadoop) • Comes with over 80 distributed operations for grouping, filtering etc. • Runs standalone or on Hadoop, Mesos and TaskCluster in the future (right Jonas?)
  • 10. WHY DO WE CARE? • In memory caching • Interactive command line interface for EDA (think R command line) • Comes with higher level libraries for machine learning and graph processing • Works beautifully on a single machine without tedious setup; doesn’t depend on Hadoop/HDFS • Scala, Python, Clojure and R APIs are available
  • 11. WHY DO WE REALLY CARE? The easier we make it to get answers, the more questions we will ask
  • 13. HOW DOES IT WORK? • User creates Resilient Distributed Datasets (RDDs), transforms and executes them • RDD operations are compiled to a DAG of operators • DAG is compiled into stages • A stage is executed in parallel as a series of tasks
  • 14. RDD A parallel dataset with partitions Var A Var B Var C observation observation observation observation Partition Partition
  • 15. DAG Logical graph of RDD operations sc.textFile("input") .map(line => line.split(",")) .map(line => (line(0), line(1).toInt)) .reduceByKey(_ + _, 3) RDD[String] RDD[Array[String]] RDD[(String, Int)] RDD[(String, Int)] map map reduceByKey read P1 P2 P3 P4
  • 16. RDD[String] RDD[Array[String]] RDD[(String, Int)] RDD[(String, Int)] map map reduceByKey read STAGE Stage 1 Stage 2 P1 P2 P3 P4
  • 17. Stage 1 map map STAGE shuffle RDD[String] RDD[Array[String]] RDD[(String, Int)] read input output read map map shuffle P1 P2 P3 P4 T1 T2 T3 T4 Set of tasks that can run in parallel Stage 1
  • 18. STAGE Set of tasks that can run in parallel Stage 1 Stage 2
  • 19. STAGE Set of tasks that can run in parallel • Tasks are the fundamental unit of work • Tasks are serialised and shipped to workers • Task execution 1. Fetch input 2. Execute 3. Output result task 1 task 2 task 3 task 4
  • 21. HANDS-ON 1. Visit telemetry-dash.mozilla.org and sign in using Persona. 2. Click “Launch an ad-hoc analysis worker”. 3. Upload your SSH public key (this allows you to log in to the server once it’s started up). 4. Click “Submit” 5. A Ubuntu machine will be started up on Amazon’s EC2 infrastructure.
  • 22. HANDS-ON • Connect to the machine through ssh • Clone the starter template: 1. git clone https://github.com/vitillo/mozilla-telemetry-spark.git 2. cd mozilla-telemetry-spark && source aws/setup.sh 3. sbt console • Open http://bit.ly/1wBHHDH