Successfully reported this slideshow.

Spark meets Telemetry

4

Share

Upcoming SlideShare
Elixir 5 minute intro
Elixir 5 minute intro
Loading in …3
×
1 of 24
1 of 24

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Spark meets Telemetry

  1. 1. SPARK MEETS TELEMETRY Mozlandia 2014 Roberto Agostino Vitillo
  2. 2. TELEMETRY PINGS
  3. 3. TELEMETRY PINGS • If Telemetry is enabled, a ping is generated for each session • Pings are sent to our backend infrastructure as json blobs • Backend validates and stores pings on S3
  4. 4. TELEMETRY PINGS
  5. 5. TELEMETRY MAP-REDUCE import json def map(k, d, v, cx): j = json.loads(v) os = j['info']['OS'] cx.write(os, 1) def reduce(k, v, cx): cx.write(k, sum(v)) • Processes pings from S3 using a map reduce framework written in Python • https://github.com/mozilla/telemetry-server
  6. 6. SHORTCOMINGS • Not distributed, limited to a single machine • Doesn’t support chains of map/reduce ops • Doesn’t support SQL-like queries • Batch oriented
  7. 7. source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
  8. 8. WHAT IS SPARK? • In-memory data analytics cluster computing framework (up to 100x faster than Hadoop) • Comes with over 80 distributed operations for grouping, filtering etc. • Runs standalone or on Hadoop, Mesos and TaskCluster in the future (right Jonas?)
  9. 9. WHY DO WE CARE? • In memory caching • Interactive command line interface for EDA (think R command line) • Comes with higher level libraries for machine learning and graph processing • Works beautifully on a single machine without tedious setup; doesn’t depend on Hadoop/HDFS • Scala, Python, Clojure and R APIs are available
  10. 10. WHY DO WE REALLY CARE? The easier we make it to get answers, the more questions we will ask
  11. 11. MASHUP DEMO
  12. 12. HOW DOES IT WORK? • User creates Resilient Distributed Datasets (RDDs), transforms and executes them • RDD operations are compiled to a DAG of operators • DAG is compiled into stages • A stage is executed in parallel as a series of tasks
  13. 13. RDD A parallel dataset with partitions Var A Var B Var C observation observation observation observation Partition Partition
  14. 14. DAG Logical graph of RDD operations sc.textFile("input") .map(line => line.split(",")) .map(line => (line(0), line(1).toInt)) .reduceByKey(_ + _, 3) RDD[String] RDD[Array[String]] RDD[(String, Int)] RDD[(String, Int)] map map reduceByKey read P1 P2 P3 P4
  15. 15. RDD[String] RDD[Array[String]] RDD[(String, Int)] RDD[(String, Int)] map map reduceByKey read STAGE Stage 1 Stage 2 P1 P2 P3 P4
  16. 16. Stage 1 map map STAGE shuffle RDD[String] RDD[Array[String]] RDD[(String, Int)] read input output read map map shuffle P1 P2 P3 P4 T1 T2 T3 T4 Set of tasks that can run in parallel Stage 1
  17. 17. STAGE Set of tasks that can run in parallel Stage 1 Stage 2
  18. 18. STAGE Set of tasks that can run in parallel • Tasks are the fundamental unit of work • Tasks are serialised and shipped to workers • Task execution 1. Fetch input 2. Execute 3. Output result task 1 task 2 task 3 task 4
  19. 19. HANDS-ON
  20. 20. HANDS-ON 1. Visit telemetry-dash.mozilla.org and sign in using Persona. 2. Click “Launch an ad-hoc analysis worker”. 3. Upload your SSH public key (this allows you to log in to the server once it’s started up). 4. Click “Submit” 5. A Ubuntu machine will be started up on Amazon’s EC2 infrastructure.
  21. 21. HANDS-ON • Connect to the machine through ssh • Clone the starter template: 1. git clone https://github.com/vitillo/mozilla-telemetry-spark.git 2. cd mozilla-telemetry-spark && source aws/setup.sh 3. sbt console • Open http://bit.ly/1wBHHDH
  22. 22. TUTORIAL

×