Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
FIRE IN THE SKY
AN INTRODUCTION TO MONITORING APACHE SPARK
IN THE CLOUD
Michael McCune - msm@redhat.com
1
WHY MONITOR SPARK?
2
WHAT ABOUT LOGS?
spark-submit --master=spark://cluster:7077 app.py
17/10/13 13:33:03 INFO BlockManager: BlockManager stopp...
LET'S TALK METRICS
4
BASIC CONFIGURATION
metrics.properties
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm...
METRICS SERVLET
GET /metrics/master/json/
{
"gauges": {
"jvm.PS-MarkSweep.count": {
"value": 1
},
"jvm.PS-MarkSweep.time":...
METRICS SERVLET
GET /metrics/applications/json/
GET /metrics/json/
7
CONSOLE SINK
metrics.properties
*.sink.console.class=org.apache.spark.metrics.sink.ConsoleSink
*.sink.console.period=5
*.s...
CONSOLE SINK
10/12/17 4:59:54 PM ======================================================
-- Gauges ------------------------...
SLF4J SINK
metrics.properties
*.sink.slf4j.class=org.apache.spark.metrics.sink.Slf4jSink
*.sink.slf4j.period=5
*.sink.slf4...
SLF4J SINK
17/10/14 11:47:22 INFO Master: I have been elected leader! New state: ALIVE
17/10/14 11:47:25 INFO metrics: typ...
CSV SINK
metrics.properties
*.sink.csv.class=org.apache.spark.metrics.sink.CsvSink
*.sink.csv.period=1
*.sink.csv.unit=sec...
CSV SINK
[mike@opb-ultra] /tmp/spark/csv/master
$ ls
application.PythonPi.1507910958999.cores.csv
application.PythonPi.150...
CSV SINK
jvm.PS-MarkSweep.count.csv
t,value
1507910949,1
1507910950,1
1507910951,1
1507910952,1
1507910953,1
1507910954,1
...
JMX SINK
metrics.properties
spark-defaults.conf
*.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink
spark.driver.extraJ...
JMX SINK
jconsole
16
JMX SINK
Jolokia Java Agent
spark-class -javaagent:jolokia-jvm-agent.jar=port=19150,host=127.0.0.1
17
JMX SINK
GET /jolokia/
{
"request": {
"type": "version"
},
"status": 200,
"timestamp": 1507946690,
"value": {
"agent": "1....
JMX SINK
GET /jolokia/read/metrics:name=jvm.PS-MarkSweep.count
{
"request": {
"mbean": "metrics:name=jvm.PS-MarkSweep.coun...
GRAPHITE SINK
metrics.properties
*.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.graphite.host=127...
GRAPHITE SINK
21
GRAPHITE SINK
/render?target=jvm.heap.usage&from=-3minutes
22
GRAPHITE SINK
/metrics/index.json
[
"jvm.PS-MarkSweep.count",
"jvm.PS-MarkSweep.time",
"jvm.PS-Scavenge.count",
"jvm.PS-Sc...
GRAPHITE SINK
/render?target=jvm.PS-MarkSweep.count&format=json
[
{
"datapoints": [
[
null,
1507947840
],
[
1.0,
150794790...
MUCH MORE TO DISCOVER
app-*-*.driver.BlockManager.memory.*
app-*-*.driver.CodeGenerator.compilationTime.*
app-*-*.driver.C...
26
WHAT IS CLOUD NATIVE?
Containerized
Dynamically orchestrated
Microservice oriented
cncf.io/about/faq
27
TRADITIONAL ORCHESTRATION
28 . 1
TRADITIONAL ORCHESTRATION
28 . 2
CLOUD NATIVE ORCHESTRATION
29 . 1
CLOUD NATIVE ORCHESTRATION
29 . 2
CLOUD MONITORING
30
SCRAPING
31 . 1
SCRAPING
rhuss/jolokia
prometheus/jmx_exporter
fabric8io/agent-bond
31 . 2
BROADCAST
32 . 1
BROADCAST
erikerlandson/spark-kafka-sink
32 . 2
HYBRID
33 . 1
HYBRID
33 . 2
HYBRID
radanalyticsio/scorpion-stare
33 . 3
CHALLENGES
34 . 1
CHALLENGES
34 . 2
CHALLENGES
34 . 3
CHALLENGES
34 . 4
WHAT WILL YOU BUILD?
35
THANK YOU
STAY IN TOUCH
@FOSSjunkie
CHECK OUT OUR PROJECTS
https://radanalytics.io
36
Upcoming SlideShare
Loading in …5
×

Fire in the Sky: An Introduction to Monitoring Apache Spark in the Cloud with Michael McCune

811 views

Published on

Writing intelligent cloud native applications is hard enough when things go well, but what happens when there are performance and debugging issues that arise during production? Inspecting the logs is a good start, but what if the logs don’t show the whole picture? Now you have to go deeper, examining the live performance metrics that are generated by Spark, or even deploying specialized microservices to monitor and act upon that data. Spark provides several built-in sinks for exposing metrics data about the internal state of its executors and drivers, but getting at that information when your cluster is in the cloud can be a time consuming and arduous process. In this presentation, Michael McCune will walk through the options available for gaining access to the metrics data even when a Spark cluster lives in a cloud native containerized environment. Attendees will see demonstrations of techniques that will help them to integrate a full-fledged metrics story into their deployments. Michael will also discuss the pain points and challenges around publishing this data outside of the cloud and explain how to overcome them. In this talk you will learn about: Deploying metrics sinks as microservices, Common configuration options, and Accessing metrics data through a variety of mechanisms.

Published in: Data & Analytics
  • Be the first to comment

Fire in the Sky: An Introduction to Monitoring Apache Spark in the Cloud with Michael McCune

  1. 1. FIRE IN THE SKY AN INTRODUCTION TO MONITORING APACHE SPARK IN THE CLOUD Michael McCune - msm@redhat.com 1
  2. 2. WHY MONITOR SPARK? 2
  3. 3. WHAT ABOUT LOGS? spark-submit --master=spark://cluster:7077 app.py 17/10/13 13:33:03 INFO BlockManager: BlockManager stopped 17/10/13 13:33:03 INFO BlockManagerMaster: BlockManagerMaster stopped 17/10/13 13:33:03 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 17/10/13 13:33:03 INFO SparkContext: Successfully stopped SparkContext 10.0.1.113 - - [13/Oct/2017 13:33:04] "GET /sparkpi?scale=1000 HTTP/1.1" 200 - Traceback (most recent call last): File "/home/mike/workspace/github/radanalyticsio/tutorial-sparkpi-python-flask/app.py", line 41, in app.run(host='0.0.0.0', port=port) File "/home/mike/.venvs/sparkpi/lib/python2.7/site-packages/flask/app.py", line 841, in run run_simple(host, port, self, **options) File "/home/mike/.venvs/sparkpi/lib/python2.7/site-packages/werkzeug/serving.py", line 739, in run_simple inner() File "/home/mike/.venvs/sparkpi/lib/python2.7/site-packages/werkzeug/serving.py", line 702, in inner srv.serve_forever() File "/home/mike/.venvs/sparkpi/lib/python2.7/site-packages/werkzeug/serving.py", line 539, in serve_forever HTTPServer.serve_forever(self) File "/usr/lib64/python2.7/SocketServer.py", line 231, in serve_forever poll_interval) File "/usr/lib64/python2.7/SocketServer.py", line 150, in _eintr_retry return func(*args) File "/home/mike/opt/other-spark/python/lib/pyspark.zip/pyspark/context.py", line 236, in signal_handler File "/home/mike/opt/other-spark/python/lib/pyspark.zip/pyspark/context.py", line 962, in cancelAllJobs AttributeError: 'NoneType' object has no attribute 'sc' 17/10/13 13:37:30 INFO ShutdownHookManager: Shutdown hook called 17/10/13 13:37:30 INFO ShutdownHookManager: Deleting directory /tmp/spark-c9989888-5190-4d70-9d20-5960ede510e4/pysp 17/10/13 13:37:30 INFO ShutdownHookManager: Deleting directory /tmp/spark-c9989888-5190-4d70-9d20-5960ede510e4 3
  4. 4. LET'S TALK METRICS 4
  5. 5. BASIC CONFIGURATION metrics.properties master.source.jvm.class=org.apache.spark.metrics.source.JvmSource worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource 5
  6. 6. METRICS SERVLET GET /metrics/master/json/ { "gauges": { "jvm.PS-MarkSweep.count": { "value": 1 }, "jvm.PS-MarkSweep.time": { "value": 21 }, "jvm.PS-Scavenge.count": { "value": 3 }, "jvm.PS-Scavenge.time": { "value": 18 }, "jvm.heap.committed": { "value": 235405312 6
  7. 7. METRICS SERVLET GET /metrics/applications/json/ GET /metrics/json/ 7
  8. 8. CONSOLE SINK metrics.properties *.sink.console.class=org.apache.spark.metrics.sink.ConsoleSink *.sink.console.period=5 *.sink.console.unit=seconds 8
  9. 9. CONSOLE SINK 10/12/17 4:59:54 PM ====================================================== -- Gauges ---------------------------------------------------------------- jvm.PS-MarkSweep.count value = 1 jvm.PS-MarkSweep.time value = 23 jvm.PS-Scavenge.count value = 4 jvm.PS-Scavenge.time value = 31 jvm.heap.committed value = 198705152 jvm.heap.init value = 253755392 jvm.heap.max value = 954728448 jvm.heap.usage 9
  10. 10. SLF4J SINK metrics.properties *.sink.slf4j.class=org.apache.spark.metrics.sink.Slf4jSink *.sink.slf4j.period=5 *.sink.slf4j.unit=seconds 10
  11. 11. SLF4J SINK 17/10/14 11:47:22 INFO Master: I have been elected leader! New state: ALIVE 17/10/14 11:47:25 INFO metrics: type=GAUGE, name=jvm.PS-MarkSweep.count, value=1 17/10/14 11:47:25 INFO metrics: type=COUNTER, name=HiveExternalCatalog.fileCacheHits, count=0 17/10/14 11:47:25 INFO metrics: type=GAUGE, name=jvm.PS-MarkSweep.time, value=20 17/10/14 11:47:25 INFO metrics: type=COUNTER, name=HiveExternalCatalog.filesDiscovered, count=0 17/10/14 11:47:25 INFO metrics: type=GAUGE, name=jvm.PS-Scavenge.count, value=3 17/10/14 11:47:25 INFO metrics: type=COUNTER, name=HiveExternalCatalog.hiveClientCalls, count=0 17/10/14 11:47:25 INFO metrics: type=GAUGE, name=jvm.PS-Scavenge.time, value=17 17/10/14 11:47:25 INFO metrics: type=COUNTER, name=HiveExternalCatalog.parallelListingJobCount, count=0 17/10/14 11:47:25 INFO metrics: type=COUNTER, name=HiveExternalCatalog.partitionsFetched, count=0 17/10/14 11:47:25 INFO metrics: type=GAUGE, name=jvm.heap.committed, value=171442176 17/10/14 11:47:25 INFO metrics: type=GAUGE, name=jvm.heap.init, value=253755392 17/10/14 11:47:25 INFO metrics: type=GAUGE, name=jvm.heap.max, value=954728448 17/10/14 11:47:25 INFO metrics: type=GAUGE, name=jvm.heap.usage, value=0.04117022393366454 17/10/14 11:47:25 INFO metrics: type=GAUGE, name=jvm.heap.used, value=39306384 17/10/14 11:47:25 INFO metrics: type=GAUGE, name=jvm.non-heap.committed, value=34996224 17/10/14 11:47:25 INFO metrics: type=GAUGE, name=jvm.non-heap.init, value=2555904 17/10/14 11:47:25 INFO metrics: type=GAUGE, name=jvm.non-heap.max, value=-1 17/10/14 11:47:25 INFO metrics: type=GAUGE, name=jvm.non-heap.usage, value=-3.4380496E7 17/10/14 11:47:25 INFO metrics: type=HISTOGRAM, name=CodeGenerator.compilationTime, count=0, min=0, max 17/10/14 11:47:25 INFO metrics: type=GAUGE, name=jvm.non-heap.used, value=34381136 17/10/14 11:47:25 INFO metrics: type=HISTOGRAM, name=CodeGenerator.generatedClassSize, count=0, min=0, 17/10/14 11:47:25 INFO metrics: type=GAUGE, name=jvm.pools.Code-Cache.committed, value=5111808 17/10/14 11:47:25 INFO metrics: type=HISTOGRAM, name=CodeGenerator.generatedMethodSize, count=0, min=0, 17/10/14 11:47:25 INFO metrics: type=GAUGE, name=jvm.pools.Code-Cache.init, value=2555904 17/10/14 11:47:25 INFO metrics: type=HISTOGRAM, name=CodeGenerator.sourceCodeSize, count=0, min=0, max= 17/10/14 11:47:25 INFO metrics: type=GAUGE, name=jvm.pools.Code-Cache.max, value=251658240 17/10/14 11:47:25 INFO metrics: type=GAUGE, name=jvm.pools.Code-Cache.usage, value=0.019977823893229166 11
  12. 12. CSV SINK metrics.properties *.sink.csv.class=org.apache.spark.metrics.sink.CsvSink *.sink.csv.period=1 *.sink.csv.unit=seconds *.sink.csv.directory=/tmp/spark/csv/master/ 12
  13. 13. CSV SINK [mike@opb-ultra] /tmp/spark/csv/master $ ls application.PythonPi.1507910958999.cores.csv application.PythonPi.1507910958999.runtime_ms.csv application.PythonPi.1507910958999.status.csv CodeGenerator.compilationTime.csv CodeGenerator.generatedClassSize.csv CodeGenerator.generatedMethodSize.csv CodeGenerator.sourceCodeSize.csv HiveExternalCatalog.fileCacheHits.csv HiveExternalCatalog.filesDiscovered.csv HiveExternalCatalog.hiveClientCalls.csv HiveExternalCatalog.parallelListingJobCount.csv HiveExternalCatalog.partitionsFetched.csv jvm.heap.committed.csv jvm.heap.init.csv jvm.heap.max.csv jvm.heap.usage.csv jvm.heap.used.csv jvm.non-heap.committed.csv 13
  14. 14. CSV SINK jvm.PS-MarkSweep.count.csv t,value 1507910949,1 1507910950,1 1507910951,1 1507910952,1 1507910953,1 1507910954,1 1507910955,1 1507910956,1 1507910957,1 1507910958,1 1507910959,1 1507910960,1 1507910961,1 1507910962,1 1507910963,1 1507910964,1 14
  15. 15. JMX SINK metrics.properties spark-defaults.conf *.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink spark.driver.extraJavaOptions -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=19150 -Dcom.sun.management.jmxremote.local.only=false -Djava.rmi.server.hostname=127.0.0.1 -Dcom.sun.management.jmxremote.rmi.port=19151 15
  16. 16. JMX SINK jconsole 16
  17. 17. JMX SINK Jolokia Java Agent spark-class -javaagent:jolokia-jvm-agent.jar=port=19150,host=127.0.0.1 17
  18. 18. JMX SINK GET /jolokia/ { "request": { "type": "version" }, "status": 200, "timestamp": 1507946690, "value": { "agent": "1.3.7", "config": { "agentContext": "/jolokia", "agentId": "10.0.1.113-7868-4926097b-jvm", "agentType": "jvm", "debug": "false", "debugMaxEntries": "100", "discoveryEnabled": "true", "historyMaxEntries": "10", "maxCollectionSize": "0", "maxDepth": "15", "maxObjects": "0" }, "info": { "product": "jetty", "vendor": "Mortbay", "version": "6.1.26" }, "protocol": "7.2" } } 18
  19. 19. JMX SINK GET /jolokia/read/metrics:name=jvm.PS-MarkSweep.count { "request": { "mbean": "metrics:name=jvm.PS-MarkSweep.count", "type": "read" }, "status": 200, "timestamp": 1507946213, "value": { "Value": 1 } } 19
  20. 20. GRAPHITE SINK metrics.properties *.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink *.sink.graphite.host=127.0.0.1 *.sink.graphite.port=2003 *.sink.graphite.period=1 *.sink.graphite.unit=seconds 20
  21. 21. GRAPHITE SINK 21
  22. 22. GRAPHITE SINK /render?target=jvm.heap.usage&from=-3minutes 22
  23. 23. GRAPHITE SINK /metrics/index.json [ "jvm.PS-MarkSweep.count", "jvm.PS-MarkSweep.time", "jvm.PS-Scavenge.count", "jvm.PS-Scavenge.time", "jvm.heap.committed", "jvm.heap.init", "jvm.heap.max", "jvm.heap.usage", "jvm.heap.used", "jvm.non-heap.committed", "jvm.non-heap.init", "jvm.non-heap.max", "jvm.non-heap.usage", "jvm.non-heap.used", "jvm.pools.Code-Cache.committed", 23
  24. 24. GRAPHITE SINK /render?target=jvm.PS-MarkSweep.count&format=json [ { "datapoints": [ [ null, 1507947840 ], [ 1.0, 1507947900 ], [ 1.0, 1507947960 ] ], "target": "jvm.PS-MarkSweep.count" 24
  25. 25. MUCH MORE TO DISCOVER app-*-*.driver.BlockManager.memory.* app-*-*.driver.CodeGenerator.compilationTime.* app-*-*.driver.CodeGenerator.generatedClassSize.* app-*-*.driver.CodeGenerator.generatedMethodSize.* app-*-*.driver.CodeGenerator.sourceCodeSize.* app-*-*.driver.DAGScheduler.job.* app-*-*.driver.DAGScheduler.messageProcessingTime.* app-*-*.driver.DAGScheduler.stage.* application.PythonPi.*.cores application.PythonPi.*.runtime_ms jvm.total.committed jvm.total.init jvm.total.max jvm.total.used worker.coresFree worker.coresUsed worker.executors worker.memFree_MB 25
  26. 26. 26
  27. 27. WHAT IS CLOUD NATIVE? Containerized Dynamically orchestrated Microservice oriented cncf.io/about/faq 27
  28. 28. TRADITIONAL ORCHESTRATION 28 . 1
  29. 29. TRADITIONAL ORCHESTRATION 28 . 2
  30. 30. CLOUD NATIVE ORCHESTRATION 29 . 1
  31. 31. CLOUD NATIVE ORCHESTRATION 29 . 2
  32. 32. CLOUD MONITORING 30
  33. 33. SCRAPING 31 . 1
  34. 34. SCRAPING rhuss/jolokia prometheus/jmx_exporter fabric8io/agent-bond 31 . 2
  35. 35. BROADCAST 32 . 1
  36. 36. BROADCAST erikerlandson/spark-kafka-sink 32 . 2
  37. 37. HYBRID 33 . 1
  38. 38. HYBRID 33 . 2
  39. 39. HYBRID radanalyticsio/scorpion-stare 33 . 3
  40. 40. CHALLENGES 34 . 1
  41. 41. CHALLENGES 34 . 2
  42. 42. CHALLENGES 34 . 3
  43. 43. CHALLENGES 34 . 4
  44. 44. WHAT WILL YOU BUILD? 35
  45. 45. THANK YOU STAY IN TOUCH @FOSSjunkie CHECK OUT OUR PROJECTS https://radanalytics.io 36

×