Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

CIRCUIT 2015 - Monitoring AEM

3,030 views

Published on

Mike Chan - ICF Interactive
This Session will discuss availability and performance monitoring around AEM

Published in: Technology

CIRCUIT 2015 - Monitoring AEM

  1. 1. CIRCUIT – An Adobe Developer Event Presented by ICF Interactive Monitoring AEM - Going above and beyond CPU, Disk, and Memory Michael Chan ICFI Interactive
  2. 2. Introduction Who Am I •  Michael Chan, Systems Engineer & Architect for ICFI Interactive Managed Services •  Former Java & C developer •  With past experience in –  Unix security –  Network monitoring –  Systems (network, storage, server) integration –  Ecommerce •  Primary responsibilities at ICFI (among others) –  Build out systems infrastructure, including systems automation, logging, and monitoring –  Enable engineers to quickly assess and respond to systems issues
  3. 3. Purpose of session Session will cover: •  Introduce systems monitoring concepts •  Provide practical ideas and examples on how to monitor your website and AEM stack •  Use data to make correlations for root-cause analysis Session will not cover: •  Which monitoring software to use •  How to implement x or y feature in your monitoring software •  What alerting strategies you should use
  4. 4. Goals of systems monitoring •  Maintain site availability – Can users access the site? •  Identify performance issues – Are users waiting too long? •  Troubleshoot problems – How do I identify root cause? •  Identify long-term trends – Is the application slowing down? – Do we need faster hardware?
  5. 5. Monitoring tools out there (not exhaustive) Open Source (free!) •  Nagios •  Icinga •  Zabbix SAAS •  Application-performance focused –  AppDynamics –  New Relic •  Boundary •  Datadog
  6. 6. Monitoring software considerations What I have found most important •  Easy to use –  Has a convenient GUI –  Easy to add servers, applications •  Easy to view and interpret data –  Need to be able to view data and quickly make correlations •  Extensible –  Easy to customize, e.g. monitor Publisher listening on port 4506 instead of 4503 –  Support for plugins and especially custom scripts, necessary for application-specific monitoring •  Other considerations –  Can the setup configs be version controlled in Git? –  Is there an API for the monitoring system, to create/modify configs? Tip: everyone’s needs are differerent, use what makes sense for you!
  7. 7. Basic monitoring – CPU, network, disk Good questions to ask when monitoring these •  CPU Load Average –  What percentage of CPU is the application utilizing? –  Is there surplus CPU capacity left? •  Network Statistics –  e.g. Bytes in/out, Packets in/out –  How much traffic are our servers receiving? –  Do any network spikes correlate with slower application performance? •  Disk (IOPS, throughput) –  How much is the application utilizing the disk? –  Is the application hitting any Disk I/O thresholds? Tip: benchmark your Network and Disk I/O thresholds to discover your hardware limitations. Note: AEM may be hitting CPU limits even before CPU load is %100. Reason for this is that threads often can be waiting on another thread’s operations to complete, and until that thread completes, the rest are waiting or blocked. Therefore slowness can begin even at %50-%75 CPU utilization
  8. 8. Simple web monitoring – must have’s HTTP Code check mc-macbook-2:~ mc$ curl -I http://www.citytechinc.com/us/en.html HTTP/1.1 200 OK Content-based checks mc-macbook-2:~ mc$ curl -s http://www.citytechinc.com/us/en.html | grep 'CITYTECH, Inc. all rights reserved' © 2015 CITYTECH, Inc. all rights reserved Content-based checks with timeout mc-macbook-2:~ mc$ curl --max-time 30 -s http://www.citytechinc.com/us/en.html | grep 'CITYTECH, Inc. all rights reserved' © 2015 CITYTECH, Inc. all rights reserved Response time check mc-macbook-2:~ mc$ time curl -s http://www.citytechinc.com/us/en.html | grep 'CITYTECH, Inc. all rights reserved' >/dev/null 2>&1 real 0m0.195s user 0m0.007s Sys 0m0.006s Tip: install content-based checks on each Publish and Dispatcher instance. That way you can quickly detect which instance has a failure.
  9. 9. Simple web monitoring – Apache performance stats Apache, mod_status module •  Provides performance statistics •  Note: path e.g. /server-status should be disabled from public internet root@Client Prod CQ Disp 1a i-a678d2db:~# curl -s http://localhost/ server-status | html2text|more ****** Apache Server Status for localhost ****** Server Version: Apache/2.2.15 (Unix) Communique/4.1.2 mod_ssl/ 2.2.15 OpenSSL/ 1.0.1e-fips Server Built: Jul 18 2014 02:31:29 ==================================================================== Current Time: Wednesday, 29-Jul-2015 01:37:00 GMT Restart Time: Sunday, 26-Jul-2015 03:39:30 GMT Parent Server Generation: 4 Server uptime: 2 days 21 hours 57 minutes 30 seconds Total accesses: 3430869 - Total Traffic: 114.6 GB CPU Usage: u43.79 s19.41 cu0 cs0 - .0251% CPU load 13.6 requests/sec - 477.1 kB/second - 35.0 kB/request 41 requests currently being processed, 21 idle workers
  10. 10. Web monitoring – STM / RUM – nice to have Synthetic Transaction Monitoring •  (also known as active monitoring) is website monitoring that is done using a web browser emulation or scripted recordings of web transactions. •  Examples –  Selenium –  Neustar –  Keynote •  Advantages –  Repeatable process •  e.g. can ensure that the process of “login, add product to shopping cart, checkout” works between code releases –  Can be used as a control –  Cheap •  Disadvantages –  Monitors only what you decided to test against –  Not as thorough as RUM
  11. 11. Web monitoring – RUM / STM – nice to have Real User Monitoring •  (RUM) is a passive monitoring technology that records all user interaction with a website or client interacting with a server or cloud-based application. •  Examples –  Google Analytics –  New Relic –  Keynote –  Many, many more •  Advantages –  Real-user “testing” data –  Monitoring for issues as they occur –  Identifies browser-related issues •  Disadvantages –  Expensive –  Too much information (information overload)
  12. 12. Adobe WEM monitoring – basic checks for Author, Publisher Ports to monitor (are they accessible)? •  Author – 4502 •  Publisher – 4503 Suggested pages to monitor •  Sling login page - /system/sling/cqform/defaultlogin.html –  Should always work! –  Response times almost always the same –  If Sling login page is up, but for example homepage is not, can be indicative a content or code-related issue curl -s http://localhost:4503/system/sling/cqform/defaultlogin.html | grep QUICKSTART_HOMEPAGE <!-- QUICKSTART_HOMEPAGE - (string used for readyness detection, do not remove) --> •  Homepage, important landing pages –  If Publisher hosts multiple farms & host-specific sling mappings are used, you may need to pass host-header: curl -H "Host: www.citytechinc.com" http://localhost:4503/us/en.html –  Above example is another reason why a customizable monitoring solution is needed •  Nagios has an http_check plugin that supports sending host headers with requests
  13. 13. Adobe WEM monitoring – error.log, critical errors Files to monitor •  error.log, keywords (AEM 5.5, 5.6, although some may still be applicable to 6.x) –  critical errors •  OutOfMemoryMonitor  CQ  shutting  down   •  StackOverflowError   •  Maximum  threads  reached   •  Java OutOfMemoryErrors, e.g. –  java.lang.OutOfMemoryError:  unable  to  create  new   native  thread   •  too  many  open  files   –  Non-critical errors (error count is useful) •  RecursionTooDeepException   •  Failed  to  mmap  tar  file  /   java.lang.OutOfMemoryError:  Map  failed  
  14. 14. Adobe WEM monitoring – error.log, repository related Files to monitor •  error.log, repository-related keywords –  critical errors •  tar  files  read-­‐only   –  Non-critical errors (error count is useful, with alarm set when threshold is exceeded) •  failed  to  retrieve  state  of(.+)node   •  failed  to  retrieve  state  of  intermediary  node   •  Failed  to  read  bundle   •  Repository  error  during  page  import   •  Unable  to  create  version   •  lucene(.+)Unknown(.+)node   •  lucene(.+)query  result  node     Tip: When encoutering important repository errors, make sure to update your monitoring software to detect it!  
  15. 15. Adobe WEM monitoring – error.log
  16. 16. Adobe WEM monitoring – access.log Files to monitor •  access.log –  HTTP code frequency, e.g. –  200 Success –  302 Redirect –  403 Forbidden –  404 Not Found –  500 Internal Server Error # tail access.log 127.0.0.1 - anonymous 04/Aug/2015:20:30:54 +0000 "GET /us/en.html HTTP/1.1" 200 22572 "-" "-" 127.0.0.1 - anonymous 04/Aug/2015:20:30:55 +0000 "GET /content/ citytech/global/en.html HTTP/1.1" 200 22598 "-" "curl agent, CTMSP monitoring” Tip: throw these stats into graph in order to correlate trends or possible page issues or anomalies (RUM, ELK does this excellently)
  17. 17. Adobe WEM monitoring – access.log, cont.
  18. 18. Adobe WEM monitoring – access.log, cont. Files to monitor •  access.log – Cache-busting requests •  Contains query strings, e.g. –  http://www.citytechinc.com/us/en.html?hi=test •  Extensionless, e.g. –  GET /athletes/athletes.34360.html/career – Extensions •  .js, .css •  Images - .bmp, .jpg, .jpeg, .png Tip: calculate the percentage of cache-busting requests over time as a baseline to compare against.
  19. 19. Adobe WEM monitoring – access.log, cont.
  20. 20. Adobe WEM monitoring – request.log Files to monitor •  request.log –  Looks like root@Citytech Prod CQ Pub 1c i-cbdd5ba9:/var/log/cq5# tail request.log 26/Jul/2015:01:10:23 +0000 [2774404] -> GET /content/citytech/global/ en.html HTTP/1.1 26/Jul/2015:01:10:23 +0000 [2774404] <- 200 text/html 229ms 26/Jul/2015:01:10:25 +0000 [2774405] -> GET /system/sling/cqform/ defaultlogin.html HTTP/1.1 26/Jul/2015:01:10:25 +0000 [2774405] <- 200 text/html 3ms 26/Jul/2015:01:10:28 +0000 [2774407] -> GET /us/en.html HTTP/1.1 26/Jul/2015:01:10:28 +0000 [2774407] <- 200 text/html 222ms –  Can obtain list of response times with rlog.jar java -jar /opt/adobe-cq5.6.1/publish/crx-quickstart/opt/ helpers/rlog.jar -n 50 -xdev /var/log/cq5/request.log Tip: create a top 100 list of slowest page requests over 5 minute intervals in order to spot poorly performing pages
  21. 21. Adobe WEM monitoring – request.log 07/24/2015 01:50:27 PM ------------- Fri Jul 24 18:50:26 GMT 2015 -------------- *Info * Parsed 1135 requests. *Info * Time for parsing: 72ms *Info * Time for sorting: 3ms *Info * Total Memory: 110mb *Info * Free Memory: 109mb *Info * Used Memory: 1mb ------------------------------------------------------ 7165ms 24/Jul/2015:18:48:16 +0000 200 GET /en/home/products/show_products.html?tag=bikes-us%3Abrand%2Ftrek&tag=bikes-us%3Aproduct%2Fulocks %2Fshimano text/html 7020ms 24/Jul/2015:18:46:54 +0000 200 GET /en/home/products/show_products.html?tag=trek-americas%3Abrand%2Ftrek&tag=trek-americas%3Aproduct %2Fulocks%2FnonLocking text/html 6643ms 24/Jul/2015:18:46:54 +0000 200 GET /en/home/style/show_products.html?tag=bikes-us%3Abrand%2Ftrek&tag=bikes-us%3AstyleCollection%2Fcamelot text/html 6001ms 24/Jul/2015:18:46:54 +0000 200 GET /en/home/style/show_products.html?tag=bikes-us%3Abrand%2Ftrek&tag=bikes-us%3AstyleCollection %2Fbrookshire text/html 4979ms 24/Jul/2015:18:46:53 +0000 200 GET /en/home/products/show_products.html?tag=trek-americas%3Abrand%2Ftrek&tag=trek-americas%3Aproduct %2Fhandlesets%2FtwoSidesKeyed text/html 4074ms 24/Jul/2015:18:48:16 +0000 200 GET /en/home/products/show_products.html?tag=bikes-us%3Abrand%2Ftrek&tag=bikes-us%3Aproduct%2Fknobs %2FnonLocking text/html 3357ms 24/Jul/2015:18:46:39 +0000 200 GET /en/home/products/show_products.html?tag=bikes-us%3Abrand%2Ftrek&tag=bikes-us%3Aproduct %2Fshifters&tag=bikes-us%3Aproduct%2Fshifters%2FoneSideKeyed text/html 1031ms 24/Jul/2015:18:46:02 +0000 200 GET /en/home/products/show_products.html?tag=bikes-us:brand/trek&tag=bikes-us:product/shifters&tag=bikes- us:product/shifters/oneSideKeyed text/html 925ms 24/Jul/2015:18:46:44 +0000 200 GET /content/bikes-us/en/home/search.html?searchQuery=user+and+alarm+programming text/html 818ms 24/Jul/2015:18:46:36 +0000 200 GET /en/home/style/design-guides/style-evolution-2014.html text/html 528ms 24/Jul/2015:18:49:38 +0000 200 GET /en/home/products/F51ACCFFF.html?bck=@@bikes-us:brand/trek@@bikes-us:product/ulocks/keyedLock@@bikes- us:product/ulocks text/html 456ms 24/Jul/2015:18:47:00 +0000 200 GET /content/dam/bikes-us/product-images/F10%20%28F75%29/F10ACC622ADD.jpg/_jcr_content/renditions/ cq5dam.thumbnail.319.319.png image/png 361ms 24/Jul/2015:18:49:38 +0000 200 GET /content/bikes-us/en/home/search.html?searchQuery=AL+SERIES text/html 306ms 24/Jul/2015:18:47:01 +0000 200 GET /content/dam/bikes-us/product-images/F10%20%28F75%29/F10BRW625GRW.jpg/_jcr_content/renditions/ cq5dam.thumbnail.319.319.png image/png 292ms 24/Jul/2015:18:49:42 +0000 200 GET /en/home/faq.html?id=42 text/html 272ms 24/Jul/2015:18:47:01 +0000 200 GET /content/dam/bikes-us/product-images/BE469NX/BE469NXCEN626.jpg/_jcr_content/renditions/cq5dam.thumbnail. 319.319.png image/png 244ms 24/Jul/2015:18:47:27 +0000 200 GET /content/bikes-us/en/home.html text/html 236ms 24/Jul/2015:18:47:02 +0000 200 GET /content/dam/bikes-us/product-images/F10%20%28F75%29/F10PLY716GRW.jpg/_jcr_content/renditions/ cq5dam.thumbnail.319.319.png image/png
  22. 22. Adobe WEM monitoring – thread count WEM request thread count •  Why is this important? –  default max request thread is set to 200 –  If hitting the maximum, can indicate spike in traffic or application slowness •  How do I view? System console: http://i-cbdd5ba9.citytech-prod.ctmsp.com:4503/system/console/status-Threads Thread #768010/10.87.66.63 [1437866326422] <closed> [priority=5, alive=true, daemon=true, interrupted=false, loader=cqse-httpservice [22]] Thread #2228/127.0.0.1 [1437868798545] GET /us/en.html HTTP/1.1 [priority=5, alive=true, daemon=true, interrupted=false, loader=org.apache.sling.commons.classloader.impl.ClassLoaderFacade@2d58350a] Thread #2196/127.0.0.1 [1437868798613] GET /us/en.html HTTP/1.1 [priority=5, alive=true, daemon=true, interrupted=false, loader=org.apache.sling.commons.classloader.impl.ClassLoaderFacade@2d58350a] Thread #768030/127.0.0.1 [1437868798881] GET /us/en.html HTTP/1.1 [priority=5, alive=true, daemon=true, interrupted=false, loader=org.apache.sling.commons.classloader.impl.ClassLoaderFacade@2d58350a] Thread #937774/127.0.0.1 [1437868798896] GET /us/en.html HTTP/1.1 [priority=5, alive=true, daemon=true, interrupted=false, loader=org.apache.sling.commons.classloader.impl.ClassLoaderFacade@2d58350a] Thread #767978/127.0.0.1 [1437868798909] GET /us/en.html HTTP/1.1 [priority=5, alive=true, daemon=true, interrupted=false, loader=org.apache.sling.commons.classloader.impl.ClassLoaderFacade@2d58350a] Thread #767927/127.0.0.1 [1437868802472] <closed> [priority=5, alive=true, daemon=true, interrupted=false, loader=cqse-httpservice [22]] Thread #767940/64.6.160.57 [1437868802408] GET /system/console/status-Threads HTTP/1.1 [priority=5, alive=true, daemon=true, interrupted=false, loader=cqse-httpservice [22]] Thread #767982/64.6.160.57 [1437868802410] <parse> [priority=5, alive=true, daemon=true, interrupted=false, loader=cqse-httpservice [22]] Thread #87/ActivityServiceImpl [priority=5, alive=true, daemon=true, interrupted=false, loader=java.net.URLClassLoader@42472d48] Thread #109/Adobe Granite Offloading job cloner queue processor [priority=5, alive=true, daemon=true, interrupted=false, loader=java.net.URLClassLoader@42472d48] •  Obtaining request thread count (easy curl command) curl -s -u 'admin:_insert_password_here' http://localhost:4503/system/console/status- Threads|grep -E 'GET|POST’|wc –l
  23. 23. Adobe WEM monitoring – JCR queries Slow Queries, Popular Queries •  AEM built-in •  Displays top 15 slowest JCR queries •  Example: /usr/bin/java -jar /usr/local/bin/cmdline- jmxclient-0.10.3.jar - localhost:12345 com.adobe.granite:type=QueryStat SlowQueries
  24. 24. Adobe WEM monitoring – JCR queries ------------- 07/26/2015 01:44:33 +0000 org.archive.jmx.Client SlowQueries: -------------- creationTime: Sun Jul 26 01:40:06 GMT 2015 duration: 2788ms language: xpath occurrenceCount: 1 position: 1 statement: /jcr:root/content/trek-us/en/home/products//element(*, cq:Page)[jcr:contains(., '*') and (jcr:content/ @cq:template = '/apps/trek-americas/templates/productDetail-page') and ((jcr:content/@cq:tags = 'trek- americas:product/shifters' and jcr:content/@cq:tags = 'trek-americas:product/shifters/nonLocking' and jcr:content/ @cq:tags = 'trek-americas:brand/trek'))] creationTime: Sun Jul 26 01:36:34 GMT 2015 duration: 1766ms language: xpath occurrenceCount: 8729 position: 2 statement: /jcr:root/var/eventing/jobs//element(*, slingevent:Job)[jcr:contains(., '/com/day/cq/replication/job') and not(@slingevent:finished)] creationTime: Sun Jul 26 01:40:33 GMT 2015 duration: 809ms language: xpath occurrenceCount: 1 position: 3 statement: /jcr:root/content/trek-us/en/home/products//element(*, cq:Page)[jcr:contains(., '*') and (jcr:content/ @cq:template = '/apps/trek-americas/templates/productDetail-page') and ((jcr:content/@cq:tags = 'trek- americas:product/shifters' and jcr:content/@cq:tags = 'trek-americas:product/shifters/nonLocking' and jcr:content/ @cq:tags = 'trek-americas:brand/trek'))] creationTime: Sun Jul 26 01:41:15 GMT 2015 duration: 790ms language: xpath occurrenceCount: 1 position: 4 statement: /jcr:root/content/trek-us/en/home/products//element(*, cq:Page)[jcr:contains(., '*') and (jcr:content/ @cq:template = '/apps/trek-americas/templates/productDetail-page') and ((jcr:content/@cq:tags = 'trek-americas:brand/ trek' and jcr:content/@cq:tags = 'trek-americas:product/ulocks' and jcr:content/@cq:tags = 'trek-americas:product/ ulocks/titanium'))] creationTime: Sun Jul 26 01:40:05 GMT 2015 duration: 782ms language: xpath occurrenceCount: 1 position: 5 statement: /jcr:root/content/trek-us/en/home/products//element(*, cq:Page)[jcr:contains(., '*') and (jcr:content/ @cq:template = '/apps/trek-americas/templates/productDetail-page') and ((jcr:content/@cq:tags = 'trek- americas:product/levers' and jcr:content/@cq:tags = 'trek-americas:electric/zwave' and jcr:content/@cq:tags = 'trek- americas:product/ulocks/titanium'))] order by jcr:content/content-par/productdetail/@releasedate descending Tip: The slow query statistic by default shows all queries since AEM startup. However this counter can be reset, if you want to have for example 10-minute “summaries” of the slowest queries.
  25. 25. Adobe WEM monitoring – misc. Other possible things to monitor •  Running workflows •  Bundle status - installed, active •  Replication queues - total, blocked data for all of the above is possible via curl!
  26. 26. JVM monitoring – heap usage Heap usage •  Useful for viewing AEM memory usage and GC issues •  Can be obtained via JMX –  Example using free cmdline-jmxclient.jar tool: # java -jar /usr/local/bin/cmdline-jmxclient.jar - i-d4bb64dd.ct-prod.ctmsp.com:12345 'java.lang:name=PS Old Gen,type=MemoryPool' Usage 07/26/2015 20:12:20 +0000 org.archive.jmx.Client Usage: committed: 4462215168 init: 894828544 max: 14316601344 used: 4158743792 •  Also viewable via jmap command # jmap -heap 31470 Attaching to process ID 31470, please wait... Debugger attached successfully. Server compiler detected. JVM version is 20.5-b03 using thread-local object allocation. Parallel GC with 1 thread(s) - additional output trimmed -
  27. 27. JVM monitoring – heap usage
  28. 28. JVM monitoring – heap usage issues
  29. 29. JVM Monitoring – GC pause times Why monitor JVM pause times? •  These are “stop-the-world” events where the application is unreponsive due to JVM garbage collection •  Sometimes JVM garbage collection is not successful, and thus constant GCs occur since memory cannot be freed – this incurs serious CPU usage •  Should be monitored since it can be a performance hit How to monitor? •  Pause times can be added to stdout via JVM options -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps 2015-07-27T18:50:30.212+0000: [Full GC [PSYoungGen: 98121K->0K(6107264K)] [ParOldGen: 6144935K->1561525K(6291456K)] 6243056K->1561525K(12398720K) [PSPermGen: 193509K- >193465K(193600K)], 5.7558230 secs] [Times: user=22.98 sys=0.00, real=5.75 secs] 2015-07-27T18:50:42.432+0000: [GC [PSYoungGen: 5916288K->81734K(5998080K)] 7477813K- >1643259K(12289536K), 0.1018320 secs] [Times: user=0.52 sys=0.00, real=0.10 secs] •  Pause times also can be added via: -XX:+PrintGCApplicationStoppedTime Total time for which application threads were stopped: 0.0001780 seconds Total time for which application threads were stopped: 0.0001920 seconds Tip: Even if you don’t have time to enable monitoring via JMX, at least print GC output to a log file for later analysis when AEM is slowing down!
  30. 30. JVM Monitoring – GC pause times
  31. 31. Summary •  Monitor all homepage & landing pages, for all individual Publishers and Dispatchers •  Use AEM logs and tools to provide info on AEM status and performance – access/ error/request logs, rlog.jar, thread status, slow queries page and customize your monitoring to record this data •  Use JMX and verbose GC logging to record JVM memory heap usage, and GC pause times
  32. 32. References References •  http://smartbear.com/articles/what-is-real-user- monitoring/ •  https://en.wikipedia.org/wiki/ Synthetic_monitoring •  https://docs.adobe.com/docs/en/cq/5-6-1/ deploying/performance.html Contact Info: Michael Chan michael.chan@icfi.com

×