11. Fixing PRD
▸ What item is breaking PRD?
▸ Why doesn’t it let Sitecore start up?
▸ WinDbg: Also part of Debugging Tools
for Windows
▸ Can load dump files
▸ Supports managed applications (SOS)
▸ Common commands, sequence and
SOS initialization: https://bit.ly/2EGIiN2
11
13. Event Queue
13
▸ Event not marked as done until
processing completed (on web)
select * from dbo.EventQueue where InstanceData like '%d5729bfe-e486-4a37-8e0d-d1af8738a6aa%'
delete from dbo.EventQueue where InstanceData like '%d5729bfe-e486-4a37-8e0d-d1af8738a6aa%'
15. Summary
▸ Data structure expected as tree, but
received graph
▸ Recursion not handled in code
▸ StackOverflowException
▸ Event not marked as completed
15
17. Memory Leak
▸ Noticed by Operations / Monitoring
▸ Related to Index Rebuilds
▸ Got worse over time (more items)
17
18. Gathering hang dumps
▸ Using adplus in hang mode
▸ Two separate dumps: before and after
18
.adplus.exe -p 11452 -o c:tempcrash-analysismemory-leak -hang
24. Further Research & Sitecore
Support
▸ Service resolved in computed index fields
▸ https://github.com/aspnet/DependencyInjection
/issues/456
▸ Answer from Sitecore Support:
24
The thing is that some of the Sitecore functionality depends on the
current version of the "Microsoft.Extensions.DependencyInjection.dll" file.
Unfortunately, replacing the mentioned file with a newer version was not
deeply tested, so it could cause some breaking changes and unexpected
behavior.
So, I kindly ask you consider upgrading your Sitecore instance to Sitecore
9.1.
26. RabbitMQ processing stuck
▸ Customer noticed that items are
missing
▸ Middleware logs looked fine
▸ Messages were still in RabbitMq
▸ Nothing in Sitecore logs
26
27. Gathering some crash dumps
▸ At least 2, some minutes apart
27
.adplus.exe -p 34084 -hang -o C:Tempcrash-analysisimport-stuck
34. Summary
▸ Query did not use exact match
▹ Tab interpreted as space by SolR
▸ Input data not sanitized
▸ Method stuck in endless loop
▸ Implemented sanitization in
Middleware and Sitecore
▸ Re-coded unique lookup logic
34
David Szöke
Application Engineer at Unic in Switzerland
This talk is about debugging techniques
Because there is always the possibility that you missed some constellations while developing, testing or in the automated tests which could break your enviroment
I’ve prepared 3 real world cases which I though might be interesting to you
We are going to look at how we approached these issues, what tools we used and how we tracked down the root cause
Let’s start with the first one
After executing the command, go and grab a coffee. Depending on the size of your application and type of error, this might take some minutes
-pmn: defines process name to be monitored. ADPlus will keep monitoring if a process with this name starts and attach
We are also using the –FullOnFirst argument, which will generate a full dump, even if only a first chance exception occurred.
-crash: Since the application is in fact crashing, we provide the –crash argument. Another option would be –hang, which we will cover in another case
After the dump has been generated, you will end up with something like in the screenshot
From here, we have several options: We can either directly go to hardcore mode and spin up windbg, or we can first run some automated analysis on the dump. Let’s start with the automated analysis first.
Microsoft provides a set of diagnostics tools, called DebugDiag. Next to the analyser, there is also a utility for gathering crash dumps. The nice thing there is, that it also provides a GUI. So if you didn’t like the previous CLI approach, you could also use DebugDiag for that.
So let’s spin up DebugDiag Analysis and select our freshly generated crash dumps
DebugDiag will generate a report which can be displayed in the browser
In the .NET Threads Summary we can see that there is StackOverflowException in Thread 114
An alternative to DebugDiag is PostMortem. It’s a simple tool I wrote a few months ago in order to facilitate and make crash dump analysis somewhat easier. You can clone it from my Github account.
Like DebugDiag, it will generate a report which you can display in the browser
It will not only tell us, that there is a StackOverflowException, but also display the stack trace of the Thread the exception happened in
Looking at the stack trace, we can see that there is clearly something going on in the ReferencedItemComputedField.GetRootId method
There was in fact a recursive call
User requirements:
Items imported
Reference to each other
We expected tree, but got a graph
Fixed by implementing a circuit breaker, list that would be passed
So we were able to fix the bug, but the system was still down and had to go up as soon as possible. Therefore, we had to find the conflicting item. Due to the fact that we had over 100’000 content items in 42 different language versions, we couldn’t just click through each item, but had to come up with a different solution.
Let’s enter windbg
It’s a multipurpose debugger by Microsoft. It was originally made for unmanaged applications, but Microsoft also provides a debugging extension called «Son Of Strike» allowing you to debug managed applications as well.
It’s more or less a CLI and has a pretty steep learning curve, but the more you use it, the more rewarding it’s going to get. Especially when you are able to track down an proof an issue.
So for starters, I created a short markdown file which lists some common commands and a way to initialize the SOS extensions.
With WinDbg open, we opened the dump of the StackOverflowException
WinDbg will automatically set you to the current threads where the exception occured
We can now run the !ClrStack –p command to display the stack trace, including the parameters
Input is highlighted in blue, important addresses are in yellow
Remote event only removed when processing complete
We were able to correlate the memory leaks to index rebuilds through monitoring tools and Sitecore logs
<Explain difference to crash dumps>
Switch to second dump
<explain difference between «Largest Size» and «Largest Retained Size»>
It usually helps to compare these graphs with a dump of a vanilla Sitecore instance. Usually, the
<Explain similar retention>
Talk about _transientDisposables
We were able to confirm that the SampleService was in fact registered as transient
And indeed, the class also implements Idisposable.
And surely enough, it also allocates a byte array
We’ve also checked the referenced to the class and were able to confirm, that it was resolved in a computed field
The original problem was happening in an import job. I’ve ported the problem to an event handler, just for demo purposes and easier implementation
Again in hang mode
Note that there is also a –r parameter which can be used to take a dump in a defined interval
Run with multiple dumps in «CrashHangMode»
So if these were really SolR queries, they should show up in the search logs, right?
Next thing we did was to enable search logs. By default, we had them disabled on the Solution, as they were generating just too much data
Requirements
Ensure title is unique across all items
Use while developing
Not much time investment needed to get a good understanding of the tool