Sherlock Homepage - A detective story about running large web services - NDC Oslo

Maarten Balliauw
Maarten BalliauwDeveloper Advocate
Sherlock Homepage
A detective story about running
large web services.
Maarten Balliauw
@maartenballiauw
Site unavailable!
Site unavailable!
Primary website location unavailable!
No problem: traffic manager in front – pfew!
Secondary location unavailable!
Website down…
Initial investigation & monitoring showed:
Primary & secondary website location instances all up
Machines available individually
Not through traffic manager and Azure load balancer
The cause…
Custom Azure load balancer probe
Implementation (StatusService.cs)
<LoadBalancerProbes>
<LoadBalancerProbe name="HTTP" path="/api/status" protocol="http" port="80" />
</LoadBalancerProbes>
return new HttpStatusCodeWithBodyResult(AvailabilityStatusCode(galleryServiceAvailable),
String.Format(CultureInfo.InvariantCulture,
StatusMessageFormat,
AvailabilityMessage(galleryServiceAvailable),
AvailabilityMessage(sqlAzureAvailable),
AvailabilityMessage(storageAvailable),
AvailabilityMessage(searchServiceAvailable),
AvailabilityMessage(metricsServiceAvailable),
HostMachine.Name));
How did we find the issue?
Quote from “Mind Hunter” (written by an FBI profiler):
You have to be able to re-create the crime scene in your head. You need
to know as much as you can about the victim so that you can imagine
how she might have reacted. You have to be able to put yourself in her
place as the attacker threatens her with a gun or a knife, a rock, his fists,
or whatever.
You have to be able to feel her fear as he approaches her. You have to be
able to feel her pain. You have to try to imagine what she was going
through when he tortured her. You have to understand what it’s like to
scream in terror and agony, realizing that it won’t help, that it won’t get
him to stop. You have to know what it was like.
http://highscalability.com/blog/2015/7/30/how-debugging-is-like-hunting-serial-killers.html
How did we find the issue?
Debugging requires a particular sympathy for the machine. You must be able to run
the machine and networks of machines in your mind while simulating what-ifs based
on mere wisps of insight.
Knowing the system you are working on – even by similarity
Empathy for what is going on in that system
A hunch based on prior experience / insights
Sherlock Homepage
A detective story about running
large web services.
Maarten Balliauw
@maartenballiauw
Who am I?
Maarten Balliauw
Antwerp, Belgium
Software Engineer, Microsoft
Founder, MyGet
AZUG
Focus on web
ASP.NET MVC, Azure, SignalR, ...
Former MVP Azure & ASPInsider
Big passion: Azure
http://blog.maartenballiauw.be
@maartenballiauw
Shameless self promotion: Pro NuGet - http://amzn.to/pronuget2
History and Context
A bit of history
Site serves dependencies for .NET developers worldwide
On average good for ~8.000.000 to ~10.000.000 request per day (on compute)
Built 4 years ago on top of a SQL database and OData services
Monolithic – site + OData service are the same app
Improvements over the years
Some rough times Q2 2015
Architecture overview (Q2 2015)
Front end servers (2 regions)
MVC + WCF OData
Search Service
Lucene based Web API
Search Service
(secondary region)
Lucene based Web API
Azure Storage
Lucene index
Download counts
Azure SQL Database
Metadata
Download counts
Jobs VMs
Create index from database
Create stats
Create download count reports
Did we solve the
crime?
One of the services caused this…
SQL database?
Storage?
Search?
Metrics service?
return new HttpStatusCodeWithBodyResult(AvailabilityStatusCode(galleryServiceAvailable),
String.Format(CultureInfo.InvariantCulture,
StatusMessageFormat,
AvailabilityMessage(galleryServiceAvailable),
AvailabilityMessage(sqlAzureAvailable),
AvailabilityMessage(storageAvailable),
AvailabilityMessage(searchServiceAvailable),
AvailabilityMessage(metricsServiceAvailable),
HostMachine.Name));
Log spelunking
Check SQL database logs – we found we had none (fixed now)
Storage – storage statistics seemed stable
Search – no real pointers to issues there
Metrics service – very lightweight and has been stable for months
Start looking around at the crime scene!
IIS logs, event viewer on web servers
Profiling on web servers
Profiling the website
demo
It could have been
search...
No real evidence though.
Profiling the search
service
demo
Turns out it was search!
Profiling the search service revealed some things!
SearcherManager.cs checks Lucene index freshness on Get() – MaybeReopen()
StartReopen() blocks access to the index until finished
Part of HTTP request pipeline – blocking request handling
This was fixed by getting these calls out of the HTTP request path.
The suspect no longer had code available – used our informants
www.jetbrains.com/dotpeek
Search had some other flaws…
Actually also visible in the dotTrace snapshot we just saw: high GC!
Memory profiling the search service revealed some things! (I lost the actual traces )
Search had some other flaws…
High memory traffic on reading download count # for search ranking
The source: DownloadLookup.cs#L18
Fixed by:
Reusing objects (instead of new)
JsonStreamReader instead of JObject.Parse(theWorld)
In the meanwhile…
Added additional monitoring
Added additional tracing
Started looking into using AppInsights for better insights into application behavior
Events happening on the website
Requests
Exceptions
Execution and dependency times (basic but continuous profiling)
Internal Server (T)Error
during package restore
What we were seeing…
On the V2-based feeds:
500 Internal Server Error
during package restore
Response time goes up
while # of requests goes down
EventVwr on servers: lots of IIS crashes
Lots of crash dumps on web servers
So IIS crashes… Could it be?
HTTP.SYS tells us when things go wrong
D:WindowsSystem32LogFilesHTTPERR
2015-07-31 01:46:34 - 60810 - 80 HTTP/1.1 GET /api/v2/FindPackagesById()?id='...' - 1273337584
Connection_Abandoned_By_ReqQueue 86cd3cb1-729c-425c-898f-b15b0330bc38
Connection_Abandoned_By_ReqQueue
“A worker process from the application pool has quit unexpectedly or orphaned a pending
request by closing its handle. Specific to Windows Vista and Windows Server 2008.”
Gift from the gods: crash dumps!
C:ResourcesDirectory31edcaa5186f…...DiagnosticStoreWAD0104CrashDumps
Analyzing a crash
dump
demo
An Exception crashes IIS?
Time to crank up the search engine queries!
Found a similar issue: unobserved task exceptions causing IIS to crash
// If metrics service is specified we post the data to it asynchronously.
if (_config != null && _config.MetricsServiceUri != null)
{
// Disable warning about not awaiting async calls
// because we are _intentionally_ not awaiting this.
#pragma warning disable 4014
Task.Run(() => PostDownloadStatistics(id, version, …));
#pragma warning restore 4014
}
TaskScheduler.UnobservedTaskException += (object sender, UnobservedTaskExceptionEventArgs excArgs) =>
{
// ... log it ...
excArgs.SetObserved();
};
Tasks and fire-and-forget are evil!
Unobserved task can cause the entire process to give up on Exception
Handle unobserved task Exceptions!
Sherlock Homepage - A detective story about running large web services - NDC Oslo
High response times
on the web server
What we were seeing…
High response times on the servers
Resulting in higher than normal CPU usage on the servers
Azure would often auto-scale additional instances
Profiling the web application still showed wait times with no obvious cause…
Eating donuts
Research
Reading and searching on what could be the cause of these issues
http://stackoverflow.com/questions/12304691/why-are-iis-threads-so-precious-as-compared-to-regular-clr-threads
http://www.monitis.com/blog/2012/06/11/improving-asp-net-performance-part3-threading/
https://msdn.microsoft.com/en-us/library/ms998549.aspx
http://blogs.msdn.com/b/tmarq/archive/2007/07/21/asp-net-thread-usage-on-iis-7-0-and-6-0.aspx
https://support.microsoft.com/en-us/kb/821268
Consider minIoThreads and minWorkerThreads for Burst Load
If your application experiences burst loads where there are prolonged periods of inactivity between the burst
loads, the thread pool may not have enough time to reach the optimal level of threads. A burst load occurs
when a large number of users connect to your application suddenly and at the same time. The
minIoThreads and minWorkerThreads settings enable you to configure a minimum number of worker
threads and I/O threads for load conditions.
The result
GUESS WHERE WE DID THE TWEAK… COMPARED TO LAST WEEK…
Making it permanent
Part of the cloud service startup script
# Increase the number of available IIS threads for high performance applications
# Uses the recommended values from http://msdn.microsoft.com/en-us/library/ms998549.aspx#scalenetchapt06_topic8
# Assumes running on two cores (medium instance on Azure)
&$appcmd set config /commit:MACHINE -section:processModel -maxWorkerThreads:100
&$appcmd set config /commit:MACHINE -section:processModel -minWorkerThreads:50
&$appcmd set config /commit:MACHINE -section:processModel -minIoThreads:50
&$appcmd set config /commit:MACHINE -section:processModel -maxIoThreads:100
# Adjust the maximum number of connections per core for all IP addresses
&$appcmd set config /commit:MACHINE -section:connectionManagement /+["address='*',maxconnection='240'"]
Package restore
timeouts
What we were seeing…
On the V2-based feeds:
Package restore timeouts coming from the WCF OData service
Occurs every 7-15 hours, fixes itself ~15 minutes later
Extreme load times on Get(Id=,Version=) – probably the cause of these timeouts
No easy way to reproduce…
Happening only on production
Observation after RDP-ing in: 100% CPU when it happens
No way to profile continuously – AppInsights did show us the entry point
Donut time again
The thing we recently changed was minIOThreads and throughput
The slow code path is FindPackagesById()
Makes HTTP calls to search service
What could this setting and HTTP calls have in common…
HttpClient, Async and multithreading
Interesting article benchmarking HttpClient in async and multithreading scenarios
Async + HttpClient are not limited in terms of concurrency
Many CPU’s and threads? Many HttpClients and requests
Many HttpClients and requests? Many TCP ports used on machine
Many TCP ports used? TCP port depletion
But aren’t ports reclaimed?
240 seconds TIME_WAIT (4 minutes)
Users also use up TCP ports
“As far as HTTP requests are concerned, a limit should always be set to
ServicePointManager.DefaultConnectionLimit. The limit should be large enough to allow a
good level of parallelism, but low enough to prevent performance and reliability problems (from the
exhaustion of ephemeral ports). “
Limiting HttpClient async concurrency
Set ServicePointManager properties on startup
Nagling – “bundle traffic in properly stuffed TCP packets”
Expect100Continue – “only send out traffic if server says 100 Continue”
Both optimizations also disabled
// Tune ServicePointManager
// (based on http://social.technet.microsoft.com/Forums/en-US/windowsazuredata/thread/d84ba34b-b0e0-
4961-a167-bbe7618beb83 and https://msdn.microsoft.com/en-
us/library/system.net.servicepointmanager.aspx)
ServicePointManager.DefaultConnectionLimit = 500;
ServicePointManager.UseNagleAlgorithm = false;
ServicePointManager.Expect100Continue = false;
Some charts…
Memory pressure
What we were seeing…
Massive memory usage! Even when changing VM sizes.
100% of memory on a Medium Azure instance
100% of memory on a Large Azure instance
100% of memory on a X-Large Azure instance
What is eating this memory?
Memory profiling!
On the server?
Try to reproduce it?
Decided on the latter
Reproducing
production traffic
demo
.NET Memory Management 101
Memory Allocation
.NET runtime reserves region of address space for every new process
managed heap
Objects are allocated in the heap
Allocating memory is fast, it’s just adding a pointer
Some unmanaged memory is also consumed (not GC-ed)
.NET CLR, Dynamic libraries, Graphics buffer, …
Memory Release or “Garbage Collection” (GC)
Generations
Large Object Heap
.NET Memory Management 101
Memory Allocation
Memory Release or “Garbage Collection” (GC)
GC releases objects no longer in use by examining application roots
GC builds a graph that contains all the objects that are reachable from these
roots.
Object unreachable? GC removes the object from the heap, releasing memory
After the object is removed, GC compacts reachable objects in memory.
Generations
Large Object Heap
.NET Memory Management 101
Memory Allocation
Memory Release or “Garbage Collection” (GC)
Generations
Managed heap divided in segments: generation 0, 1 and 2
New objects go into Gen 0
Gen 0 full? Perform GC and promote all reachable objects to Gen 1. This is typically pretty fast.
Gen 1 full? Perform GC on Gen 1 and Gen 0. Promote all reachable objects to Gen 2.
Gen 2 full? Perform full GC (2, 1, 0). If not enough memory for new allocations, throws
OutOfMemoryException
Full GC has performance impact since all objects in managed heap are verified.
Large Object Heap
.NET Memory Management 101
Memory Allocation
Memory Release or “Garbage Collection” (GC)
Generations
Large Object Heap
Generation 0 Generation 1 Generation 2
Short-lived objects (e.g. Local
variables)
In-between objects Long-lived objects (e.g. App’s
main form)
.NET Memory Management 101
Memory Allocation
Memory Release or “Garbage Collection” (GC)
Generations
Large Object Heap
Large objects (>85KB) stored in separate segment of managed heap: Large
Object Heap (LOH)
Objects in LOH collected only during full garbage collection
Survived objects in LOH are not compacted (by default). This means that LOH
becomes fragmented over time.
Fragmentation can cause OutOfMemoryException
The .NET garbage collector
Simulates “infinite memory” by removing objects no longer needed
When does it run? Vague… But usually:
Out of memory condition – when the system fails to allocate or re-allocate memory
After some significant allocation – if X memory is allocated since previous GC
Failure of allocating some native resources – internal to .NET
Profiler – when triggered from profiler API
Forced – when calling methods on System.GC
Application moves to background
GC is not guaranteed to run
http://blogs.msdn.com/b/oldnewthing/archive/2010/08/09/10047586.aspx
http://blogs.msdn.com/b/abhinaba/archive/2008/04/29/when-does-the-net-compact-framework-garbage-collector-run.aspx
Analyzing memory
usage
demo
So our DI container? NInject?
Our memory profiling confirms it. The retained EntitiesContext also
retains entities and SQL connections.
Spelunking the NInject source code, we found the
GarbageCollectionCachePruner responsible for releasing objects.
Runs every 30 seconds (timer)
Releases objects only if GC happened in that time
GC is not guaranteed to run, so NInject potentially never releases objects
Known, old bug.
https://groups.google.com/forum/#!topic/ninject/PQNMIsQhCvE
http://stackoverflow.com/questions/16775362/ninject-caching-object-that-should-be-disposed-memoryleak
Replacing our DI container (Autofac)
Perform replacement
Run same analysis on new codebase
and verify objects are freed
Once deployed:
Immediate drop in response times
Memory usage now stable at ~4 GB
Conclusion
Conclusion
Debugging requires a particular sympathy for the machine. You must be
able to run the machine and networks of machines in your mind while
simulating what-ifs based on mere wisps of insight.
Bugs hide. They blend in. They can pass for "normal" which makes them tough to find.
One bug off the streets doesn’t mean all of them are gone. Sometimes one gone exposes
another.
Know your system, know your tools, know your options. Look for evidence.
Profilers (performance and memory), dump files, AppInsights and others
Dive in. It builds experience and makes solving the next crime scene easier.
https://msdn.microsoft.com/en-us/library/ee817663.aspx
Thank you!
http://blog.maartenballiauw.be
@maartenballiauw
http://amzn.to/pronuget2
1 of 56

Recommended

Sherlock Homepage - A detective story about running large web services - WebN... by
Sherlock Homepage - A detective story about running large web services - WebN...Sherlock Homepage - A detective story about running large web services - WebN...
Sherlock Homepage - A detective story about running large web services - WebN...Maarten Balliauw
2.2K views58 slides
Get more than a cache back! The Microsoft Azure Redis Cache (NDC Oslo) by
Get more than a cache back! The Microsoft Azure Redis Cache (NDC Oslo)Get more than a cache back! The Microsoft Azure Redis Cache (NDC Oslo)
Get more than a cache back! The Microsoft Azure Redis Cache (NDC Oslo)Maarten Balliauw
2.8K views42 slides
10 performance and scalability secrets of ASP.NET websites by
10 performance and scalability secrets of ASP.NET websites10 performance and scalability secrets of ASP.NET websites
10 performance and scalability secrets of ASP.NET websitesoazabir
120.6K views32 slides
Scaling asp.net websites to millions of users by
Scaling asp.net websites to millions of usersScaling asp.net websites to millions of users
Scaling asp.net websites to millions of usersoazabir
65.2K views35 slides
Tips and Tricks For Faster Asp.NET and MVC Applications by
Tips and Tricks For Faster Asp.NET and MVC ApplicationsTips and Tricks For Faster Asp.NET and MVC Applications
Tips and Tricks For Faster Asp.NET and MVC ApplicationsSarvesh Kushwaha
38.2K views21 slides
Microsoft Azure Web Sites Performance Analysis Lessons Learned by
Microsoft Azure Web Sites Performance Analysis Lessons LearnedMicrosoft Azure Web Sites Performance Analysis Lessons Learned
Microsoft Azure Web Sites Performance Analysis Lessons LearnedChris Woodill
9.9K views18 slides

More Related Content

What's hot

DNS for Developers - NDC Oslo 2016 by
DNS for Developers - NDC Oslo 2016DNS for Developers - NDC Oslo 2016
DNS for Developers - NDC Oslo 2016Maarten Balliauw
1.4K views58 slides
Building Scalable .NET Web Applications by
Building Scalable .NET Web ApplicationsBuilding Scalable .NET Web Applications
Building Scalable .NET Web ApplicationsBuu Nguyen
9.5K views24 slides
An Overview of Node.js by
An Overview of Node.jsAn Overview of Node.js
An Overview of Node.jsAyush Mishra
4.1K views16 slides
Apache spark with akka couchbase code by bhawani by
Apache spark with akka couchbase code by bhawaniApache spark with akka couchbase code by bhawani
Apache spark with akka couchbase code by bhawaniBhawani N Prasad
428 views7 slides
Gruntwork Executive Summary by
Gruntwork Executive SummaryGruntwork Executive Summary
Gruntwork Executive SummaryYevgeniy Brikman
10.4K views38 slides
Building Web APIs that Scale by
Building Web APIs that ScaleBuilding Web APIs that Scale
Building Web APIs that ScaleSalesforce Developers
2.2K views57 slides

What's hot(20)

DNS for Developers - NDC Oslo 2016 by Maarten Balliauw
DNS for Developers - NDC Oslo 2016DNS for Developers - NDC Oslo 2016
DNS for Developers - NDC Oslo 2016
Maarten Balliauw1.4K views
Building Scalable .NET Web Applications by Buu Nguyen
Building Scalable .NET Web ApplicationsBuilding Scalable .NET Web Applications
Building Scalable .NET Web Applications
Buu Nguyen9.5K views
An Overview of Node.js by Ayush Mishra
An Overview of Node.jsAn Overview of Node.js
An Overview of Node.js
Ayush Mishra4.1K views
Apache spark with akka couchbase code by bhawani by Bhawani N Prasad
Apache spark with akka couchbase code by bhawaniApache spark with akka couchbase code by bhawani
Apache spark with akka couchbase code by bhawani
Bhawani N Prasad428 views
Using Apache as an Application Server by Phil Windley
Using Apache as an Application ServerUsing Apache as an Application Server
Using Apache as an Application Server
Phil Windley6.5K views
(WEB304) Running and Scaling Magento on AWS | AWS re:Invent 2014 by Amazon Web Services
(WEB304) Running and Scaling Magento on AWS | AWS re:Invent 2014(WEB304) Running and Scaling Magento on AWS | AWS re:Invent 2014
(WEB304) Running and Scaling Magento on AWS | AWS re:Invent 2014
Amazon Web Services17.6K views
Spring Boot and REST API by 07.pallav
Spring Boot and REST APISpring Boot and REST API
Spring Boot and REST API
07.pallav670 views
Parse cloud code by 維佋 唐
Parse cloud codeParse cloud code
Parse cloud code
維佋 唐13.6K views
Writing RESTful web services using Node.js by FDConf
Writing RESTful web services using Node.jsWriting RESTful web services using Node.js
Writing RESTful web services using Node.js
FDConf33.3K views
Java 6 [Mustang] - Features and Enchantments by Pavel Kaminsky
Java 6 [Mustang] - Features and Enchantments Java 6 [Mustang] - Features and Enchantments
Java 6 [Mustang] - Features and Enchantments
Pavel Kaminsky4.1K views
More Cache for Less Cash (DevLink 2014) by Michael Collier
More Cache for Less Cash (DevLink 2014)More Cache for Less Cash (DevLink 2014)
More Cache for Less Cash (DevLink 2014)
Michael Collier8K views
Lecture 11 Firebase overview by Maksym Davydov
Lecture 11 Firebase overviewLecture 11 Firebase overview
Lecture 11 Firebase overview
Maksym Davydov477 views
Making Sense of APEX Security by Christoph Ruepprich by Enkitec
Making Sense of APEX Security by Christoph RuepprichMaking Sense of APEX Security by Christoph Ruepprich
Making Sense of APEX Security by Christoph Ruepprich
Enkitec4.7K views
Locking and Race Conditions in Web Applications by Andrew Kandels
Locking and Race Conditions in Web ApplicationsLocking and Race Conditions in Web Applications
Locking and Race Conditions in Web Applications
Andrew Kandels7.8K views
Automating Your Microsoft Azure Environment (DevLink 2014) by Michael Collier
Automating Your Microsoft Azure Environment (DevLink 2014)Automating Your Microsoft Azure Environment (DevLink 2014)
Automating Your Microsoft Azure Environment (DevLink 2014)
Michael Collier8.1K views

Similar to Sherlock Homepage - A detective story about running large web services - NDC Oslo

Sherlock Homepage (Maarten Balliauw) by
Sherlock Homepage (Maarten Balliauw)Sherlock Homepage (Maarten Balliauw)
Sherlock Homepage (Maarten Balliauw)Visug
536 views58 slides
Sherlock Homepage - A detective story about running large web services (VISUG... by
Sherlock Homepage - A detective story about running large web services (VISUG...Sherlock Homepage - A detective story about running large web services (VISUG...
Sherlock Homepage - A detective story about running large web services (VISUG...Maarten Balliauw
1.2K views58 slides
Building Continuous Application with Structured Streaming and Real-Time Data ... by
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Databricks
1.8K views33 slides
How and why we evolved a legacy Java web application to Scala... and we are s... by
How and why we evolved a legacy Java web application to Scala... and we are s...How and why we evolved a legacy Java web application to Scala... and we are s...
How and why we evolved a legacy Java web application to Scala... and we are s...Katia Aresti
1.3K views114 slides
Reactive application using meteor by
Reactive application using meteorReactive application using meteor
Reactive application using meteorSapna Upreti
278 views106 slides
ASP.NET MVC introduction by
ASP.NET MVC introductionASP.NET MVC introduction
ASP.NET MVC introductionTomi Juhola
2.7K views37 slides

Similar to Sherlock Homepage - A detective story about running large web services - NDC Oslo(20)

Sherlock Homepage (Maarten Balliauw) by Visug
Sherlock Homepage (Maarten Balliauw)Sherlock Homepage (Maarten Balliauw)
Sherlock Homepage (Maarten Balliauw)
Visug536 views
Sherlock Homepage - A detective story about running large web services (VISUG... by Maarten Balliauw
Sherlock Homepage - A detective story about running large web services (VISUG...Sherlock Homepage - A detective story about running large web services (VISUG...
Sherlock Homepage - A detective story about running large web services (VISUG...
Maarten Balliauw1.2K views
Building Continuous Application with Structured Streaming and Real-Time Data ... by Databricks
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...
Databricks1.8K views
How and why we evolved a legacy Java web application to Scala... and we are s... by Katia Aresti
How and why we evolved a legacy Java web application to Scala... and we are s...How and why we evolved a legacy Java web application to Scala... and we are s...
How and why we evolved a legacy Java web application to Scala... and we are s...
Katia Aresti1.3K views
Reactive application using meteor by Sapna Upreti
Reactive application using meteorReactive application using meteor
Reactive application using meteor
Sapna Upreti278 views
ASP.NET MVC introduction by Tomi Juhola
ASP.NET MVC introductionASP.NET MVC introduction
ASP.NET MVC introduction
Tomi Juhola2.7K views
FMK2019 being an optimist in a pessimistic world by vincenzo menanno by Verein FM Konferenz
FMK2019 being an optimist in a pessimistic world by vincenzo menannoFMK2019 being an optimist in a pessimistic world by vincenzo menanno
FMK2019 being an optimist in a pessimistic world by vincenzo menanno
What is going on - Application diagnostics on Azure - TechDays Finland by Maarten Balliauw
What is going on - Application diagnostics on Azure - TechDays FinlandWhat is going on - Application diagnostics on Azure - TechDays Finland
What is going on - Application diagnostics on Azure - TechDays Finland
Maarten Balliauw746 views
Taking care of a cloud environment by Maarten Balliauw
Taking care of a cloud environmentTaking care of a cloud environment
Taking care of a cloud environment
Maarten Balliauw1.3K views
Spark Streaming Recipes and "Exactly Once" Semantics Revised by Michael Spector
Spark Streaming Recipes and "Exactly Once" Semantics RevisedSpark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Michael Spector3.5K views
The Future is Now: Leveraging the Cloud with Ruby by Robert Dempsey
The Future is Now: Leveraging the Cloud with RubyThe Future is Now: Leveraging the Cloud with Ruby
The Future is Now: Leveraging the Cloud with Ruby
Robert Dempsey1K views
Application Security Workshop by Priyanka Aash
Application Security Workshop Application Security Workshop
Application Security Workshop
Priyanka Aash2.4K views
StrongLoop Overview by Shubhra Kar
StrongLoop OverviewStrongLoop Overview
StrongLoop Overview
Shubhra Kar2.3K views
Rethinking Syncing at AltConf 2019 by Joe Keeley
Rethinking Syncing at AltConf 2019Rethinking Syncing at AltConf 2019
Rethinking Syncing at AltConf 2019
Joe Keeley142 views
Reactive Application Using METEOR by NodeXperts
Reactive Application Using METEORReactive Application Using METEOR
Reactive Application Using METEOR
NodeXperts107 views
112 portfpres.pdf by sash236
112 portfpres.pdf112 portfpres.pdf
112 portfpres.pdf
sash23623 views

More from Maarten Balliauw

Nerd sniping myself into a rabbit hole... Streaming online audio to a Sonos s... by
Nerd sniping myself into a rabbit hole... Streaming online audio to a Sonos s...Nerd sniping myself into a rabbit hole... Streaming online audio to a Sonos s...
Nerd sniping myself into a rabbit hole... Streaming online audio to a Sonos s...Maarten Balliauw
360 views64 slides
Building a friendly .NET SDK to connect to Space by
Building a friendly .NET SDK to connect to SpaceBuilding a friendly .NET SDK to connect to Space
Building a friendly .NET SDK to connect to SpaceMaarten Balliauw
182 views47 slides
Microservices for building an IDE - The innards of JetBrains Rider - NDC Oslo... by
Microservices for building an IDE - The innards of JetBrains Rider - NDC Oslo...Microservices for building an IDE - The innards of JetBrains Rider - NDC Oslo...
Microservices for building an IDE - The innards of JetBrains Rider - NDC Oslo...Maarten Balliauw
406 views52 slides
Indexing and searching NuGet.org with Azure Functions and Search - .NET fwday... by
Indexing and searching NuGet.org with Azure Functions and Search - .NET fwday...Indexing and searching NuGet.org with Azure Functions and Search - .NET fwday...
Indexing and searching NuGet.org with Azure Functions and Search - .NET fwday...Maarten Balliauw
180 views32 slides
NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain... by
NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain...NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain...
NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain...Maarten Balliauw
326 views53 slides
JetBrains Australia 2019 - Exploring .NET’s memory management – a trip down m... by
JetBrains Australia 2019 - Exploring .NET’s memory management – a trip down m...JetBrains Australia 2019 - Exploring .NET’s memory management – a trip down m...
JetBrains Australia 2019 - Exploring .NET’s memory management – a trip down m...Maarten Balliauw
280 views42 slides

More from Maarten Balliauw(20)

Nerd sniping myself into a rabbit hole... Streaming online audio to a Sonos s... by Maarten Balliauw
Nerd sniping myself into a rabbit hole... Streaming online audio to a Sonos s...Nerd sniping myself into a rabbit hole... Streaming online audio to a Sonos s...
Nerd sniping myself into a rabbit hole... Streaming online audio to a Sonos s...
Maarten Balliauw360 views
Building a friendly .NET SDK to connect to Space by Maarten Balliauw
Building a friendly .NET SDK to connect to SpaceBuilding a friendly .NET SDK to connect to Space
Building a friendly .NET SDK to connect to Space
Maarten Balliauw182 views
Microservices for building an IDE - The innards of JetBrains Rider - NDC Oslo... by Maarten Balliauw
Microservices for building an IDE - The innards of JetBrains Rider - NDC Oslo...Microservices for building an IDE - The innards of JetBrains Rider - NDC Oslo...
Microservices for building an IDE - The innards of JetBrains Rider - NDC Oslo...
Maarten Balliauw406 views
Indexing and searching NuGet.org with Azure Functions and Search - .NET fwday... by Maarten Balliauw
Indexing and searching NuGet.org with Azure Functions and Search - .NET fwday...Indexing and searching NuGet.org with Azure Functions and Search - .NET fwday...
Indexing and searching NuGet.org with Azure Functions and Search - .NET fwday...
Maarten Balliauw180 views
NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain... by Maarten Balliauw
NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain...NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain...
NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain...
Maarten Balliauw326 views
JetBrains Australia 2019 - Exploring .NET’s memory management – a trip down m... by Maarten Balliauw
JetBrains Australia 2019 - Exploring .NET’s memory management – a trip down m...JetBrains Australia 2019 - Exploring .NET’s memory management – a trip down m...
JetBrains Australia 2019 - Exploring .NET’s memory management – a trip down m...
Maarten Balliauw280 views
.NET Conf 2019 - Indexing and searching NuGet.org with Azure Functions and Se... by Maarten Balliauw
.NET Conf 2019 - Indexing and searching NuGet.org with Azure Functions and Se....NET Conf 2019 - Indexing and searching NuGet.org with Azure Functions and Se...
.NET Conf 2019 - Indexing and searching NuGet.org with Azure Functions and Se...
Maarten Balliauw290 views
CloudBurst 2019 - Indexing and searching NuGet.org with Azure Functions and S... by Maarten Balliauw
CloudBurst 2019 - Indexing and searching NuGet.org with Azure Functions and S...CloudBurst 2019 - Indexing and searching NuGet.org with Azure Functions and S...
CloudBurst 2019 - Indexing and searching NuGet.org with Azure Functions and S...
Maarten Balliauw564 views
NDC Oslo 2019 - Indexing and searching NuGet.org with Azure Functions and Search by Maarten Balliauw
NDC Oslo 2019 - Indexing and searching NuGet.org with Azure Functions and SearchNDC Oslo 2019 - Indexing and searching NuGet.org with Azure Functions and Search
NDC Oslo 2019 - Indexing and searching NuGet.org with Azure Functions and Search
Maarten Balliauw958 views
Approaches for application request throttling - Cloud Developer Days Poland by Maarten Balliauw
Approaches for application request throttling - Cloud Developer Days PolandApproaches for application request throttling - Cloud Developer Days Poland
Approaches for application request throttling - Cloud Developer Days Poland
Maarten Balliauw1.1K views
Indexing and searching NuGet.org with Azure Functions and Search - Cloud Deve... by Maarten Balliauw
Indexing and searching NuGet.org with Azure Functions and Search - Cloud Deve...Indexing and searching NuGet.org with Azure Functions and Search - Cloud Deve...
Indexing and searching NuGet.org with Azure Functions and Search - Cloud Deve...
Maarten Balliauw1.1K views
Approaches for application request throttling - dotNetCologne by Maarten Balliauw
Approaches for application request throttling - dotNetCologneApproaches for application request throttling - dotNetCologne
Approaches for application request throttling - dotNetCologne
Maarten Balliauw246 views
CodeStock - Exploring .NET memory management - a trip down memory lane by Maarten Balliauw
CodeStock - Exploring .NET memory management - a trip down memory laneCodeStock - Exploring .NET memory management - a trip down memory lane
CodeStock - Exploring .NET memory management - a trip down memory lane
Maarten Balliauw1.9K views
ConFoo Montreal - Microservices for building an IDE - The innards of JetBrain... by Maarten Balliauw
ConFoo Montreal - Microservices for building an IDE - The innards of JetBrain...ConFoo Montreal - Microservices for building an IDE - The innards of JetBrain...
ConFoo Montreal - Microservices for building an IDE - The innards of JetBrain...
Maarten Balliauw1.2K views
ConFoo Montreal - Approaches for application request throttling by Maarten Balliauw
ConFoo Montreal - Approaches for application request throttlingConFoo Montreal - Approaches for application request throttling
ConFoo Montreal - Approaches for application request throttling
Maarten Balliauw1.2K views
Microservices for building an IDE – The innards of JetBrains Rider - TechDays... by Maarten Balliauw
Microservices for building an IDE – The innards of JetBrains Rider - TechDays...Microservices for building an IDE – The innards of JetBrains Rider - TechDays...
Microservices for building an IDE – The innards of JetBrains Rider - TechDays...
Maarten Balliauw10.5K views
JetBrains Day Seoul - Exploring .NET’s memory management – a trip down memory... by Maarten Balliauw
JetBrains Day Seoul - Exploring .NET’s memory management – a trip down memory...JetBrains Day Seoul - Exploring .NET’s memory management – a trip down memory...
JetBrains Day Seoul - Exploring .NET’s memory management – a trip down memory...
Maarten Balliauw1.1K views
DotNetFest - Let’s refresh our memory! Memory management in .NET by Maarten Balliauw
DotNetFest - Let’s refresh our memory! Memory management in .NETDotNetFest - Let’s refresh our memory! Memory management in .NET
DotNetFest - Let’s refresh our memory! Memory management in .NET
Maarten Balliauw480 views
VISUG - Approaches for application request throttling by Maarten Balliauw
VISUG - Approaches for application request throttlingVISUG - Approaches for application request throttling
VISUG - Approaches for application request throttling
Maarten Balliauw817 views
ConFoo - Exploring .NET’s memory management – a trip down memory lane by Maarten Balliauw
ConFoo - Exploring .NET’s memory management – a trip down memory laneConFoo - Exploring .NET’s memory management – a trip down memory lane
ConFoo - Exploring .NET’s memory management – a trip down memory lane
Maarten Balliauw523 views

Recently uploaded

What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue by
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueShapeBlue
191 views23 slides
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue by
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlueShapeBlue
75 views23 slides
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti... by
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...ShapeBlue
69 views29 slides
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R... by
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...ShapeBlue
105 views15 slides
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ... by
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...ShapeBlue
97 views28 slides
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ... by
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...ShapeBlue
114 views12 slides

Recently uploaded(20)

What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue by ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
ShapeBlue191 views
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue by ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
ShapeBlue75 views
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti... by ShapeBlue
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
ShapeBlue69 views
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R... by ShapeBlue
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
ShapeBlue105 views
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ... by ShapeBlue
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
ShapeBlue97 views
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ... by ShapeBlue
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
ShapeBlue114 views
Future of AR - Facebook Presentation by Rob McCarty
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook Presentation
Rob McCarty54 views
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool by ShapeBlue
Extending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPoolExtending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPool
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool
ShapeBlue56 views
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue by ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
ShapeBlue68 views
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava... by ShapeBlue
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
ShapeBlue74 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker50 views
Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ by ShapeBlue
Confidence in CloudStack - Aron Wagner, Nathan Gleason - AmericConfidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
ShapeBlue58 views
NTGapps NTG LowCode Platform by Mustafa Kuğu
NTGapps NTG LowCode Platform NTGapps NTG LowCode Platform
NTGapps NTG LowCode Platform
Mustafa Kuğu287 views
DRBD Deep Dive - Philipp Reisner - LINBIT by ShapeBlue
DRBD Deep Dive - Philipp Reisner - LINBITDRBD Deep Dive - Philipp Reisner - LINBIT
DRBD Deep Dive - Philipp Reisner - LINBIT
ShapeBlue110 views
The Power of Heat Decarbonisation Plans in the Built Environment by IES VE
The Power of Heat Decarbonisation Plans in the Built EnvironmentThe Power of Heat Decarbonisation Plans in the Built Environment
The Power of Heat Decarbonisation Plans in the Built Environment
IES VE67 views
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT by ShapeBlue
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBITUpdates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
ShapeBlue138 views
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ... by ShapeBlue
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...
ShapeBlue121 views
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates by ShapeBlue
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesKeynote Talk: Open Source is Not Dead - Charles Schulz - Vates
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates
ShapeBlue178 views

Sherlock Homepage - A detective story about running large web services - NDC Oslo

  • 1. Sherlock Homepage A detective story about running large web services. Maarten Balliauw @maartenballiauw
  • 3. Site unavailable! Primary website location unavailable! No problem: traffic manager in front – pfew! Secondary location unavailable! Website down… Initial investigation & monitoring showed: Primary & secondary website location instances all up Machines available individually Not through traffic manager and Azure load balancer
  • 4. The cause… Custom Azure load balancer probe Implementation (StatusService.cs) <LoadBalancerProbes> <LoadBalancerProbe name="HTTP" path="/api/status" protocol="http" port="80" /> </LoadBalancerProbes> return new HttpStatusCodeWithBodyResult(AvailabilityStatusCode(galleryServiceAvailable), String.Format(CultureInfo.InvariantCulture, StatusMessageFormat, AvailabilityMessage(galleryServiceAvailable), AvailabilityMessage(sqlAzureAvailable), AvailabilityMessage(storageAvailable), AvailabilityMessage(searchServiceAvailable), AvailabilityMessage(metricsServiceAvailable), HostMachine.Name));
  • 5. How did we find the issue? Quote from “Mind Hunter” (written by an FBI profiler): You have to be able to re-create the crime scene in your head. You need to know as much as you can about the victim so that you can imagine how she might have reacted. You have to be able to put yourself in her place as the attacker threatens her with a gun or a knife, a rock, his fists, or whatever. You have to be able to feel her fear as he approaches her. You have to be able to feel her pain. You have to try to imagine what she was going through when he tortured her. You have to understand what it’s like to scream in terror and agony, realizing that it won’t help, that it won’t get him to stop. You have to know what it was like. http://highscalability.com/blog/2015/7/30/how-debugging-is-like-hunting-serial-killers.html
  • 6. How did we find the issue? Debugging requires a particular sympathy for the machine. You must be able to run the machine and networks of machines in your mind while simulating what-ifs based on mere wisps of insight. Knowing the system you are working on – even by similarity Empathy for what is going on in that system A hunch based on prior experience / insights
  • 7. Sherlock Homepage A detective story about running large web services. Maarten Balliauw @maartenballiauw
  • 8. Who am I? Maarten Balliauw Antwerp, Belgium Software Engineer, Microsoft Founder, MyGet AZUG Focus on web ASP.NET MVC, Azure, SignalR, ... Former MVP Azure & ASPInsider Big passion: Azure http://blog.maartenballiauw.be @maartenballiauw Shameless self promotion: Pro NuGet - http://amzn.to/pronuget2
  • 10. A bit of history Site serves dependencies for .NET developers worldwide On average good for ~8.000.000 to ~10.000.000 request per day (on compute) Built 4 years ago on top of a SQL database and OData services Monolithic – site + OData service are the same app Improvements over the years Some rough times Q2 2015
  • 11. Architecture overview (Q2 2015) Front end servers (2 regions) MVC + WCF OData Search Service Lucene based Web API Search Service (secondary region) Lucene based Web API Azure Storage Lucene index Download counts Azure SQL Database Metadata Download counts Jobs VMs Create index from database Create stats Create download count reports
  • 12. Did we solve the crime?
  • 13. One of the services caused this… SQL database? Storage? Search? Metrics service? return new HttpStatusCodeWithBodyResult(AvailabilityStatusCode(galleryServiceAvailable), String.Format(CultureInfo.InvariantCulture, StatusMessageFormat, AvailabilityMessage(galleryServiceAvailable), AvailabilityMessage(sqlAzureAvailable), AvailabilityMessage(storageAvailable), AvailabilityMessage(searchServiceAvailable), AvailabilityMessage(metricsServiceAvailable), HostMachine.Name));
  • 14. Log spelunking Check SQL database logs – we found we had none (fixed now) Storage – storage statistics seemed stable Search – no real pointers to issues there Metrics service – very lightweight and has been stable for months Start looking around at the crime scene! IIS logs, event viewer on web servers Profiling on web servers
  • 16. It could have been search... No real evidence though.
  • 18. Turns out it was search! Profiling the search service revealed some things! SearcherManager.cs checks Lucene index freshness on Get() – MaybeReopen() StartReopen() blocks access to the index until finished Part of HTTP request pipeline – blocking request handling This was fixed by getting these calls out of the HTTP request path. The suspect no longer had code available – used our informants www.jetbrains.com/dotpeek
  • 19. Search had some other flaws… Actually also visible in the dotTrace snapshot we just saw: high GC! Memory profiling the search service revealed some things! (I lost the actual traces )
  • 20. Search had some other flaws… High memory traffic on reading download count # for search ranking The source: DownloadLookup.cs#L18 Fixed by: Reusing objects (instead of new) JsonStreamReader instead of JObject.Parse(theWorld)
  • 21. In the meanwhile… Added additional monitoring Added additional tracing Started looking into using AppInsights for better insights into application behavior Events happening on the website Requests Exceptions Execution and dependency times (basic but continuous profiling)
  • 23. What we were seeing… On the V2-based feeds: 500 Internal Server Error during package restore Response time goes up while # of requests goes down EventVwr on servers: lots of IIS crashes Lots of crash dumps on web servers
  • 24. So IIS crashes… Could it be? HTTP.SYS tells us when things go wrong D:WindowsSystem32LogFilesHTTPERR 2015-07-31 01:46:34 - 60810 - 80 HTTP/1.1 GET /api/v2/FindPackagesById()?id='...' - 1273337584 Connection_Abandoned_By_ReqQueue 86cd3cb1-729c-425c-898f-b15b0330bc38 Connection_Abandoned_By_ReqQueue “A worker process from the application pool has quit unexpectedly or orphaned a pending request by closing its handle. Specific to Windows Vista and Windows Server 2008.” Gift from the gods: crash dumps! C:ResourcesDirectory31edcaa5186f…...DiagnosticStoreWAD0104CrashDumps
  • 26. An Exception crashes IIS? Time to crank up the search engine queries! Found a similar issue: unobserved task exceptions causing IIS to crash // If metrics service is specified we post the data to it asynchronously. if (_config != null && _config.MetricsServiceUri != null) { // Disable warning about not awaiting async calls // because we are _intentionally_ not awaiting this. #pragma warning disable 4014 Task.Run(() => PostDownloadStatistics(id, version, …)); #pragma warning restore 4014 }
  • 27. TaskScheduler.UnobservedTaskException += (object sender, UnobservedTaskExceptionEventArgs excArgs) => { // ... log it ... excArgs.SetObserved(); }; Tasks and fire-and-forget are evil! Unobserved task can cause the entire process to give up on Exception Handle unobserved task Exceptions!
  • 29. High response times on the web server
  • 30. What we were seeing… High response times on the servers Resulting in higher than normal CPU usage on the servers Azure would often auto-scale additional instances Profiling the web application still showed wait times with no obvious cause…
  • 32. Research Reading and searching on what could be the cause of these issues http://stackoverflow.com/questions/12304691/why-are-iis-threads-so-precious-as-compared-to-regular-clr-threads http://www.monitis.com/blog/2012/06/11/improving-asp-net-performance-part3-threading/ https://msdn.microsoft.com/en-us/library/ms998549.aspx http://blogs.msdn.com/b/tmarq/archive/2007/07/21/asp-net-thread-usage-on-iis-7-0-and-6-0.aspx https://support.microsoft.com/en-us/kb/821268 Consider minIoThreads and minWorkerThreads for Burst Load If your application experiences burst loads where there are prolonged periods of inactivity between the burst loads, the thread pool may not have enough time to reach the optimal level of threads. A burst load occurs when a large number of users connect to your application suddenly and at the same time. The minIoThreads and minWorkerThreads settings enable you to configure a minimum number of worker threads and I/O threads for load conditions.
  • 33. The result GUESS WHERE WE DID THE TWEAK… COMPARED TO LAST WEEK…
  • 34. Making it permanent Part of the cloud service startup script # Increase the number of available IIS threads for high performance applications # Uses the recommended values from http://msdn.microsoft.com/en-us/library/ms998549.aspx#scalenetchapt06_topic8 # Assumes running on two cores (medium instance on Azure) &$appcmd set config /commit:MACHINE -section:processModel -maxWorkerThreads:100 &$appcmd set config /commit:MACHINE -section:processModel -minWorkerThreads:50 &$appcmd set config /commit:MACHINE -section:processModel -minIoThreads:50 &$appcmd set config /commit:MACHINE -section:processModel -maxIoThreads:100 # Adjust the maximum number of connections per core for all IP addresses &$appcmd set config /commit:MACHINE -section:connectionManagement /+["address='*',maxconnection='240'"]
  • 36. What we were seeing… On the V2-based feeds: Package restore timeouts coming from the WCF OData service Occurs every 7-15 hours, fixes itself ~15 minutes later Extreme load times on Get(Id=,Version=) – probably the cause of these timeouts
  • 37. No easy way to reproduce… Happening only on production Observation after RDP-ing in: 100% CPU when it happens No way to profile continuously – AppInsights did show us the entry point Donut time again The thing we recently changed was minIOThreads and throughput The slow code path is FindPackagesById() Makes HTTP calls to search service What could this setting and HTTP calls have in common…
  • 38. HttpClient, Async and multithreading Interesting article benchmarking HttpClient in async and multithreading scenarios Async + HttpClient are not limited in terms of concurrency Many CPU’s and threads? Many HttpClients and requests Many HttpClients and requests? Many TCP ports used on machine Many TCP ports used? TCP port depletion But aren’t ports reclaimed? 240 seconds TIME_WAIT (4 minutes) Users also use up TCP ports “As far as HTTP requests are concerned, a limit should always be set to ServicePointManager.DefaultConnectionLimit. The limit should be large enough to allow a good level of parallelism, but low enough to prevent performance and reliability problems (from the exhaustion of ephemeral ports). “
  • 39. Limiting HttpClient async concurrency Set ServicePointManager properties on startup Nagling – “bundle traffic in properly stuffed TCP packets” Expect100Continue – “only send out traffic if server says 100 Continue” Both optimizations also disabled // Tune ServicePointManager // (based on http://social.technet.microsoft.com/Forums/en-US/windowsazuredata/thread/d84ba34b-b0e0- 4961-a167-bbe7618beb83 and https://msdn.microsoft.com/en- us/library/system.net.servicepointmanager.aspx) ServicePointManager.DefaultConnectionLimit = 500; ServicePointManager.UseNagleAlgorithm = false; ServicePointManager.Expect100Continue = false;
  • 42. What we were seeing… Massive memory usage! Even when changing VM sizes. 100% of memory on a Medium Azure instance 100% of memory on a Large Azure instance 100% of memory on a X-Large Azure instance
  • 43. What is eating this memory? Memory profiling! On the server? Try to reproduce it? Decided on the latter
  • 45. .NET Memory Management 101 Memory Allocation .NET runtime reserves region of address space for every new process managed heap Objects are allocated in the heap Allocating memory is fast, it’s just adding a pointer Some unmanaged memory is also consumed (not GC-ed) .NET CLR, Dynamic libraries, Graphics buffer, … Memory Release or “Garbage Collection” (GC) Generations Large Object Heap
  • 46. .NET Memory Management 101 Memory Allocation Memory Release or “Garbage Collection” (GC) GC releases objects no longer in use by examining application roots GC builds a graph that contains all the objects that are reachable from these roots. Object unreachable? GC removes the object from the heap, releasing memory After the object is removed, GC compacts reachable objects in memory. Generations Large Object Heap
  • 47. .NET Memory Management 101 Memory Allocation Memory Release or “Garbage Collection” (GC) Generations Managed heap divided in segments: generation 0, 1 and 2 New objects go into Gen 0 Gen 0 full? Perform GC and promote all reachable objects to Gen 1. This is typically pretty fast. Gen 1 full? Perform GC on Gen 1 and Gen 0. Promote all reachable objects to Gen 2. Gen 2 full? Perform full GC (2, 1, 0). If not enough memory for new allocations, throws OutOfMemoryException Full GC has performance impact since all objects in managed heap are verified. Large Object Heap
  • 48. .NET Memory Management 101 Memory Allocation Memory Release or “Garbage Collection” (GC) Generations Large Object Heap Generation 0 Generation 1 Generation 2 Short-lived objects (e.g. Local variables) In-between objects Long-lived objects (e.g. App’s main form)
  • 49. .NET Memory Management 101 Memory Allocation Memory Release or “Garbage Collection” (GC) Generations Large Object Heap Large objects (>85KB) stored in separate segment of managed heap: Large Object Heap (LOH) Objects in LOH collected only during full garbage collection Survived objects in LOH are not compacted (by default). This means that LOH becomes fragmented over time. Fragmentation can cause OutOfMemoryException
  • 50. The .NET garbage collector Simulates “infinite memory” by removing objects no longer needed When does it run? Vague… But usually: Out of memory condition – when the system fails to allocate or re-allocate memory After some significant allocation – if X memory is allocated since previous GC Failure of allocating some native resources – internal to .NET Profiler – when triggered from profiler API Forced – when calling methods on System.GC Application moves to background GC is not guaranteed to run http://blogs.msdn.com/b/oldnewthing/archive/2010/08/09/10047586.aspx http://blogs.msdn.com/b/abhinaba/archive/2008/04/29/when-does-the-net-compact-framework-garbage-collector-run.aspx
  • 52. So our DI container? NInject? Our memory profiling confirms it. The retained EntitiesContext also retains entities and SQL connections. Spelunking the NInject source code, we found the GarbageCollectionCachePruner responsible for releasing objects. Runs every 30 seconds (timer) Releases objects only if GC happened in that time GC is not guaranteed to run, so NInject potentially never releases objects Known, old bug. https://groups.google.com/forum/#!topic/ninject/PQNMIsQhCvE http://stackoverflow.com/questions/16775362/ninject-caching-object-that-should-be-disposed-memoryleak
  • 53. Replacing our DI container (Autofac) Perform replacement Run same analysis on new codebase and verify objects are freed Once deployed: Immediate drop in response times Memory usage now stable at ~4 GB
  • 55. Conclusion Debugging requires a particular sympathy for the machine. You must be able to run the machine and networks of machines in your mind while simulating what-ifs based on mere wisps of insight. Bugs hide. They blend in. They can pass for "normal" which makes them tough to find. One bug off the streets doesn’t mean all of them are gone. Sometimes one gone exposes another. Know your system, know your tools, know your options. Look for evidence. Profilers (performance and memory), dump files, AppInsights and others Dive in. It builds experience and makes solving the next crime scene easier. https://msdn.microsoft.com/en-us/library/ee817663.aspx

Editor's Notes

  1. One of these services was failing – resulting in status 500 – resulting in LB removing the instance from pool – all were removed from pool…
  2. Bugs are like psychopaths in the societal machine. They hide. They blend in. They can pass for "normal" which makes them tough to find. They attack weakness causing untold damage until caught. And they will keep causing damage until caught. They are always hunting for opportunity.
  3. Yes it’s fun to work on new things. But try figuring out this stuff. Builds experience.
  4. Profiling the search service issue - search service issue * Open capture1.dtt * Notice lots of lock contention * Disable system methods - notice SearcherManager.Get() is top suspect * Unfortunately, this code was deployed without symbols and we no longer had the sources around * Decided to decompile using dotPeek and walk through code - which revealed something (next slide)
  5. Profiling the search service issue - search service issue * Open capture1.dtt * Notice lots of lock contention * Disable system methods - notice SearcherManager.Get() is top suspect * Unfortunately, this code was deployed without symbols and we no longer had the sources around * Decided to decompile using dotPeek and walk through code - which revealed something (next slide)
  6. Perhaps do a quick tour of AppInsights depending on time left.
  7. Analyzing a crash dump * Open w3wp.exe.3344.dmp - explain this can be done with hardcore debugging tools but VS2015 works just as fine * Explain dump summary - what we can see (threads, modules, last state when it all broke down) * Becomes more interesting when we debug it - "Debug with managed only" * Unhandled Exception, it says! That's probably the cause for IIS erroring out. But where does it come from... * We can try the various windows for inspecting modules, threads, call stacks, ... but not al ot to see in there. * We need DEBUGGER SYMBOLS! Unfortunately in this case the gallery was deployed without symbols. No symbols on teh build server either, for this deployment. * DotPeek! Load the actual DLL, enable symbol server, start debugging again. * See the exception now shows us the (decompiled) code. Not the real code but it does give us an idea. * Unhandled exception in a task. Which crashes out IIS.
  8. Reproducing production traffic * We'll use jMeter - a tool that is perfect for replaying web server logs against other servers * Explain IIS logs come from production * Explain rconvlog tool to translate them to NCSA format * Open up jMeter and explain how it is all linked to each other
  9. Application roots: Typically, these are global and static object pointers, local variables, and CPU registers.
  10. Application roots: Typically, these are global and static object pointers, local variables, and CPU registers.
  11. Application roots: Typically, these are global and static object pointers, local variables, and CPU registers.
  12. Application roots: Typically, these are global and static object pointers, local variables, and CPU registers.
  13. Application roots: Typically, these are global and static object pointers, local variables, and CPU registers.
  14. Analyzing memory usage * Open the Workspace [2015-08-05] [13-24].dmw workspace * From the snapshots overview we already see a few interesting things: memory keeps going up, when GC'ed (using the profiler API) memory stays around * Let's open one of the snapshots - snapshot #6 - why? As it says objects have been collected so we want to see what remained in memory * From the overview: lots of objects on gen2 - eaning they survived many collections - dive in * Seems NInject.Activation.Caching.Cache is keeping a lot of bytes in memory - dive in * Lots of objects seem to be cached by NInjet. Check outgoing references to see what they are. * The first ones look normal, e.g. the controller activator and so on is needed the entire time by ASP.NET MVC * At [2] we see an EntitiesContext retained. That's weird: it should be disposed of after each request. Let's see if there are more of these still in memory. * Snapshot - largest size - search entitiescontext - A LOT! Also in size relative to the snapshot size - dive in * We can see 991 instances are kept around - who's holding on to them? * Group by similar retention shows NInject for 596 of them. Wow.
  15. Yes it’s fun to work on new things. But try