.NET Fest 2019. Леонид Молотиевский. DotNet Core in production

Тема доклада
KYIV 2019
.Net Core in production
By Leonid Molotiievskyi
.NET CONFERENCE #1 IN UKRAINE

2
About me
• Hands-on software architect and technological
consultant
• Good at splitting a monolith to microservices
• Built a huge enterprise financial solution from scratch
• Technical guy who believes that right people decisions
are more important than technological ones
• Speaker and mentor

3
Spoilers about what we are going to talk
Agenda
Context overview
Environment that we used to live with
Scaling
How did we scale our services?

4
Hell for the DevOps
teamDo we solve the right problem?
Useful advices
The things that can help you to resolve
the problem
Lessons learned
How can we benefit in
future?
Q&A
Questions and answers

Context overview
Several statements about the project

6
Context overview
• Financial domain
• 25+ microservices
• Team 70+ people
• 20+ environments
• Three versions in support one in development

7
Solution overview: managing workflows

Scaling
How did we scale our services?

10
Solution?
- And a set of dummy queues left after descale/redeploy appear

11
Gateway: infinite redirect
- Where do we store
keys for cookies?

12
Gateway: infinite redirect solution

Hell for the DevOps team
What technology decisions helped us to survive

14
Each morning…
• Dev/Staging/Prod cluster is down
• RabbitMq/Mongo/Consul/Prometheus is not
operational
• The fire-fighter team is on the duty

16
Queues are growing… - 1
• “TTL time is too small” or?

17
Queues are growing… - 2
• A queue has a set of consumers
• Service A consumes the message
• Service A starts processing the message
• Heath check of consumer fails due to high load of
service A/network issue/OOM killed/etc.
• Duplicated message appear in the queue

18
OOM Killed issue
• .Net Core 2.2 doesn’t respect docker limits:
https://github.com/aspnet/AspNetCore/issues/3409
https://github.com/dotnet/coreclr/issues/18971
• ” Server GC was designed with the assumption that
the process using Server GC is the dominant process
on the machine. By default it uses as many heaps as
there are # of processors on the machine.”

19
Let’s fix issue by upgrade to .Net Core 3.0?
https://github.com/mongodb/mongo-csharp-driver/pull/372/files

20
Socket file descriptor leak in HttpClient

21
Docker: no space left on the device
level=info msg="[8] System error: write
/sys/fs/cgroup/docker/01f5670fbee1f6687f58f3a943b1e1bdaec26
30197fa4da1b19cc3db7e3d3883/cgroup.procs: no space left on
device"

Useful advices
What can prevent nasty situations

25
What can help you to find them?
Configured monitoring to track:
• Memory consumption
• CPU consumption
• Number of threads on worker node
• Number of open socket descriptors per node/pod
• Connection refused errors
• Correlation Ids in logs
• Number of messages in queues
• Number of consumers for queues

26
Use the standard health check middleware

27
Setup environment in the way…
• Infrastructure services must have HA setup
• Deploy at least two instances of each service
• Setup monitoring and alerting
• To be sure that “temporary data” disappear after
redeployment
• To not configure something manually

Lessons learned
What we get from it

29
Lessons learned
• ”Do it as simple as possible” principle
doesn’t work. “Do it in the smart way” - works
• Think about application scaling from the
begging
• Know about open issues inside your target
framework
• Do not blame DevOps team, try to help them
to find out what is the reason

.NET Fest 2019. Леонид Молотиевский. DotNet Core in production

More Related Content

Similar to .NET Fest 2019. Леонид Молотиевский. DotNet Core in production

More from NETFest

Recently uploaded

.NET Fest 2019. Леонид Молотиевский. DotNet Core in production