Тема доклада
Тема доклада
Тема доклада
KYIV 2019
.Net Core in production
By Leonid Molotiievskyi
.NET CONFERENCE #1 IN UKRAINE
2
About me
• Hands-on software architect and technological
consultant
• Good at splitting a monolith to microservices
• Built a huge enterprise financial solution from scratch
• Technical guy who believes that right people decisions
are more important than technological ones
• Speaker and mentor
3
Spoilers about what we are going to talk
Agenda
Context overview
Environment that we used to live with
Scaling
How did we scale our services?
4
Hell for the DevOps
teamDo we solve the right problem?
Useful advices
The things that can help you to resolve
the problem
Lessons learned
How can we benefit in
future?
Q&A
Questions and answers
Context overview
Several statements about the project
6
Context overview
• Financial domain
• 25+ microservices
• Team 70+ people
• 20+ environments
• Three versions in support one in development
7
Solution overview: managing workflows
Scaling
How did we scale our services?
9
Notification service
10
Solution?
- And a set of dummy queues left after descale/redeploy appear
11
Gateway: infinite redirect
- Where do we store
keys for cookies?
12
Gateway: infinite redirect solution
Hell for the DevOps team
What technology decisions helped us to survive
14
Each morning…
• Dev/Staging/Prod cluster is down
• RabbitMq/Mongo/Consul/Prometheus is not
operational
• The fire-fighter team is on the duty
15
Greedy service
16
Queues are growing… - 1
• “TTL time is too small” or?
17
Queues are growing… - 2
• A queue has a set of consumers
• Service A consumes the message
• Service A starts processing the message
• Heath check of consumer fails due to high load of
service A/network issue/OOM killed/etc.
• Duplicated message appear in the queue
18
OOM Killed issue
• .Net Core 2.2 doesn’t respect docker limits:
https://github.com/aspnet/AspNetCore/issues/3409
https://github.com/dotnet/coreclr/issues/18971
• ” Server GC was designed with the assumption that
the process using Server GC is the dominant process
on the machine. By default it uses as many heaps as
there are # of processors on the machine.”
19
Let’s fix issue by upgrade to .Net Core 3.0?
https://github.com/mongodb/mongo-csharp-driver/pull/372/files
20
Socket file descriptor leak in HttpClient
21
Docker: no space left on the device
level=info msg="[8] System error: write
/sys/fs/cgroup/docker/01f5670fbee1f6687f58f3a943b1e1bdaec26
30197fa4da1b19cc3db7e3d3883/cgroup.procs: no space left on
device"
22
Reason:
23
Prometheus is down
Useful advices
What can prevent nasty situations
25
What can help you to find them?
Configured monitoring to track:
• Memory consumption
• CPU consumption
• Number of threads on worker node
• Number of open socket descriptors per node/pod
• Connection refused errors
• Correlation Ids in logs
• Number of messages in queues
• Number of consumers for queues
26
Use the standard health check middleware
27
Setup environment in the way…
• Infrastructure services must have HA setup
• Deploy at least two instances of each service
• Setup monitoring and alerting
• To be sure that “temporary data” disappear after
redeployment
• To not configure something manually
Lessons learned
What we get from it
29
Lessons learned
• ”Do it as simple as possible” principle
doesn’t work. “Do it in the smart way” - works
• Think about application scaling from the
begging
• Know about open issues inside your target
framework
• Do not blame DevOps team, try to help them
to find out what is the reason
30
Follow me @lmolotii on
Q&A
THANKS FOR WATCH !!!

.NET Fest 2019. Леонид Молотиевский. DotNet Core in production

  • 1.
    Тема доклада Тема доклада Темадоклада KYIV 2019 .Net Core in production By Leonid Molotiievskyi .NET CONFERENCE #1 IN UKRAINE
  • 2.
    2 About me • Hands-onsoftware architect and technological consultant • Good at splitting a monolith to microservices • Built a huge enterprise financial solution from scratch • Technical guy who believes that right people decisions are more important than technological ones • Speaker and mentor
  • 3.
    3 Spoilers about whatwe are going to talk Agenda Context overview Environment that we used to live with Scaling How did we scale our services?
  • 4.
    4 Hell for theDevOps teamDo we solve the right problem? Useful advices The things that can help you to resolve the problem Lessons learned How can we benefit in future? Q&A Questions and answers
  • 5.
  • 6.
    6 Context overview • Financialdomain • 25+ microservices • Team 70+ people • 20+ environments • Three versions in support one in development
  • 7.
  • 8.
    Scaling How did wescale our services?
  • 9.
  • 10.
    10 Solution? - And aset of dummy queues left after descale/redeploy appear
  • 11.
    11 Gateway: infinite redirect -Where do we store keys for cookies?
  • 12.
  • 13.
    Hell for theDevOps team What technology decisions helped us to survive
  • 14.
    14 Each morning… • Dev/Staging/Prodcluster is down • RabbitMq/Mongo/Consul/Prometheus is not operational • The fire-fighter team is on the duty
  • 15.
  • 16.
    16 Queues are growing…- 1 • “TTL time is too small” or?
  • 17.
    17 Queues are growing…- 2 • A queue has a set of consumers • Service A consumes the message • Service A starts processing the message • Heath check of consumer fails due to high load of service A/network issue/OOM killed/etc. • Duplicated message appear in the queue
  • 18.
    18 OOM Killed issue •.Net Core 2.2 doesn’t respect docker limits: https://github.com/aspnet/AspNetCore/issues/3409 https://github.com/dotnet/coreclr/issues/18971 • ” Server GC was designed with the assumption that the process using Server GC is the dominant process on the machine. By default it uses as many heaps as there are # of processors on the machine.”
  • 19.
    19 Let’s fix issueby upgrade to .Net Core 3.0? https://github.com/mongodb/mongo-csharp-driver/pull/372/files
  • 20.
    20 Socket file descriptorleak in HttpClient
  • 21.
    21 Docker: no spaceleft on the device level=info msg="[8] System error: write /sys/fs/cgroup/docker/01f5670fbee1f6687f58f3a943b1e1bdaec26 30197fa4da1b19cc3db7e3d3883/cgroup.procs: no space left on device"
  • 22.
  • 23.
  • 24.
    Useful advices What canprevent nasty situations
  • 25.
    25 What can helpyou to find them? Configured monitoring to track: • Memory consumption • CPU consumption • Number of threads on worker node • Number of open socket descriptors per node/pod • Connection refused errors • Correlation Ids in logs • Number of messages in queues • Number of consumers for queues
  • 26.
    26 Use the standardhealth check middleware
  • 27.
    27 Setup environment inthe way… • Infrastructure services must have HA setup • Deploy at least two instances of each service • Setup monitoring and alerting • To be sure that “temporary data” disappear after redeployment • To not configure something manually
  • 28.
  • 29.
    29 Lessons learned • ”Doit as simple as possible” principle doesn’t work. “Do it in the smart way” - works • Think about application scaling from the begging • Know about open issues inside your target framework • Do not blame DevOps team, try to help them to find out what is the reason
  • 30.
  • 31.