Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

.NET Fest 2019. Леонид Молотиевский. DotNet Core in production

19 views

Published on

Во время доклада, я поделюсь с Вами опытом, который мы получили, используя микросервисы в прод K8S кластере. Также, обозначу основные проблемы, с которыми столкнулась наша команда на этапе их диагностики. И, самое главное - что мы сделали чтобы избежать их в будущем. Отвечу на вопросы: Почему мы мигрировали в облако? Почему dotNet Core 2.2 вызвал кучу проблем? Данный доклад сохранит сотни часов вашим разработчикам и DevOps команде, жизнь которой может напоминать кошмар.

Published in: Education
  • Be the first to comment

  • Be the first to like this

.NET Fest 2019. Леонид Молотиевский. DotNet Core in production

  1. 1. Тема доклада Тема доклада Тема доклада KYIV 2019 .Net Core in production By Leonid Molotiievskyi .NET CONFERENCE #1 IN UKRAINE
  2. 2. 2 About me • Hands-on software architect and technological consultant • Good at splitting a monolith to microservices • Built a huge enterprise financial solution from scratch • Technical guy who believes that right people decisions are more important than technological ones • Speaker and mentor
  3. 3. 3 Spoilers about what we are going to talk Agenda Context overview Environment that we used to live with Scaling How did we scale our services?
  4. 4. 4 Hell for the DevOps teamDo we solve the right problem? Useful advices The things that can help you to resolve the problem Lessons learned How can we benefit in future? Q&A Questions and answers
  5. 5. Context overview Several statements about the project
  6. 6. 6 Context overview • Financial domain • 25+ microservices • Team 70+ people • 20+ environments • Three versions in support one in development
  7. 7. 7 Solution overview: managing workflows
  8. 8. Scaling How did we scale our services?
  9. 9. 9 Notification service
  10. 10. 10 Solution? - And a set of dummy queues left after descale/redeploy appear
  11. 11. 11 Gateway: infinite redirect - Where do we store keys for cookies?
  12. 12. 12 Gateway: infinite redirect solution
  13. 13. Hell for the DevOps team What technology decisions helped us to survive
  14. 14. 14 Each morning… • Dev/Staging/Prod cluster is down • RabbitMq/Mongo/Consul/Prometheus is not operational • The fire-fighter team is on the duty
  15. 15. 15 Greedy service
  16. 16. 16 Queues are growing… - 1 • “TTL time is too small” or?
  17. 17. 17 Queues are growing… - 2 • A queue has a set of consumers • Service A consumes the message • Service A starts processing the message • Heath check of consumer fails due to high load of service A/network issue/OOM killed/etc. • Duplicated message appear in the queue
  18. 18. 18 OOM Killed issue • .Net Core 2.2 doesn’t respect docker limits: https://github.com/aspnet/AspNetCore/issues/3409 https://github.com/dotnet/coreclr/issues/18971 • ” Server GC was designed with the assumption that the process using Server GC is the dominant process on the machine. By default it uses as many heaps as there are # of processors on the machine.”
  19. 19. 19 Let’s fix issue by upgrade to .Net Core 3.0? https://github.com/mongodb/mongo-csharp-driver/pull/372/files
  20. 20. 20 Socket file descriptor leak in HttpClient
  21. 21. 21 Docker: no space left on the device level=info msg="[8] System error: write /sys/fs/cgroup/docker/01f5670fbee1f6687f58f3a943b1e1bdaec26 30197fa4da1b19cc3db7e3d3883/cgroup.procs: no space left on device"
  22. 22. 22 Reason:
  23. 23. 23 Prometheus is down
  24. 24. Useful advices What can prevent nasty situations
  25. 25. 25 What can help you to find them? Configured monitoring to track: • Memory consumption • CPU consumption • Number of threads on worker node • Number of open socket descriptors per node/pod • Connection refused errors • Correlation Ids in logs • Number of messages in queues • Number of consumers for queues
  26. 26. 26 Use the standard health check middleware
  27. 27. 27 Setup environment in the way… • Infrastructure services must have HA setup • Deploy at least two instances of each service • Setup monitoring and alerting • To be sure that “temporary data” disappear after redeployment • To not configure something manually
  28. 28. Lessons learned What we get from it
  29. 29. 29 Lessons learned • ”Do it as simple as possible” principle doesn’t work. “Do it in the smart way” - works • Think about application scaling from the begging • Know about open issues inside your target framework • Do not blame DevOps team, try to help them to find out what is the reason
  30. 30. 30 Follow me @lmolotii on Q&A
  31. 31. THANKS FOR WATCH !!!

×