Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

OSDC 2018 | From Monolith to Microservices by Paul Puschmann_


Published on

Scaling up from two developer teams supporting a monolith to more than 20 developer teams powering a micro-service landscape is not only a matter of technical excellence but also the matter of culture and collaboration. This talk will show the positive aspects of our evolution as well as the things we learned to improve on.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

OSDC 2018 | From Monolith to Microservices by Paul Puschmann_

  1. 1. From Monolith to Microservices Paul Puschmann OSDC 2018, Berlin
  2. 2. Our history
  3. 3. 3 Details REWE GROUP Turnover >54 bn History > 90 years Employees >330.000 Industries Food Retail, Tourism, DIY Shops >15.000
  4. 4. 4 What do we actually run?
  5. 5. Our history at REWE Digital
  6. 6. ● Have a good platform / software architecture ● Scale the application ● Enable fast delivery of features, accelerate the business Main Goals
  7. 7. Our history at REWE Digital 2014 40 15 100 28 150 51 2015 2016 2017 # Services # Teams 1 2 200 51 2018 This is an approximation...
  8. 8. ● Take care of our “Managed-Hosting” ● Re-automate an already existing PROD-environment with Ansible ● Keep everything running ● Support our developers ● Pager-Duty What did the Ops-people do?
  9. 9. ● Integration of new features — difficult ● Deployments every two weeks — slow ● Deployments took eventually 1h — slow ● “Everything” in the monolith (plus databases) — dependency constraints Status Quo of the Monolith 2014 / beginning of year 2015
  10. 10. Wishes of the stakeholders: ● Features, features, features ● Application must not break Developers ● We want to code, test, deploy Ops ● Really!? Wishes Beginning of year 2015
  11. 11. ● New features aren’t built into the monolith anymore, but as a separate applications ● There are strong guidelines regarding - API - Monitoring interfaces - Logging Beginning of 2015 The Plan
  12. 12. ● While building new functionality in to micro-services, existing features were extracted from the monolith (scoop out) to allow faster, independent development of features ● This should remove all BL from the monolith soon Continued… The Plan
  13. 13. The pressure of having a perfectly working runtime-environment soon was quite high. Pragmatic decision: poor-mans micro-services ● @devs: please package your app in a .deb-package ● we do the rest via Ansible and HAProxy Bring this to life, then move on Containers? Yes, but no. How should we manage all those new applications?
  14. 14. Our solution Containers
  15. 15. Should we use the early versions of Kubernetes or Mesosphere Marathon? No. We wanted to have an environment we were able to understand, automate and manage. So we created a custom Docker-environment with Docker & Swarm. Containers? Yes, but how? How should we manage all those containers?
  16. 16. ● Debian machines (VMs & Metal) ● Docker-CE ● Docker Swarm (not swarm-mode) ● Consul ● Consul-template ● Dnsmasq ● Nginx ● Deployment with Ansible ● Secrets managed by “Ops" - sorry, no “Hashicorp Vault”, yet Our solution consists of…
  17. 17. We say: No. Because we created a solution we were capable to run and fits our needs. “Best Practises” don’t work for everybody. Only downsides: • Docker swarm isn’t good at “deploy spread” • We’ve no orchestration-service that ensures our containers are running fine and in the right Reinventing the wheel?
  18. 18. ● Blue / Green deployment ● Semi-automatic configuration / deployment of new databases ● Fully automated set-up of Team-Jenkins instances (even at GCP) ● A custom-built platform-service called “Slash” as a service-interface for our developers ● Use a common set of Docker base-images with daily builds Additional goodies
  19. 19. Docker-Images Automatic build-chains with night builds for our Docker base-images and dependent images. Example: Java-base images
  20. 20. All is fine now? What about ● Monitoring —> Check. ● Responsibility —> ? ● THE MONOLITH?
  21. 21. The Monolith… … is still in production.
  22. 22. Scaling Services
  23. 23. Our 45 teams are developing and running more than 200 services Imagine if all of them talk to each other: Scale at Servicelevel
  24. 24. Our 45 teams are developing and running more than 200 services Imagine if all of them talk to each other: Scale at Servicelevel
  25. 25. Our 45 teams are developing and running more than 200 services Imagine if all of them talk to each other: Scale at Servicelevel
  26. 26. Challenges in HTTP/REST-only architectures Gateway µService 1 µService 2 µService 5µService 4 Things that help: ● API-Guidelines ● Timeouts ● Fallbacks ● Circuit Breakers ● Eventing µService 3
  27. 27. What is Eventing?
  28. 28. ● Enable services to provide themselves with data asynchronously before it is needed in a request ● Kind of database replication ● More performance & stability Service A a Service B Service C ca cba’ c’ What is the goal of Eventing? produce produceconsumeconsume
  29. 29. Representation of something that happened in the domain (Evans) ● An event concerns: ○ one domain entity (e.g. “customer”, “shopping cart”, “order”, “delivery area”) ○ and one state change that happened to that entity (e.g. “customer registered”, “item added to shopping cart”, “order fulfilled”, “delivery area expanded”) ● Event = functional object to describe domain changes ● Event = vehicle for database replication What is a (domain-) event?
  30. 30. ● ID: Unique identifier ● Key: Which entity is affected? ● Version: Which version of this entity is this? ● Time: When did the event occur? ● Type: What kind of action happened? ● Payload: What are the details? ○ Entire entity - not deltas! { “id” : “4ea55fbb7c887”, “key” : “7ebc8eeb1f2f45”, “version” : 1, “time” : "2018-02-22T17:05:55Z", “type” : “customer-registered”, “payload” : { “id” : “7ebc8eeb1f2f45”, “version” : 1, “first_name” : “Paul”, “last_name” : “Puschmann”, “e-mail” : “bofh(at)” } } Technical Event
  31. 31. customer data customer topic Customer Data Service <<publish Event>> Invoice Service customer data’ <<subscribe Event>> Loyalty Service customer data’’ <<subscribe Event>> . . . “payload”: { “customer_uuid” : ”876ef6e5”, “version” : 3, “name” : “Peter Smith”, “loyalty_id” : “477183877”, “invoice_address” : “752 High Street”, “delivery_address” : “67 Liverpool Street” } “customer_uuid” : ”876ef6e5”, “version” : 3, “name” : “Peter Smith”, “invoice_address” : “752 High Street” “customer_uuid” : ”876ef6e5”, “version” : 3, “name” : “Peter Smith”, “loyalty_id” : “477183877” Example: Customer Data
  32. 32. We chose Apache Kafka
  33. 33. ● Open-source stream processing platform written in Scala and Java ● High-throughput, low-latency platform for real-time data streams ● Originally developed at Linkedin, open sourced in 2011 ● Offers 4 APIs: Producer, Consumer, Stream, Connect ● We use Apache Kafka in a pub-sub manner. This means most of our services use the Producer and Consumer APIs “Kafka is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.” ( Apache Kafka
  34. 34. ● In general every resource a service owns should be published to a topic ● Every state change in the domain is published as an event ● Those events are sent as messages to topics ● Topics can be created when the first event arrives { “id” : “4ea55fbb7c887”, “key” : “7ebc8eeb1f2f45”, “version” : 1, “time” : "2018-02-22T17:05:55Z", “type” : “customer-registered”, “payload” : { “id” : “7ebc8eeb1f2f45”, “version” : 1, “first_name” : “Paul”, “last_name” : “Puschmann”, “e-mail” : “bofh(at)” } } What to send as messages
  35. 35. ● Domain events are sent to topics ● Topics can have any number of subscribers ● Topics are split in partitions, the order is only ensured inside a partition ● Each record has a sequential ID assigned ● Partitions are distributed over the Cluster ● Partitions can have a configurable number of replicas Topics and their organisation
  36. 36. ● An endless stream of events is stored in the cluster ● Only the most recent version of an entity is kept ● No point-in-time access to data ● Choose a wise key for your entity and update a single entity always to this key Log-Compaction
  37. 37. ● Every service which owns a resource should publish those resource-entities to a topic ● Use only one producer or make sure there are no issues about the order of events ● To enable log-compaction use a partitioner that ensures an event with the same key is always sent to the same partition ● All producers should be able to republish all entities on request A B C D <<publish>> <<subscribe>> <<subscribe>> <<subscribe>> topic Producers a <<persist>>
  38. 38. Entity Repo Event Repo Published Version Repo Producer <<upsert>> <<upsert>> TX_1 <<delete>> Topic <<upsert>> TX_2 <<publish>> ● The producer has to make sure that the message is delivered and committed ● Therefore we store the raw event in a database to enable retries until it’s committed to the cluster ● Scheduled jobs can take care of retries and cleanup Producers
  39. 39. ● Every service can consume every available data and should consume all data it needs to fulfill a request - having data at request time is better than trying to get it from another service ● The consumer has to process events idempotently. An event could be consumed more than once. The infrastructure ensures at-least-once delivery ● Consumers have to take care of deployment specialties like blue/green ● Consumers should be able to re-consume from the beginning. For instance more data is needed ● Consumers only should persist the data they really need Consumers
  40. 40. ● The consumer is responsible for a manual commit only after a successful processing of the event. Successful can mean: ● Needed data from an event is saved in the services data-store ● The event can’t be processed and is stored in a private error queue / table Entity Repo Error Repo Processed Version Repo Consumer <<upsert>> Topic <<subscribe>> <<upsert>> <<insert>> Consumers
  41. 41. ● The consumer is responsible for a manual commit only after a successful processing of the event. Successful can mean: ● Needed data from an event is saved in the services data-store ● The event can’t be processed and is stored in a private error queue / table Entity Repo Error Repo Processed Version Repo Consumer <<upsert>> Topic <<subscribe>> <<upsert>> <<insert>> Kafka & Ops Eventing benefits for Operation
  42. 42. Pros Contras • Each service has its own database: This impacts / supports migrations, query tuning, database usage • Topic replication: The replication of topics to different brokers offer support for second datacenters or migration to different environments • Asynchrony: Services don’t need no do synchronous calls to share their data with other services • Another super important service: Kafka is the hub of your business data. Take care about this. • Redundancy of data: Your databases will store the same data, or subsets, more than once • Asynchrony: A consumer may not be up-to-date with some topics, this might lead to inconsistencies, e.g. in the frontend Kafka & Ops Eventing benefits for Operation
  43. 43. Kafka & Ops Eventing benefits for Operation By using the concept of “Kafka-mirrors”, you can push selected topics to a different Kafka-Cluster (one-way). This way you easily can setup services as consumers at a different datacenter. For producers you’d shut down the producer, switch the direction of the “kafka-mirror” and the start the producer “at the other side”. Optionally: delete the topic, create and fill it anew. Kafka can help you to migrate between providers / datacenters. —> NICE!
  44. 44. Different things
  45. 45. • We run Logstash on each VM to read, filter and ship log-events to have consistency in log-mangling per VM (for infrastructure services / daemons). • Each application (microservice, rest of the monolith) is forced to write log-events in a defined JSON-structure. • Our central “logindexers” only have to do one thing: Read from Redis and deliver to Elasticsearch as fast as possible. (we at least think are doing differently) What do we do differently?
  46. 46. Helpful things
  47. 47. ● Strong Architecture-Guild: - Eventing-Guide - API-Guide - … and many more ● Active Communication of changes & constraints ● Monthly / Bi-monthly Bootcamps for (new) colleagues What helped us most?
  48. 48. • Introduction of Eventing (with Kafka) • Make development teams analyse logs & metrics on their own • Strong usage of ELK • Strong usage of Prometheus • External traffic (Web, mobile App, partners) always has to get routed through a gateway Continued … What helped us most?
  49. 49. What we Learned
  50. 50. • Communication is a key factor • Automation pays off • Eventing with Kafka is cool • Temporary solutions last very long • The knowledge / distribution of RACI helps • UBIURI (you built it, you run it) is not only an option What did we learn?
  51. 51. We did try to scoop out the Monolith. —> That was not a good idea. Perhaps better: Put a gateway in front of your legacy-application and switch resource by ressource. Every service must have an owner! What did we learn? Continued …
  52. 52. The Future
  53. 53. Our Future
  54. 54. For Eventing-Foo: • Markus Rebbert @mrebbert • Sebastian Gauder @rattakresch • Ansgar Brauner @a_brauner For motivating me to give this talk • Matthias Jambor @grambulf Thanks go to:
  55. 55. Thank You! Paul Puschmann I @ppuschmann I I @rewedigitaltech All background pictures are licensed under CC0. Source: From Monolith to Microservices OSDC 2018, Berlin