A talk presented at https://krk-rb.pl/ in May 2019 about how Zendesk uses its Change Data Capture fire hose with Kafka, MySQL & domain events to fan out changes from its services and build event based systems.
5. ● Worked at Zendesk > 5 years
● Amount of services, developers and products skyrocketed
● Complexity, methodologies, languages
● One thing keeps getting harder: Communication
Background
6. ● Move towards event based architectures
● More services, more products, but one message bus: Kafka
● Events!
● Wait, what events?..
Eventful discussions
23. Domain events
If we are to use domain events for meaningful and
critical business logic, we need to be able to trust
that they convey the truth about the state of the
world and the changes therein.
25. Kafka + Transactions
Transactions across MySQL + Kafka: That’s Not A Thing.
Example: new record
1) Save record to MySQL
2) Write “PaymentCompleted” event to
Kafka
3) Kafka publish failed
4) ... ?
- Rollback MySQL tx?
- Retry --harder?
26. Kafka + transactions
● ACID?
● Transactions? Maybe...
● Distributed transactions are hard
● How do we use Kafka and guarantee transactionality?
33. Binlog relay
● Maxwell: Service that sits on the DB cluster level
● Watches binlog for changes
● Pushes transaction rows as messages to Kafka
● Keeps offset state in persistent storage
39. Building an outbox
Building your outbox architecture
● Instrument events in your application
● Make use of an Outbox::Log to guarantee
transactionality
● Create and start writing to your outbox table
● Use a binlog relay service like Maxwell to fetch
data from storage and into Kafka
● Consume said events using Kafka consumers
42. Outbox + Domain events recap
Outbox pattern: What does it allow us to do?
● Keeping the transactionality we need for our event based
architecture
● Disconnects the application logic itself from the message
bus
43. Outbox + Domain events recap
Domain events + outbox pattern
● Read your own writes semantics
● Capture impact and change
● Transactional guarantees
● Trusted to be used as an ISC tool for our critical systems
50. Lessons learned
● Instrumenting the events themselves takes time.
● Metrics, alerts and monitoring are your friends.
● Schemas. How do you handle schema change across
multiple consumers using multiple versions?
● End to end testing. Asynchronous system testing is hard!
51. Things to look out for
● Business data and the outbox must exist in the same
database.
● Deduplication - using unique event IDs, can be handled at
the consumer layer.
● Transactional guarantees will not save you from the wrong
message being sent out.
● Designing your events will make you truly understand your
product domains.
Hi everyone, thank you very much for attending! Today I would like to talk to you about Domain events & Kafka in Ruby applications.
This is me, my name is Spyros Livathinos, and you can find me under @livathinos on Github. I am currently working at Zendesk as a Group Tech Lead.
A little bit a about me:
I am based in Copenhagen Denmark, where I had the fortune to join a really talented group of people and work on the Guide product for Zendesk.
Guide is the Self Service product that Zendesk offers, allowing customers to build their knowledge base and community with content that will allow their users to solve their problems without involving support agents.
… and I was born and studied in Patras, Greece, a coastal city about a 3 hours drive away from Athens. If you ever visit, I would recommend staying for the sunsets, they are really beautiful :)
I have worked at Zendesk for more than 5 years now. When I first started, Zendesk was a Support SaaS company, but through the years, more products were released, which led to our engineering force multiplying, more offices added all around the world. Through this growth, we started seeing many different languages, methodologies and architectures being used all around our application ecosystem.
And one pattern we started recognizing, is how much more challenging it becomes to communicate when the company scales. Not only as human beings that work in geographically distributed teams and product structures, but also on an architectural level and the way we build our services.
So, through the years we started thinking about and moving towards event based architectures.
We took the decision, as an organization, to use one message bus, and that would be Kafka.
Kafka comes with messages, or events. But what kind of events are we even talking about? Who writes those? What kind of guarantees do we have for them? We had a lot of questions to answer before this kind of approach would be ready to be used throughout the company.
And we’ve had a few projects that touched on the concept of events and I want to talk to you about how we went through this transformation.
But first, I wanted to very briefly introduce Kafka, since it’s playing a significant role in our discussion here.
TO CROWD: How many people here know of Kafka, use Kafka in their professional environments? Or just like to play with Kafka in their free time?
So Kafka is a message bus, or event log which keeps track of messages in an append log like fashion.
If we are to take a very high level overview of how Kafka works, we have three main components.
On the one side we have producers, which are processes that are producing data, or events. We have the Kafka cluster, which is taking care of the storage, partitioning and routing of your data to topics.
And on the other side we have consumers, consuming those events.
So if I were a producer pushing data to a topic, a topic being a namespace which has a name identifier, these would end up in my Kafka cluster. My consumers on the other side, would be able to subscribe to this topic and consume the events that were appended to it.
In the Ruby world, the way this would look like is, we would have a producing service and our Kafka cluster as the first two components of our pipeline.
In the producing service, I would start writing what is called a Kafka client. Here, I’m making use of the ruby-kafka gem, and creating a client with a bunch of configuration, adding the brokers of my cluster, a client_id for logging.
(Next slide for deliver)
Next, I want to do something with my client, and more specifically, I want to deliver a message to the users topic.
So I invoke the deliver_message function, which takes in two arguments, one being the payload of the message that i want to send off to the cluster, the other being the topic I want to broadcast the event to. You can see here, my payload contains a user_id, and a banned flag - I want to broadcast the fact that the user with ID 1 should be banned from my service. The use case here being, that - I - want - to - there is a user that is spamming my community and - I want to make sure this user is banned.
In the end, I have implemented a producer inside my Ruby application, from which I can send off an event to my cluster, in the topic of my choosing.
Now, on the other side of the spectrum we have the consumer which is supposed to consume events from the topic that we are subscribed to.
Let’s assume here that I am also the owner of a Consuming Service, and this service will be responsible for acting upon the events that were sent off to the users topic.
(Next slide for explanation of code)
So I’m implementing my Consumer class that subscribes to the users topic and processes a message. The process method is something that comes with every consumer made with the Racecar gem, and allows for an attribute to be passed it, that attribute being the message that was consumed from the Kafka topic.
I’m processing my message, and, knowing it is in JSON format, I parse it - you can imagine that there can be multiple formats of serializing my messages, like protobuf, avro etc. or whatever else you want to use - but in this case it’s just a JSON object.
So I parse the message’s value, find the User in question and ban him. In the end I have a user that was banned from my system, asynchronously through two different systems, using the event stream. In this case I have an asynchronous ruby consumer, which took care of this work for me.
This is a very simple approach, if you ever wanted to send out a message to Kafka using Ruby, this is all it takes. What’s not included here, is that you of course need to have the operations side of things covered, meaning you have a working and healthy Kafka cluster and the interconnectivity between your apps and the cluster readily available.
I talked to you before, about the communication issues of scaling in a company that grows a lot.
All of this work brought us to a more philosophical topic of the difference between intent of a user versus the impact an action has on the system in the services that we own. Communication can go wrong in spectacular ways. How should we handle this in our systems?
Leaving the world of mythology and going back to the world of services and scaling, we in the team, talked a lot about events and how we want them to be meaningful and trustworthy. And through these conversations we realised there are different categories of events.
When a user intends to action on something, we do want to keep track of that notion, for analytics purposes. When a change is made to our system’s underlying record of truth, we want to keep track of the impact in the system.
So not all events are written and used the same way. In fact, there seemed to be two distinct categories of events that we were interested in for the sake of our projects.
To illustrate the difference in their meaning and impact, here’s a real-life example of an action that could result in event emission and handling: A user views our webpage, and wants to buy our product. They fill up their shopping cart, fill out their payment details and click on the “Pay Now” button on our payment form. We instrument this click and want to keep track of the user_id, the account this took place in, and the product the user bought.
Why do we need this information? Because we would like to understand user behaviour in our application throughout the day. We gather these click metrics and aggregate them. Note that we don’t actually care if the payment was registered in our system or not. The click itself, even if the underlying system failed, is important to us to register.
So we would like to emit an event through our Kafka producer, called payment_clicked, that will carry information about the usage of our payment form.
Now another category of events we started using to instrument our systems were what are called Domain Events. And Domain events have been around for a long time.
A domain event is an event that relates to a specific application domain. For example, for a payment domain, we can imagine PaymentInitiated, PaymentCompleted, etc.
Domain events should correspond to domain concepts rather than data store concepts or usage concepts etc.
Now, let’s take the same example and assume that the user goes through the same flow.
And let’s assume that the underlying payment service works perfectly fine and the payment registers. What does this mean for us? It might mean that a payment record was added to our payment service’s database. So in this case, we save the payment record to our storage engine.
We also want to instrument our domain event, to describe the change that just happened. We do this, by emitting an event again, to Kafka, but this time with a different name and most likely, payload.
So already here we can see that the meaning behind the activity and the domain event with regards to our payments platform are very different.
In the case of the activity event, we tracked what went on with the user interaction. In the case of the domain event, we care about the underlying system, and are rather interested in what happened on the insides of our domain model rather than on the surface of it.
So when we talk about domain events, we talk about change.
Domain events are an inter-service communication (ISC) mechanism. They are modeled and designed with the intent of allowing internal and external consumers to react to record mutations.
So one of the questions we had was: is it important that these events are reliable?
This becomes a bit clearer when we start thinking about what downstream consumers might do with those events.
Let’s take an example of the payment service making a change, and propagating the payment_completed event to the message bus. Another downstream application, the Accounts Service picks up this event and modifies its record of truth to allow for the paid for product to show up to a customer’s Administration panel.
So it seems that since we have an important service relying on these events, reliability is important to us.
And if we are to use domain events for meaningful business logic, we need to be able to trust them.
Now, when we’re usually thinking about reliability models like ACID, the things we’ve come to enjoy with classic relational databases for example, we’re usually thinking besides other things, of transactions.
Transactions, allow us to bunch together actions that will have an effect on our data layer, and make sure that either they all succeed, or none of them succeeds.
In the payment example, how could we handle the case of the event not being published.
Well, we could try to keep the transaction open for as long as the publish takes to complete, but that would kill performance.
We could keep retrying, but where do we keep the state of those retries? Do we keep them in a job queue somewhere? Maintaining the state of that job queue would add quite some overhead.
So where does Kafka come in related to these things? Is Kafka ACID compliant? Do we even have Kafka transactions? The answer is kind of, maybe and depending on the version of Kafka that you’re using :) It is notoriously hard to ensure distributed ACID compliance and Kafka is no exception - it is after all, the very essence of a distributed message bus.
Then the question becomes: can we use Kafka and still allow for transactionality guarantees during our instrumentation without dealing with distributed transactions?
The answer is yes: there are solutions to the problem.
I want to talk here about one of them, which has the name of the “Outbox pattern” or Transactional Outbox..
In order to understand the outbox pattern, I feel it helps to think about the 90s.
Dial up - households didn’t have internet 24/7 and our browsers didn’t support SPA.
Desktop email clients like Thunderbird were popular.
So how would this work with our previous example of payment systems? Well, we would again have our payments service, storing data to its database.
And these actions, we want to keep in the same transaction.
But one thing that happens when you save data to a database like MySQL for example, is that you get a file called the binlog, if you so choose in your configuration options. The binlog, is responsible for writing out every transaction that was successful in our DB cluster. This line in the log contains a transaction ID and the whole payload of the SQL statement that was executed, including the data that inserted, deleted or updated.
Now, imagine you have another service that reads the binlog line by line, and transforms these lines into Kafka messages that get published to your message bus. This is essentially the outbox pattern and how it can be implemented in our systems.
But we started moiving away from Maxwell and into a more specialized service that takes care of pushing domain events to Kafka rather than changes to our database.
So how do we make all of this work in Ruby? After all, this is krk-rb :)
This is an implementation of the Domain Event Log. The class is pretty simple and only cares about the storage
The record method takes a block and wraps it in a transaction. It has some convenience built in, in that it also instruments (yes, again) to datadog, for metrics and monitoring purposes. It also allows for batching of events, in the case of let’s say a transaction containing mutliple domain events.
Explain points.
This is the first-ever picture of a black hole. It was released, about a month ago, by an international network of radio telescopes. You might be wondering what it has to do with domain events, transactionality and the ever-growing domain event table in our storage.
Well, there is a performance improvement to be done in our implementation. Specifically, if you use MySQL, you can use the BLACKHOLE engine to store the events that will be stored in your DB. Think about it: you don’t need to store those events in the DB in the first place. All you really use the DB for is for the transactionality guarantees and writing to the binlog, both of which will be available to you with the BLACKHOLE engine as well.
This is how a BLACKHOLE table could look like. We save the Kafka topic, the partition_key and message to the table, which will allow our Binlog relay service to pick up the record by watching the binlog and publish it to the Kafka topic of our choice.
So what does this architecture allow us to do?
The outbox pattern allows us to guarantee the transactionality of our event emission along with the changes to our domain model without using a distributed transaction model.
It also allow for disconnecting our applications from the message bus. Think about the case where you have hundreds of microservices in your architecture. It might very well be the case that you don’t want all of them to talk to the same message bus. It might be also the case that you don’t want to introduce a message bus dependency at all to most of them.
The outbox allows you to use the same storage you would use to keep state, to also keep a log of changes to your data.
And to recap on our domain events when used in conjunction with the outbox pattern: Domain events, with the usage of the outbox pattern, have “read your own writes” semantics. They capture impact on the service or application in question. We are attributing transactional guarantees to them and if implemented with all the above characteristics, are very useful for communicating between critical systems in our service ecosystem.
I wanted to briefly speak about a real life example that we started working on some time ago in Copenhagen. This is about the concept of mirroring data between services.
We have been working towards better reliability and stability of our services, and during those efforts, one of the issues that kept appearing was “What if another team’s a core service is down, when our team’s service depends on it?”.
The answer, most of the times was, “Well, I guess our service is also considered down or at least partially functioning” depending on the architecture. We can always design our way out of completely broken services, but the point is, how robust can we make our applications to handle their own when things go wrong?
The answer almost always will depend on how much of the data your application needs is readily available to it. If everything is an API call, you have a problem because at some point, core domain entities of your application will rely on these calls to be successful in order to render content that is useful to your customer.
Let’s take about the real life example in detail.
We have a content editing application that allows users to author articles. The article is a domain object that is pretty complicated here, it contains a title, an author, a body and many more attributes. This application though, also has domain events instrumented.
Let’s take about the real life example in detail.
Let’s take about the real life example in detail.
Instrumenting with intent in order to capture impact.
Business data and the outbox must exist in the same database. Unless you want to deal with distributed transactions!
…
Finally, designing and discussing your events will make your truly understand your domain. Anton talked about event storming before, I would recommend it to anyone that wants to model their domain in a way that will uncover convoluted domain models, complexity and hard to follow user paths.
But before we go into this I want to talk to you about, one of my favourite ancient Greek myths. You know, being Greek, I almost feel the obligation to talk to you, even briefly about Greek mythology :)This myth is about Aegeus. Aegeus and Theseus.
Aegeus, was the King of Athens and he was the father to Theseus - who grew up to be a formidable warrior - and Aegeus, having lost a war to King Minos of Crete, has to pay tribute to that King, every year - he actually had to pay human tribute to that King - which means he had to send off seven women and seven men on a vessel to Crete every year.But, Theseus growing up, he really detested this thing that his father was forced to do. So he seeks to change this and tells his father that he will sail along with the 7 women and men on a tribute vessel to Crete, with the goal to defeat the mighty Minotaur: half man, half bull, for whom the tribute is for. Now every year, the vessel is sent off with black sails in its masts with Athenians crying in the streets for losing so many of their own, being sure that they will never return.Aegeus, dreading the outcome of this journey and being a worrying father, asks of his son that should he be successful in the dangerous journey, and return victorious, to put white sails on the mast - so that he can see them from afar and know that his son is returning to him alive and well. Otherwise, the ship will return whence it came with its black sails on display.
Theseus and his crew were successful at defeating the Minotaur. They were successful of putting an end to the terrible human tribute, and were going to be celebrated when they came back to Athens. But during this joy and drunk from their victory, crucially, forget to change the ship’s sails to white on their return journey to Athens.
Aegeus, standing at the edge of a cliff back home, overseeing the sea, sees the black sails and jumps into the sea in dismay, believing his son to be perished. And this, according to myth is how the Aegean Sea was named.Now you might be wondering what this has to do with communication and events. And I think there’s one thing that’s important to understand in this story:Theseus’ success on his mission and the message sent to his father were disconnected and ultimately, did not describe the truth. This, of course, had tragic and dramatic consequences, as it normally goes in Greek mythology :)