A common pattern in application development is to build systems where the data is directly linked to the current state of the application; one row in the database equates to one entity’s current state. Only ever knowing the current state of the data is adequate for many systems, but imagine the possibilities if one had access to the state of the data at any point in time. Enter Event Sourcing: instead of persisting the current state of our Domain Objects or Entities, we record historical events about our data. This pattern changes how we persist and process our data, but is surprisingly lightweight. In this talk I will present the basic concepts of Event Sourcing and the positive effects it can have on analytics and performance. We’ll discuss how storing historical events provides extremely powerful views into our data at any point in time. We’ll see how naturally it couples with the Event-oriented world of modern Reactive systems, and how easily it can be implemented in Groovy. We’ll examine some practical use cases and example implementations in Ratpack. Event Sourcing will change how you think about your data.
46. –Eric Evans
“Some objects are not defined primarily by their attributes. They
represent a thread of identity that runs through time and often
across distinct representations. Sometimes such an object must
be matched with another object even though attributes differ. An
object must be distinguished from other objects even though they
might have the same attributes.”
47.
48. ES & CQRS can exist separately,
but compliment each other
Before I begin, I just want to run through a quick scenario with you all
Picture your bank in your mind. For many of us, this may not exactly be a happy thought.
There’s likely several brick and mortar branches for your particular bank in the area.
However, being tech folks, I don’t suppose it’s a stretch to surmise that you all primarily interact with your bank via their website, yes?
Imagine you went to your bank’s wonderful website, entered your information, and logged in successfully.
I hope no one works for farmers. I typed in “Bank website” into google, and this was the first result.
And, upon logging in, you click the link to check your balance. In doing so, you’re presented with a screen that just shows you “Balance, $100”, but with no context around that number.
This may be fine if you expected there to be $100…
What if that’s all your bank balance was, just a simple number?
What if that was all your bank could tell you? What if your balance was simply a column in a row in a database somewhere?
… and What if you didn’t agree?
How angry would you be?
Can you imagine the arguments you’d have with the teller or an agent over the phone, trying to figure out if your latest pay check was deposited?
Luckily, that’s not how things are done. Banks store your account’s entire history.
Every transaction you make with your bank. Every Credit or Debit made is logged, along with an audit trail of who (e.g. which teller) made the change.
To get your balance, your bank simply adds up each of these transactions
May also periodically record what the balance was at specific points in time, to prevent having to recalculate everything from the beginning of time.
There’s a certain advantage to this idea, that we can record all modifications to our data - in this case, the credits and the debits - as EVENTS that occur within our system.
For example, Your bank is able to tell you exactly how they arrived at your account balance.
-What about your company’s software?
-Can you tell your users or internal business analysts how you are arriving at the data your application presents to them? If they have a disagreement with a value in a
…
-Today I’m going to present a method called Event Sourcing that does just… that and how it can fuel the competitiveness of your company.
Bold, right?
With that, this is ‘Richer Data History with Event Sourcing’. My name is Steve, and I work for a startup called ThirdChannel, which is located in Boston
Today I’d like to go over the following topics:
-Event Sourcing at a high level,
-Challenges
- Querying: “How do I query this mess?”
Let’s begin
Or rather, it’s alternative to your standard ORM storage mapping,
Where an object in memory maps directly to a row in a database, even if that row may be split via joins
* update is made to a model, updates a column in your database
* in this method, the only thing you know about your data is what it looks like right now.
Event Sourcing says “that’s fine, but we’re going to something a bit different”.
Instead of storing the current state of our models, we’re going to store facts about our models
Every successful user interaction generates a series of facts or ‘events’ within our system
This stream of events are persisted in our database in the order they occurred, as a journal of what has transpired.
These events can then be played back against our Domain Object, building it up to the state it would be at any given point in time, although this is most likely the Current State
A stream of events represent a particular object in Aggregate
Which means I should talk about the two main concepts behind Event Sourcing
An Event Represents something that has occurred within your system. The past tense is important when describing them. It represents an intentional user action or result of user action that almost always results in the manipulation or state change or an Object.
Things like “BankAccountOpened”, “CurrencyDeposited” are decent names for events
The objects which are affected by Events are referred to as an Aggregate. They generally serve as a root for a stream of events; they represent the state of an event stream ‘in aggregate’.
If it helps, you can think of it like a domain model. It doesn’t have to be, though. It can be, say, relationships between objects. For example, at ThirdChannel we model the assignment relationship between our users and what we call programs as an Aggregate. Along with many other things.
Note this is a very simple explanation, and I’ll touch more on this later
Event Sourcing is a purely additive model…
there are no deletes or updating of events. Events are immutable once written to our journal
This is a powerful notion, if you consider the implications: using Event Sourcing, no data is ever lost or ignored.
<pause>
Now, When I need to retrieve information about my Aggregates, I simply play back all of the events that have occurred in the past in order to build up the data to a specific point in time, generally the current date, thus getting the current state of our data.
One of the Key points: by maintaining all events, we’re able to access the current state of our aggregates (again, or objects), certainly, but we can also access the state of our data or aggregates at any point in time.
Which is huge.
Now, I’m sure some of you are thinking “waiiit, if I never get rid of anything, certainly that has tradeoffs, too?” Specifically, performance. What happens if I have thousands, or even millions of events I have to apply?
You’re right, and that’s a great observation.
Luckily, Event Sourcing recommends an early optimization known as ‘Snapshots’
A Snapshot is just what you’d think it would be: a recording of the details of your Aggregate at that moment in time. Persisted forever
As we consume and create events, periodically we persist a snapshot, containing the state of the aggregate at that point in time.
-When replaying events, we load from the most recent snapshot, then apply only the events between when that snapshot was taken, and the targeted end date. So, in this case,…
-I’ll get into some more specifics around snapshots in a bit.
One of my favorite examples
Suppose we were building an ecommerce app and we are building the ‘shopping’ cart feature.
Naive, ORM, relational -> join table with quantity
Event Sourcing -> <identify page components> are not saved as a join table or a single row.
Instead, w/ ES, system stores all commands you’ve issued / replays for current state
<list events> Quickly remove base from the cart, before placing the order and generating an OrderPlacedEvent
View doesn’t display raw events
Data backing the view is built up from events to form an Object intended for View
For those of you familiar with ‘CQRS’, ES is commonly featured as part of it; I’m attempting to describe the ‘Read’ component.
Object is Transient -> object will be garbage collected and no direct representation exists on disk
That brings up the next step, Working with Objects
And this is where Event Sourcing will start to hurt your brain.
In order to fully grasp what Event Sourcing is, it’s important to realize that…:
All objects that are ‘displayed’ to the user in your View layer are simply transient derivatives of your event stream.
They are ephemeral and must be built up from the events to be used in a traditional manner within your application
Finally, I argue that structuring our data in this way is akin to the way our brains work; it’s natural.
Internally, your mind is able to tell you the current state of your knowledge about things. This current state is formed by a series of observations / facts / events in your past.
You’re able to replay these events in your mind, and also remember your knowledge at that point in time.
Our minds aren’t perfect, though, and sometimes we violate the ES rules by deleting Events. Ooops.
Let’s take me as an Example. Even if you’ve never seen or met me before today, your mind has already recorded a series of facts which is driving your mental model of me. For example:
FeatureObservedEvent
ActionObservedEvent
-Now, if I were to suddenly make a rude gesture at the audience…
that would apply a new event to your mental model. Your current state opinion of me would likely be negative
although you could remember a time before you thought negatively of me.
“Man, Steve seemed like an alright guy until he flipped off the audience. What a jerk”
Finally, I would be remiss if I didn’t mention Domain Driven Design or CQRS while standing up here talking about Event Sourcing
Has anyone read? Domain Driven Design by Eric Evans?
- Eric wrote this book after years of trying to create enterprise software
- some great concepts in there…
-Ubiquitous Language -> the domain model the engineers build should be handed down from domain experts. Both the business side of things and the engineers should be able to talk about the system using the same terms and languages. In other words, your engineers should never say ‘AbstractFactory’ to your designers.
-Bounded Context -> keep related models isolated and segregated from the others
-Domain events - something that occurs that domain experts care about
-Aggregates
-Repositories -> each domain object or bounded context should have a dedicated repository for accessing its storage
Earlier I mentioned that you could think of an Aggregate as equivalent to a domain object or an entity… and while you can, the concept of an Aggregate is expressed a bit different in DDD.
An Aggregate can be comprised of several ‘internal’ objects which are owned by a ‘parent’ object. This parent object is known as the Aggregate Root.
External code touches the Root Aggregate, but cannot link to the Root’s supporting objects
One of the most interesting parts of DDD, one that really stuck with me, is this quote: <read quote>
- that’s interesting, yeah?
If I change my name, am I no longer me? Of course not, I’m defined by more than my name. If I change my email address, or my address, or my social security number for some reason, am I no longer me? Obviously not… Can your database understand identity changes like this and still be able to find the original object?
I think this is the heart of Event Sourcing.
not going to spend too much time on this, as CQRS could be it’s own session. I believe there’s a full session dedicated to CQRS later in the conference, so if this seems intriguing, please go.
Commands vs Queries
- Write and Read Requests are handled by different objects and different routines within your system. Actual POGO or POJO objects encapsulating your incoming commands and the outing reads
- Ideally, different data repositories
- Event Sourcing tends to go hand in hand with CQRS
-as we’ll see later, some of the terms overlap
- in this talk, I explicitly focus on ES, as I think it has a greater
That, I think, is the basis for what Event Sourcing is.
It may be a bit early, but are there any questions so far?
Then Let’s move on to the next section, Challenges or Difficulties with ES.
Or as I like to call it…
Right now, you may be suspicious.
You may be thinking:
“I mean, what you’re describing sounds like a ton of extra work to implement.”
-Not to mention a ton of overhead in processing these events, even if we do make good use of these snapshots you mention
How can you operate in a world without Models?!
And yes, that’s true.
*pause*
Furthermore, here’s some more bad things:
Storing every event that occurs within your system will almost certainly require more storage space
your database is going to start looking a bit cluttered, and you will repeatedly ask yourself why the heck you keep all this stuff around… who’s going to back and look at all those old books anyway?
Now this is a truly difficult one.
- Will have reduced Database Level Constraints, like Foreign Keys, null checks, unique checks, etc.
- Instead, we have to rely on our software for transactions and these database constraints
-This is usually where I lose the more seasoned developers.
-Because our properties tend to be transients, serialized within the event, we lose things like foreign key constraints at the db level
-Instead, in Event Sourcing, these checks tend to move within Transactional blocks within your code. If you’re using Grails or Spring, just slap on an @Transactional annotation
Finally… ES can also be difficult for Junior Engineers
I’ve noticed that people really cling to the Model View Controller way of life.
- This is a very different way of building our applications, particularly for the web
Recommending a different structure for the Model can make people wary.
- Telling people that their Views become a “Transient object derived from the event stream” scares them
As crazy as this all might sound, I argue that Event Sourcing actually has huge Benefits
Next Up, “don’t worry, ES is worth it”
First, Going back to the concept of a transaction or an Audit Log…
Why is it that I’d want something fancy like ES, when I can audit a log file or look at my transaction log?
There’s a subtle difference between an Audit Log and an Event Stream.
Audit logs tell the history of what happened in your system or what was persisted to the database.
Events Tell
Furthermore, Having the Events as a first-order member of your platform can give you enhanced information around what your users or systems are doing beyond what might normally get written to the database.
Can make events that don’t necessarily deal with the properties changed by a user, but additional actions that may have occurred
And it’s easier to work with and analyze the data if the events are integrated within your platform already.
Incidentally, an event object typically should have attached to it information about the user which generated the event, which also makes ES a perfect Audit Log
Data storage is crazy cheap. Last I looked, AWS basic SSD storage was 0.013 per gigabyte hour/ which equates to…
If you’re at the point where those pennies matter, you probably have bigger problems.
next.
What I find interesting is that Event Sourcing, or a non-digital analogue of it, is used by every ‘mature’, or long running business.
Just Like I went over in the beginning, banks and accounting methods operate with Event Sourcing
Bankers additionally even use snapshots of your balance in an additional column beyond credits and debits
Lawyers!
If a contract needs to be adjusted, is the contract thrown out and re-written? No. Rather, ‘addendums’ are placed on the contract.
To figure out the contract actually involves, one has to read the initial contract, and then each successful addendum in order to figure out what the thing actually says.
In addition, all business problems can be successfully modeled with - and benefit greatly by - event sourcing
How many of you all have delete statements in your code?
Even if you don’t, every time you update a row in a database and overwrite some column, you’ve just lost information
Remember: there is no delete, ES is the only structural model which does not lose information.
Event Sourcing simplifies Testing and Debugging. A bold claim, I know.
Testing is easier / simplified
with ES, you unit test the events, then later you can simply assert that specific events are applied during integration testing
In addition, debugging is easy, because we have the entire history of our data.
We can look back through our Aggregates’ timelines <next> and examine them at any point in history.
Thus I can see what the historical state of the aggregate… or all my aggregates… was at a particular point in time, along with how it reached that state and who caused those changes.
ahem… I’m sure you’re all keenly aware, but 2015 is the year they visited in this movie. “Where’s my Hoverboard?!”
Anyway.
If we at some point note that there’s an error or discrepancy in our data…
debugging or tracing the error is a snap
-We can find the faulty or conflicting event, know who executed it, when they executed it, and what lead up to the bad state.
And then we can emit a new event to ‘patch’ the issue
If we want to get even crazier, we could go to a specific point in our data’s timeline, then fire fake events in order to simulate alternate timelines.
pause
This has interesting applications for, say, a.b. testing, stress, and disaster testing.
If any of these past few slides reminds you of git… well, how astute. Git is like recursive event sourcing. Ever look at the reflog?
Event sourcing is the ideal storage mechanism for business analysis
-because Event Sourced systems do not lose data, they’re future proofed against any crazy reports that your business analysts may need in the future
suppose one such analyst came to your Ecommerce / shopping cart team asking for… all shoppers who add items to their cart and then remove them within 5 minutes. They want to know who, and which products
with non es, and the naive way I mentioned earlier, you might have to build some sort of tracking table, or mark additional rows with a timestamp.. I dunno.
Regardless, then you deploy… and then wait for the data to gather, as users add and remove products
with Event Sourcing, your write a query for that report, you deploy… and then what to do you have? If you’re thinking: all of the data, obviously… well, you’re wrong.
You have MORE than everything. We can generate the report for how it would look at every point in our history.
Which makes the company and your business analysts extremely happy. There’s nothing they like better than a good report.
Querying over the events; presenting different Views on them, is often called a projection
Perhaps the biggest advantage of ES for me.
look at specific events across one stream
look at specific event types across all streams
I don’t have to query on properties of our domain objects… we can look for patterns in our event stream.
Grabbing the Current State of an aggregate is a projection, and probably the easiest: take all events for an Aggregate, in order.
There’s a good deal else to find. In our shopping cart example:
find items in cart for any given date or time range
find items that were removed for any given date or time range
find average rate of items removed vs items purchased for any given…
find average duration between items being added and then being removed for any…
Model the relation between our users and the role and state within our platform
Can certainly tell you All of the current FMRs or Agents
which one of us made those transitions
a timeline of each agent’s transition within the program
applications with a long gap between entering the system and being wait listed or interviewed, to see how long a candidate waits until we contact them
how long on average, an agent lasts before being fired and/or average time for agents that have quit
Turnover for a date range
And I can tell you that information at ANY POINT IN TIME. e.g. the average quit rate might be different now than 6 months ago, for example
that’s all I could think of off the top of my head
Which is amazing, right? Just from that one relation.
What would happen.. if I started to correlate other event streams?
<Pause>
Even after this presentation, if you’re still skeptical of the benefits… and you think this is the silliest thing you’ve ever heard of
..be aware that the decision can be out of your hands.
Event Sourcing is often chosen or driven by Management out of business needs, and ‘hacky’ analogues are shoehorned into an existing system AFTER the fact.
I know I have
And now we get to a particular hairy topic…
How am I going to handle all of these events and find what I’m looking for?
All queries within ES are often referred to as a Projections over the Event Stream. This includes the concept of the current state.
Returning the current state of your aggregates is easy. In other words, a lookup by Id is simple.
Load the aggregate or aggregates by id, load all of the events and replay the events to get back to current state
Fairly easy and straightforward
All queries within ES are often referred to as a Projection over the Event Stream. This includes the concept of the current state.
Typically when working with relational data, we’ll either say something like “get me the object Foo with id x”, which maps well to the current state of an object.
But not so much with Events, get to this in a minute, but an event is difficult to query
Queries like “Find all Foos with active=true and date between 2 values” is easy in standard ‘normal’ form databases.
One alternative is a read model
maintain a synchronized copy of your aggregates in another table; always sync current state into it, and then query that table
Analogous to a database ‘View’
I generally like to keep any data synchronization to a minimum, but this approach can be an attractive convenience for current state searches
In fact, the Event Store database, that’s exactly what they do.
One has to write Projections using Streams within a web interface, that then become query-able by clients.
However, as you can probably see. This is fairly difficult. The development team will need to spend time writing the projections.
If you have analysts on your team that are used to writing sql, well, it’s going to be much more difficult for them.
Consider feeding events into additional services or tools, particularly those that are stream friendly, like Splunk or Apache Storm
first let’s discuss the theoretical approaches.
Pure Event Sourcing is fairly simplistic in terms of implementation
There’s really only 3 base objects that you have to worry about.
-First up, the Aggregate. You have the id, which should be a UUID, the current revision number, and the type (or ‘clazz’ if you’re working with java)
-The current revision number is used for optimistic locking and to see how advanced our aggregate is.
-The type is used by our system when we want to load the aggregate into a more meaningful class in the system, say, a SubClass of Aggregate. In our example, the ShoppingCart class would SubClass from Aggregate
Next up, Event
-id, revision aggregate_id, the date with time stamp, the type, the user id, and then ‘data’
-data is a serialized representation of the event type’s properties. Generally, JSON, XML, or are recommended for the storage mechanism in the data column.
- this could also be a more efficient mechanism, too, like Google’s Protocol Buffers or Apache Avro
All Events should be named in the past tense, as they should reflect something successful that happened in the PAST
Lastly, we have Snapshot.
Again, we want to serialize the properties of the aggregate at that moment in time.
Next,
I should be clear about what - exactly - is being serialized into those data fields.
Aggregates and Events, or at least classes that implement Aggregate and Event, contain, themselves, transient properties which are generally not persisted to the database.
Plain old object with explicit transient properties
each has corresponding event or events
The Event itself has transient properties, whose values are persisted to the database.
Also, if anyone notices that I’m using JPA annotations and I earlier mentioned that this is an alternative to ORMs… appreciate the irony. This is from a small demo app.
Events Modify Transient values on the Aggregate
It’s almost a Visitor pattern. As Events are generated, they are applied to an Aggregate.
Aggregates are built up or, in the case of loading an aggregate, building back up. Event by Event
my actual aggregate class may have several properties, however, they are all transient, in the sense that they are not persisted local to the aggregate… e.g. not in the same table.
When the aggregate is first created, all of these transients are at their default value, and the playback of the events will restore them to whichever point in time I want.
shopping cart -> Order placement should only charge credit card the first time the event is created
In addition, you’ll also need a service layer to store and load the events and aggregates
it must remember to load events in order of their revision number for the correct aggregate, and initiate the event serialization and de-serialization processes
And those are the basics. That’s not too bad, eh?
Unfortunately… there are a few more practical considerations to go over that are a reality for any real ES system.
While snapshots seem awesome, do it only rarely. Greg Young, one of the largest voices in the ES community, claims that he doesn’t snapshot an aggregate until it reaches 1000 events.
You have to juggle the time cost of the additional query for the snapshot versus the processing of the small event objects.
While snapshots seem awesome, do it only rarely. Greg Young, one of the largest voices in the ES community, claims that he doesn’t snapshot an aggregate until it reaches 1000 events.
You have to juggle the time cost of the additional query for the snapshot versus the processing of the small event objects.
I internally made this slide hard to read to emphasize the point: querying for specific properties within events is tough. See that nice blob of JSON in the data column?
As an aside: We use Postgres, which I love, which has two dedicated types for working with JSON. However, I also love JOOQ and JPA, both of which can not work with those types. If anyone knows of a way to get those working with Postgres JSON, please let me know.
While snapshots seem awesome, do it only rarely. Greg Young, one of the largest voices in the ES community, claims that he doesn’t snapshot an aggregate until it reaches 1000 events.
You have to juggle the time cost of the additional query for the snapshot versus the processing of the small event objects.
First, the ‘Synced’ pattern. This involves keeping a ‘standard’ synchronized, current state representation of your aggregates.
All writes go into the Event log, and then the Domain object is updated.
All queries done against the ‘standard’ Domain objects.
This is a read model
Downside is that you have to do an additional write in sync with your event stream, which can be a pain.
Second, and my preferred pattern, is what I call the ‘Hybrid’ approach
Use multi table inheritance to give each of your aggregates their own table. Add each of the properties you’d like to index or add database level checks, like foreign keys. It doesn’t have to be all the properties. Your aggregate maintains the current state of your data at the db level
Now, the advantage of this approach is that I get the benefits of both ES and standard relational. I can make use of the standard relational db querying and indexing, plus each of my aggregates is backed by the event journal
I had originally built this out to enumerate many more items, but then realized that they could be grouped into distinct sections.
So, next up…
With the naive use case, here’s my entire schema… our at least, without Snapshot.
The snapshot is very similar to aggregate, just with a data text field
These are the ones I’m aware of that are in active development.
NVentStore, Prooph, and Akka. Akka is interesting, in that every object you work with is persisted as an event stream, but it doesn’t explicitly call itself Akka Event Source.
Of these, I’d probably recommend Akka, provided you’re on the JVM. The persistence component is available as a standalone jar
When looking for a storage mechanism, there many options available to you, and, for the most part, you’ll be fine no matter which database option you choose.
Now, here are some of the better options
First, EventStore, the database. Highly specialized for working with events and generating projections.
Written By Greg Young, who is perhaps the most well known person in the Event Sourcing community.
A time series database like influxDB is a great choice as an event store
Chiefly intended as a message broker, it also maintains a journal of all your events.
I haven’t used it directly, but anytime I start talking about Event Sourcing, someone mentions Kafka. I should really check it out.
And if course, you cannot go wrong with good, old fashioned Relational Databases.
I would suggest consider sticking with a standard relation database, if you’re already using it
Switching to something like Event Sourcing is already enough change
A quick aside.
First, I’d like to point out that we at ThirdChannel have open sourced a small library that we’re using internally for doing Event Sourcing on the JVM.
written in Groovy and RxJava
Event Sourcing is an additive only, lossless data storage pattern which has insanely high potential for data analysis. It is, however, tricky to work efficiently with.
I wouldn’t recommend it for certain applications; say small static content information websites (e.g. a restaurant or a business’ marketing site). Nor does it make sense to apply to every domain object in a system. However, key data in your application that you use to drive your business can benefit greatly from this approach.