Compared to a traditional analysis-centric data hub, today's data platforms need to fulfill many different use cases. The need for real-time, transport-agnostic data protocols have become a crucial feature shared across many different use cases.
During this talk, we will discuss our approach to bootstrapping and bounded context subscription, leveraging a mix of open source technologies and home-grown services aimed at providing a full end-to-end solution.
We will demonstrate and discuss our use of Kafka, Spark Streaming and Akka to orchestrate a unified data transfer protocol that frees developers from having to listen to and process events within their bounded contexts. More specifically:
- Leveraging Kafka as the source of truth
- Topic serialization formats, Avro, and retention rules
- Using Kafka's distributed commit logs to produce durable datasets
- Log compaction and its role in service bootstrapping
- Ingestion at scale: consuming data in different formats across different teams at different latencies
- Using Spark Streaming and Akka to perform near real-time replication protocols
25. Indexes are awesome and we need it,
they make the lookups fast!
why do you want to scan all the data if you know
what you want?
That’s dumb.
- Dustin Vannoy
28. MUTATION vs. facts
UPDATE wishlist set qty=3
where user_id=121 and product_id=123
At 2:39pm, user 121 updated his wish list,
changing the quantity of product 123 from 1 to 3.
AND
state mutation
fact
Generate a lot of data and leverage it to make the product better
Single process
Codebase and development
Split the monolith into Many processes, different codebases, different deployment pipelines
Independently Built
The build process for creating a service should be completely separate from building another service.
Independently Testable
Our microservice should be testable independently of the test lifecycle of other services and components.
Independently Deployable
Our microservice must be independently deployable, this is a fundamental aspect of enabling rapid change.
Independent Teams
Small independent teams owning the full lifecycle of a service from inception through to it’s final death.
Independent Data
One of the hardest aspects for the microservice purist to achieve is data independence.
When it comes to being independent, data is usually a naggy point.
Services still need to share data somehow
Around deployment, contract schemas, deprecation, interconnectivity, etc.
Very rarely you will find a service that has a tightly bounded context so that data sharing is secondary. Maybe AuthN services, but even then.
Most services will fall on this area where they slice and dice the same core business facts and data, they just slice them differently
These applications/services must work together.
Services force to think about what we need to expose and share to the outside world.
Mostly an afterthought
Future services will become even more interconnected and intertwined.
Because of this, you end up with multiple copies of data across different service that will get out of sync.
The more mutable copies of data, the more divergent data will be come.
What you do? Keep changing the contract of services to add more attributes?
Turn your services into daos?
8down vote
A transaction is a sequence of one or more SQL operations that are treated as a unit.
Specifically, each transaction appears to run in isolation, and furthermore, if the system fails, each transaction is either executed in its entirety or not all.
The concept of transactions is actually motivated by two completely independent concerns. One has to do with concurrent access to the database by multiple clients and the other has to do with having a system that is resilient to system failures.
Acid is overkill or as some would say old school
Databases do this really well!
the idea is that you have a copy of the same data on multiple machines (nodes), so that you can serve reads in parallel, and so that the system keeps running if you lose a machine.
This distinction between an imperative modification and an immutable fact is something you may have seen in the context of event sourcing. That’s a method of database design that says you should structure all of your data as immutable facts, and it’s an interesting idea.
However, there’s something really compelling about this idea of materialized views. I see a materialized view almost as a kind of cache that magically keeps itself up-to-date. Instead of putting all of the complexity of cache invalidation in the application (risking race conditions and all the discussed problems), materialized views say that cache maintenance should be the responsibility of the data infrastructure.
Stream of immutable facts are used to segregate reads from writes
SHARED STATE IS ONLY IN THE CACHE SO THAT DATA CANNOT DIVERGE
Let’s talk about Kafka as a commit log / source for a replication stream
Kafka messages have a key and value.
See the benefit if ever used a regular message queue
Data becomes an immutable stream of facts
Keeps the latest record per key.
Truncate history but at least the latest version of every key will be present in the log.
What differentiates Kafka from a traditional messaging system
Medium latency
High volume
data flows, SQL
en masse processing
massive scaling - 10,000s nodes
not for small volumes
rich options for SQL, etc.
Low limit: 0.5 seconds (we are ok with that)
Failover and lifecycle management from cluster itself - restartability (ADD TO WHY SPARK)
Why we chose akka and scala?
Distributed systems
Functional paradigm and datasets
Akka is really the backbone of this platfform