Why Stream Data as Part
of Data Transformation
Glen Gomez Zuazo, Senior Solutions Architect
Presenter
Glen Gomez Zuazo, Senior Solutions Architect
● Data Science, Machine Learning, Distributed Systems, Full
Stack Development, Blockchain and Enterprise
Architecture
● Passionate involvement in Diversity and Inclusion
● STEM advocate for young people (Middle and High School)
● Teaching technology (CSSE, AWS and Microservices)
● Spending time with his family, including his dog (Bolillo),
running and camping
Event-Driven Data Architecture in 2019
■ Event-driven architectures are increasingly part of a complete data
transformation solution
■ This talks covers
● details of each
● advantages and disadvantages
● how to select the best for your company’s needs
Prevalent examples
■ Apache Kafka
■ Cloud Native Computing Foundation’s NATS
■ Amazon SQS
■ Lightbend Akka
AWS SQS
Amazon Simple Queue Service
■ Fully managed message queuing service
■ Allows decoupling/scaling microservices, distributed systems, and
serverless applications from Sync to Asynch.
■ Eliminates complexity/overhead of managing and operating
message oriented middleware
SQS: type types of message queues
■ Standard queues: maximum throughput, best-effort ordering,
and at-least-once delivery
■ SQS FIFO queues: guarantees messages are processed exactly
once, in the exact order that they are sent.
SQS Functionality
■ Unlimited queues and messages
■ Payload
● Up to 256KB of text in any format
● Each 64KB ‘chunk’ of payload is billed as 1 request
● (E.g. 256KB payload is billed as four requests)
● Use Amazon SQS Extended Client Library for Java to send messages >256KB
● Extended Client Library uses Amazon S3 to store the message payload
■ Batches
● Send, receive, or delete messages in batches of up to 10 messages or 256KB
● Batches cost the same amount as single messages
● More cost effective for customers
SQS Functionality (cont’d)
■ Long polling
● Reduce extraneous polling to minimize cost while receiving new messages as
quickly as possible.
● When your queue is empty, long-poll requests wait up to 20 seconds for the next
message to arrive
● Long poll requests cost the same amount as regular requests.
■ Retain messages in queues for up to 14 days.
■ Send and read messages simultaneously
Functionality
■ Message locking.
● While is Processing.
■ Queue sharing
● Anonymously
● Specific AWS Accounts
■ Server-side encryption (SSE)
● AWS Key Management Service (AWS KMS)
■ Dead Letter Queues (DLQ)
● source queue (standard or FIFO).
Publish-Subscribe for Application Integration
● Exchange Data Asynchronously
● Be Independent and fault-tolerant
● Allow Systems to be in different environments (OS, Language)
Messaging Patterns
Message queuing
Publish-subscribe (pub-sub)
NATS
NATS
■ high-performance, cloud native messaging system
■ provides an entire foundational level
■ can build both synchronous and asynchronous, reliable, highly
available systems
■ 2.0 release provides incredible features both for high availability and
security
● not to be confused with NGS, the Synadia commercial version
Let’s cover the details of how we plan to deploy and configure NATS with
special focus on HA and security.
High Availability
■ Deploy a NATS cluster as a global entity with NATS gateways used to
connect multi regions. Both NATS and System proper will be
deployed active/active.
■ It is assumed that there is a geographically pinned single point of
entry into each cluster in all of these scenarios as per standard AWS
practices.
■ In "classic" active-active scenarios, you have two or more completely
isolated mirrors.
Sharing Streams and Services
■ NATS account model also comes with an explicit and secure by
default means of allowing communication between accounts.
● Account owners can export either a stream (write-only from the account, read-only
to subscribers)
● Service (read/write).
■ Ability to export your service or stream
● Public export (allows any authorized account to import that subject)
● Private export. (Requires an explicit, out of band delivery of an activation token).
Security and Multi-Tenancy
■ Main considerations / concerns in a multi-tenant system that sits on top of a
central messaging system
● Security of clients and the message traffic
● Configuration maintenance.
● Multi-tenant systems running in the same cluster (e.g. K8s tenants co-
existing with ECS tenants) complexity
■ In a decentralized model, clients authenticate to NATS with signed user JWTs.
There is a hierarchy that goes from Operator to Account to User.
■ In NATS, an account is a unit of isolation and a user is a unit of client
authentication and authorization.
RabbitMQ
RabbitMQ
■ Messages published to queues (through exchange points).
■ Multiple consumers can connect to a queue.
■ Message broker distribute messages across all available consumers.
■ Also, we can re-deliver the message if the consumer fails.
■ Delivery order guaranteed for queues with a single consumer (this is
not possible when the queue has multiple consumers).
Architecture Considerations
■ Performance:
● RabbitMQ is around 20,000 messages/second
■ Processing:
● The consumer is just FIFO based, reading from the HEAD and processing 1 by 1
■ HA
● Provides High Availability Support
■ Open Source
● RabbitMQ is open Source through Mozilla Public License
Architecture Diagram
Apache Kafka
Kafka
■ We use Apache Kafka when it comes to enabling communication
between producers and consumers using message-based topics.
Apache Kafka is a fast, scalable, fault-tolerant, publish-subscribe
messaging system.
■ Basically, it designs a platform for high-end new generation
distributed applications. Also, it allows a large number of permanent
or ad-hoc consumers.
Architecture
■ Kafka Producer API
● Permits an application to publish a stream of records to one or more Kafka topics.
■ Kafka Consumer API
● To subscribe to one or more topics and process the stream of records produced to
them in an application
■ Kafka Streams API
● Gives permission to an application in order to act as a stream processor
● Consumes an input stream from one or more topics
● Produces an output stream to one or more output topics
● Also effectively transforming the input streams to output streams
■ Kafka Connector API
● Allows building and running reusable producers or consumers that connect Kafka
topics to existing applications or data systems
● Example: connector to a relational database might capture every change to a table
Architecture Diagram
Data Transformation - Architecture
Scylla + Kafka Users — just at Scylla Summit!
Scylla Summit 2018 Presenters
■ Discord
■ Faraday Future
■ GE
■ Grab
■ Natura
■ Nauto
■ Numberly
Scylla Summit 2019 Presenters
■ Lookout
■ Nauto
■ Numberly
■ OlaCabs
■ SmartDeployAI
■ Zeotap
Take away
Architectural Message Review Example
We follow processes to define which technology and patterns are going
to be apply base on the specifics requirements of the system.
We perform the following steps:
■ System Requirements
■ ASR (Architecturally Significant Requirements)
■ ADR (Architecturally Decisions Record)
■ System Context and Data Flow
■ PoC
■ MVPx
Architecturally Significant Requirements
Architecturally Significant Requirements (ASR) have a measurable effect on a system's
architecture, which includes application and infrastructure.
ASR Criteria
Requirements that have wide effects, are strict, or difficult to achieve are often ASRs. Per the Wikipedia article
on ASRs, some common indicators for a requirement being an ASR are:
■ The requirement is associated with high business value and/or technical risk.
■ The requirement is a concern of a particularly important (influential, that is) stakeholder.
■ The requirement has a first-of-a-kind character, e.g. none of the responsibilities of already existing
components in the architecture addresses it.
■ The requirement has QoS/SLA characteristics that deviate from all ones that are already satisfied by the
evolving architecture.
■ The requirement has caused budget overruns or client dissatisfaction in a previous project with a similar
context.
Architecturally Significant Requirements
Categories
We have split our ASRs up into categories to make them easier to read and to allow us to
provide more detail for each requirement. These categories are:
■ Availability
■ Maintainability
■ Observability
■ Performance
■ Resiliency
■ Testability
■ Usability
Architecturally Decision Record
■ NATS is an open source, powerful, lightweight, secure-by-default
messaging system.
■ Gives same kind of delivery control as consumer groups in Kafka
■ But without overhead of maintenance and operations cost.
■ NATS is essentially self-managing---it doesn’t need anyone to create
new partitions to scale up or down
■ Clusters form themselves and self-heal, and clients are immediately
notified of cluster topology changes.
■ NATS supports traditional request/reply, pub/sub, fanout, and many
more messaging patterns.
Why did we need a message broker?
Our ASRs lean heavily toward:
■ Resiliency,
■ Stability, and
■ Performance
When doing traditional point-to-point communications you have to do a number of things
that introduce points of failure, possible performance degradation, and loss of stability:
■ Service discovery (what's the address for a service?)
■ Retries and Failure Responses
■ Coping with slow connections and intermittent failure
■ Exponential back-off to avoid cascading failures
Why not Kafka?
Once we decided that we wanted to take advantage of a message
broker and utilize all of the asynchronous power that comes with it,
we needed to pick which broker.
■ We require low operations burden.
■ Ability to scale without having delicate reconfiguration
■ Fast request-response performance
Why not RabbitMQ?
Rabbit has a reputation for reliability and speed, and some of the team
members had used it before. One of the main reasons we disliked the
use of Rabbit was because of the explicit nature of fanout exchanges.
■ Require explicit definition of queues and subscriptions
■ Not recommended for multi-tenant systems
■ Ability to add instances / subscribers without reconfiguration
NATS Security
Neither Rabbit nor Kafka gave us the kind of security support we
needed. We need the ability to explicitly control which clients can
publish to which topics and which clients can subscribe to those
topics.
■ Ability to inject the security information without taking broker
down.
■ Flexibility to work with nkeys
■ Asymmetric encryption key system
Comparison Matrix
The following is a summary of the satisfaction of requirements for each
of the options.
References
■ Apache Kafka Website: https://kafka.apache.org
■ NATS Website: https://nats.io
■ AWS SQS: https://aws.amazon.com/sqs/
■ MQ Website: https://www.rabbitmq.com
■ Benchmarking Message Queue Latency: https://bravenewgeek.com/benchmarking-
message-queue-latency/
Thank you Stay in touch
Any questions?
Glen Gomez Zuazo
g_gomez_zuazo@hotmail.com
@ZuazoGlen

Captial One: Why Stream Data as Part of Data Transformation?

  • 1.
    Why Stream Dataas Part of Data Transformation Glen Gomez Zuazo, Senior Solutions Architect
  • 2.
    Presenter Glen Gomez Zuazo,Senior Solutions Architect ● Data Science, Machine Learning, Distributed Systems, Full Stack Development, Blockchain and Enterprise Architecture ● Passionate involvement in Diversity and Inclusion ● STEM advocate for young people (Middle and High School) ● Teaching technology (CSSE, AWS and Microservices) ● Spending time with his family, including his dog (Bolillo), running and camping
  • 3.
    Event-Driven Data Architecturein 2019 ■ Event-driven architectures are increasingly part of a complete data transformation solution ■ This talks covers ● details of each ● advantages and disadvantages ● how to select the best for your company’s needs
  • 4.
    Prevalent examples ■ ApacheKafka ■ Cloud Native Computing Foundation’s NATS ■ Amazon SQS ■ Lightbend Akka
  • 5.
  • 6.
    Amazon Simple QueueService ■ Fully managed message queuing service ■ Allows decoupling/scaling microservices, distributed systems, and serverless applications from Sync to Asynch. ■ Eliminates complexity/overhead of managing and operating message oriented middleware
  • 7.
    SQS: type typesof message queues ■ Standard queues: maximum throughput, best-effort ordering, and at-least-once delivery ■ SQS FIFO queues: guarantees messages are processed exactly once, in the exact order that they are sent.
  • 8.
    SQS Functionality ■ Unlimitedqueues and messages ■ Payload ● Up to 256KB of text in any format ● Each 64KB ‘chunk’ of payload is billed as 1 request ● (E.g. 256KB payload is billed as four requests) ● Use Amazon SQS Extended Client Library for Java to send messages >256KB ● Extended Client Library uses Amazon S3 to store the message payload ■ Batches ● Send, receive, or delete messages in batches of up to 10 messages or 256KB ● Batches cost the same amount as single messages ● More cost effective for customers
  • 9.
    SQS Functionality (cont’d) ■Long polling ● Reduce extraneous polling to minimize cost while receiving new messages as quickly as possible. ● When your queue is empty, long-poll requests wait up to 20 seconds for the next message to arrive ● Long poll requests cost the same amount as regular requests. ■ Retain messages in queues for up to 14 days. ■ Send and read messages simultaneously
  • 10.
    Functionality ■ Message locking. ●While is Processing. ■ Queue sharing ● Anonymously ● Specific AWS Accounts ■ Server-side encryption (SSE) ● AWS Key Management Service (AWS KMS) ■ Dead Letter Queues (DLQ) ● source queue (standard or FIFO).
  • 11.
    Publish-Subscribe for ApplicationIntegration ● Exchange Data Asynchronously ● Be Independent and fault-tolerant ● Allow Systems to be in different environments (OS, Language)
  • 12.
  • 13.
  • 14.
    NATS ■ high-performance, cloudnative messaging system ■ provides an entire foundational level ■ can build both synchronous and asynchronous, reliable, highly available systems ■ 2.0 release provides incredible features both for high availability and security ● not to be confused with NGS, the Synadia commercial version Let’s cover the details of how we plan to deploy and configure NATS with special focus on HA and security.
  • 15.
    High Availability ■ Deploya NATS cluster as a global entity with NATS gateways used to connect multi regions. Both NATS and System proper will be deployed active/active. ■ It is assumed that there is a geographically pinned single point of entry into each cluster in all of these scenarios as per standard AWS practices. ■ In "classic" active-active scenarios, you have two or more completely isolated mirrors.
  • 16.
    Sharing Streams andServices ■ NATS account model also comes with an explicit and secure by default means of allowing communication between accounts. ● Account owners can export either a stream (write-only from the account, read-only to subscribers) ● Service (read/write). ■ Ability to export your service or stream ● Public export (allows any authorized account to import that subject) ● Private export. (Requires an explicit, out of band delivery of an activation token).
  • 19.
    Security and Multi-Tenancy ■Main considerations / concerns in a multi-tenant system that sits on top of a central messaging system ● Security of clients and the message traffic ● Configuration maintenance. ● Multi-tenant systems running in the same cluster (e.g. K8s tenants co- existing with ECS tenants) complexity ■ In a decentralized model, clients authenticate to NATS with signed user JWTs. There is a hierarchy that goes from Operator to Account to User. ■ In NATS, an account is a unit of isolation and a user is a unit of client authentication and authorization.
  • 21.
  • 22.
    RabbitMQ ■ Messages publishedto queues (through exchange points). ■ Multiple consumers can connect to a queue. ■ Message broker distribute messages across all available consumers. ■ Also, we can re-deliver the message if the consumer fails. ■ Delivery order guaranteed for queues with a single consumer (this is not possible when the queue has multiple consumers).
  • 23.
    Architecture Considerations ■ Performance: ●RabbitMQ is around 20,000 messages/second ■ Processing: ● The consumer is just FIFO based, reading from the HEAD and processing 1 by 1 ■ HA ● Provides High Availability Support ■ Open Source ● RabbitMQ is open Source through Mozilla Public License
  • 24.
  • 25.
  • 26.
    Kafka ■ We useApache Kafka when it comes to enabling communication between producers and consumers using message-based topics. Apache Kafka is a fast, scalable, fault-tolerant, publish-subscribe messaging system. ■ Basically, it designs a platform for high-end new generation distributed applications. Also, it allows a large number of permanent or ad-hoc consumers.
  • 27.
    Architecture ■ Kafka ProducerAPI ● Permits an application to publish a stream of records to one or more Kafka topics. ■ Kafka Consumer API ● To subscribe to one or more topics and process the stream of records produced to them in an application ■ Kafka Streams API ● Gives permission to an application in order to act as a stream processor ● Consumes an input stream from one or more topics ● Produces an output stream to one or more output topics ● Also effectively transforming the input streams to output streams ■ Kafka Connector API ● Allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems ● Example: connector to a relational database might capture every change to a table
  • 28.
  • 29.
  • 30.
    Scylla + KafkaUsers — just at Scylla Summit! Scylla Summit 2018 Presenters ■ Discord ■ Faraday Future ■ GE ■ Grab ■ Natura ■ Nauto ■ Numberly Scylla Summit 2019 Presenters ■ Lookout ■ Nauto ■ Numberly ■ OlaCabs ■ SmartDeployAI ■ Zeotap
  • 31.
  • 32.
    Architectural Message ReviewExample We follow processes to define which technology and patterns are going to be apply base on the specifics requirements of the system. We perform the following steps: ■ System Requirements ■ ASR (Architecturally Significant Requirements) ■ ADR (Architecturally Decisions Record) ■ System Context and Data Flow ■ PoC ■ MVPx
  • 33.
    Architecturally Significant Requirements ArchitecturallySignificant Requirements (ASR) have a measurable effect on a system's architecture, which includes application and infrastructure. ASR Criteria Requirements that have wide effects, are strict, or difficult to achieve are often ASRs. Per the Wikipedia article on ASRs, some common indicators for a requirement being an ASR are: ■ The requirement is associated with high business value and/or technical risk. ■ The requirement is a concern of a particularly important (influential, that is) stakeholder. ■ The requirement has a first-of-a-kind character, e.g. none of the responsibilities of already existing components in the architecture addresses it. ■ The requirement has QoS/SLA characteristics that deviate from all ones that are already satisfied by the evolving architecture. ■ The requirement has caused budget overruns or client dissatisfaction in a previous project with a similar context.
  • 34.
    Architecturally Significant Requirements Categories Wehave split our ASRs up into categories to make them easier to read and to allow us to provide more detail for each requirement. These categories are: ■ Availability ■ Maintainability ■ Observability ■ Performance ■ Resiliency ■ Testability ■ Usability
  • 35.
    Architecturally Decision Record ■NATS is an open source, powerful, lightweight, secure-by-default messaging system. ■ Gives same kind of delivery control as consumer groups in Kafka ■ But without overhead of maintenance and operations cost. ■ NATS is essentially self-managing---it doesn’t need anyone to create new partitions to scale up or down ■ Clusters form themselves and self-heal, and clients are immediately notified of cluster topology changes. ■ NATS supports traditional request/reply, pub/sub, fanout, and many more messaging patterns.
  • 36.
    Why did weneed a message broker? Our ASRs lean heavily toward: ■ Resiliency, ■ Stability, and ■ Performance When doing traditional point-to-point communications you have to do a number of things that introduce points of failure, possible performance degradation, and loss of stability: ■ Service discovery (what's the address for a service?) ■ Retries and Failure Responses ■ Coping with slow connections and intermittent failure ■ Exponential back-off to avoid cascading failures
  • 37.
    Why not Kafka? Oncewe decided that we wanted to take advantage of a message broker and utilize all of the asynchronous power that comes with it, we needed to pick which broker. ■ We require low operations burden. ■ Ability to scale without having delicate reconfiguration ■ Fast request-response performance
  • 38.
    Why not RabbitMQ? Rabbithas a reputation for reliability and speed, and some of the team members had used it before. One of the main reasons we disliked the use of Rabbit was because of the explicit nature of fanout exchanges. ■ Require explicit definition of queues and subscriptions ■ Not recommended for multi-tenant systems ■ Ability to add instances / subscribers without reconfiguration
  • 39.
    NATS Security Neither Rabbitnor Kafka gave us the kind of security support we needed. We need the ability to explicitly control which clients can publish to which topics and which clients can subscribe to those topics. ■ Ability to inject the security information without taking broker down. ■ Flexibility to work with nkeys ■ Asymmetric encryption key system
  • 40.
    Comparison Matrix The followingis a summary of the satisfaction of requirements for each of the options.
  • 41.
    References ■ Apache KafkaWebsite: https://kafka.apache.org ■ NATS Website: https://nats.io ■ AWS SQS: https://aws.amazon.com/sqs/ ■ MQ Website: https://www.rabbitmq.com ■ Benchmarking Message Queue Latency: https://bravenewgeek.com/benchmarking- message-queue-latency/
  • 42.
    Thank you Stayin touch Any questions? Glen Gomez Zuazo g_gomez_zuazo@hotmail.com @ZuazoGlen

Editor's Notes

  • #3 Event-driven architectures are increasingly part of a complete data transformation solution. Learn how to employ Apache Kafka, Cloud Native Computing Foundation’s NATS, Amazon SQS, or other message queueing technologies. This talks covers the details of each, their advantages and disadvantages and how to select the best for your company’s needs.
  • #5 Notes: Lightbend Akka, this ie beyond my analysis scope for this presentation for Capital One applications, But I know that at least one other presenter is going to be speaking about Akka/Scala — that is Alexandros Bantis from Tubi.tv. Even though it may have been beyond Capital One's consideration, you may wish to mention it in a roundup of popular solutions.
  • #7 Extra Notes: Send, store, and receive messages between software components at any volume, without losing messages or requiring other services to be available. AWS console, Command Line Interface or SDK of your choice, and three simple commands.
  • #11 Message locking: When a message is received, it becomes “locked” while being processed. This keeps other computers from processing the message simultaneously. If the message processing fails, the lock will expire and the message will be available again. Queue sharing: Securely share Amazon SQS queues anonymously or with specific AWS accounts. Queue sharing can also be restricted by IP address and time-of-day. Server-side encryption (SSE): Protect the contents of messages in Amazon SQS queues using keys managed in the AWS Key Management Service (AWS KMS). SSE encrypts messages as soon as Amazon SQS receives them. The messages are stored in encrypted form and Amazon SQS decrypts messages only when they are sent to an authorized consumer. Dead Letter Queues (DLQ): Handle messages that have not been successfully processed by a consumer with Dead Letter Queues. When the maximum receive count is exceeded for a message it will be moved to the DLQ associated with the original queue. Set up separate consumer processes for DLQs which can help analyze and understand why messages are getting stuck. DLQs must be of the same type as the source queue (standard or FIFO).
  • #16 In a solution where every service requires NATS to be available in order to function, we clearly need to ensure that NATS meets or exceeds our Top Resiliency Tier level SLAs. To do this, we'll deploy a NATS cluster as a global entity with NATS gateways used to connect east and west. Both NATS and System proper will be deployed active/active. It is assumed that there is a geographically pinned single point of entry into each cluster in all of these scenarios as per standard AWS practices. In "classic" active-active scenarios, you have two or more completely isolated mirrors. These two geolocated clusters are completely unaware of each other. Independent component failure is isolated within a region, and in the case of an entire region failure, routes are updated to direct all traffic to the other remaining regions.
  • #17 The NATS account model also comes with an explicit and secure by default means of allowing communication between accounts. As an account owner, you can export either a stream (write-only from the account, read-only to subscribers) or a service (read/write). When you export your service or stream, you can choose to do so as a public or a private export. A public export allows any authorized account to import that subject. A private export requires an explicit, out of band delivery of an activation token to the account wishing to import. Without this token, an account cannot import a private export. What this boils down to is that, with some facilitation by a service to generate keys and tokens, tenants can manage their own topic namespaces, their own users (connected clients), and their own imports/exports with no manual operations overhead. We get security by default, decentralized configuration, self-service secure message exchange, and a "service marketplace" where account (tenant) owners can browse exported subjects and add requests like a shopping cart.
  • #20 In a multi-tenant system that sits on top of a central messaging system, one of our main concerns was not just the security of clients and the message traffic, but in maintaining configuration. If we had to re-write a configuration file and send an update signal to a server every time we added or removed a tenant, this would become a maintenance nightmare. This would be compounded even more with two multi-tenant systems running in the same cluster (e.g. K8s tenants co-existing with ECS tenants). In a decentralized model, clients authenticate to NATS with signed user JWTs. There is a hierarchy that goes from Operator to Account to User. In NATS, an account is a unit of isolation and a user is a unit of client authentication and authorization. This decentralized security model actually solves a number of other problems we would have inevitably run into.
  • #23 RabbitMQ is an open-source message-broker software (sometimes called message-oriented middleware) that originally implemented the Advanced Message Queuing Protocol (AMQP) and has since been extended with a plug-in architecture to support Streaming Text Oriented Messaging Protocol (STOMP), Message Queuing Telemetry Transport (MQTT), and other protocols. The output of RabbitMQ design:
  • #27 One of the best features of Kafka is, it is highly available and resilient to node failures and supports automatic recovery. This feature makes Apache Kafka ideal for communication and integration between components of large-scale data systems in real-world data systems.
  • #37 Point to point operations are generally synchronous, though you can accomplish some decent asynchronous operations with gRPC streaming. Finally, point-to-point means that no interested parties can become aware of communications unless the sender goes out of its way to make multiple P2P connections or emit secondary events. Our thought is if you're going to emit secondary events, why not build the entire substrate out of asynchronous messaging, skipping point to point altogether? Service discovery, especially explicit discovery requiring a discovery broker like Netflix Eureka, introduces a new single point of failure to the entire system and, even when working perfectly, introduces the latency cost of at least one more network hop (if you're caching, then you have to deal with the consequences of outdated discovery data).
  • #38 Because of the history and precedent of using Kafka within Capital One, including its role as the backbone behind the Streaming Data Platform (SDP), we considered using Kafka for our broker. Once we decided that we wanted to take advantage of a message broker and utilize all of the asynchronous power that comes with it, we needed to pick which broker. There are a number of critical reasons why we chose against Kafka. First and foremost, we wanted a low operations burden and Kafka is anything but that. Further, we need the ability to scale our services and to dynamically add new topics and new subscribers live, at runtime, in production, without having to perform delicate reconfiguration. Because of the way Kafka works, we would have to reconfigure partitions and topics manually or through some form of potentially brittle automation. You can't simply scale up and down subscribers and publishers without altering Kafka configuration accordingly. We also needed incredibly fast request-response performance. We wanted the flexibility of an asynchronous substrate without sacrificing synchronous point-to-point performance. We could not get that with Kafka and NATS outperformed Kafka for non-durable messages in every benchmark.
  • #39 Because of the history and precedent of using Kafka within Capital One, including its role as the backbone behind the Streaming Data Platform (SDP), we considered using Kafka for our broker. With Rabbit, clients must explicitly define the queues and subscriptions and exchanges in use when they connect. This can be problematic and create problems in multi-tenant systems. We needed a system where we could dynamically scale the number of instances of a queue subscriber AND add more subscribers to the same queue without negatively impacting existing service or requiring a reconfiguration (manual or automatic) of the message broker.
  • #40 Because the client list is external to the message broker (a 1:1 correlation with tenant services), this security information needs to be injectable into the broker cluster, no matter how many instances of the broker are running, without ever taking the broker down in production. NATS security not only gives us this, but lets us work with nkeys, an incredibly powerful asymmetric encryption key system that is less vulnerable to attack than traditional SSH keys and can allow security information to easily flow from a Kubernetes secret to tenant services and the broker configuration.