The talk focuses on a log-based architecture ("The Polylog") we've developed to handle data change capture in order to easily build new services and databases based on other service's full datasets. Some of the tools we'll cover include Debezium for database change capture, Kafka for storing the logs, and the Denormalizer, which is an in-house tool we built to do left joins on streams.
4. JW Player
1. Established - 2008
2. Headquarters - NYC (2 Park Ave)
3. Employees - 200+
4. Business Model - SaaS
5. JW Player Footprint: 5%+ of all
video on the web
5. Data @ JW Player
1. 1Bn video hours consumed per month
2. 1Bn unique viewers per month
3. 5MM analytics events per minute
4. 3TB of logs per day
Pipelines
Ingestion, pipelines,
Infrastructure
Discovery
Recs & Search in
Production
Insights
Customer
dashboards
Media
Intelligence
Media metadata
extraction
Data Science
R&D, instrumentation,
predictive modeling
8. JW Player is breaking up its monolith
1. JW Player is moving to a Service Oriented
Architecture (SOA)
2. SOA promotes loose coupling between services
3. Part of the roadmap is to break up our monolithic
database into separate datastores for faster
iteration
9. Some services don’t work under SOA
1. Our data services depend on syncing
Elasticsearch with numerous tables from the
monolith
2. Traditional API-style architecture doesn’t work
for indexing data across many sources and data
change monitoring:
a. Hard to know when, how and what changed
b. Hard to maintain consistency
c. Hard to scan the entire dataset
10. We need the ability to perform
both iterative updates and full
rebuilds of recommendations
simply and efficiently
Our Mission
12. The Monolog
1. New York Times solved this
problem with log-based
architecture
2. CMSs write to Kafka first, from
which other services read and build
3. “Mono” because everything written
to single Kafka topic and partition
https://www.confluent.io/blog/publishing-apache-kafka-new-york-times/
13. The simplicity of logs
Simplest possible storage abstraction. It is an append-only,
totally-ordered sequence of records ordered by time.
14. 1. Distributed and fault tolerant
2. Stores full history
3. Can replay from beginning
4. Supports log compaction
5. Clients in many languages:
JVM, Python, Go
Apache Kafka: distributed logs
# hello world in Kafka
import confluent_kafka
consumer = confluent_kafka.Consumer({
"bootstrap.servers": "my-kafka:9092",
"group.id": "my_consumer",
})
consumer.subscribe(["my_topic"])
while True:
message = consumer.poll()
process_message(message)
16. The Polylog
1. Fewer assumptions than Monolog
2. Can be multiple topics, partitions or clusters
3. Easier to scale
4. Ability to create consistent view of denormalized
data
17. Polylog components
1. Producers - populating The Polylog
a. Debezium
b. Custom
2. Storage - Kafka
3. Intermediate processors
a. Denormalizer
b. Custom
4. Consumers - consuming off of The Polylog
19. Debezium: read logs from the database
1. Reads op logs from various databases
(MySQL, Postgres, Mongo, etc.) and writes
to Kafka
2. Minimal setup
3. Every table is a topic
4. Handles schema changes
5. Configuration options (e.g. table whitelist,
column blacklist)
20. 1. Debezium is not appropriate for all use cases
2. We have custom producers writing to Polylog
a. Derived data (E.g. algorithm results)
b. Producers requiring business logic
c. Kafka as source of truth
Custom Producers
21. Denormalizer: left joins on streams
1. Join records across multiple topics
2. Create full denormalized records (e.g. media with tags)
3. Generic schema
4. RocksDB with AWS S3 backup
5. Looking to open source
23. Consumers: stream to other datastores
1. Read denormalized records
2. Transform into expected format
3. Write transformed records into another
datastore (e.g. Elasticsearch)
26. 2. Kafka as primary source of truth
a. Write to Kafka first
b. Can have multiple consumers
c. At least once guarantee
d. Guarantee consistency - Avoid dual write issue
27. 3. Database migrations
a. Avoid dual write issues!
b. Stand up new service while old service still active
c. Seamless switch - no hard cutover
29. 5. Disaster recovery and fault tolerance
a. Kafka retention means we have an audit trail
b. Examples:
➢ Accidentally overwriting data in upstream
database
➢ Debugging how data changed over time
30. a. “Don’t be a salmon!” - don’t talk directly to upstream
services
b. Polylog is a single data source that multiple consumers
can work off of
c. When you need a service that can’t do basic API calls
6. New services based on other service's
datasets
32. Use log-based architectures!
1. Build data models from disparate data sources
2. Kafka as primary source of truth
3. Database migrations for SOA
4. Data change monitoring
5. Disaster recovery and fault tolerance
6. Building new services based on other service's full
datasets