Polylog: A Log-Based Architecture for Distributed Systems

The Polylog
A Log-Based Architecture for
Distributed Systems

JW Player
Motivation
Inspiration
Implementation
Use cases
1
2
3
4
5
Agenda

JW Player
1. Established - 2008
2. Headquarters - NYC (2 Park Ave)
3. Employees - 200+
4. Business Model - SaaS
5. JW Player Footprint: 5%+ of all
video on the web

Data @ JW Player
1. 1Bn video hours consumed per month
2. 1Bn unique viewers per month
3. 5MM analytics events per minute
4. 3TB of logs per day
Pipelines
Ingestion, pipelines,
Infrastructure
Discovery
Recs & Search in
Production
Insights
Customer
dashboards
Media
Intelligence
Media metadata
extraction
Data Science
R&D, instrumentation,
predictive modeling

JW Player is breaking up its monolith
1. JW Player is moving to a Service Oriented
Architecture (SOA)
2. SOA promotes loose coupling between services
3. Part of the roadmap is to break up our monolithic
database into separate datastores for faster
iteration

Some services don’t work under SOA
1. Our data services depend on syncing
Elasticsearch with numerous tables from the
monolith
2. Traditional API-style architecture doesn’t work
for indexing data across many sources and data
change monitoring:
a. Hard to know when, how and what changed
b. Hard to maintain consistency
c. Hard to scan the entire dataset

We need the ability to perform
both iterative updates and full
rebuilds of recommendations
simply and efficiently
Our Mission

The Monolog
1. New York Times solved this
problem with log-based
architecture
2. CMSs write to Kafka first, from
which other services read and build
3. “Mono” because everything written
to single Kafka topic and partition
https://www.confluent.io/blog/publishing-apache-kafka-new-york-times/

The simplicity of logs
Simplest possible storage abstraction. It is an append-only,
totally-ordered sequence of records ordered by time.

1. Distributed and fault tolerant
2. Stores full history
3. Can replay from beginning
4. Supports log compaction
5. Clients in many languages:
JVM, Python, Go
Apache Kafka: distributed logs
# hello world in Kafka
import confluent_kafka
consumer = confluent_kafka.Consumer({
"bootstrap.servers": "my-kafka:9092",
"group.id": "my_consumer",
})
consumer.subscribe(["my_topic"])
while True:
message = consumer.poll()
process_message(message)

The Polylog
1. Fewer assumptions than Monolog
2. Can be multiple topics, partitions or clusters
3. Easier to scale
4. Ability to create consistent view of denormalized
data

Polylog components
1. Producers - populating The Polylog
a. Debezium
b. Custom
2. Storage - Kafka
3. Intermediate processors
a. Denormalizer
b. Custom
4. Consumers - consuming off of The Polylog

Debezium: read logs from the database
1. Reads op logs from various databases
(MySQL, Postgres, Mongo, etc.) and writes
to Kafka
2. Minimal setup
3. Every table is a topic
4. Handles schema changes
5. Configuration options (e.g. table whitelist,
column blacklist)

1. Debezium is not appropriate for all use cases
2. We have custom producers writing to Polylog
a. Derived data (E.g. algorithm results)
b. Producers requiring business logic
c. Kafka as source of truth
Custom Producers

Denormalizer: left joins on streams
1. Join records across multiple topics
2. Create full denormalized records (e.g. media with tags)
3. Generic schema
4. RocksDB with AWS S3 backup
5. Looking to open source

Denormalizer: what does the data look like?
{
"id": 123,
"title": "My title",
"duration":600
}
{
"PrimaryKey": "0360",
"Record": {
"id": 123,
"title": "My title",
"duration": 600
},
"Children": {
"table2": [{
"PrimaryKey": "0203",
"Record": {
"id": 234,
"table1_id": 123,
"val": "hello world!"
},
"Children": {...}
}
}
mysql.mydb.table1
my_denormalizer_topic
{
"id": 234,
"table1_id": 123,
"val": "hello world"
}
mysql.mydb.table2

Consumers: stream to other datastores
1. Read denormalized records
2. Transform into expected format
3. Write transformed records into another
datastore (e.g. Elasticsearch)

1. Build data models from disparate data
sources

2. Kafka as primary source of truth
a. Write to Kafka first
b. Can have multiple consumers
c. At least once guarantee
d. Guarantee consistency - Avoid dual write issue

3. Database migrations
a. Avoid dual write issues!
b. Stand up new service while old service still active
c. Seamless switch - no hard cutover

5. Disaster recovery and fault tolerance
a. Kafka retention means we have an audit trail
b. Examples:
➢ Accidentally overwriting data in upstream
database
➢ Debugging how data changed over time

a. “Don’t be a salmon!” - don’t talk directly to upstream
services
b. Polylog is a single data source that multiple consumers
can work off of
c. When you need a service that can’t do basic API calls
6. New services based on other service's
datasets

Use log-based architectures!
1. Build data models from disparate data sources
2. Kafka as primary source of truth
3. Database migrations for SOA
4. Data change monitoring
5. Disaster recovery and fault tolerance
6. Building new services based on other service's full
datasets

Thank you... and we’re hiring!
Questions?

Polylog: A Log-Based Architecture for Distributed Systems

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Polylog: A Log-Based Architecture for Distributed Systems

Similar to Polylog: A Log-Based Architecture for Distributed Systems (20)

Recently uploaded

Recently uploaded (17)

Polylog: A Log-Based Architecture for Distributed Systems