Apache Kafka® Delivers a Single Source of Truth for The New York Times

The Source of Truth
2018-05-09
The New York Times
Why The New York Times
Stores Every Piece of Content
Ever Published
in Kafka

1
Boerge Svingen was a founder of Fast Search &
Transfer (alltheweb.com, FAST ESP). He was later a
founder and CTO of Open AdExchange, doing
contextual advertising for online news. He is now
working on search and backend platforms at The
New York Times.
Boerge Svingen
Director of Engineering, The New York Times

2
Housekeeping Items
" This session will last about an hour.
● This session will be recorded.
" You can submit your questions by entering them into the GoToWebinar panel.
" The last 10-15 minutes will consist of Q&A.
● The slides and recording will be available after the talk.

Boerge Svingen
Director of Engineering
at the New York Times,
working on backend systems.

Topic:
How is published content made
available to applications and
services?

CMS
CMS
Archives
Search
Personal-
ization
Collections
Producersofcontent
Consumersofcontent
Etc.
Etc.
Etc.
Etc.
Etc.
Etc.

CMS
CMS
Archives
Web
iOS
Android
Etc.
Producersofcontent
Consumersofcontent
Kafka
Gateway
Search
Personal-
ization
Collections
Etc.
Etc.
Etc.
GraphQLAPI
Etc.
Etc.
Etc.

Agenda
1. A little history
2. How things used to work
3. Log-based architectures
4. The schema
5. The Monolog
6. The Skinny Log
7. Some challenges

Source: http://www.nytimes.com/1865/04/15/news/president-lincoln-
shot-assassin-deed-done-ford-s-theatre-last-night-act.html

The New York Times Company Archives

Producersofcontent
Consumersofcontent
CMS
CMS
Archives
Etc.
Etc.Etc.
Search
Personal-
ization
Collections
Etc.
Etc.
Etc.

A rather typical API-based
architecture.

Disadvantages with this approach …
The consumers have to know about all the
producers of content.

Producer APIs have to live forever.

Every API tends to be different.

Every API tends to return data with a
different (implicit) schema.

We have no efficient way of reading old
content in bulk, so it’s hard to replace service
stores.

Most services have to manage permanent
state.

It is difficult to change the (non-existent)
schema, leading to inconsistencies and
duplication.

We get monoliths that try to be everything for
everyone.

It’s hard to develop new products and change
current ones.

We wanted to decouple
producers and consumers of
content

Make the log the
source of truth

First covered by Martin Kleppmann
Turning the database inside-out with Apache
Samza
Designing Data-Intensive Applications.

The database becomes derived
The database is now secondary, and can be
recreated by replaying the log at any time.

Always be replaying
A consumer can start consuming at any point
in the log, and just keep going.

Always be replaying
This means that there’s no distinction
between reading archive events and live
events.

Materialized views
Every consumer can now have its own,
custom, materialized view of the log.

Materialized views
The database schema can be specific to the
needs of each consumer.

Why not PubSub or AWS SNS/SQS/
Kinesis?
For most use cases, Kafka is used as a
message broker, and the log is an
implementation detail.

Kinesis?
For this use case, the log is the point.

Kinesis?
Two requirements:
1. We most retain the log forever.
2. Since messages have causal relationships,
consumption must be ordered.

Kinesis?
Only Kafka gives us an
ordered log with infinite
retention.

Protobufs
All assets are published to Kafka as
protobuf binaries.

The schema.
All published assets are identified by a URI:
nyt://article/186faf12-24a0-4dda-b737-018cee0b81cd

Protobufs
The schema has lots of different types
of messages.

Article Video Image
Slideshow Interactive Playlist
Person Title Organization
Subject Location
Program Promo
Package List

Protobufs
A single Event type is used for all messages,
wrapping the actual asset.

Protobufs
Assets are published in normalized form.

Article 1
Byline 1
Tag 1
Section 1
Image 2
Image 1
Tag 2
Article 2
Section 2
Image 3

Shared properties
Asset types have shared properties, but
protobufs don’t have interfaces.

Shared properties
We really want interfaces.

Shared properties
We use composition instead.

summary
headline
tone
byline
desks
section
subsection
advertisingProperties
CreativeWork
creativeWork
publicationProperties
promotionalProperties
Article
body
wordCount
translations
dateline
creativeWork
Video
renditions
playlists
transcript
distributionRights
creativeWork
Slideshow
slides
creativeWork
Interactive
source
credit
html
css
js

displayName
description
vernacular
taggingRules
TimesTag
timesTag
Subject
retiredName
contactDetails
timesTag
Location
geoDetails
contactPoint
previousNames
timesTag
Person
tickerSymbols
contactDetails
previousNames
timesTag
Title
subType
displayName
timesTag
Organization
tickerSymbols
contactDetails
previousNames

timesTag
Subject
retiredName
contactDetails
uri
url
firstPublished
lastModified
source
sourceId
PublicationProperties
timesTag
Location
geoDetails
contactPoint
previousNames
timesTag
Person
tickerSymbols
contactDetails
previousNames
timesTag
Title
subType
displayName
timesTag
Organization
tickerSymbols
contactDetails
previousNames
creativeWork
Article
body
wordCount
translations
dateline
creativeWork
Video
renditions
playlists
transcript
distributionRights
creativeWork
Slideshow
slides
creativeWork
Interactive
source
credit
html
css
js

message Article { 
PublicationProperties publicationProperties = 1; // Publication metadata
CreativeWork creativeWork = 2; // Editorial metadata
PromotionalProperties promotionalProperties = 3; // Promotional metadata 
PrintInformation printInformation = 4; // Print metadata
Dateline dateline = 5; // Where reporting for the article occurred
int32 wordCount = 6; // The number of words of body text.
DocumentBlock body = 7; // The body of the article
}

Validation
All assets are validated before they are
published to Kafka.

Schema governance
Custom linter to check for forwards and
backward compatibility.

Schema governance
Breaking changes are not allowed, like
deleting a field or changing a type.

Schema governance
The linter will warn against risky changing,
like changing a custom type.

“Virtual” schema team
We have a cross-functional team of people
across the organization who manage the
schema evolution.

The Monolog
Single partition, totally ordered, infinite
retention.

The Monolog
The Source of Truth for published content.

The Monolog
Contains everything published since 1851.

Article 1
Dateline 1
Credit 1
Section 1
Image 2
Image 1
Credit 2
Article 2
Section 2
Image 3

Topological sort
Section1
Byline1
Tag1
Tag2
Image1
Image2
Image3
Section2
Article2
Article1

Section1
Dateline1
Credit1
Credit2
Image1
Image2
Image3
Section2
Article2
Article1
Image2,version2
Credit2,version2

The Skinny Log
A Kafka topic containing notifications
signifying some system acting on a Monolog
event.

The Skinny Log
A Skinny Log message says, “I processed
event X at time Y”.

The Skinny Log
Notification are acted on by downstream
systems.

The Skinny Log
Example: A list service consumes a publish
for an asset, updates the list, and posts a
notification saying that list has been updated.

Active Cache Invalidation
The Skinny Log is used for active cache
invalidation.

Active Cache Invalidation
A notification saying that an asset has been
updated is used by a downstream system to
invalidate its cache.

CMS WebMonolog
Gateway
Asset
store
GraphQLAPI
CMS WebMonolog
Gateway
Asset
store
GraphQLAPI
Fastly
CMS WebMonolog
Gateway
Asset
store
GraphQLAPI
Fastly
Skinny log
Skinnygateway
CMS WebMonolog
Gateway
Asset
store
GraphQLAPI
Fastly
Skinny log
Skinnygateway
CMS WebMonolog
Gateway
Asset
store
GraphQLAPI
Fastly
Skinny log
Skinnygateway
Invalidator
CMS WebMonolog
Gateway
Asset
store
GraphQLAPI
Fastly
Skinny log
Skinnygateway
Invalidator
CMS WebMonolog
Gateway
Asset
store
GraphQLAPI
Fastly
Skinny log
Skinnygateway
Invalidator
CMS WebMonolog
Gateway
Asset
store
GraphQLAPI
Fastly
Skinny log
Skinnygateway
Invalidator
Invalidator
CMS WebMonolog
Gateway
Asset
store
GraphQLAPI
Fastly
Skinny log
Skinnygateway
Invalidator
Invalidator
CMS WebMonolog
Gateway
Asset
store
GraphQLAPI
Fastly
Skinny log
Skinnygateway
Invalidator
Invalidator

Metrics/SLOs
The Skinny Log has a complete record of
when every update is processed by every
system.

Metrics/SLOs
We use the Skinny Log to calculate metrics
and check our SLOs.

Metrics/SLOs
As in, how long does it take from something
is published until it’s available on the site, or
in a search result.

Timestamps before 1970
Do not work in Kafka.

Timestamps before 1970
KIP-228

Failover
We want to fail over between Kafka clusters,
transparently, with consumers keeping their
offsets.

Auth on GCP
On GCP, all brokers are on the public
internet, requiring stronger auth.

Apache Kafka® Delivers a Single Source of Truth for The New York Times

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Kafka® Delivers a Single Source of Truth for The New York Times

Similar to Apache Kafka® Delivers a Single Source of Truth for The New York Times (20)

More from confluent

More from confluent (20)

Recently uploaded

Recently uploaded (20)

Apache Kafka® Delivers a Single Source of Truth for The New York Times