From Postgres
to Cassandra
In four easy steps
Axel Eirola <axel.eirola@f-secure.com>
Jarrod Creado <jarrod.creado@f-secure.com>
Agenda
1.
2.
3.
4.

Postgres
Cassandra
???
Profit
0. Context
Categorizing the internet
Hundreds of millions
Data size in the terabytes
Reputation metadata:
Categories: adult, gambling, …
Safety: malicious, safe, …
Automatic processing
(re)Processing hundreds of thousands of URLs per day
Computation divided among multiple services, each with
multiple instances
Downtime not an option
Manual research
Data mining capabilities
Researching (aimlessly poking around)
Reporting
1. Postgres
BCNF up in this
Planned for storage, not queries
Highly normalized
Stiff schema, hard to add more fields
Sharding like a boss
Segmenting the URL keyspace
One (or more) box for each segment
Difficult to add more capacity
We got eight single points of failure
Upgrading means downtime
Index all the things
Building queries is hard due to the structure of the schema
Managing indices for those queries is hard
The mess needs to be abstracted away from the user,
this is also hard
2. Cassandra
Easy management
Easy scaling up as more data is stored
Out of the box:
Replication
Pagination
Load balancing
Less downtime during upgrades
TTL
Mapping data
Structure of our data is suitable for NoSQL
Mostly based around single URLs
Given a URL, fetch metadata
Got queries?
Cassandra schema designed for fixed pattern access
performed by automation
Human free-form searches offloaded to Elasticsearch
Load on one doesn't affect the other
Denormalize
Provide fixed pattern access for automation
Relations become ranges in the column namespace
This is pre-CQL, so we are doing collections the old-school way
Minimize the amount of read-then-write scenarios
Postgres
Url_Category

Url

Category

url_key

key

key

category_key

url

name

timestamp

Cassandra
Url
row_key

url

(c)_<category_name>

<url_key> <url> <timestamp>

Category
row_key

<url_key>

<category_name> <empty>
3. ???
Going into production
before going into production
DAL (data access layer) abstracts away the split databases
Implement new features in Cassandra only
Get a feel of Cassandra before taking it into full use
A tale of two databases
Run both databases in parallel
Writes:
New data, and updates, into both databases
Blind writes makes it easy to do partial updates

Reads:
Reads from both databases, cross-validate responses
Easy to move responsibilities from one database to another
Migration boiled down to this
1. Dump URL keys form Postgres into batches
2. Custom migration script to chew a batch; for each URL in
batch:
2.1. Read data from Postgres
2.2. Delete Cassandra row key for each URL
2.3. Write fresh data from Postgres into Cassandra

3. Log failing URLs
4. Cross-validate on reads for a while to ensure successful
migration
4. Profit
Bro-tips
Decide what you don't want to migrate
Dry run while testing, keep an eye on the performance
Start in small batches, and verify the results before proceeding
Parallelize the batches, if you need to speed it up
Keep an eye on performance, throttle if necessary
Everything doesn't always go as planned, make it easy to
repeat migration
Make sure the cluster is prepared for the migration, reserve
time to tweak if not
Kiitos

Helsinki Cassandra Meetup #2: From Postgres to Cassandra