Helsinki Cassandra Meetup #2: From Postgres to Cassandra

3,761 views

Published on

- From PostgreSQL to Cassandra, In Four Easy Steps (Axel Eirola and Jarrod Creado, LabDev, F-Secure)
In this presentation Axel and Jarrod will tell you the tale of our Network Reputation System Live Migration ( PostgreSQL to Cassandra ).

F-Secure Network Reputation System is a core element of the protection we provide to our customers.
It consists of URLs and other network related metadata, used to make fast assessments regarding their reputation.
Currently the Network Reputation System database contains hundreds of millions of URLs.


More info about Cassandra @ F-Secure?
http://www.planetcassandra.com/blog/post/apache-cassandra-at-f-secure

Published in: Technology
  • I am not sure if it's just me, but what are the four steps?? How do you import data into Cassandra from PostgreSQL? I see no explanation of that sort.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Helsinki Cassandra Meetup #2: From Postgres to Cassandra

  1. 1. From Postgres to Cassandra In four easy steps Axel Eirola <axel.eirola@f-secure.com> Jarrod Creado <jarrod.creado@f-secure.com>
  2. 2. Agenda 1. 2. 3. 4. Postgres Cassandra ??? Profit
  3. 3. 0. Context
  4. 4. Categorizing the internet Hundreds of millions Data size in the terabytes Reputation metadata: Categories: adult, gambling, … Safety: malicious, safe, …
  5. 5. Automatic processing (re)Processing hundreds of thousands of URLs per day Computation divided among multiple services, each with multiple instances Downtime not an option
  6. 6. Manual research Data mining capabilities Researching (aimlessly poking around) Reporting
  7. 7. 1. Postgres
  8. 8. BCNF up in this Planned for storage, not queries Highly normalized Stiff schema, hard to add more fields
  9. 9. Sharding like a boss Segmenting the URL keyspace One (or more) box for each segment Difficult to add more capacity We got eight single points of failure Upgrading means downtime
  10. 10. Index all the things Building queries is hard due to the structure of the schema Managing indices for those queries is hard The mess needs to be abstracted away from the user, this is also hard
  11. 11. 2. Cassandra
  12. 12. Easy management Easy scaling up as more data is stored Out of the box: Replication Pagination Load balancing Less downtime during upgrades TTL
  13. 13. Mapping data Structure of our data is suitable for NoSQL Mostly based around single URLs Given a URL, fetch metadata
  14. 14. Got queries? Cassandra schema designed for fixed pattern access performed by automation Human free-form searches offloaded to Elasticsearch Load on one doesn't affect the other
  15. 15. Denormalize Provide fixed pattern access for automation Relations become ranges in the column namespace This is pre-CQL, so we are doing collections the old-school way Minimize the amount of read-then-write scenarios
  16. 16. Postgres Url_Category Url Category url_key key key category_key url name timestamp Cassandra Url row_key url (c)_<category_name> <url_key> <url> <timestamp> Category row_key <url_key> <category_name> <empty>
  17. 17. 3. ???
  18. 18. Going into production before going into production DAL (data access layer) abstracts away the split databases Implement new features in Cassandra only Get a feel of Cassandra before taking it into full use
  19. 19. A tale of two databases Run both databases in parallel Writes: New data, and updates, into both databases Blind writes makes it easy to do partial updates Reads: Reads from both databases, cross-validate responses Easy to move responsibilities from one database to another
  20. 20. Migration boiled down to this 1. Dump URL keys form Postgres into batches 2. Custom migration script to chew a batch; for each URL in batch: 2.1. Read data from Postgres 2.2. Delete Cassandra row key for each URL 2.3. Write fresh data from Postgres into Cassandra 3. Log failing URLs 4. Cross-validate on reads for a while to ensure successful migration
  21. 21. 4. Profit
  22. 22. Bro-tips Decide what you don't want to migrate Dry run while testing, keep an eye on the performance Start in small batches, and verify the results before proceeding Parallelize the batches, if you need to speed it up Keep an eye on performance, throttle if necessary Everything doesn't always go as planned, make it easy to repeat migration Make sure the cluster is prepared for the migration, reserve time to tweak if not
  23. 23. Kiitos

×