Roman Komkov presented on Cassandra at Glogster. Glogster EDU is an online platform used by 19 million users for presentation and learning, generating over 45,000 new projects daily. Glogster has used Cassandra since 2011 as its primary database, starting with 0.6-0.8 and upgrading over time. Currently it uses a 5 node cluster storing 600GB of data. Upgrades brought challenges around migration, repairs, and a data loss incident that took 10 days to repair after decommissioning an old datacenter without proper hints handling. Lessons included increasing repair windows, enabling parallel repairs, and regularly scheduling backups and repairs.
1. Cassandra at Glogster
Roman Komkov – roman@glogster.com
System Engineer at Glogster
Prague Cassandra Meet up
03.09.2015
2. About me
2 years at Glogster EDU as System Engineer
5+ years of Linux administration
5+ years of Python development
Cluster, HA, Orchestration
CI, CD…
Twitter - @alkoengineering
GitHub, Freenode - decayofmind
3. About Glogster EDU
Started in 2009
Platform for presentation and interactive learning mainly
used by educators and students
19 million users
Over 45 million glogs
40000 new glogs daily
Web service, mobile applications
http://edu.glogster.com
4. Cassandra at Glogster
From 2011 as primary DB for initial Glogster.com
From 2012 as backend (storage) DB for Glogster EDU
Started from 0.6… or 0.8, I guess…
10 nodes
RF=5, QUORUM
SATA disks
OrderPreservingPartitioner ¯_(ツ)_/¯
6. Cassandra now
5 nodes cluster
~600Gb average node size
RF=5, QUORUM
SSD disks
VNodes
OrderPreservingPartitioner…
pycassa + datastax-driver
7.
8. 0.8 problems
Migration with downtime by transferring a copy of data
HintedHandoff hell
No repairs, no cleanups
Enormous HeapSize (20GB)
Different time on servers
SOLUTION!
Upgrade to 1.0
9. 1.1 problems
Cassandra guy left Glogster
Don’t touch it while it works
BUT…
Load averages like 14.0-16.0
2 disks failed
Everything is slow
Repairs? Never heard!
10. 1.1 solutions
Replace disks, rebuild nodes.
Don’t try to run repair on new node instead of ReplaceToken
Move old Glogster.com keyspace to another cluster
Load gone
https://glogster.github.io/posts/2015/03/23/cassandra-
migration.html
Nodes are fast again
Regular repairs and cleanups? Never did!
OpsCenter installed
Cluster upgraded to 1.2
11. 1.2 and migration
Cluster migrated to the new servers without
downtime
http://www.planetcassandra.org/blog/cassandra-migration-
to-ec2/
Vnodes
…
12.
13. Old datacenter, connected to production was disconnected
from new datacenter
Forgot about Hints TTL (max_hint_window_in_ms ~ 3 hours)
Forgot to run repair on cluster after
Old DC was decommissioned
Application switched the new one
…
DATA GONE
14. Here the hell begins
~ 1200 glogs remain on old decommissioned datacenter
Thanks God, we have RF=<N of nodes>
Transfer data from one old node to the new server
Run Cassandra on it, add node to the cluster
Run repair on entire cluster
Increase repair chance with read_repair_chance
Peacefully wait until done…
Do your complicated repairs through OpsCenter, cause it can
continue if failed.
18. Conclusions and Improvements
Increase max_hint_window_in_ms value to something like 3
days
Make use of parallel things
CQL3 and datastax-driver
Upgrade to Cassandra 2.2
faster repairs and other operations
New OpsCenter
Schedule regular backups and repairs
We still love Cassandra!