Christopher Keller works as a solutions architect at NASA Ames researching the use of Cassandra to store and analyze security event data. He set up a 3 node Cassandra cluster on virtual machines and found it provided fast writes, no single point of failure, and flexibility as requirements evolved. Through trial and error, he optimized the data schema to match the questions they plan to ask. While there were some bugs encountered, Cassandra overall proved capable of handling their workload.
3. WHO AM I?
•a CSC solutions architect working at the advanced
supercomputing facility (NAS) at NASA Ames in silicon valley
• consulted at various federal agencies during the tech boom of
the 90’s
• classified and unclassified
• http://about.me/christopherkeller
4. WHO I’M NOT
•a cassandra expert
• someone pushing a corporate agenda
5. ENVIRONMENT
• unix based enterprise (desktops, servers, supercomputers)
• heavywrites around the clock from incoming data, but far
fewer analytical reads
• we retain the data in a raw format, but it does not need to be
in a database (however we can easily load old data)
• weneed flexibility as technology and our requirements evolve
over time
6. THE PROBLEM
• TL;DR- how to use all of our available data to make
supercomputing more secure for our customers
• replace a COTS security event management system
• poor query performance
• difficult to extend and integrate with our custom software
• pre-defined analytics were a big plus, but more overall
minuses for our environment
7.
8. WHY CASSANDRA
• snapshotting for backups was lightning fast
• no single point of failure
• reads are fast, writes are faster
•idid research other solutions (couchbase, hbase, mongo, riak,
etc), but didn’t find anything compelling enough to trial
9. WHY CASSANDRA
• simple clustering = win
• availability + scalability + replication
• built in data expiration was key
• enabling
technology that allowed us to ask new
questions
10. IN THE BEGINNING...
• set up a virtualized three node cluster on a spare server
• wrote the cassandra equivalent of “hello world” to check
• replication / availability
• data expiration
• rough performance estimates
11. ARE YOU KIDDING ME?
• selling cassandra to management was easier than i thought
• theNAS is very receptive to new technology even though we
prefer to be system integrators rather than developers
• my testing showed that cassandra works...shocking!!!
• openssource resources are good, DataStax being able to
provide support after i leave is better
12. TAX DOLLARS AT WORK
• bought five servers for around 22.5k
•3 of them for our production cluster
•1 for our data parsing and loading
•1 for our analytics
• thosewere our only purchases, the rest has been primarily my
labor hours
14. TAKEAWAY
• bare metal > virtualized w/ assigned disks > fully virtualized
• match your hardware to your environment , expertise, and
requirements
15. CURRENT CLUSTER
• gentoo running xen 4.1.2 & apache cassandra 1.1.3
• three virtual nodes per physical server
•7 cpu’s, 15gig RAM, 1.2 TB disk
• eight disks per physical server
•2 running the hypervisor + OS in a RAID 1
•2 disks per virtual machine in a RAID 0
16. ELAPSED TIME
• emptyrack to benchmarks took about five days over the
course of christmas/new years 2011
• veryhelpful to understand our hardware limits and how
cassandra scaled
• understandinghow to model the data and effectively use
cassandra took a lot longer...i’m still learning
17. HELPFUL TIPS
• always start with the questions you plan to ask the data
• if you know these your job just got exponentially easier
• if you never deviate from this, you’re lucky
• once you realize how powerful cassandra is, you’ll figure out
new questions that may change things
• don’t use supercolumns
18. MAINTENANCE
•i haven’t done serious sys-admin years...had to develop tools
from scratch
• cluster start up and shutdown scripts
• use good CM software (we use puppet)
• OS, Cassandra & JVM upgrades
• cassandra-env.sh & cassandra.yaml
19. TRIAL AND ERROR
•a lot of testing dealing how to organize the data
• secondary indexes
• materialized views
• i’d
get failures and errors in cassandra that were solved by
changing the schema to be more efficient (based on our
questions)
• try not to think relationally, it wasn’t helping me
20. THIS WORKED...POORLY
uid name age gender uid job hobby
1 chris 39 male 1 architect jiu-jitsu
2 jaeden 2 male 2 toddler gaming
uid employer phone address
1 csc 5555555555 123 Main St
2 mom 4444444444 123 Main St
21. THIS WORKED WELL
1234 1235 {“age”:”39”,
“name”:”chris”,”gender”:”male”...}
architect json blob
{“age”:”2”,
toddler json blob
“name”:”jaeden”,”gender”:”male”...}
4567 7364 3453 4554
male json blob json blob chris json blob
jaeden json blob
22. WHY DID THAT WORK
• we only have to query a single table
• aren’tyou glad you optimized the schemas for the questions
ahead of time?
• manualjoins by reading successive column families resulted in
timeout errors even though the cluster was idle and
everything was on the same switch segment
23. LESSONS LEARNED
• if
your data changes frequently, de-normalization is annoying,
but can be solved with discipline
• give yourself a lot of experimentation time if you’re new to
cassandra
• if you are hitting problems...likely you’re doing it wrong
24. TECHNICAL TIPS
• use ‘-pr’ to repair each node at least every gc_grace_seconds
• script which staggers weekly repairs across each node
• onceyou assign a token ID, you can remove it from
cassandra.yaml and keep the same file across nodes
• you are free to use the Thrift bindings for the language of your
choice, but save yourself time and use a high level client (eg
Java, Python, Scala, PHP, Erlang, etc)
25. HOW I SPENT MY TIME
• i’dspend a few hours writing code to load data into
cassandra, then another few hours writing code to retrieve it
• the data browsers aren’t great and unhelpful with blobs
• theni’d profile the performance, tweak the code, tweak the
schema, reload the data and repeat until i was happy
26. ANALYTICS
• all
server side analytics are developed in python using sub-
processes for parallel performance
• pycassa is our cassandra client library
• our web layer is currently ruby on rails, but we might end up
going with django to stay language consistent
27. SHOW STOPPERS
• dealing
with an incredibly annoying JMX recurring crash but it
doesn’t seem to affect cassandra stability
• other
cassandra sites haven’t seen this, so it may just be a
consequence of java6 on gentoo
.1.3
• commitlog_total_space_in_mb was being ignored
in 1
ED
FIX
28. RECENT SHOW STOPPERS
• 1.1.3 accidentally removed the ability to drop column families
• pick your poison - full disks or data that never goes away
• recent v6 JVM patches required per-thread stack sizes to 180k
• nodes were up individually, zero log errors, gossip is up, but
the nodes weren’t talking collectively
• cassandra solves a need, but bugs like this make my customers
wary
30. SHOUT OUT
• the folks at datastax have been very helpful
• Tyler Hobbs (cassandra developer)
• Darren Sack (accounts)
• Michael Shaler (biz dev)
• everyone in #cassandra on irc.freenode.org