Cassandra Silicon Valley

REAL WORLD CASSANDRA
AT
NASA

Christopher Keller
December 13th, 2012

THANKS!
I failed to copy this to iCloud after the DC presentation

WHO AM I?

•a CSC solutions architect working at the advanced
supercomputing facility (NAS) at NASA Ames in silicon valley

• consulted at various federal agencies during the tech boom of
the 90’s

• classiﬁed and unclassiﬁed

• http://about.me/christopherkeller

WHO I’M NOT

•a cassandra expert

• someone pushing a corporate agenda

ENVIRONMENT

• unix based enterprise (desktops, servers, supercomputers)

• heavywrites around the clock from incoming data, but far
fewer analytical reads

• we retain the data in a raw format, but it does not need to be
in a database (however we can easily load old data)

• weneed ﬂexibility as technology and our requirements evolve
over time

THE PROBLEM

• TL;DR- how to use all of our available data to make
supercomputing more secure for our customers

• replace a COTS security event management system

• poor query performance

• difﬁcult to extend and integrate with our custom software

• pre-deﬁned analytics were a big plus, but more overall
minuses for our environment

WHY CASSANDRA

• snapshotting for backups was lightning fast

• no single point of failure

• reads are fast, writes are faster

•idid research other solutions (couchbase, hbase, mongo, riak,
etc), but didn’t ﬁnd anything compelling enough to trial

WHY CASSANDRA

• simple clustering = win

• availability + scalability + replication

• built in data expiration was key

• enabling
technology that allowed us to ask new
questions

IN THE BEGINNING...

• set up a virtualized three node cluster on a spare server

• wrote the cassandra equivalent of “hello world” to check

• replication / availability

• data expiration

• rough performance estimates

ARE YOU KIDDING ME?

• selling cassandra to management was easier than i thought

• theNAS is very receptive to new technology even though we
prefer to be system integrators rather than developers

• my testing showed that cassandra works...shocking!!!

• openssource resources are good, DataStax being able to
provide support after i leave is better

TAX DOLLARS AT WORK

• bought ﬁve servers for around 22.5k

•3 of them for our production cluster

•1 for our data parsing and loading

•1 for our analytics

• thosewere our only purchases, the rest has been primarily my
labor hours

write operations
30000

23750
Operations

17500

11250

5000
1 6 12 17 23 28 34 39 45 50 56 61 67 72 78 83 89 94
Elapsed Time

6 nodes 9 nodes (v) 9 nodes (p)

latency

.6
20

1.0
Milliseconds

10

0
1 6 12 17 23 28 34 39 45 50 56 61 67 72 78 83 89 94
Elapsed Time

6 nodes 9 nodes (v) 9 nodes (p)

http://christophernkeller.tumblr.com/post/15242366864/cassandra-benchmarks

TAKEAWAY

• bare metal > virtualized w/ assigned disks > fully virtualized

• match your hardware to your environment , expertise, and
requirements

CURRENT CLUSTER

• gentoo running xen 4.1.2 & apache cassandra 1.1.3

• three virtual nodes per physical server

•7 cpu’s, 15gig RAM, 1.2 TB disk

• eight disks per physical server

•2 running the hypervisor + OS in a RAID 1

•2 disks per virtual machine in a RAID 0

ELAPSED TIME

• emptyrack to benchmarks took about ﬁve days over the
course of christmas/new years 2011

• veryhelpful to understand our hardware limits and how
cassandra scaled

• understandinghow to model the data and effectively use
cassandra took a lot longer...i’m still learning

HELPFUL TIPS

• always start with the questions you plan to ask the data

• if you know these your job just got exponentially easier

• if you never deviate from this, you’re lucky

• once you realize how powerful cassandra is, you’ll ﬁgure out
new questions that may change things

• don’t use supercolumns

MAINTENANCE

•i haven’t done serious sys-admin years...had to develop tools
from scratch

• cluster start up and shutdown scripts

• use good CM software (we use puppet)

• OS, Cassandra & JVM upgrades

• cassandra-env.sh & cassandra.yaml

TRIAL AND ERROR

•a lot of testing dealing how to organize the data

• secondary indexes

• materialized views

• i’d
get failures and errors in cassandra that were solved by
changing the schema to be more efﬁcient (based on our
questions)

• try not to think relationally, it wasn’t helping me

THIS WORKED...POORLY

uid name age gender uid job hobby

1 chris 39 male 1 architect jiu-jitsu

2 jaeden 2 male 2 toddler gaming

uid employer phone address

1 csc 5555555555 123 Main St

2 mom 4444444444 123 Main St

THIS WORKED WELL

1234 1235 {“age”:”39”,
“name”:”chris”,”gender”:”male”...}
architect json blob
{“age”:”2”,
toddler json blob
“name”:”jaeden”,”gender”:”male”...}

4567 7364 3453 4554

male json blob json blob chris json blob

jaeden json blob

WHY DID THAT WORK

• we only have to query a single table

• aren’tyou glad you optimized the schemas for the questions
ahead of time?

• manualjoins by reading successive column families resulted in
timeout errors even though the cluster was idle and
everything was on the same switch segment

LESSONS LEARNED

• if
your data changes frequently, de-normalization is annoying,
but can be solved with discipline

• give yourself a lot of experimentation time if you’re new to
cassandra

• if you are hitting problems...likely you’re doing it wrong

TECHNICAL TIPS

• use ‘-pr’ to repair each node at least every gc_grace_seconds

• script which staggers weekly repairs across each node

• onceyou assign a token ID, you can remove it from
cassandra.yaml and keep the same ﬁle across nodes

• you are free to use the Thrift bindings for the language of your
choice, but save yourself time and use a high level client (eg
Java, Python, Scala, PHP, Erlang, etc)

HOW I SPENT MY TIME

• i’dspend a few hours writing code to load data into
cassandra, then another few hours writing code to retrieve it

• the data browsers aren’t great and unhelpful with blobs

• theni’d proﬁle the performance, tweak the code, tweak the
schema, reload the data and repeat until i was happy

ANALYTICS

• all
server side analytics are developed in python using sub-
processes for parallel performance

• pycassa is our cassandra client library

• our web layer is currently ruby on rails, but we might end up
going with django to stay language consistent

SHOW STOPPERS

• dealing
with an incredibly annoying JMX recurring crash but it
doesn’t seem to affect cassandra stability

• other
cassandra sites haven’t seen this, so it may just be a
consequence of java6 on gentoo
.1.3

• commitlog_total_space_in_mb was being ignored
in 1
ED
FIX

RECENT SHOW STOPPERS

• 1.1.3 accidentally removed the ability to drop column families

• pick your poison - full disks or data that never goes away

• recent v6 JVM patches required per-thread stack sizes to 180k

• nodes were up individually, zero log errors, gossip is up, but
the nodes weren’t talking collectively

• cassandra solves a need, but bugs like this make my customers
wary

ROAD AHEAD
cql
map/reduce
solr
ops center

SHOUT OUT

• the folks at datastax have been very helpful

• Tyler Hobbs (cassandra developer)

• Darren Sack (accounts)

• Michael Shaler (biz dev)

• everyone in #cassandra on irc.freenode.org

QUESTIONS?

• cnkeller@gmail.com

• @cnkeller

• http://www.linkedin.com/in/christopherkeller

Cassandra Silicon Valley

More Related Content

What's hot

Similar to Cassandra Silicon Valley

Recently uploaded

Cassandra Silicon Valley