Cassandra Silicon Valley

REAL WORLD CASSANDRA
AT
NASA

Christopher Keller
December 13th, 2012

THANKS!
I failed to copy this to iCloud after the DC presentation

WHO AM I?

•a CSC solutions architect working at the advanced
supercomputing facility (NAS) at NASA Ames in silicon valley

• consulted at various federal agencies during the tech boom of
the 90’s

• classiﬁed and unclassiﬁed

• http://about.me/christopherkeller

WHO I’M NOT

•a cassandra expert

• someone pushing a corporate agenda

ENVIRONMENT

• unix based enterprise (desktops, servers, supercomputers)

• heavywrites around the clock from incoming data, but far
fewer analytical reads

• we retain the data in a raw format, but it does not need to be
in a database (however we can easily load old data)

• weneed ﬂexibility as technology and our requirements evolve
over time

THE PROBLEM

• TL;DR- how to use all of our available data to make
supercomputing more secure for our customers

• replace a COTS security event management system

• poor query performance

• difﬁcult to extend and integrate with our custom software

• pre-deﬁned analytics were a big plus, but more overall
minuses for our environment

WHY CASSANDRA

• snapshotting for backups was lightning fast

• no single point of failure

• reads are fast, writes are faster

•idid research other solutions (couchbase, hbase, mongo, riak,
etc), but didn’t ﬁnd anything compelling enough to trial

WHY CASSANDRA

• simple clustering = win

• availability + scalability + replication

• built in data expiration was key

• enabling
technology that allowed us to ask new
questions

IN THE BEGINNING...

• set up a virtualized three node cluster on a spare server

• wrote the cassandra equivalent of “hello world” to check

• replication / availability

• data expiration

• rough performance estimates

ARE YOU KIDDING ME?

• selling cassandra to management was easier than i thought

• theNAS is very receptive to new technology even though we
prefer to be system integrators rather than developers

• my testing showed that cassandra works...shocking!!!

• openssource resources are good, DataStax being able to
provide support after i leave is better

TAX DOLLARS AT WORK

• bought ﬁve servers for around 22.5k

•3 of them for our production cluster

•1 for our data parsing and loading

•1 for our analytics

• thosewere our only purchases, the rest has been primarily my
labor hours

write operations
30000

23750
Operations

17500

11250

5000
1 6 12 17 23 28 34 39 45 50 56 61 67 72 78 83 89 94
Elapsed Time

6 nodes 9 nodes (v) 9 nodes (p)

latency

.6
20

1.0
Milliseconds

10

0
1 6 12 17 23 28 34 39 45 50 56 61 67 72 78 83 89 94
Elapsed Time

6 nodes 9 nodes (v) 9 nodes (p)

http://christophernkeller.tumblr.com/post/15242366864/cassandra-benchmarks

TAKEAWAY

• bare metal > virtualized w/ assigned disks > fully virtualized

• match your hardware to your environment , expertise, and
requirements

CURRENT CLUSTER

• gentoo running xen 4.1.2 & apache cassandra 1.1.3

• three virtual nodes per physical server

•7 cpu’s, 15gig RAM, 1.2 TB disk

• eight disks per physical server

•2 running the hypervisor + OS in a RAID 1

•2 disks per virtual machine in a RAID 0

ELAPSED TIME

• emptyrack to benchmarks took about ﬁve days over the
course of christmas/new years 2011

• veryhelpful to understand our hardware limits and how
cassandra scaled

• understandinghow to model the data and effectively use
cassandra took a lot longer...i’m still learning

HELPFUL TIPS

• always start with the questions you plan to ask the data

• if you know these your job just got exponentially easier

• if you never deviate from this, you’re lucky

• once you realize how powerful cassandra is, you’ll ﬁgure out
new questions that may change things

• don’t use supercolumns

MAINTENANCE

•i haven’t done serious sys-admin years...had to develop tools
from scratch

• cluster start up and shutdown scripts

• use good CM software (we use puppet)

• OS, Cassandra & JVM upgrades

• cassandra-env.sh & cassandra.yaml

TRIAL AND ERROR

•a lot of testing dealing how to organize the data

• secondary indexes

• materialized views

• i’d
get failures and errors in cassandra that were solved by
changing the schema to be more efﬁcient (based on our
questions)

• try not to think relationally, it wasn’t helping me

THIS WORKED...POORLY

uid name age gender uid job hobby

1 chris 39 male 1 architect jiu-jitsu

2 jaeden 2 male 2 toddler gaming

uid employer phone address

1 csc 5555555555 123 Main St

2 mom 4444444444 123 Main St

THIS WORKED WELL

1234 1235 {“age”:”39”,
“name”:”chris”,”gender”:”male”...}
architect json blob
{“age”:”2”,
toddler json blob
“name”:”jaeden”,”gender”:”male”...}

4567 7364 3453 4554

male json blob json blob chris json blob

jaeden json blob

WHY DID THAT WORK

• we only have to query a single table

• aren’tyou glad you optimized the schemas for the questions
ahead of time?

• manualjoins by reading successive column families resulted in
timeout errors even though the cluster was idle and
everything was on the same switch segment

LESSONS LEARNED

• if
your data changes frequently, de-normalization is annoying,
but can be solved with discipline

• give yourself a lot of experimentation time if you’re new to
cassandra

• if you are hitting problems...likely you’re doing it wrong

TECHNICAL TIPS

• use ‘-pr’ to repair each node at least every gc_grace_seconds

• script which staggers weekly repairs across each node

• onceyou assign a token ID, you can remove it from
cassandra.yaml and keep the same ﬁle across nodes

• you are free to use the Thrift bindings for the language of your
choice, but save yourself time and use a high level client (eg
Java, Python, Scala, PHP, Erlang, etc)

HOW I SPENT MY TIME

• i’dspend a few hours writing code to load data into
cassandra, then another few hours writing code to retrieve it

• the data browsers aren’t great and unhelpful with blobs

• theni’d proﬁle the performance, tweak the code, tweak the
schema, reload the data and repeat until i was happy

ANALYTICS

• all
server side analytics are developed in python using sub-
processes for parallel performance

• pycassa is our cassandra client library

• our web layer is currently ruby on rails, but we might end up
going with django to stay language consistent

SHOW STOPPERS

• dealing
with an incredibly annoying JMX recurring crash but it
doesn’t seem to affect cassandra stability

• other
cassandra sites haven’t seen this, so it may just be a
consequence of java6 on gentoo
.1.3

• commitlog_total_space_in_mb was being ignored
in 1
ED
FIX

RECENT SHOW STOPPERS

• 1.1.3 accidentally removed the ability to drop column families

• pick your poison - full disks or data that never goes away

• recent v6 JVM patches required per-thread stack sizes to 180k

• nodes were up individually, zero log errors, gossip is up, but
the nodes weren’t talking collectively

• cassandra solves a need, but bugs like this make my customers
wary

ROAD AHEAD
cql
map/reduce
solr
ops center

SHOUT OUT

• the folks at datastax have been very helpful

• Tyler Hobbs (cassandra developer)

• Darren Sack (accounts)

• Michael Shaler (biz dev)

• everyone in #cassandra on irc.freenode.org

QUESTIONS?

• cnkeller@gmail.com

• @cnkeller

• http://www.linkedin.com/in/christopherkeller

Cassandra Silicon Valley

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cassandra Silicon Valley

Similar to Cassandra Silicon Valley (20)

Recently uploaded

Recently uploaded (20)

Cassandra Silicon Valley