Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

Using Cassandra to Support Crisis
Informatics Research
Kenneth M. Anderson
Associate Professor
Department of Computer Science
Co-Director of The Center for Software and Society
Co-Director of Project EPIC
Director of CU’s Big Data Initiative
Happy Ada Lovelace Day!

Ken Anderson
Associate Professor; Department of Computer Science
‣ Research Interests
• Software Architecture and Software Design
• Data-Intensive Systems and Crisis Informatics
‣ Teaching Interests
• Software Engineering; OO A&D; Data Engineering
‣ Active in Broadening Participation in Computer Science
• Led the creation of the BA in CS degree at CU
- 450 new CS majors in two years; 900 CS majors on campus

Project EPIC
‣ Empowering the Public with Information in Crisis
• Largest NSF-Funded Project on Crisis Informatics
- ~4M since Fall 2009
‣ Results
• ~60 research publications, 2 PostDocs, 5 PhD graduates, 4
MS graduates, 13 current PhD students
• Tweak the Tweet; 100+ data sets (~1.5B tweets)
• Software: Data collection, analytics, NLP, GIS

Crisis Informatics
The study of how technology is changing the way
the world responds to mass emergency events

70K Geotagged Tweets
prior/during/after
Hurricane Sandy Landfall

0
35
70
105
140
9/12/13 12:00 AM
9/12/13 12:00 PM
9/13/13 12:00 AM
9/13/13 12:00 PM
9/14/13 12:00 AM
9/14/13 12:00 PM
9/15/13 12:00 AM
9/15/13 12:00 PM
9/16/13 12:00 AM
9/16/13 12:00 PM
9/17/13 12:00 AM
9/17/13 12:00 PM
9/18/13 12:00 AM
9/18/13 12:00 PM
9/19/13 12:00 AM
9/19/13 12:00 PM
9/20/13 12:00 AM
9/20/13 12:00 PM
Tweets Per Minute
2013 Colorado Floods — First Nine Days
51 31 15 17 11 7 7 5 3
Average Tweets Per Minute

Project EPIC Software Infrastructure
‣ EPIC Collect
• Twitter data collection infrastructure capable of collecting
24/7 with 99.9% uptime (since 2010)
- Built on top of Cassandra and designed for scalability,
availability, and flexibility
‣ EPIC Analyze
• A scalable and flexible data analytics environment that
allows Project EPIC analysts to browse, search, filter,
annotate, and process EPIC Collect data sets
- Built on top of DataStax Enterprise, Redis, Rails, & Postgres

Project EPIC Software Architecture
Logical Arrangement of Components
Deployed across seven servers in a CU Data Center
EPIC Event Editor EPIC Analyze Splunk Application
Layer
Service
Layer
Storage
Layer
Twitter Redis
PostgreSQL
Pig Hadoop Solr
Cassandra
EPIC
Collect
DataStax Enterprise

Twitter
Data Center
Twitter
Collection
Service
Project
EPIC Event
Editor
Cassandra Cassandra Cassandra Cassandra Log

Flexibility. Immune to changes in
Tweet metadata.
Twitter
Data Center
Twitter
Collection
Service
Why Cassandra?
Project
EPIC Event
Editor
{ “id”
: … }

Availability. Tweets can be written
Why Cassandra? to any node in the cluster.
Twitter
Data Center
Twitter
Collection
Service
Project
EPIC Event
Editor

Twitter
Data Center
Cassandra
Twitter
Collection
Service
Why Cassandra?
Cassandra Cassandra Cassandra Log
Project
EPIC Event
Editor
Scalability. Need more disk
space? Add more nodes!
CassanCdarsasandra Cassandra Cassandra

Robustness. Data on nodes
Why Cassandra? automatically replicated.
Twitter
Data Center
Twitter
Collection
Service
Project
EPIC Event
Editor

Cassandra Data Model
It’s hash tables all the way down…
Row Key 1 Column Name A ••• Column Name X
Value ••• Value
•••
Row Key N Column Name B ••• Column Name Y
Value ••• Value
The design of row keys is critical.

Why?
‣ Row keys determine what you can retrieve
• They are your primary means to make a query and retrieve
relevant data; their structure determines query expressivity
• It should be easy to generate them from elements of your
problem domain
‣ Row keys determine how “wide” your rows are
• This is important because Cassandra replicates rows
‣ Row keys are partitioned across your cluster’s nodes
• A “bad” row key design can negatively impact performance

Row Keys Should Reflect Problem Domain
‣ You need to easily be able to generate row keys based on
information in your problem domain
<region_name>:<entity_name>:<time_collected>
vs
751e8446ede178f10fd44e3a37affb6b15ed30ce
‣ The former: easily generated from domain objects
• easily reconstructed at query time
‣ The latter might be easily generated
• but not easily reconstructed

The Reason?
‣ No easy way to ask Cassandra for all row keys in a
column family
• If you want to get this information, you have to query
Cassandra for it, in batches, until all row keys have been
retrieved
- This is not an O(1) operation!
‣ Instead, it’s better if you can skip this step and
reconstruct from your problem domain
• US_EastCoast:Invoices:0000_01012014 to
US_EastCoast:Invoices:2359_12312014

Wide vs. Narrow
‣ You can design “wide” rows or “narrow” rows
• This corresponds to returning a LOT of information for a
given key or a limited amount of information
fb_users_!
dk user 1; user 2; … user 100,000; …
ken_!
age_ht age; height
• Wide rows can be useful, for instance, if you’re domain has
lots of “events” on a given day or within a given hour

The Rub? Rows Get Replicated
Cassandra Cassandra Cassandra Cassandra
As previously mentioned, rows get replicated
For wide rows, this can be a performance concern.
How wide is too wide?
Depends on size of cluster and network bandwidth

Row Keys Get Partitioned
‣ The nodes in your cluster divide up the key space
between them
• The value of a row key determines where it will get stored
‣ You have to be cognizant of this partition because often
Cassandra is being used in situations where a LOT of
data is being written to it
• You need to make sure your row key design does not
overburden any one node in your cluster

Imagine your row_key is a monotonically increasing integer
Say, for instance, tweet ids
Twitter
Collection
Service
Over a single day,
all tweets might be
saved on just one
node in the cluster;
the others would
remain idle!

Instead, you want enough variation that keys get
evenly distributed across the cluster
Reader
row_key_1 row_key_a row_key_$ row_key_2
Writer

Design of Row Key for EPIC Collect
‣ For Project EPIC, we make use of a “hybrid” row key
• The first part of the row_key is a keyword used to collect
tweets for a given event
- earthquake, flood, cowx, obama, …
• The second part of the row_key is the Julian day that a tweet
was collected on
- January 1, 2014 equals “2014001”; February 1, 2014 equals
“2014032”; etc.
• The third part of the row_key is the last digit of an MD5
hash of the entire Tweet JSON object
- i.e. 0-9, a-f; This is used to distribute tweets across the cluster

Tweets Column Family
keyword:day:tag Tweet Id 1 ••• Tweet Id N
JSON ••• JSON
•••
keyword2:day:tag Tweet Id 1 ••• Tweet Id M
JSON ••• JSON
•••
‣ keyword: a word of interest for an event; e.g. “flood”
‣ julian_day: the day of the year a tweet was collected
‣ tag: a hexadecimal character “0-9, a, b, c, d, e, f”

flood:002:0
flood:002:1
flood:002:2
flood:002:3
flood:002:4
flood:002:5
flood:002:6
flood:002:7
flood:002:8 flood:002:c
flood:002:9
flood:002:a
flood:002:b
flood:002:d
flood:002:e
flood:002:f
Row Key Distribution
Replication flood:002:0 flood:002:1 …

EPIC Analyze
A data analytics environment for large Twitter Data Sets
‣ Provides a scalable and extensible analysis environment
• Aims to partially automate Project EPIC’s analysis work
- Automatically calculate common metrics on all data sets
- Apply new analysis algorithms to entire data sets at once
- Support filtering/sampling on large data sets
- Support shared data set annotation by a team of analysts
• Provide these features while
- supporting data sets of millions of tweets
- with fast performance so as not to interrupt analysis work

Project EPIC
Web Apps
DataStax Enterprise
3rd Party
Analytics
Apps
Twitter
Hadoop Cassandra Solr
Facebook
Pig Redis

Challenges
‣ Recall: goal of EPIC Collect is to store events in a reliable,
scalable fashion
‣ Data not necessarily structured to support analysis
• Implication: Need for Migration/Duplication to enable
features such as searching, filtering, analysis, etc.

Data Migration and Duplication
‣ With EPIC Collect, we chose to have fairly “wide” rows
• Each row stores the tweets that contain a given keyword for
a given day
- “All tweets that contain the word “flood” collected on 01/01/14”
- We use the “tag” to keep the row from growing too large, but
there can still be 100s of 1000s of tweets in each row
‣ To support searching/filtering, we want to use Solr
• however, Solr requires “narrow” structured rows
- one tweet per row, each column defined by a schema

We go from…
tweet_1, tweet_2, tweet_3, … , tweet_999999, …, tweet_N
{“text” : “This flood is …” …}
row_key
flood:2014002:a
To this…
tweet_1_attributes row_key_for_tweet_1
…
tweet_N_attributes row_key_for_tweet_N

Implications
‣ Each time a data set is “imported” into EPIC Analyze
• we must launch a script that reformats each tweet into the
“narrow row” format required by Solr
- In the future, we’ll modify collection to write tweets both ways
‣ It’s not a complete duplication
• we only store those attributes that we want to search on
‣ but it’s still significant
‣ the benefit is that we can then apply all of Solr’s powerful
search capabilities to our data sets

Cassandra: Strong Foundation for Project EPIC
‣ With migration to Cassandra in 2012, EPIC Collect has
been running 24/7 with minimal downtime
• Downtime usually related to network outages
• Cassandra keeps right on ticking!
‣ Has provided Project EPIC with a reliable environment to
perform a wide range of crisis informatics research
• leading to new understanding of how people use Twitter to
coordinate and collaborate during times of disaster

Cassandra: Strong Foundation for Project EPIC
‣ An excellent NoSQL technology but you must take time to
understand Cassandra’s advantages and its data model
• Provides flexibility, availability, scalability, and robustness
• Row keys
- difficult to get right (but that’s true of all data modeling tasks!)
- design to reflect your problem domain
- to determine width of rows (and speed of replication)
- and to partition data across your cluster

Thank You
Ken Anderson <ken.anderson@colorado.edu>
Project Epic: <http://epic.cs.colorado.edu>
@epiccolorado
Department of Computer Science
University of Colorado

Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

Similar to Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research (20)

More from DataStax Academy

More from DataStax Academy (20)

Recently uploaded

Recently uploaded (20)

Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research