Crisis Informatics is an area of research that investigates how members of the public make use of social media during times of crisis. The amount of social media data generated by a single event is significant: millions of tweets and status updates accompanied by gigabytes of photos and video. To investigate the types of digital behaviors that occur around these events requires a significant investment in designing, developing, and deploying large-scale software infrastructure for both data collection and analysis. Project EPIC at the University of Colorado has been making use of Cassandra since Spring 2012 to provide a solid foundation for Project EPIC's data collection and analysis activities. Project EPIC has collected terabytes of social media data associated with hundreds of disaster events that must be stored, processed, analyzed, and visualized. This talk will cover how Project EPIC makes use of Cassandra and discuss some of the architectural, modeling, and analysis challenges encountered while developing the Project EPIC software infrastructure.
Breaking the Kubernetes Kill Chain: Host Path Mount
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research
1. Using Cassandra to Support Crisis
Informatics Research
Kenneth M. Anderson
Associate Professor
Department of Computer Science
Co-Director of The Center for Software and Society
Co-Director of Project EPIC
Director of CU’s Big Data Initiative
Happy Ada Lovelace Day!
2. Ken Anderson
Associate Professor; Department of Computer Science
‣ Research Interests
• Software Architecture and Software Design
• Data-Intensive Systems and Crisis Informatics
‣ Teaching Interests
• Software Engineering; OO A&D; Data Engineering
‣ Active in Broadening Participation in Computer Science
• Led the creation of the BA in CS degree at CU
- 450 new CS majors in two years; 900 CS majors on campus
3. Project EPIC
‣ Empowering the Public with Information in Crisis
• Largest NSF-Funded Project on Crisis Informatics
- ~4M since Fall 2009
‣ Results
• ~60 research publications, 2 PostDocs, 5 PhD graduates, 4
MS graduates, 13 current PhD students
• Tweak the Tweet; 100+ data sets (~1.5B tweets)
• Software: Data collection, analytics, NLP, GIS
4. Crisis Informatics
The study of how technology is changing the way
the world responds to mass emergency events
9. 0
35
70
105
140
9/12/13 12:00 AM
9/12/13 12:00 PM
9/13/13 12:00 AM
9/13/13 12:00 PM
9/14/13 12:00 AM
9/14/13 12:00 PM
9/15/13 12:00 AM
9/15/13 12:00 PM
9/16/13 12:00 AM
9/16/13 12:00 PM
9/17/13 12:00 AM
9/17/13 12:00 PM
9/18/13 12:00 AM
9/18/13 12:00 PM
9/19/13 12:00 AM
9/19/13 12:00 PM
9/20/13 12:00 AM
9/20/13 12:00 PM
Tweets Per Minute
2013 Colorado Floods — First Nine Days
51 31 15 17 11 7 7 5 3
Average Tweets Per Minute
10.
11.
12. Project EPIC Software Infrastructure
‣ EPIC Collect
• Twitter data collection infrastructure capable of collecting
24/7 with 99.9% uptime (since 2010)
- Built on top of Cassandra and designed for scalability,
availability, and flexibility
‣ EPIC Analyze
• A scalable and flexible data analytics environment that
allows Project EPIC analysts to browse, search, filter,
annotate, and process EPIC Collect data sets
- Built on top of DataStax Enterprise, Redis, Rails, & Postgres
13. Project EPIC Software Architecture
Logical Arrangement of Components
Deployed across seven servers in a CU Data Center
EPIC Event Editor EPIC Analyze Splunk Application
Layer
Service
Layer
Storage
Layer
Twitter Redis
PostgreSQL
Pig Hadoop Solr
Cassandra
EPIC
Collect
DataStax Enterprise
15. Twitter
Data Center
Twitter
Collection
Service
Project
EPIC Event
Editor
Cassandra Cassandra Cassandra Cassandra Log
16. Flexibility. Immune to changes in
Tweet metadata.
Twitter
Data Center
Twitter
Collection
Service
Why Cassandra?
Project
EPIC Event
Editor
{ “id”
: … }
Cassandra Cassandra Cassandra Cassandra Log
17. Availability. Tweets can be written
Why Cassandra? to any node in the cluster.
Twitter
Data Center
Twitter
Collection
Service
Project
EPIC Event
Editor
Cassandra Cassandra Cassandra Cassandra Log
18. Twitter
Data Center
Cassandra
Twitter
Collection
Service
Why Cassandra?
Cassandra Cassandra Cassandra Log
Project
EPIC Event
Editor
Scalability. Need more disk
space? Add more nodes!
CassanCdarsasandra Cassandra Cassandra
19. Robustness. Data on nodes
Why Cassandra? automatically replicated.
Twitter
Data Center
Twitter
Collection
Service
Project
EPIC Event
Editor
Cassandra Cassandra Cassandra Cassandra Log
23. Cassandra Data Model
It’s hash tables all the way down…
Row Key 1 Column Name A ••• Column Name X
Value ••• Value
•••
Row Key N Column Name B ••• Column Name Y
Value ••• Value
The design of row keys is critical.
24. Why?
‣ Row keys determine what you can retrieve
• They are your primary means to make a query and retrieve
relevant data; their structure determines query expressivity
• It should be easy to generate them from elements of your
problem domain
‣ Row keys determine how “wide” your rows are
• This is important because Cassandra replicates rows
‣ Row keys are partitioned across your cluster’s nodes
• A “bad” row key design can negatively impact performance
25. Row Keys Should Reflect Problem Domain
‣ You need to easily be able to generate row keys based on
information in your problem domain
<region_name>:<entity_name>:<time_collected>
vs
751e8446ede178f10fd44e3a37affb6b15ed30ce
‣ The former: easily generated from domain objects
• easily reconstructed at query time
‣ The latter might be easily generated
• but not easily reconstructed
26. The Reason?
‣ No easy way to ask Cassandra for all row keys in a
column family
• If you want to get this information, you have to query
Cassandra for it, in batches, until all row keys have been
retrieved
- This is not an O(1) operation!
‣ Instead, it’s better if you can skip this step and
reconstruct from your problem domain
• US_EastCoast:Invoices:0000_01012014 to
US_EastCoast:Invoices:2359_12312014
27. Wide vs. Narrow
‣ You can design “wide” rows or “narrow” rows
• This corresponds to returning a LOT of information for a
given key or a limited amount of information
fb_users_!
dk user 1; user 2; … user 100,000; …
ken_!
age_ht age; height
• Wide rows can be useful, for instance, if you’re domain has
lots of “events” on a given day or within a given hour
28. The Rub? Rows Get Replicated
Cassandra Cassandra Cassandra Cassandra
As previously mentioned, rows get replicated
For wide rows, this can be a performance concern.
How wide is too wide?
Depends on size of cluster and network bandwidth
29. Row Keys Get Partitioned
‣ The nodes in your cluster divide up the key space
between them
• The value of a row key determines where it will get stored
‣ You have to be cognizant of this partition because often
Cassandra is being used in situations where a LOT of
data is being written to it
• You need to make sure your row key design does not
overburden any one node in your cluster
30. Imagine your row_key is a monotonically increasing integer
Say, for instance, tweet ids
Cassandra Cassandra Cassandra Cassandra
Twitter
Collection
Service
Over a single day,
all tweets might be
saved on just one
node in the cluster;
the others would
remain idle!
31. Instead, you want enough variation that keys get
evenly distributed across the cluster
Reader
Cassandra Cassandra Cassandra Cassandra
row_key_1 row_key_a row_key_$ row_key_2
Writer
32. Design of Row Key for EPIC Collect
‣ For Project EPIC, we make use of a “hybrid” row key
• The first part of the row_key is a keyword used to collect
tweets for a given event
- earthquake, flood, cowx, obama, …
• The second part of the row_key is the Julian day that a tweet
was collected on
- January 1, 2014 equals “2014001”; February 1, 2014 equals
“2014032”; etc.
• The third part of the row_key is the last digit of an MD5
hash of the entire Tweet JSON object
- i.e. 0-9, a-f; This is used to distribute tweets across the cluster
33. Tweets Column Family
keyword:day:tag Tweet Id 1 ••• Tweet Id N
JSON ••• JSON
•••
keyword2:day:tag Tweet Id 1 ••• Tweet Id M
JSON ••• JSON
•••
‣ keyword: a word of interest for an event; e.g. “flood”
‣ julian_day: the day of the year a tweet was collected
‣ tag: a hexadecimal character “0-9, a, b, c, d, e, f”
36. EPIC Analyze
A data analytics environment for large Twitter Data Sets
‣ Provides a scalable and extensible analysis environment
• Aims to partially automate Project EPIC’s analysis work
- Automatically calculate common metrics on all data sets
- Apply new analysis algorithms to entire data sets at once
- Support filtering/sampling on large data sets
- Support shared data set annotation by a team of analysts
• Provide these features while
- supporting data sets of millions of tweets
- with fast performance so as not to interrupt analysis work
37. Project EPIC
Web Apps
DataStax Enterprise
3rd Party
Analytics
Apps
Twitter
Hadoop Cassandra Solr
Facebook
Pig Redis
38. Challenges
‣ Recall: goal of EPIC Collect is to store events in a reliable,
scalable fashion
‣ Data not necessarily structured to support analysis
• Implication: Need for Migration/Duplication to enable
features such as searching, filtering, analysis, etc.
39. Data Migration and Duplication
‣ With EPIC Collect, we chose to have fairly “wide” rows
• Each row stores the tweets that contain a given keyword for
a given day
- “All tweets that contain the word “flood” collected on 01/01/14”
- We use the “tag” to keep the row from growing too large, but
there can still be 100s of 1000s of tweets in each row
‣ To support searching/filtering, we want to use Solr
• however, Solr requires “narrow” structured rows
- one tweet per row, each column defined by a schema
40. We go from…
tweet_1, tweet_2, tweet_3, … , tweet_999999, …, tweet_N
{“text” : “This flood is …” …}
row_key
flood:2014002:a
To this…
tweet_1_attributes row_key_for_tweet_1
tweet_2_attributes row_key_for_tweet_2
tweet_3_attributes row_key_for_tweet_3
…
tweet_N_attributes row_key_for_tweet_N
41. Implications
‣ Each time a data set is “imported” into EPIC Analyze
• we must launch a script that reformats each tweet into the
“narrow row” format required by Solr
- In the future, we’ll modify collection to write tweets both ways
‣ It’s not a complete duplication
• we only store those attributes that we want to search on
‣ but it’s still significant
‣ the benefit is that we can then apply all of Solr’s powerful
search capabilities to our data sets
43. Cassandra: Strong Foundation for Project EPIC
‣ With migration to Cassandra in 2012, EPIC Collect has
been running 24/7 with minimal downtime
• Downtime usually related to network outages
• Cassandra keeps right on ticking!
‣ Has provided Project EPIC with a reliable environment to
perform a wide range of crisis informatics research
• leading to new understanding of how people use Twitter to
coordinate and collaborate during times of disaster
44. Cassandra: Strong Foundation for Project EPIC
‣ An excellent NoSQL technology but you must take time to
understand Cassandra’s advantages and its data model
• Provides flexibility, availability, scalability, and robustness
• Row keys
- difficult to get right (but that’s true of all data modeling tasks!)
- design to reflect your problem domain
- to determine width of rows (and speed of replication)
- and to partition data across your cluster
45. Thank You
Ken Anderson <ken.anderson@colorado.edu>
Project Epic: <http://epic.cs.colorado.edu>
@epiccolorado
Department of Computer Science
University of Colorado