SlideShare a Scribd company logo
1 of 45
Download to read offline
Using Cassandra to Support Crisis 
Informatics Research 
Kenneth M. Anderson 
Associate Professor 
Department of Computer Science 
Co-Director of The Center for Software and Society 
Co-Director of Project EPIC 
Director of CU’s Big Data Initiative 
Happy Ada Lovelace Day!
Ken Anderson 
Associate Professor; Department of Computer Science 
‣ Research Interests 
• Software Architecture and Software Design 
• Data-Intensive Systems and Crisis Informatics 
‣ Teaching Interests 
• Software Engineering; OO A&D; Data Engineering 
‣ Active in Broadening Participation in Computer Science 
• Led the creation of the BA in CS degree at CU 
- 450 new CS majors in two years; 900 CS majors on campus
Project EPIC 
‣ Empowering the Public with Information in Crisis 
• Largest NSF-Funded Project on Crisis Informatics 
- ~4M since Fall 2009 
‣ Results 
• ~60 research publications, 2 PostDocs, 5 PhD graduates, 4 
MS graduates, 13 current PhD students 
• Tweak the Tweet; 100+ data sets (~1.5B tweets) 
• Software: Data collection, analytics, NLP, GIS
Crisis Informatics 
The study of how technology is changing the way 
the world responds to mass emergency events
70K Geotagged Tweets 
prior/during/after 
Hurricane Sandy Landfall
0 
35 
70 
105 
140 
9/12/13 12:00 AM 
9/12/13 12:00 PM 
9/13/13 12:00 AM 
9/13/13 12:00 PM 
9/14/13 12:00 AM 
9/14/13 12:00 PM 
9/15/13 12:00 AM 
9/15/13 12:00 PM 
9/16/13 12:00 AM 
9/16/13 12:00 PM 
9/17/13 12:00 AM 
9/17/13 12:00 PM 
9/18/13 12:00 AM 
9/18/13 12:00 PM 
9/19/13 12:00 AM 
9/19/13 12:00 PM 
9/20/13 12:00 AM 
9/20/13 12:00 PM 
Tweets Per Minute 
2013 Colorado Floods — First Nine Days 
51 31 15 17 11 7 7 5 3 
Average Tweets Per Minute
Project EPIC Software Infrastructure 
‣ EPIC Collect 
• Twitter data collection infrastructure capable of collecting 
24/7 with 99.9% uptime (since 2010) 
- Built on top of Cassandra and designed for scalability, 
availability, and flexibility 
‣ EPIC Analyze 
• A scalable and flexible data analytics environment that 
allows Project EPIC analysts to browse, search, filter, 
annotate, and process EPIC Collect data sets 
- Built on top of DataStax Enterprise, Redis, Rails, & Postgres
Project EPIC Software Architecture 
Logical Arrangement of Components 
Deployed across seven servers in a CU Data Center 
EPIC Event Editor EPIC Analyze Splunk Application 
Layer 
Service 
Layer 
Storage 
Layer 
Twitter Redis 
PostgreSQL 
Pig Hadoop Solr 
Cassandra 
EPIC 
Collect 
DataStax Enterprise
EPIC Collect
Twitter 
Data Center 
Twitter 
Collection 
Service 
Project 
EPIC Event 
Editor 
Cassandra Cassandra Cassandra Cassandra Log
Flexibility. Immune to changes in 
Tweet metadata. 
Twitter 
Data Center 
Twitter 
Collection 
Service 
Why Cassandra? 
Project 
EPIC Event 
Editor 
{ “id” 
: … } 
Cassandra Cassandra Cassandra Cassandra Log
Availability. Tweets can be written 
Why Cassandra? to any node in the cluster. 
Twitter 
Data Center 
Twitter 
Collection 
Service 
Project 
EPIC Event 
Editor 
Cassandra Cassandra Cassandra Cassandra Log
Twitter 
Data Center 
Cassandra 
Twitter 
Collection 
Service 
Why Cassandra? 
Cassandra Cassandra Cassandra Log 
Project 
EPIC Event 
Editor 
Scalability. Need more disk 
space? Add more nodes! 
CassanCdarsasandra Cassandra Cassandra
Robustness. Data on nodes 
Why Cassandra? automatically replicated. 
Twitter 
Data Center 
Twitter 
Collection 
Service 
Project 
EPIC Event 
Editor 
Cassandra Cassandra Cassandra Cassandra Log
However…
Data Modeling is Wicked Hard
Getting Row Keys Right
Cassandra Data Model 
It’s hash tables all the way down… 
Row Key 1 Column Name A ••• Column Name X 
Value ••• Value 
••• 
Row Key N Column Name B ••• Column Name Y 
Value ••• Value 
The design of row keys is critical.
Why? 
‣ Row keys determine what you can retrieve 
• They are your primary means to make a query and retrieve 
relevant data; their structure determines query expressivity 
• It should be easy to generate them from elements of your 
problem domain 
‣ Row keys determine how “wide” your rows are 
• This is important because Cassandra replicates rows 
‣ Row keys are partitioned across your cluster’s nodes 
• A “bad” row key design can negatively impact performance
Row Keys Should Reflect Problem Domain 
‣ You need to easily be able to generate row keys based on 
information in your problem domain 
<region_name>:<entity_name>:<time_collected> 
vs 
751e8446ede178f10fd44e3a37affb6b15ed30ce 
‣ The former: easily generated from domain objects 
• easily reconstructed at query time 
‣ The latter might be easily generated 
• but not easily reconstructed
The Reason? 
‣ No easy way to ask Cassandra for all row keys in a 
column family 
• If you want to get this information, you have to query 
Cassandra for it, in batches, until all row keys have been 
retrieved 
- This is not an O(1) operation! 
‣ Instead, it’s better if you can skip this step and 
reconstruct from your problem domain 
• US_EastCoast:Invoices:0000_01012014 to 
US_EastCoast:Invoices:2359_12312014
Wide vs. Narrow 
‣ You can design “wide” rows or “narrow” rows 
• This corresponds to returning a LOT of information for a 
given key or a limited amount of information 
fb_users_! 
dk user 1; user 2; … user 100,000; … 
ken_! 
age_ht age; height 
• Wide rows can be useful, for instance, if you’re domain has 
lots of “events” on a given day or within a given hour
The Rub? Rows Get Replicated 
Cassandra Cassandra Cassandra Cassandra 
As previously mentioned, rows get replicated 
For wide rows, this can be a performance concern. 
How wide is too wide? 
Depends on size of cluster and network bandwidth
Row Keys Get Partitioned 
‣ The nodes in your cluster divide up the key space 
between them 
• The value of a row key determines where it will get stored 
‣ You have to be cognizant of this partition because often 
Cassandra is being used in situations where a LOT of 
data is being written to it 
• You need to make sure your row key design does not 
overburden any one node in your cluster
Imagine your row_key is a monotonically increasing integer 
Say, for instance, tweet ids 
Cassandra Cassandra Cassandra Cassandra 
Twitter 
Collection 
Service 
Over a single day, 
all tweets might be 
saved on just one 
node in the cluster; 
the others would 
remain idle!
Instead, you want enough variation that keys get 
evenly distributed across the cluster 
Reader 
Cassandra Cassandra Cassandra Cassandra 
row_key_1 row_key_a row_key_$ row_key_2 
Writer
Design of Row Key for EPIC Collect 
‣ For Project EPIC, we make use of a “hybrid” row key 
• The first part of the row_key is a keyword used to collect 
tweets for a given event 
- earthquake, flood, cowx, obama, … 
• The second part of the row_key is the Julian day that a tweet 
was collected on 
- January 1, 2014 equals “2014001”; February 1, 2014 equals 
“2014032”; etc. 
• The third part of the row_key is the last digit of an MD5 
hash of the entire Tweet JSON object 
- i.e. 0-9, a-f; This is used to distribute tweets across the cluster
Tweets Column Family 
keyword:day:tag Tweet Id 1 ••• Tweet Id N 
JSON ••• JSON 
••• 
keyword2:day:tag Tweet Id 1 ••• Tweet Id M 
JSON ••• JSON 
••• 
‣ keyword: a word of interest for an event; e.g. “flood” 
‣ julian_day: the day of the year a tweet was collected 
‣ tag: a hexadecimal character “0-9, a, b, c, d, e, f”
Cassandra Cassandra Cassandra Cassandra 
flood:002:0 
flood:002:1 
flood:002:2 
flood:002:3 
flood:002:4 
flood:002:5 
flood:002:6 
flood:002:7 
flood:002:8 flood:002:c 
flood:002:9 
flood:002:a 
flood:002:b 
flood:002:d 
flood:002:e 
flood:002:f 
Row Key Distribution 
Replication flood:002:0 flood:002:1 …
EPIC Analyze
EPIC Analyze 
A data analytics environment for large Twitter Data Sets 
‣ Provides a scalable and extensible analysis environment 
• Aims to partially automate Project EPIC’s analysis work 
- Automatically calculate common metrics on all data sets 
- Apply new analysis algorithms to entire data sets at once 
- Support filtering/sampling on large data sets 
- Support shared data set annotation by a team of analysts 
• Provide these features while 
- supporting data sets of millions of tweets 
- with fast performance so as not to interrupt analysis work
Project EPIC 
Web Apps 
DataStax Enterprise 
3rd Party 
Analytics 
Apps 
Twitter 
Hadoop Cassandra Solr 
Facebook 
Pig Redis
Challenges 
‣ Recall: goal of EPIC Collect is to store events in a reliable, 
scalable fashion 
‣ Data not necessarily structured to support analysis 
• Implication: Need for Migration/Duplication to enable 
features such as searching, filtering, analysis, etc.
Data Migration and Duplication 
‣ With EPIC Collect, we chose to have fairly “wide” rows 
• Each row stores the tweets that contain a given keyword for 
a given day 
- “All tweets that contain the word “flood” collected on 01/01/14” 
- We use the “tag” to keep the row from growing too large, but 
there can still be 100s of 1000s of tweets in each row 
‣ To support searching/filtering, we want to use Solr 
• however, Solr requires “narrow” structured rows 
- one tweet per row, each column defined by a schema
We go from… 
tweet_1, tweet_2, tweet_3, … , tweet_999999, …, tweet_N 
{“text” : “This flood is …” …} 
row_key 
flood:2014002:a 
To this… 
tweet_1_attributes row_key_for_tweet_1 
tweet_2_attributes row_key_for_tweet_2 
tweet_3_attributes row_key_for_tweet_3 
… 
tweet_N_attributes row_key_for_tweet_N
Implications 
‣ Each time a data set is “imported” into EPIC Analyze 
• we must launch a script that reformats each tweet into the 
“narrow row” format required by Solr 
- In the future, we’ll modify collection to write tweets both ways 
‣ It’s not a complete duplication 
• we only store those attributes that we want to search on 
‣ but it’s still significant 
‣ the benefit is that we can then apply all of Solr’s powerful 
search capabilities to our data sets
Conclusions
Cassandra: Strong Foundation for Project EPIC 
‣ With migration to Cassandra in 2012, EPIC Collect has 
been running 24/7 with minimal downtime 
• Downtime usually related to network outages 
• Cassandra keeps right on ticking! 
‣ Has provided Project EPIC with a reliable environment to 
perform a wide range of crisis informatics research 
• leading to new understanding of how people use Twitter to 
coordinate and collaborate during times of disaster
Cassandra: Strong Foundation for Project EPIC 
‣ An excellent NoSQL technology but you must take time to 
understand Cassandra’s advantages and its data model 
• Provides flexibility, availability, scalability, and robustness 
• Row keys 
- difficult to get right (but that’s true of all data modeling tasks!) 
- design to reflect your problem domain 
- to determine width of rows (and speed of replication) 
- and to partition data across your cluster
Thank You 
Ken Anderson <ken.anderson@colorado.edu> 
Project Epic: <http://epic.cs.colorado.edu> 
@epiccolorado 
Department of Computer Science 
University of Colorado

More Related Content

What's hot

Runaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop itRunaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop itnathanmarz
 
Kindling: Getting Started with Spark and Cassandra
Kindling: Getting Started with Spark and CassandraKindling: Getting Started with Spark and Cassandra
Kindling: Getting Started with Spark and CassandraDataStax Academy
 
Getting started with Spark & Cassandra by Jon Haddad of Datastax
Getting started with Spark & Cassandra by Jon Haddad of DatastaxGetting started with Spark & Cassandra by Jon Haddad of Datastax
Getting started with Spark & Cassandra by Jon Haddad of DatastaxData Con LA
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackDataStax Academy
 
Deep Dive Into Elasticsearch
Deep Dive Into ElasticsearchDeep Dive Into Elasticsearch
Deep Dive Into ElasticsearchKnoldus Inc.
 
DataStax and Esri: Geotemporal IoT Search and Analytics
DataStax and Esri: Geotemporal IoT Search and AnalyticsDataStax and Esri: Geotemporal IoT Search and Analytics
DataStax and Esri: Geotemporal IoT Search and AnalyticsDataStax Academy
 
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and HadoopEventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and HadoopAyon Sinha
 
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...Lucidworks
 
Денис Резник "Моя база данных не справляется с нагрузкой. Что делать?"
Денис Резник "Моя база данных не справляется с нагрузкой. Что делать?"Денис Резник "Моя база данных не справляется с нагрузкой. Что делать?"
Денис Резник "Моя база данных не справляется с нагрузкой. Что делать?"Fwdays
 
quick intro to elastic search
quick intro to elastic search quick intro to elastic search
quick intro to elastic search medcl
 
An Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise SearchAn Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise SearchPatricia Gorla
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark Summit
 
MongoDB and AWS Best Practices
MongoDB and AWS Best PracticesMongoDB and AWS Best Practices
MongoDB and AWS Best PracticesMongoDB
 
Get involved with the Apache Software Foundation
Get involved with the Apache Software FoundationGet involved with the Apache Software Foundation
Get involved with the Apache Software FoundationShalin Shekhar Mangar
 
Pythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra ClusterPythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra ClusterDataStax Academy
 
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis TechnologySimple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis TechnologyLucidworks
 
The Secrets of Building Realtime Big Data Systems
The Secrets of Building Realtime Big Data SystemsThe Secrets of Building Realtime Big Data Systems
The Secrets of Building Realtime Big Data Systemsnathanmarz
 
Ubiquitous Solr - A Database's not-so-evil Twin
Ubiquitous Solr - A Database's not-so-evil TwinUbiquitous Solr - A Database's not-so-evil Twin
Ubiquitous Solr - A Database's not-so-evil TwinAyon Sinha
 
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...Rahul K Chauhan
 

What's hot (20)

Runaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop itRunaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop it
 
Kindling: Getting Started with Spark and Cassandra
Kindling: Getting Started with Spark and CassandraKindling: Getting Started with Spark and Cassandra
Kindling: Getting Started with Spark and Cassandra
 
Getting started with Spark & Cassandra by Jon Haddad of Datastax
Getting started with Spark & Cassandra by Jon Haddad of DatastaxGetting started with Spark & Cassandra by Jon Haddad of Datastax
Getting started with Spark & Cassandra by Jon Haddad of Datastax
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Deep Dive Into Elasticsearch
Deep Dive Into ElasticsearchDeep Dive Into Elasticsearch
Deep Dive Into Elasticsearch
 
DataStax and Esri: Geotemporal IoT Search and Analytics
DataStax and Esri: Geotemporal IoT Search and AnalyticsDataStax and Esri: Geotemporal IoT Search and Analytics
DataStax and Esri: Geotemporal IoT Search and Analytics
 
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and HadoopEventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
 
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
 
Денис Резник "Моя база данных не справляется с нагрузкой. Что делать?"
Денис Резник "Моя база данных не справляется с нагрузкой. Что делать?"Денис Резник "Моя база данных не справляется с нагрузкой. Что делать?"
Денис Резник "Моя база данных не справляется с нагрузкой. Что делать?"
 
quick intro to elastic search
quick intro to elastic search quick intro to elastic search
quick intro to elastic search
 
An Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise SearchAn Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise Search
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
 
MongoDB and AWS Best Practices
MongoDB and AWS Best PracticesMongoDB and AWS Best Practices
MongoDB and AWS Best Practices
 
Get involved with the Apache Software Foundation
Get involved with the Apache Software FoundationGet involved with the Apache Software Foundation
Get involved with the Apache Software Foundation
 
Pythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra ClusterPythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra Cluster
 
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis TechnologySimple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
 
The Secrets of Building Realtime Big Data Systems
The Secrets of Building Realtime Big Data SystemsThe Secrets of Building Realtime Big Data Systems
The Secrets of Building Realtime Big Data Systems
 
Ubiquitous Solr - A Database's not-so-evil Twin
Ubiquitous Solr - A Database's not-so-evil TwinUbiquitous Solr - A Database's not-so-evil Twin
Ubiquitous Solr - A Database's not-so-evil Twin
 
Elastic search
Elastic searchElastic search
Elastic search
 
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
 

Viewers also liked

Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...
Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...
Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...DataStax Academy
 
Cassandra Day Denver 2014: Cassandra Anti-Pattern Jeopardy
Cassandra Day Denver 2014: Cassandra Anti-Pattern JeopardyCassandra Day Denver 2014: Cassandra Anti-Pattern Jeopardy
Cassandra Day Denver 2014: Cassandra Anti-Pattern JeopardyDataStax Academy
 
Cassandra Day Denver 2014: Python & Cassandra Best Friends
Cassandra Day Denver 2014: Python & Cassandra Best FriendsCassandra Day Denver 2014: Python & Cassandra Best Friends
Cassandra Day Denver 2014: Python & Cassandra Best FriendsDataStax Academy
 
Cassandra Day Denver 2014: So, You Want to Use Cassandra?
Cassandra Day Denver 2014: So, You Want to Use Cassandra?Cassandra Day Denver 2014: So, You Want to Use Cassandra?
Cassandra Day Denver 2014: So, You Want to Use Cassandra?DataStax Academy
 
Cassandra Day Denver 2014: Transitioning to Cassandra for an Already Giant Pr...
Cassandra Day Denver 2014: Transitioning to Cassandra for an Already Giant Pr...Cassandra Day Denver 2014: Transitioning to Cassandra for an Already Giant Pr...
Cassandra Day Denver 2014: Transitioning to Cassandra for an Already Giant Pr...DataStax Academy
 
Cassandra Day Denver 2014: Building Java Applications with Apache Cassandra
Cassandra Day Denver 2014: Building Java Applications with Apache CassandraCassandra Day Denver 2014: Building Java Applications with Apache Cassandra
Cassandra Day Denver 2014: Building Java Applications with Apache CassandraDataStax Academy
 
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...DataStax Academy
 
Cassandra Day Denver 2014: A Cassandra Data Model for Serving up Cat Videos
Cassandra Day Denver 2014: A Cassandra Data Model for Serving up Cat VideosCassandra Day Denver 2014: A Cassandra Data Model for Serving up Cat Videos
Cassandra Day Denver 2014: A Cassandra Data Model for Serving up Cat VideosDataStax Academy
 
Cassandra Day Denver 2014: Introduction to Apache Cassandra
Cassandra Day Denver 2014: Introduction to Apache CassandraCassandra Day Denver 2014: Introduction to Apache Cassandra
Cassandra Day Denver 2014: Introduction to Apache CassandraDataStax Academy
 
Cassandra Summit 2014: Fuzzy Entity Matching at Scale
Cassandra Summit 2014: Fuzzy Entity Matching at ScaleCassandra Summit 2014: Fuzzy Entity Matching at Scale
Cassandra Summit 2014: Fuzzy Entity Matching at ScaleDataStax Academy
 
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...DataStax Academy
 

Viewers also liked (11)

Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...
Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...
Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...
 
Cassandra Day Denver 2014: Cassandra Anti-Pattern Jeopardy
Cassandra Day Denver 2014: Cassandra Anti-Pattern JeopardyCassandra Day Denver 2014: Cassandra Anti-Pattern Jeopardy
Cassandra Day Denver 2014: Cassandra Anti-Pattern Jeopardy
 
Cassandra Day Denver 2014: Python & Cassandra Best Friends
Cassandra Day Denver 2014: Python & Cassandra Best FriendsCassandra Day Denver 2014: Python & Cassandra Best Friends
Cassandra Day Denver 2014: Python & Cassandra Best Friends
 
Cassandra Day Denver 2014: So, You Want to Use Cassandra?
Cassandra Day Denver 2014: So, You Want to Use Cassandra?Cassandra Day Denver 2014: So, You Want to Use Cassandra?
Cassandra Day Denver 2014: So, You Want to Use Cassandra?
 
Cassandra Day Denver 2014: Transitioning to Cassandra for an Already Giant Pr...
Cassandra Day Denver 2014: Transitioning to Cassandra for an Already Giant Pr...Cassandra Day Denver 2014: Transitioning to Cassandra for an Already Giant Pr...
Cassandra Day Denver 2014: Transitioning to Cassandra for an Already Giant Pr...
 
Cassandra Day Denver 2014: Building Java Applications with Apache Cassandra
Cassandra Day Denver 2014: Building Java Applications with Apache CassandraCassandra Day Denver 2014: Building Java Applications with Apache Cassandra
Cassandra Day Denver 2014: Building Java Applications with Apache Cassandra
 
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
 
Cassandra Day Denver 2014: A Cassandra Data Model for Serving up Cat Videos
Cassandra Day Denver 2014: A Cassandra Data Model for Serving up Cat VideosCassandra Day Denver 2014: A Cassandra Data Model for Serving up Cat Videos
Cassandra Day Denver 2014: A Cassandra Data Model for Serving up Cat Videos
 
Cassandra Day Denver 2014: Introduction to Apache Cassandra
Cassandra Day Denver 2014: Introduction to Apache CassandraCassandra Day Denver 2014: Introduction to Apache Cassandra
Cassandra Day Denver 2014: Introduction to Apache Cassandra
 
Cassandra Summit 2014: Fuzzy Entity Matching at Scale
Cassandra Summit 2014: Fuzzy Entity Matching at ScaleCassandra Summit 2014: Fuzzy Entity Matching at Scale
Cassandra Summit 2014: Fuzzy Entity Matching at Scale
 
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...
 

Similar to Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...DataStax Academy
 
Move a successful onpremise oltp application to the cloud
Move a successful onpremise oltp application to the cloudMove a successful onpremise oltp application to the cloud
Move a successful onpremise oltp application to the cloudIke Ellis
 
The Right Data for the Right Job
The Right Data for the Right JobThe Right Data for the Right Job
The Right Data for the Right JobEmily Curtin
 
Data Science Lab Meetup: Cassandra and Spark
Data Science Lab Meetup: Cassandra and SparkData Science Lab Meetup: Cassandra and Spark
Data Science Lab Meetup: Cassandra and SparkChristopher Batey
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataDatabricks
 
Practical Machine Learning in Information Security
Practical Machine Learning in Information SecurityPractical Machine Learning in Information Security
Practical Machine Learning in Information SecuritySven Krasser
 
Webinar: Proofpoint, a pioneer in security-as-a-service protects people, info...
Webinar: Proofpoint, a pioneer in security-as-a-service protects people, info...Webinar: Proofpoint, a pioneer in security-as-a-service protects people, info...
Webinar: Proofpoint, a pioneer in security-as-a-service protects people, info...DataStax
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNATomas Cervenka
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapSrinath Perera
 
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Kristi Lewandowski
 
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017SingleStore
 
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Kristi Lewandowski
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraCaserta
 
Open Source North - MongoDB Advanced Schema Design Patterns
Open Source North - MongoDB Advanced Schema Design PatternsOpen Source North - MongoDB Advanced Schema Design Patterns
Open Source North - MongoDB Advanced Schema Design PatternsMatthew Kalan
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSAmazon Web Services
 
TrueReusableCode-BigDataCodeCamp2016
TrueReusableCode-BigDataCodeCamp2016TrueReusableCode-BigDataCodeCamp2016
TrueReusableCode-BigDataCodeCamp2016Eduard Lazar
 

Similar to Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research (20)

Data Science
Data ScienceData Science
Data Science
 
MySQL vs. MonetDB
MySQL vs. MonetDBMySQL vs. MonetDB
MySQL vs. MonetDB
 
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...
 
Move a successful onpremise oltp application to the cloud
Move a successful onpremise oltp application to the cloudMove a successful onpremise oltp application to the cloud
Move a successful onpremise oltp application to the cloud
 
The Right Data for the Right Job
The Right Data for the Right JobThe Right Data for the Right Job
The Right Data for the Right Job
 
Data Science Lab Meetup: Cassandra and Spark
Data Science Lab Meetup: Cassandra and SparkData Science Lab Meetup: Cassandra and Spark
Data Science Lab Meetup: Cassandra and Spark
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big Data
 
F8 tech talk_pinterest_v4
F8 tech talk_pinterest_v4F8 tech talk_pinterest_v4
F8 tech talk_pinterest_v4
 
Practical Machine Learning in Information Security
Practical Machine Learning in Information SecurityPractical Machine Learning in Information Security
Practical Machine Learning in Information Security
 
Webinar: Proofpoint, a pioneer in security-as-a-service protects people, info...
Webinar: Proofpoint, a pioneer in security-as-a-service protects people, info...Webinar: Proofpoint, a pioneer in security-as-a-service protects people, info...
Webinar: Proofpoint, a pioneer in security-as-a-service protects people, info...
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
 
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017
 
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017
 
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017
 
Spark at Zillow
Spark at ZillowSpark at Zillow
Spark at Zillow
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
 
Open Source North - MongoDB Advanced Schema Design Patterns
Open Source North - MongoDB Advanced Schema Design PatternsOpen Source North - MongoDB Advanced Schema Design Patterns
Open Source North - MongoDB Advanced Schema Design Patterns
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
TrueReusableCode-BigDataCodeCamp2016
TrueReusableCode-BigDataCodeCamp2016TrueReusableCode-BigDataCodeCamp2016
TrueReusableCode-BigDataCodeCamp2016
 

More from DataStax Academy

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftDataStax Academy
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseDataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraDataStax Academy
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsDataStax Academy
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingDataStax Academy
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache CassandraDataStax Academy
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready CassandraDataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonDataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First ClusterDataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with DseDataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraDataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseDataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraDataStax Academy
 
Apache Cassandra and Drivers
Apache Cassandra and DriversApache Cassandra and Drivers
Apache Cassandra and DriversDataStax Academy
 
Getting Started with Graph Databases
Getting Started with Graph DatabasesGetting Started with Graph Databases
Getting Started with Graph DatabasesDataStax Academy
 

More from DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 
Apache Cassandra and Drivers
Apache Cassandra and DriversApache Cassandra and Drivers
Apache Cassandra and Drivers
 
Getting Started with Graph Databases
Getting Started with Graph DatabasesGetting Started with Graph Databases
Getting Started with Graph Databases
 

Recently uploaded

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Recently uploaded (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Research

  • 1. Using Cassandra to Support Crisis Informatics Research Kenneth M. Anderson Associate Professor Department of Computer Science Co-Director of The Center for Software and Society Co-Director of Project EPIC Director of CU’s Big Data Initiative Happy Ada Lovelace Day!
  • 2. Ken Anderson Associate Professor; Department of Computer Science ‣ Research Interests • Software Architecture and Software Design • Data-Intensive Systems and Crisis Informatics ‣ Teaching Interests • Software Engineering; OO A&D; Data Engineering ‣ Active in Broadening Participation in Computer Science • Led the creation of the BA in CS degree at CU - 450 new CS majors in two years; 900 CS majors on campus
  • 3. Project EPIC ‣ Empowering the Public with Information in Crisis • Largest NSF-Funded Project on Crisis Informatics - ~4M since Fall 2009 ‣ Results • ~60 research publications, 2 PostDocs, 5 PhD graduates, 4 MS graduates, 13 current PhD students • Tweak the Tweet; 100+ data sets (~1.5B tweets) • Software: Data collection, analytics, NLP, GIS
  • 4. Crisis Informatics The study of how technology is changing the way the world responds to mass emergency events
  • 5.
  • 6.
  • 7.
  • 8. 70K Geotagged Tweets prior/during/after Hurricane Sandy Landfall
  • 9. 0 35 70 105 140 9/12/13 12:00 AM 9/12/13 12:00 PM 9/13/13 12:00 AM 9/13/13 12:00 PM 9/14/13 12:00 AM 9/14/13 12:00 PM 9/15/13 12:00 AM 9/15/13 12:00 PM 9/16/13 12:00 AM 9/16/13 12:00 PM 9/17/13 12:00 AM 9/17/13 12:00 PM 9/18/13 12:00 AM 9/18/13 12:00 PM 9/19/13 12:00 AM 9/19/13 12:00 PM 9/20/13 12:00 AM 9/20/13 12:00 PM Tweets Per Minute 2013 Colorado Floods — First Nine Days 51 31 15 17 11 7 7 5 3 Average Tweets Per Minute
  • 10.
  • 11.
  • 12. Project EPIC Software Infrastructure ‣ EPIC Collect • Twitter data collection infrastructure capable of collecting 24/7 with 99.9% uptime (since 2010) - Built on top of Cassandra and designed for scalability, availability, and flexibility ‣ EPIC Analyze • A scalable and flexible data analytics environment that allows Project EPIC analysts to browse, search, filter, annotate, and process EPIC Collect data sets - Built on top of DataStax Enterprise, Redis, Rails, & Postgres
  • 13. Project EPIC Software Architecture Logical Arrangement of Components Deployed across seven servers in a CU Data Center EPIC Event Editor EPIC Analyze Splunk Application Layer Service Layer Storage Layer Twitter Redis PostgreSQL Pig Hadoop Solr Cassandra EPIC Collect DataStax Enterprise
  • 15. Twitter Data Center Twitter Collection Service Project EPIC Event Editor Cassandra Cassandra Cassandra Cassandra Log
  • 16. Flexibility. Immune to changes in Tweet metadata. Twitter Data Center Twitter Collection Service Why Cassandra? Project EPIC Event Editor { “id” : … } Cassandra Cassandra Cassandra Cassandra Log
  • 17. Availability. Tweets can be written Why Cassandra? to any node in the cluster. Twitter Data Center Twitter Collection Service Project EPIC Event Editor Cassandra Cassandra Cassandra Cassandra Log
  • 18. Twitter Data Center Cassandra Twitter Collection Service Why Cassandra? Cassandra Cassandra Cassandra Log Project EPIC Event Editor Scalability. Need more disk space? Add more nodes! CassanCdarsasandra Cassandra Cassandra
  • 19. Robustness. Data on nodes Why Cassandra? automatically replicated. Twitter Data Center Twitter Collection Service Project EPIC Event Editor Cassandra Cassandra Cassandra Cassandra Log
  • 21. Data Modeling is Wicked Hard
  • 23. Cassandra Data Model It’s hash tables all the way down… Row Key 1 Column Name A ••• Column Name X Value ••• Value ••• Row Key N Column Name B ••• Column Name Y Value ••• Value The design of row keys is critical.
  • 24. Why? ‣ Row keys determine what you can retrieve • They are your primary means to make a query and retrieve relevant data; their structure determines query expressivity • It should be easy to generate them from elements of your problem domain ‣ Row keys determine how “wide” your rows are • This is important because Cassandra replicates rows ‣ Row keys are partitioned across your cluster’s nodes • A “bad” row key design can negatively impact performance
  • 25. Row Keys Should Reflect Problem Domain ‣ You need to easily be able to generate row keys based on information in your problem domain <region_name>:<entity_name>:<time_collected> vs 751e8446ede178f10fd44e3a37affb6b15ed30ce ‣ The former: easily generated from domain objects • easily reconstructed at query time ‣ The latter might be easily generated • but not easily reconstructed
  • 26. The Reason? ‣ No easy way to ask Cassandra for all row keys in a column family • If you want to get this information, you have to query Cassandra for it, in batches, until all row keys have been retrieved - This is not an O(1) operation! ‣ Instead, it’s better if you can skip this step and reconstruct from your problem domain • US_EastCoast:Invoices:0000_01012014 to US_EastCoast:Invoices:2359_12312014
  • 27. Wide vs. Narrow ‣ You can design “wide” rows or “narrow” rows • This corresponds to returning a LOT of information for a given key or a limited amount of information fb_users_! dk user 1; user 2; … user 100,000; … ken_! age_ht age; height • Wide rows can be useful, for instance, if you’re domain has lots of “events” on a given day or within a given hour
  • 28. The Rub? Rows Get Replicated Cassandra Cassandra Cassandra Cassandra As previously mentioned, rows get replicated For wide rows, this can be a performance concern. How wide is too wide? Depends on size of cluster and network bandwidth
  • 29. Row Keys Get Partitioned ‣ The nodes in your cluster divide up the key space between them • The value of a row key determines where it will get stored ‣ You have to be cognizant of this partition because often Cassandra is being used in situations where a LOT of data is being written to it • You need to make sure your row key design does not overburden any one node in your cluster
  • 30. Imagine your row_key is a monotonically increasing integer Say, for instance, tweet ids Cassandra Cassandra Cassandra Cassandra Twitter Collection Service Over a single day, all tweets might be saved on just one node in the cluster; the others would remain idle!
  • 31. Instead, you want enough variation that keys get evenly distributed across the cluster Reader Cassandra Cassandra Cassandra Cassandra row_key_1 row_key_a row_key_$ row_key_2 Writer
  • 32. Design of Row Key for EPIC Collect ‣ For Project EPIC, we make use of a “hybrid” row key • The first part of the row_key is a keyword used to collect tweets for a given event - earthquake, flood, cowx, obama, … • The second part of the row_key is the Julian day that a tweet was collected on - January 1, 2014 equals “2014001”; February 1, 2014 equals “2014032”; etc. • The third part of the row_key is the last digit of an MD5 hash of the entire Tweet JSON object - i.e. 0-9, a-f; This is used to distribute tweets across the cluster
  • 33. Tweets Column Family keyword:day:tag Tweet Id 1 ••• Tweet Id N JSON ••• JSON ••• keyword2:day:tag Tweet Id 1 ••• Tweet Id M JSON ••• JSON ••• ‣ keyword: a word of interest for an event; e.g. “flood” ‣ julian_day: the day of the year a tweet was collected ‣ tag: a hexadecimal character “0-9, a, b, c, d, e, f”
  • 34. Cassandra Cassandra Cassandra Cassandra flood:002:0 flood:002:1 flood:002:2 flood:002:3 flood:002:4 flood:002:5 flood:002:6 flood:002:7 flood:002:8 flood:002:c flood:002:9 flood:002:a flood:002:b flood:002:d flood:002:e flood:002:f Row Key Distribution Replication flood:002:0 flood:002:1 …
  • 36. EPIC Analyze A data analytics environment for large Twitter Data Sets ‣ Provides a scalable and extensible analysis environment • Aims to partially automate Project EPIC’s analysis work - Automatically calculate common metrics on all data sets - Apply new analysis algorithms to entire data sets at once - Support filtering/sampling on large data sets - Support shared data set annotation by a team of analysts • Provide these features while - supporting data sets of millions of tweets - with fast performance so as not to interrupt analysis work
  • 37. Project EPIC Web Apps DataStax Enterprise 3rd Party Analytics Apps Twitter Hadoop Cassandra Solr Facebook Pig Redis
  • 38. Challenges ‣ Recall: goal of EPIC Collect is to store events in a reliable, scalable fashion ‣ Data not necessarily structured to support analysis • Implication: Need for Migration/Duplication to enable features such as searching, filtering, analysis, etc.
  • 39. Data Migration and Duplication ‣ With EPIC Collect, we chose to have fairly “wide” rows • Each row stores the tweets that contain a given keyword for a given day - “All tweets that contain the word “flood” collected on 01/01/14” - We use the “tag” to keep the row from growing too large, but there can still be 100s of 1000s of tweets in each row ‣ To support searching/filtering, we want to use Solr • however, Solr requires “narrow” structured rows - one tweet per row, each column defined by a schema
  • 40. We go from… tweet_1, tweet_2, tweet_3, … , tweet_999999, …, tweet_N {“text” : “This flood is …” …} row_key flood:2014002:a To this… tweet_1_attributes row_key_for_tweet_1 tweet_2_attributes row_key_for_tweet_2 tweet_3_attributes row_key_for_tweet_3 … tweet_N_attributes row_key_for_tweet_N
  • 41. Implications ‣ Each time a data set is “imported” into EPIC Analyze • we must launch a script that reformats each tweet into the “narrow row” format required by Solr - In the future, we’ll modify collection to write tweets both ways ‣ It’s not a complete duplication • we only store those attributes that we want to search on ‣ but it’s still significant ‣ the benefit is that we can then apply all of Solr’s powerful search capabilities to our data sets
  • 43. Cassandra: Strong Foundation for Project EPIC ‣ With migration to Cassandra in 2012, EPIC Collect has been running 24/7 with minimal downtime • Downtime usually related to network outages • Cassandra keeps right on ticking! ‣ Has provided Project EPIC with a reliable environment to perform a wide range of crisis informatics research • leading to new understanding of how people use Twitter to coordinate and collaborate during times of disaster
  • 44. Cassandra: Strong Foundation for Project EPIC ‣ An excellent NoSQL technology but you must take time to understand Cassandra’s advantages and its data model • Provides flexibility, availability, scalability, and robustness • Row keys - difficult to get right (but that’s true of all data modeling tasks!) - design to reflect your problem domain - to determine width of rows (and speed of replication) - and to partition data across your cluster
  • 45. Thank You Ken Anderson <ken.anderson@colorado.edu> Project Epic: <http://epic.cs.colorado.edu> @epiccolorado Department of Computer Science University of Colorado