Advanced search and 
Top-K queries in Cassandra 
1 
Daniel Higuero 
dhiguero@stratio.com 
@dhiguero 
Andrés de la Peña 
andres@stratio.com 
@a_de_la_pena
Who are we? 
• Stratio is a Big Data Company 
• Founded in 2013 
• Commercially launched in 2014 
• 70+ employees in Madrid 
• Office in San Francisco 
• Certified Spark distribution 
#CassandraSummit 2014
Cassandra query methods 
Stratio Lucene based 2i implementation 
Integrating Lucene 2i with Apache Spark 
1 
2 
3 
CONTENTS
primary key 
secondary indexes 
token ranges 
Throughput 
Expressiveness 
Cassandra query methods 
#CassandraSummit 2014 4
Primary key queries 
• O(1) node lookup for partition key 
• Range slices for clustering key 
• Usually requires denormalization 
Partition key CLIENT Clustering key range 
Node 
3 
Node 
1 
Node 
2 
apena 
2014-04-10:body 
When you.. 
aagea 
dhiguero 
apena 
2014-04-06:body 2014-04-07:body 2014-04-08:body 
To study and… To think and... If you see what.. 
2014-04-06:body 
The cautious… 
2014-04-10:body 
When you.. 
2014-04-11:body 
When you do… 
#CassandraSummit 2014 5
primary key 
secondary indexes 
token ranges 
Throughput 
Expressiveness 
Cassandra query methods 
#CassandraSummit 2014 6
CLIENT C* 
node 
C* 
node 
2i local column 
family 
C* 
node 
2i local column 
family 
2i local column 
family 
Secondary indexes queries 
• Inverted index 
• Mitigates denormalization 
• Queries may involve all C* nodes 
• Queries limited to a single column 
#CassandraSummit 2014 7
primary key 
secondary indexes 
token ranges 
Throughput 
Expressiveness 
Cassandra query methods 
#CassandraSummit 2014 8
C*# 
node# 
C*# 
node# 
C*# 
node# 
Spark 
master 
Token range queries 
• Used by MapReduce frameworks 
as Hadoop or Spark 
• All kinds of queries are possible 
• Low throughput 
• Ad-hoc queries 
• Batch processing 
• Materialized views 
CLIENT 
query= function (all data) 
#CassandraSummit 2014 9
C*# 
node# 
C*# 
node# 
C*# 
node# 
Combining 2i with MapReduce 
• Expressiveness avoiding full scans 
• Still limited by one indexed column per query 
Spark 
CLIENT master 
Secondary 
index 
Secondary 
index 
Secondary 
index 
#CassandraSummit 2014 10
What do we miss from 2i indexes? 
MORE EXPRESIVENESS 
• Range queries 
• Multivariable search 
• Full text search 
• Sorting by fields 
• Top-k queries 
#CassandraSummit 2014 11
What do we like from the existing 2i? 
IT’S ARCHITECTURE 
• Each node indexes its own data 
• The index implementations do not need to be distributed 
• Natural extension point 
• Can be created after design and ingestion 
#CassandraSummit 2014 12
Thinking in a custom secondary index implementation… 
WHY NOT USE ? 
#CassandraSummit 2014 13
Why we like Lucene 
• Proven stable and fast indexing solution 
• Expressive queries 
- Multivariable, ranges, full text, sorting, top-k, etc. 
• Mature distributed search solutions built on top of it 
- Solr, ElasticSearch 
• Can be fully embedded in application code 
• Published under the Apache License 
#CassandraSummit 2014 14
HOW IT WORKS
ALTER TABLE tweets ADD lucene TEXT; 
CREATE TABLE tweets ( 
id bigint, 
createdAt timestamp, 
message text, 
userid bigint, 
username text, 
PRIMARY KEY (userid, createdAt, id) ); 
Create index 
• Built in the background in any moment 
• Real time updates 
• Mapping eases ETL 
• Language aware 
CREATE CUSTOM INDEX tweets_idx ON twitter.tweets (lucene) 
USING 'com.stratio.index.RowIndex' 
WITH OPTIONS = { 
'refresh_seconds' : '60', 
'schema' : '{ default_analyzer : "EnglishAnalyzer", 
fields : { 
createdat : {type : "date", pattern : "yyyy-MM-dd"}, 
message : {type : "text", analyzer : ”EnglishAnalyzer"}, 
userid : {type : "string"}, 
username : {type : "string"} 
}} '}; 
#CassandraSummit 2014 16
SELECT * FROM tweets WHERE lucene 
= ‘{ 
filter : {type : "match", 
field : "text", 
value : "cassandra"} 
}’ LIMIT 10; 
search 10 
found 6 
found 4 
We are done ! 
Filtering query 
CLIENT 
C* 
node 
C* 
node 
C* 
node 
Lucene 
index 
Lucene 
index 
Lucene 
index 
#CassandraSummit 2014 17
Found 5 
Found 4 
Found 5 
Top-k query 
SELECT * FROM tweets WHERE lucene 
= ‘{ 
query: {type:”match", 
field : ”text”, 
value : “cassandra”} 
}’ LIMIT 5; 
C* 
node 
Search top-5 CLIENT Search top-5 
C* 
node 
C* 
node 
Lucene 
index 
Lucene 
index 
Lucene 
index 
Merge 
14 to 
best 5 
#CassandraSummit 2014 18
Modifying Cassandra for generic top-k queries 
Two new methods in SecondaryIndexSearcher: 
boolean'requiresFullScan(List<IndexExpression>'clause);' 
List<Row>'sort(List<IndexExpression>'clause,'List<Row>'rows);' 
Two new methods in AbstractRangeCommand: 
boolean'requiresFullScan();' 
List<Row>'combine(List<Row>'rows);' 
And some changes in StorageProxy#getRangeSlice… 
#CassandraSummit 2014 19
Queries can be as complex as you want 
SELECT * FROM tweets WHERE lucene = ‘{ 
filter : 
{ 
type : "boolean", must : 
[ 
{type : "range", field : "time" lower : "2014/04/25”}, 
{type : "boolean", should : 
[ 
{type : "prefix", field : "user", value : "a"} , 
{type : "wildcard", field : "user", value : "*b*"} , 
{type : "match", field : "user", value : "fast"} 
] 
} 
] 
}, 
sort : 
{ 
fields: [ {field :"time", reverse : true}, 
{field : "user", reverse : false} ] 
} 
}’ LIMIT 10000; 
#CassandraSummit 2014 20
Some implementation details 
• A Lucene document per CQL row, and a Lucene field per indexed column 
• SortingMergePolicy keeps index sorted in the same way that C* does 
• Index commits synchronized with column family flushes 
• Segments merge synchronized with column family compactions 
NO MAINTENANCE REQUIRED 
#CassandraSummit 2014 21
QUERY BUILDER
FLUENT QUERY BUILDER 
Query builder 
• Facilitates writing index-related clauses 
• Compatible with the existing C* Query Builder 
#CassandraSummit 2014 23
Query builder example 
SELECT * FROM tweets WHERE lucene = 
‘{ 
query : {type : "range", 
field : "time”, 
lower : "2014/04/25”, 
upper : "2014/04/30”}, 
filter : {type : "match", 
field : "text", 
value : "cassandra"}, 
sort : 
{ 
fields: [ {field :"time", 
reverse : true}} ] 
} 
}’ LIMIT 10; 
String'filter'='SearchBuilders' 
'''.filter(range("time")' 
'''''''''''''.lower("2014/04/25")' 
'''''''''''''.upper("2014/04/30"))' 
'''.query(match("text",'"cassandra")' 
'''.sort(sorting("time",'true))' 
'''.toJson();' 
' 
QueryBuilder.select()'''''''''''''''''''''''''' 
'''.from(KEYSPACE,'TABLE)' 
'''.where(eq("lucene",'filter))' 
'''.limit(10) 
#CassandraSummit 2014 24
LUCENE 
AND 
SPARK
Integrating Lucene & Spark 
Split friendly. It supports searches within a token range 
SELECT * FROM tweets WHERE lucene = ‘{ 
filter : {type:"match", field:”text", value:"cassandra"} 
}’ 
AND TOKEN(userid, createdAt, id) > 253653456456 
AND TOKEN(userid, createdAt, id) <= 3456467456756 
LIMIT 10000; 
#CassandraSummit 2014 26
Integrating Lucene & Spark 
Paging friendly: It supports starting queries in a certain point 
SELECT * FROM tweets WHERE lucene = ‘{ 
filter : {type:”match", field:”text", value:”cassandra”} 
}’ 
AND userid = 3543534 
AND createdAt > 2011-02-03 04:05+0000 
LIMIT 5000; 
#CassandraSummit 2014 27
Integrating Lucene & Spark 
CLIENT 
Spark 
master 
C* 
node 
C* 
node 
C* 
node 
Lucene 
Lucene 
Lucene 
• Compute large amounts of data 
• Avoid systematic full scan 
• Reduces the amount of data to be processed 
• Filtering push-down 
#CassandraSummit 2014 28
WHEN TO 
USE INDEXES 
AND WHEN TO 
USE FULL SCAN
Index performance in Spark 
Time 
Full scan 
Lucene 2i 
Records returned 
#CassandraSummit 2014 30
Lucene indexes in C* DEMO
OTHER 
TOOLS
Stratio Deep 
INTEGRATING SPARK WITH DIFFERENT DATASTORES 
• Common Cell abstraction in the RDD 
• Maintain compatibility with Spark operations 
• Compatible with multiple datastore technologies 
• DeepSparkContext 
• DeepJobConfig 
• Compatible with Lucene indexes 
#CassandraSummit 2014 33
Stratio Crossdata 
UNIFYING BATCH AND STREAMING QUERIES 
• Single SQL-like language 
• Compatible with multiple datastore technologies 
• Connector-based architecture 
• Ability to combine data from different datastore 
• Complement non-native operation with Spark 
• E.g., JOIN in Cassandra 
• Custom support for Lucene-based secondary indexes 
#CassandraSummit 2014 34
CREATING INDEXES 
Stratio Crossdata 
CREATE'FULLTEXT'INDEX'ON'app.users(name,'email);' 
QUERYING THE INDEXES 
SELECT'*'FROM'app.users'' 
where'email'MATCH'‘*@stratio.com’;' 
#CassandraSummit 2014 35
Conclusions 
• Added new query methods 
- Multivariable queries (AND, OR, NOT) 
- Range queries (>, >=, <, <=) and regular expressions 
- Full text queries (match, phrase, fuzzy...) 
• Top-k query support 
- Lucene scoring formula 
- Sort by field values 
• Compatible with MapReduce frameworks 
• Preserves Cassandra’s functionality 
#CassandraSummit 2014 36
github.com/stratio/stratio-cassandra 
• Published as fork of Apache Cassandra 
• Apache License Version 2.0 
stratio.github.io/crossdata 
Its open source 
• Apache License Version 2.0 
#CassandraSummit 2014 37
Advanced search and 
Top-K queries in Cassandra 
38 
Daniel Higuero 
dhiguero@stratio.com 
@dhiguero 
Andrés de la Peña 
andres@stratio.com 
@a_de_la_pena

Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014

  • 1.
    Advanced search and Top-K queries in Cassandra 1 Daniel Higuero dhiguero@stratio.com @dhiguero Andrés de la Peña andres@stratio.com @a_de_la_pena
  • 2.
    Who are we? • Stratio is a Big Data Company • Founded in 2013 • Commercially launched in 2014 • 70+ employees in Madrid • Office in San Francisco • Certified Spark distribution #CassandraSummit 2014
  • 3.
    Cassandra query methods Stratio Lucene based 2i implementation Integrating Lucene 2i with Apache Spark 1 2 3 CONTENTS
  • 4.
    primary key secondaryindexes token ranges Throughput Expressiveness Cassandra query methods #CassandraSummit 2014 4
  • 5.
    Primary key queries • O(1) node lookup for partition key • Range slices for clustering key • Usually requires denormalization Partition key CLIENT Clustering key range Node 3 Node 1 Node 2 apena 2014-04-10:body When you.. aagea dhiguero apena 2014-04-06:body 2014-04-07:body 2014-04-08:body To study and… To think and... If you see what.. 2014-04-06:body The cautious… 2014-04-10:body When you.. 2014-04-11:body When you do… #CassandraSummit 2014 5
  • 6.
    primary key secondaryindexes token ranges Throughput Expressiveness Cassandra query methods #CassandraSummit 2014 6
  • 7.
    CLIENT C* node C* node 2i local column family C* node 2i local column family 2i local column family Secondary indexes queries • Inverted index • Mitigates denormalization • Queries may involve all C* nodes • Queries limited to a single column #CassandraSummit 2014 7
  • 8.
    primary key secondaryindexes token ranges Throughput Expressiveness Cassandra query methods #CassandraSummit 2014 8
  • 9.
    C*# node# C*# node# C*# node# Spark master Token range queries • Used by MapReduce frameworks as Hadoop or Spark • All kinds of queries are possible • Low throughput • Ad-hoc queries • Batch processing • Materialized views CLIENT query= function (all data) #CassandraSummit 2014 9
  • 10.
    C*# node# C*# node# C*# node# Combining 2i with MapReduce • Expressiveness avoiding full scans • Still limited by one indexed column per query Spark CLIENT master Secondary index Secondary index Secondary index #CassandraSummit 2014 10
  • 11.
    What do wemiss from 2i indexes? MORE EXPRESIVENESS • Range queries • Multivariable search • Full text search • Sorting by fields • Top-k queries #CassandraSummit 2014 11
  • 12.
    What do welike from the existing 2i? IT’S ARCHITECTURE • Each node indexes its own data • The index implementations do not need to be distributed • Natural extension point • Can be created after design and ingestion #CassandraSummit 2014 12
  • 13.
    Thinking in acustom secondary index implementation… WHY NOT USE ? #CassandraSummit 2014 13
  • 14.
    Why we likeLucene • Proven stable and fast indexing solution • Expressive queries - Multivariable, ranges, full text, sorting, top-k, etc. • Mature distributed search solutions built on top of it - Solr, ElasticSearch • Can be fully embedded in application code • Published under the Apache License #CassandraSummit 2014 14
  • 15.
  • 16.
    ALTER TABLE tweetsADD lucene TEXT; CREATE TABLE tweets ( id bigint, createdAt timestamp, message text, userid bigint, username text, PRIMARY KEY (userid, createdAt, id) ); Create index • Built in the background in any moment • Real time updates • Mapping eases ETL • Language aware CREATE CUSTOM INDEX tweets_idx ON twitter.tweets (lucene) USING 'com.stratio.index.RowIndex' WITH OPTIONS = { 'refresh_seconds' : '60', 'schema' : '{ default_analyzer : "EnglishAnalyzer", fields : { createdat : {type : "date", pattern : "yyyy-MM-dd"}, message : {type : "text", analyzer : ”EnglishAnalyzer"}, userid : {type : "string"}, username : {type : "string"} }} '}; #CassandraSummit 2014 16
  • 17.
    SELECT * FROMtweets WHERE lucene = ‘{ filter : {type : "match", field : "text", value : "cassandra"} }’ LIMIT 10; search 10 found 6 found 4 We are done ! Filtering query CLIENT C* node C* node C* node Lucene index Lucene index Lucene index #CassandraSummit 2014 17
  • 18.
    Found 5 Found4 Found 5 Top-k query SELECT * FROM tweets WHERE lucene = ‘{ query: {type:”match", field : ”text”, value : “cassandra”} }’ LIMIT 5; C* node Search top-5 CLIENT Search top-5 C* node C* node Lucene index Lucene index Lucene index Merge 14 to best 5 #CassandraSummit 2014 18
  • 19.
    Modifying Cassandra forgeneric top-k queries Two new methods in SecondaryIndexSearcher: boolean'requiresFullScan(List<IndexExpression>'clause);' List<Row>'sort(List<IndexExpression>'clause,'List<Row>'rows);' Two new methods in AbstractRangeCommand: boolean'requiresFullScan();' List<Row>'combine(List<Row>'rows);' And some changes in StorageProxy#getRangeSlice… #CassandraSummit 2014 19
  • 20.
    Queries can beas complex as you want SELECT * FROM tweets WHERE lucene = ‘{ filter : { type : "boolean", must : [ {type : "range", field : "time" lower : "2014/04/25”}, {type : "boolean", should : [ {type : "prefix", field : "user", value : "a"} , {type : "wildcard", field : "user", value : "*b*"} , {type : "match", field : "user", value : "fast"} ] } ] }, sort : { fields: [ {field :"time", reverse : true}, {field : "user", reverse : false} ] } }’ LIMIT 10000; #CassandraSummit 2014 20
  • 21.
    Some implementation details • A Lucene document per CQL row, and a Lucene field per indexed column • SortingMergePolicy keeps index sorted in the same way that C* does • Index commits synchronized with column family flushes • Segments merge synchronized with column family compactions NO MAINTENANCE REQUIRED #CassandraSummit 2014 21
  • 22.
  • 23.
    FLUENT QUERY BUILDER Query builder • Facilitates writing index-related clauses • Compatible with the existing C* Query Builder #CassandraSummit 2014 23
  • 24.
    Query builder example SELECT * FROM tweets WHERE lucene = ‘{ query : {type : "range", field : "time”, lower : "2014/04/25”, upper : "2014/04/30”}, filter : {type : "match", field : "text", value : "cassandra"}, sort : { fields: [ {field :"time", reverse : true}} ] } }’ LIMIT 10; String'filter'='SearchBuilders' '''.filter(range("time")' '''''''''''''.lower("2014/04/25")' '''''''''''''.upper("2014/04/30"))' '''.query(match("text",'"cassandra")' '''.sort(sorting("time",'true))' '''.toJson();' ' QueryBuilder.select()'''''''''''''''''''''''''' '''.from(KEYSPACE,'TABLE)' '''.where(eq("lucene",'filter))' '''.limit(10) #CassandraSummit 2014 24
  • 25.
  • 26.
    Integrating Lucene &Spark Split friendly. It supports searches within a token range SELECT * FROM tweets WHERE lucene = ‘{ filter : {type:"match", field:”text", value:"cassandra"} }’ AND TOKEN(userid, createdAt, id) > 253653456456 AND TOKEN(userid, createdAt, id) <= 3456467456756 LIMIT 10000; #CassandraSummit 2014 26
  • 27.
    Integrating Lucene &Spark Paging friendly: It supports starting queries in a certain point SELECT * FROM tweets WHERE lucene = ‘{ filter : {type:”match", field:”text", value:”cassandra”} }’ AND userid = 3543534 AND createdAt > 2011-02-03 04:05+0000 LIMIT 5000; #CassandraSummit 2014 27
  • 28.
    Integrating Lucene &Spark CLIENT Spark master C* node C* node C* node Lucene Lucene Lucene • Compute large amounts of data • Avoid systematic full scan • Reduces the amount of data to be processed • Filtering push-down #CassandraSummit 2014 28
  • 29.
    WHEN TO USEINDEXES AND WHEN TO USE FULL SCAN
  • 30.
    Index performance inSpark Time Full scan Lucene 2i Records returned #CassandraSummit 2014 30
  • 31.
  • 32.
  • 33.
    Stratio Deep INTEGRATINGSPARK WITH DIFFERENT DATASTORES • Common Cell abstraction in the RDD • Maintain compatibility with Spark operations • Compatible with multiple datastore technologies • DeepSparkContext • DeepJobConfig • Compatible with Lucene indexes #CassandraSummit 2014 33
  • 34.
    Stratio Crossdata UNIFYINGBATCH AND STREAMING QUERIES • Single SQL-like language • Compatible with multiple datastore technologies • Connector-based architecture • Ability to combine data from different datastore • Complement non-native operation with Spark • E.g., JOIN in Cassandra • Custom support for Lucene-based secondary indexes #CassandraSummit 2014 34
  • 35.
    CREATING INDEXES StratioCrossdata CREATE'FULLTEXT'INDEX'ON'app.users(name,'email);' QUERYING THE INDEXES SELECT'*'FROM'app.users'' where'email'MATCH'‘*@stratio.com’;' #CassandraSummit 2014 35
  • 36.
    Conclusions • Addednew query methods - Multivariable queries (AND, OR, NOT) - Range queries (>, >=, <, <=) and regular expressions - Full text queries (match, phrase, fuzzy...) • Top-k query support - Lucene scoring formula - Sort by field values • Compatible with MapReduce frameworks • Preserves Cassandra’s functionality #CassandraSummit 2014 36
  • 37.
    github.com/stratio/stratio-cassandra • Publishedas fork of Apache Cassandra • Apache License Version 2.0 stratio.github.io/crossdata Its open source • Apache License Version 2.0 #CassandraSummit 2014 37
  • 38.
    Advanced search and Top-K queries in Cassandra 38 Daniel Higuero dhiguero@stratio.com @dhiguero Andrés de la Peña andres@stratio.com @a_de_la_pena