Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014

Advanced search and
Top-K queries in Cassandra
1
Daniel Higuero
dhiguero@stratio.com
@dhiguero
Andrés de la Peña
andres@stratio.com
@a_de_la_pena

Who are we?
• Stratio is a Big Data Company
• Founded in 2013
• Commercially launched in 2014
• 70+ employees in Madrid
• Office in San Francisco
• Certified Spark distribution
#CassandraSummit 2014

Cassandra query methods
Stratio Lucene based 2i implementation
Integrating Lucene 2i with Apache Spark
1
2
3
CONTENTS

primary key
secondary indexes
token ranges
Throughput
Expressiveness
#CassandraSummit 2014 4

Primary key queries
• O(1) node lookup for partition key
• Range slices for clustering key
• Usually requires denormalization
Partition key CLIENT Clustering key range
Node
3
Node
1
Node
2
apena
2014-04-10:body
When you..
aagea
dhiguero
apena
2014-04-06:body 2014-04-07:body 2014-04-08:body
To study and… To think and... If you see what..
2014-04-06:body
The cautious…
2014-04-10:body
When you..
2014-04-11:body
When you do…

primary key
secondary indexes
token ranges
Throughput
Expressiveness

CLIENT C*
node
C*
node
2i local column
family
C*
node
2i local column
family
2i local column
family
Secondary indexes queries
• Inverted index
• Mitigates denormalization
• Queries may involve all C* nodes
• Queries limited to a single column

primary key
secondary indexes
token ranges
Throughput
Expressiveness

C*#
node#
C*#
node#
C*#
node#
Spark
master
Token range queries
• Used by MapReduce frameworks
as Hadoop or Spark
• All kinds of queries are possible
• Low throughput
• Ad-hoc queries
• Batch processing
• Materialized views
CLIENT
query= function (all data)

C*#
node#
C*#
node#
C*#
node#
Combining 2i with MapReduce
• Expressiveness avoiding full scans
• Still limited by one indexed column per query
Spark
CLIENT master
Secondary
index
Secondary
index
Secondary
index

What do we miss from 2i indexes?
MORE EXPRESIVENESS
• Range queries
• Multivariable search
• Full text search
• Sorting by fields
• Top-k queries

What do we like from the existing 2i?
IT’S ARCHITECTURE
• Each node indexes its own data
• The index implementations do not need to be distributed
• Natural extension point
• Can be created after design and ingestion

Thinking in a custom secondary index implementation…
WHY NOT USE ?

Why we like Lucene
• Proven stable and fast indexing solution
• Expressive queries
- Multivariable, ranges, full text, sorting, top-k, etc.
• Mature distributed search solutions built on top of it
- Solr, ElasticSearch
• Can be fully embedded in application code
• Published under the Apache License

ALTER TABLE tweets ADD lucene TEXT;
CREATE TABLE tweets (
id bigint,
createdAt timestamp,
message text,
userid bigint,
username text,
PRIMARY KEY (userid, createdAt, id) );
Create index
• Built in the background in any moment
• Real time updates
• Mapping eases ETL
• Language aware
CREATE CUSTOM INDEX tweets_idx ON twitter.tweets (lucene)
USING 'com.stratio.index.RowIndex'
WITH OPTIONS = {
'refresh_seconds' : '60',
'schema' : '{ default_analyzer : "EnglishAnalyzer",
fields : {
createdat : {type : "date", pattern : "yyyy-MM-dd"},
message : {type : "text", analyzer : ”EnglishAnalyzer"},
userid : {type : "string"},
username : {type : "string"}
}} '};

SELECT * FROM tweets WHERE lucene
= ‘{
filter : {type : "match",
field : "text",
value : "cassandra"}
}’ LIMIT 10;
search 10
found 6
found 4
We are done !
Filtering query
CLIENT
C*
node
C*
node
C*
node
Lucene
index
Lucene
index
Lucene
index

Found 5
Found 4
Found 5
Top-k query
SELECT * FROM tweets WHERE lucene
= ‘{
query: {type:”match",
field : ”text”,
value : “cassandra”}
}’ LIMIT 5;
C*
node
Search top-5 CLIENT Search top-5
C*
node
C*
node
Lucene
index
Lucene
index
Lucene
index
Merge
14 to
best 5

Modifying Cassandra for generic top-k queries
Two new methods in SecondaryIndexSearcher:
boolean'requiresFullScan(List<IndexExpression>'clause);'
List<Row>'sort(List<IndexExpression>'clause,'List<Row>'rows);'
Two new methods in AbstractRangeCommand:
boolean'requiresFullScan();'
List<Row>'combine(List<Row>'rows);'
And some changes in StorageProxy#getRangeSlice…

Queries can be as complex as you want
SELECT * FROM tweets WHERE lucene = ‘{
filter :
{
type : "boolean", must :
[
{type : "range", field : "time" lower : "2014/04/25”},
{type : "boolean", should :
[
{type : "prefix", field : "user", value : "a"} ,
{type : "wildcard", field : "user", value : "*b*"} ,
{type : "match", field : "user", value : "fast"}
]
}
]
},
sort :
{
fields: [ {field :"time", reverse : true},
{field : "user", reverse : false} ]
}
}’ LIMIT 10000;

Some implementation details
• A Lucene document per CQL row, and a Lucene field per indexed column
• SortingMergePolicy keeps index sorted in the same way that C* does
• Index commits synchronized with column family flushes
• Segments merge synchronized with column family compactions
NO MAINTENANCE REQUIRED

FLUENT QUERY BUILDER
Query builder
• Facilitates writing index-related clauses
• Compatible with the existing C* Query Builder

Query builder example
SELECT * FROM tweets WHERE lucene =
‘{
query : {type : "range",
field : "time”,
lower : "2014/04/25”,
upper : "2014/04/30”},
filter : {type : "match",
field : "text",
value : "cassandra"},
sort :
{
fields: [ {field :"time",
reverse : true}} ]
}
}’ LIMIT 10;
String'filter'='SearchBuilders'
'''.filter(range("time")'
'''''''''''''.lower("2014/04/25")'
'''''''''''''.upper("2014/04/30"))'
'''.query(match("text",'"cassandra")'
'''.sort(sorting("time",'true))'
'''.toJson();'
'
QueryBuilder.select()''''''''''''''''''''''''''
'''.from(KEYSPACE,'TABLE)'
'''.where(eq("lucene",'filter))'
'''.limit(10)

Integrating Lucene & Spark
Split friendly. It supports searches within a token range
filter : {type:"match", field:”text", value:"cassandra"}
}’
AND TOKEN(userid, createdAt, id) > 253653456456
AND TOKEN(userid, createdAt, id) <= 3456467456756
LIMIT 10000;

Paging friendly: It supports starting queries in a certain point
filter : {type:”match", field:”text", value:”cassandra”}
}’
AND userid = 3543534
AND createdAt > 2011-02-03 04:05+0000
LIMIT 5000;

CLIENT
Spark
master
C*
node
C*
node
C*
node
Lucene
Lucene
Lucene
• Compute large amounts of data
• Avoid systematic full scan
• Reduces the amount of data to be processed
• Filtering push-down

WHEN TO
USE INDEXES
AND WHEN TO
USE FULL SCAN

Index performance in Spark
Time
Full scan
Lucene 2i
Records returned

Stratio Deep
INTEGRATING SPARK WITH DIFFERENT DATASTORES
• Common Cell abstraction in the RDD
• Maintain compatibility with Spark operations
• Compatible with multiple datastore technologies
• DeepSparkContext
• DeepJobConfig
• Compatible with Lucene indexes

Stratio Crossdata
UNIFYING BATCH AND STREAMING QUERIES
• Single SQL-like language
• Compatible with multiple datastore technologies
• Connector-based architecture
• Ability to combine data from different datastore
• Complement non-native operation with Spark
• E.g., JOIN in Cassandra
• Custom support for Lucene-based secondary indexes

CREATING INDEXES
Stratio Crossdata
CREATE'FULLTEXT'INDEX'ON'app.users(name,'email);'
QUERYING THE INDEXES
SELECT'*'FROM'app.users''
where'email'MATCH'‘*@stratio.com’;'

Conclusions
• Added new query methods
- Multivariable queries (AND, OR, NOT)
- Range queries (>, >=, <, <=) and regular expressions
- Full text queries (match, phrase, fuzzy...)
• Top-k query support
- Lucene scoring formula
- Sort by field values
• Compatible with MapReduce frameworks
• Preserves Cassandra’s functionality

github.com/stratio/stratio-cassandra
• Published as fork of Apache Cassandra
• Apache License Version 2.0
stratio.github.io/crossdata
Its open source
• Apache License Version 2.0

Advanced search and
Top-K queries in Cassandra
38
Daniel Higuero
dhiguero@stratio.com
@dhiguero
Andrés de la Peña
andres@stratio.com
@a_de_la_pena

Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014

More Related Content

What's hot

Similar to Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014

Recently uploaded

Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014