1. Advanced search and
Top-K queries in Cassandra
1
Andrés de la Peña
andres@stratio.com
@a_de_la_pena
Apache Cassandra Meetup 2015
2. • Stratio is a Big Data Company
• Founded in 2013
• Commercially launched in 2014
• 70+ employees in Madrid
• Office in San Francisco
• Certified Spark distribution
Apache Cassandra Meetup 2015
Who are we?
3. Introduction to Cassandra
Cassandra query methods
Stratio Lucene based 2i implementation
Integrating Lucene 2i with Apache Spark
1
2
3
CONTENTS
Apache Cassandra Meetup 2015
4
4. Tunable
consistency
Tradeoffs between consistency and latency are tunable. C* values a high
availability and partitioning against consistency; strong consistency can be
achieved but there is no row locking.
Incremental
scalability
Nodes added to a cluster increase throughput in a predictable & linear fashion.
The
best
of
Dynamo
&
Big
Table
Combines the partitioning and replication of Amazon’s Dynamo with the log-
structured data model of Google’s Bigtable.
Decentralized
P2P architecture without master node or single point of failure.
Apache Cassandra overview
4Apache Cassandra Meetup 2015
7. • O(1) node lookup for partition key
• Range slices for clustering key
• Usually requires denormalization
Primary key queries
Node
3
Node
1
Node
2
Partition key Clustering key range
CLIENT
apena
2014-04-10:body
When you..
aagea
dhiguero
apena
2014-04-06:body 2014-04-07:body 2014-04-08:body
To study and… To think and... If you see what..
2014-04-06:body
The cautious…
2014-04-10:body
When you..
2014-04-11:body
When you do…
7Apache Cassandra Meetup 2015
9. CLIENT
C*
node
C*
node
2i local column
family
C*
node
2i local column
family
2i local column
family
Secondary indexes queries
• Inverted index
• Mitigates denormalization
• Queries may involve all C* nodes
• Queries limited to a single column
9Apache Cassandra Meetup 2015
11. C*
node
C*
node
C*
node
Spark
master
Token range queries
• Used by MapReduce frameworks
as Hadoop or Spark
• All kinds of queries are possible
• Low throughput
• Ad-hoc queries
• Batch processing
• Materialized views
CLIENT
query= function (all data)
11Apache Cassandra Meetup 2015
12. C*
node
C*
node
C*
node
Combining 2i with MapReduce
• Expressiveness avoiding full scans
• Still limited by one indexed column per query
Spark
masterCLIENT
Secondary
index
Secondary
index
Secondary
index
12Apache Cassandra Meetup 2015
13. MORE EXPRESIVENESS
What do we miss from 2i indexes?
• Range queries
• Multivariable search
• Full text search
• Sorting by fields
• Top-k queries
13Apache Cassandra Meetup 2015
14. IT’S ARCHITECTURE
What do we like from the existing 2i?
• Each node indexes its own data
• The index implementations do not need to be distributed
• Can be created after design and ingestion
• Natural extension point
14Apache Cassandra Meetup 2015
15. Thinking in a custom secondary index implementation…
WHY NOT USE ?
15Apache Cassandra Meetup 2015
16. Why we like Lucene
• Proven stable and fast indexing solution
• Expressive queries
- Multivariable, ranges, full text, sorting, top-k, etc.
• Mature distributed search solutions built on top of it
- Solr, ElasticSearch
• Can be fully embedded in application code
• Published under the Apache License
16Apache Cassandra Meetup 2015
18. ALTER TABLE tweets ADD lucene TEXT;
CREATE TABLE tweets (
id bigint,
createdAt timestamp,
message text,
userid bigint,
username text,
PRIMARY KEY (userid, createdAt, id) );
Create index
• Built in the background in any moment
• Real time updates
• Mapping eases ETL
• Language aware
18
CREATE CUSTOM INDEX tweets_idx ON twitter.tweets (lucene)
USING 'com.stratio.index.RowIndex'
WITH OPTIONS = {
'refresh_seconds' : '60',
'schema' : '{
default_analyzer : "EnglishAnalyzer”,
fields : {
createdat : {type : "date", pattern : "yyyy-MM-dd"},
message : {type : "text", analyzer : ”EnglishAnalyzer"},
userid : {type : "string"},
username : {type : "string”}
}
}'
};
Apache Cassandra Meetup 2015
19. SELECT * FROM tweets WHERE lucene
= ‘{
filter : {type : "match",
field : "text",
value : "cassandra"}
}’ LIMIT 10;
search 10
found 6
found 4
We are done !
Filtering query
CLIENT
C*
node
C*
node
C*
node
Lucene
index
Lucene
index
Lucene
index
19Apache Cassandra Meetup 2015
20. Found 5
Found 4
Found 5
Top-k query
SELECT * FROM tweets WHERE lucene
= ‘{
query: {type : "match",
field : "text",
value : "cassandra"}
}’ LIMIT 5;
Search top-5
CLIENT
Search top-5
C*
node
C*
node
C*
node
Lucene
index
Lucene
index
Lucene
index
Merge 14
to best 5
20Apache Cassandra Meetup 2015
21. SELECT * FROM tweets WHERE lucene = ‘{
filter :
{
type : "boolean", must :
[
{type : "range", field : "time" lower : "2014/04/25”},
{type : "boolean", should :
[
{type : "prefix", field : "user", value : "a"} ,
{type : "wildcard", field : "user", value : "*b*"} ,
{type : "match", field : "user", value : "fast"}
]
}
]
},
sort :
{
fields: [ {field :"time", reverse : true},
{field : "user", reverse : false} ]
}
}’ LIMIT 10000;
Queries can be as complex as you want
21Apache Cassandra Meetup 2015
22. NO MAINTENANCE REQUIRED
Some implementation details
• A Lucene document per CQL row, and a Lucene field per indexed column
• SortingMergePolicy keeps index sorted in the same way that C* does
• Index commits synchronized with column family flushes
• Segments merge synchronized with column family compactions
22Apache Cassandra Meetup 2015
24. Split friendly. It supports searches within a token range
SELECT * FROM tweets WHERE lucene = ‘{
filter : {type:"match", field:"text", value:"cassandra"}
}’
AND TOKEN(userid, createdAt, id) > 253653456456
AND TOKEN(userid, createdAt, id) <= 3456467456756
LIMIT 10000;
Integrating Lucene & Spark
24Apache Cassandra Meetup 2015
25. SELECT * FROM tweets WHERE lucene = ‘{
filter : {type:"match", field:"text", value:"cassandra"}
}’
AND userid = 3543534
AND createdAt > 2011-02-03 04:05+0000
LIMIT 5000;
Paging friendly: It supports starting queries in a certain point
Integrating Lucene & Spark
25Apache Cassandra Meetup 2015
26. Integrating Lucene & Spark
CLIENT
Spark
master
C*
node
C*
node
C*
node
Lucene
Lucene
Lucene
• Compute large amounts of data
• Avoid systematic full scan
• Reduces the amount of data to be processed
• Filtering push-down
26Apache Cassandra Meetup 2015
30. Conclusions
• Added new query methods
- Multivariable queries (AND, OR, NOT)
- Range queries (>, >=, <, <=) and regular expressions
- Full text queries (match, phrase, fuzzy...)
• Top-k query support
- Lucene scoring formula
- Sort by field values
• Compatible with MapReduce frameworks
• Preserves Cassandra’s functionality
30Apache Cassandra Meetup 2015
31. Its open source
31
github.com/stratio/stratio-cassandra
• Published as fork of Apache Cassandra
• Apache License Version 2.0
stratio.github.io/crossdata
• Apache License Version 2.0
github.com/stratio/deep-spark
• Apache License Version 2.0
Apache Cassandra Meetup 2015
32. Advanced search and
Top-K queries in Cassandra
32
Andrés de la Peña
andres@stratio.com
@a_de_la_pena
Apache Cassandra Meetup 2015