Advanced search and Top-K queries in Cassandra

Advanced search and
Top-K queries in Cassandra
1
Andrés de la Peña
andres@stratio.com
@a_de_la_pena
Apache Cassandra Meetup 2015

•  Stratio is a Big Data Company
•  Founded in 2013
•  Commercially launched in 2014
•  70+ employees in Madrid
•  Office in San Francisco
•  Certiﬁed Spark distribution
Who are we?

Introduction to Cassandra
Cassandra query methods
Stratio Lucene based 2i implementation
Integrating Lucene 2i with Apache Spark
1
2
3
CONTENTS
4

Tunable
consistency

Tradeoffs between consistency and latency are tunable. C* values a high
availability and partitioning against consistency; strong consistency can be
achieved but there is no row locking.
Incremental
scalability

Nodes added to a cluster increase throughput in a predictable & linear fashion.
The
best
of
Dynamo
&
Big
Table

Combines the partitioning and replication of Amazon’s Dynamo with the log-
structured data model of Google’s Bigtable.
Decentralized

P2P architecture without master node or single point of failure.
Apache Cassandra overview
4Apache Cassandra Meetup 2015

Apache Cassandra operators

primary key
secondary indexes
token ranges
Throughput
Expressiveness

•  O(1) node lookup for partition key
•  Range slices for clustering key
•  Usually requires denormalization
Primary key queries
Node
3
Node
1
Node
2
Partition key Clustering key range
CLIENT
apena
2014-04-10:body
When you..
aagea
dhiguero
apena
2014-04-06:body 2014-04-07:body 2014-04-08:body
To study and… To think and... If you see what..
2014-04-06:body
The cautious…
2014-04-10:body
When you..
2014-04-11:body
When you do…

primary key
secondary indexes
token ranges
Throughput
Expressiveness

CLIENT
C*
node
C*
node
2i local column
family
C*
node
2i local column
family
2i local column
family
Secondary indexes queries
•  Inverted index
•  Mitigates denormalization
•  Queries may involve all C* nodes
•  Queries limited to a single column

primary key
secondary indexes
token ranges
Throughput
Expressiveness

C*

node

C*

node

C*

node

Spark
master
Token range queries
•  Used by MapReduce frameworks
as Hadoop or Spark
•  All kinds of queries are possible
•  Low throughput
•  Ad-hoc queries
•  Batch processing
•  Materialized views
CLIENT
query= function (all data)

C*

node

C*

node

C*

node

Combining 2i with MapReduce
•  Expressiveness avoiding full scans
•  Still limited by one indexed column per query
Spark
masterCLIENT
Secondary
index
Secondary
index
Secondary
index

MORE EXPRESIVENESS
What do we miss from 2i indexes?
•  Range queries
•  Multivariable search
•  Full text search
•  Sorting by ﬁelds
•  Top-k queries

IT’S ARCHITECTURE
What do we like from the existing 2i?
•  Each node indexes its own data
•  The index implementations do not need to be distributed
•  Can be created after design and ingestion
•  Natural extension point

Thinking in a custom secondary index implementation…
WHY NOT USE ?

Why we like Lucene
•  Proven stable and fast indexing solution
•  Expressive queries
- Multivariable, ranges, full text, sorting, top-k, etc.
•  Mature distributed search solutions built on top of it
- Solr, ElasticSearch
•  Can be fully embedded in application code
•  Published under the Apache License

HOW IT WORKS

ALTER TABLE tweets ADD lucene TEXT;
CREATE TABLE tweets (
id bigint,
createdAt timestamp,
message text,
userid bigint,
username text,
PRIMARY KEY (userid, createdAt, id) );
Create index
•  Built in the background in any moment
•  Real time updates
•  Mapping eases ETL
•  Language aware
18
CREATE CUSTOM INDEX tweets_idx ON twitter.tweets (lucene)
USING 'com.stratio.index.RowIndex'
WITH OPTIONS = {
'refresh_seconds' : '60',
'schema' : '{
default_analyzer : "EnglishAnalyzer”,
fields : {
createdat : {type : "date", pattern : "yyyy-MM-dd"},
message : {type : "text", analyzer : ”EnglishAnalyzer"},
userid : {type : "string"},
username : {type : "string”}
}
}'
};

SELECT * FROM tweets WHERE lucene
= ‘{
filter : {type : "match",
field : "text",
value : "cassandra"}
}’ LIMIT 10;
search 10
found 6
found 4
We are done !
Filtering query
CLIENT
C*
node
C*
node
C*
node
Lucene
index
Lucene
index
Lucene
index

Found 5
Found 4
Found 5
Top-k query
SELECT * FROM tweets WHERE lucene
= ‘{
query: {type : "match",
field : "text",
value : "cassandra"}
}’ LIMIT 5;
Search top-5
CLIENT
Search top-5
C*
node
C*
node
C*
node
Lucene
index
Lucene
index
Lucene
index
Merge 14
to best 5

SELECT * FROM tweets WHERE lucene = ‘{
filter :
{
type : "boolean", must :
[
{type : "range", field : "time" lower : "2014/04/25”},
{type : "boolean", should :
[
{type : "prefix", field : "user", value : "a"} ,
{type : "wildcard", field : "user", value : "*b*"} ,
{type : "match", field : "user", value : "fast"}
]
}
]
},
sort :
{
fields: [ {field :"time", reverse : true},
{field : "user", reverse : false} ]
}
}’ LIMIT 10000;
Queries can be as complex as you want

NO MAINTENANCE REQUIRED
Some implementation details
•  A Lucene document per CQL row, and a Lucene ﬁeld per indexed column
•  SortingMergePolicy keeps index sorted in the same way that C* does
•  Index commits synchronized with column family ﬂushes
•  Segments merge synchronized with column family compactions

LUCENE
AND
SPARK

Split friendly. It supports searches within a token range
filter : {type:"match", field:"text", value:"cassandra"}
}’
AND TOKEN(userid, createdAt, id) > 253653456456
AND TOKEN(userid, createdAt, id) <= 3456467456756
LIMIT 10000;
Integrating Lucene & Spark

filter : {type:"match", field:"text", value:"cassandra"}
}’
AND userid = 3543534
AND createdAt > 2011-02-03 04:05+0000
LIMIT 5000;
Paging friendly: It supports starting queries in a certain point

CLIENT
Spark
master
C*
node
C*
node
C*
node
Lucene
Lucene
Lucene
•  Compute large amounts of data
•  Avoid systematic full scan
•  Reduces the amount of data to be processed
•  Filtering push-down

WHEN TO
USE INDEXES
AND WHEN TO
USE FULL SCAN

Index performance in Spark
Time
Records returned
Full scan
Lucene 2i

DEMOLucene indexes in C*

Conclusions
•  Added new query methods
- Multivariable queries (AND, OR, NOT)
- Range queries (>, >=, <, <=) and regular expressions
- Full text queries (match, phrase, fuzzy...)
•  Top-k query support
- Lucene scoring formula
- Sort by ﬁeld values
•  Compatible with MapReduce frameworks
•  Preserves Cassandra’s functionality

Its open source
31
github.com/stratio/stratio-cassandra
•  Published as fork of Apache Cassandra
•  Apache License Version 2.0
stratio.github.io/crossdata
github.com/stratio/deep-spark

Advanced search and
Top-K queries in Cassandra
32
Andrés de la Peña
andres@stratio.com
@a_de_la_pena

Advanced search and Top-K queries in Cassandra

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Advanced search and Top-K queries in Cassandra

Similar to Advanced search and Top-K queries in Cassandra (20)

More from Stratio

More from Stratio (20)

Recently uploaded

Recently uploaded (20)

Advanced search and Top-K queries in Cassandra