Stratio presented its open source Lucene-based implementation of Cassandra’s secondary indexes at Cassandra Summit London 2014, which provided several search engine features. It used to be distributed as a fork of Apache Cassandra, which was a huge problem both for users and maintainers. Nowadays, due to some changes introduced at C* 2.1.6, we are proud to announce that it has become a plugin that can be attached to the official Apache Cassandra. With the plugin we have been able to provide C* with geospatial capabilities, making it possible to index geographical positions and perform bounding box and radial distance queries. This is achieved through Lucene’s geospatial module. Another feature we have provided with our plugin is the possibility of indexing bitemporal data models, which distinguish between system time and business time. This way, it is possible to make queries over C* such as “give me what system thought in a certain instant about what happened in another instant”. The implementation has been performed combining range prefix trees with the 4R-Tree approach exposed by Bliujūtė et al. Both full-text, geospatial and bitemporal queries can be combined with Apache Spark to avoid systematic full-scan, dramatically reducing the amount of data to be processed.
2. • Stratio is a Big Data Company
• Certified Spark distribution
• Founded in 2013
• 120+ employees in Madrid
• Offices in Madrid and San Francisco
#CassandraSummit 2015
WHO WE ARE
6. primary key
secondary indexes
token ranges
Cassandra query methods by use case
#CassandraSummit 2015 6
primary key
secondary indexes
token ranges
Real time Analytics
7. Cassandra query methods trade offs
#CassandraSummit 2015 7
• Pure-range queries limited to partition
• No Boolean logic
• No Full text search
• Sorting limited to partition
• Full-table scan
• High load
• High latency
• Low concurrency
primary key
secondary indexes
token ranges
primary key
secondary indexes
token ranges
Real time Analytics
8. A third use case
#CassandraSummit 2015 8
AnalyticsReal-time Search
• Not as fast as primary key queries
• Not as expressive as map reduce
• Search can be used for both cases
10. A Lucene based secondary index implementation
• Proven stable and fast indexing solution
• Expressive queries
- Multivariable, ranges, full text, sorting, top-k, etc.
• Mature distributed search solutions built on top of it
- Solr, ElasticSearch
• Just a small embeddable library
• Easily extensible
• Published under the Apache License
#CassandraSummit 2015 10
12. A Lucene based secondary index implementation
CLIENT
C*
node
C*
node
C*
node
Lucene
index
Lucene
index
Lucene
index
#CassandraSummit 2015 12
• Each node indexes its own data
• Keep P2P architecture
• Distribution and replication managed by C*
• Just a single pluggable JAR file
- CASSANDRA-8717
JVM
JVM
JVM
13. CREATE TABLE tweets (
id bigint,
created timestamp,
message text,
userid bigint,
username text,
PRIMARY KEY (userid, created,
id) );
Create index
• Built in the background in any moment
• Real time updates
• Mapping eases ETL
• Language aware
#CassandraSummit 2015 13
ALTER TABLE tweets ADD lucene TEXT;
CREATE CUSTOM INDEX tweets_idx ON tweets (lucene)
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'refresh_seconds' : '10',
'schema' : ' fields : {
created : {type : "date", pattern : "yyyy-MM-dd"},
message : {type : "text", analyzer : "english"},
userid : {type : "string"},
username : {type : "string"} } '};
14. SELECT * FROM tweets WHERE lucene = '{
filter : {
type : "boolean",
must : [
{type : "range", field : "created_at", lower : "2015/01/01"},
{type : "wildcard", field : "user", value : "a*"}
],
not : [
{type : "match", field : "user", value : "andres"}
]
},
sort : {
fields: [
{field : "time", reverse : true},
{field : "user", reverse : false}
]
}
}' LIMIT 10000;
Searching for rows
#CassandraSummit 2015 14
15. Integrating Lucene & Spark
CLIENT
Spark
master
C*
node
C*
node
C*
node
Lucene
Lucene
Lucene
• Compute large amounts of data
• Filtering push-down
• Avoid systematic full scan
• Reduces the amount of data to be processed
#CassandraSummit 2015 15
16. Index performance in Spark
#CassandraSummit 2015 16
0
500
1000
1500
2000
2500
0 10 20 30 40 50 60 70 80 90 100
seconds
millions of collected rows
index
full scan
18. Lucene spatial module
• Spatial4J shapes
- Points, rectangles, circles, etc.
• Spatial search strategies
- BBox, RecursivePrefixTree, PointVector, etc.
• Not only geographical data
- Numbers, dates
• It can be combined with other searches
#CassandraSummit 2015 18
19. Indexing geographical locations
#CassandraSummit 2015 19
CREATE CUSTOM INDEX restaurants_idx
ON restaurants (lucene)
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'refresh_seconds' : '1',
'schema' : '{
fields : {
location : {
type : "geo_point",
latitude : "lat",
longitude : "lon"
},
stars: {type : "integer" }
}
}
'};
CREATE TABLE restaurants(
name text PRIMARY KEY,
stars bigint,
lat double,
lon double);
• No native shape data types in CQL
• Many-to-one column mapping
• Just points. For now.
20. Bounding box search
#CassandraSummit 2015 20
SELECT * FROM restaurants
WHERE lucene =
'{
filter :
{
type : "geo_bbox",
field : "location",
min_latitude : 40.425978,
max_latitude : 40.445886,
min_longitude : -3.808252,
max_longitude : -3.770999
}
}';
21. Distance search
#CassandraSummit 2015 21
SELECT * FROM restaurants
WHERE lucene =
'{
filter :
{
type : "geo_distance",
field : "location",
latitude : 40.443270,
longitude : -3.800498,
min_distance : "100m",
max_distance : "2km"
}
}';
22. Combining geospatial searches
#CassandraSummit 2015 22
SELECT * FROM restaurants WHERE lucene =
'{ filter : {
type : "boolean",
must : [
{
type : "geo_distance",
field : "location",
latitude : 40.443270,
longitude : -3.800498,
max_distance : "10km"
},
{
type : "range",
field : "stars",
lower : 2,
upper : 4
}
] } }';
23. Lucene spatial is not only geospatial…
#CassandraSummit 2015 23
• General geometry
• Numeric ranges
- NumberRangePrefixTree
• Date ranges/durations
- DateRangePrefixTree
24. Temporal/Date durations
#CassandraSummit 2015 24
• A pair composed by a start-date and a stop-date
- Can be indexed as points in a 2D space
• David Smiley's DateRangePrefixTree
- Levels for common date-ranges: years, months, days…
- Spatial operations: intersects, is_within, contains
27 Nov 2015 29 Dec 2015
intersects
is - within
contains
25. Indexing date ranges
#CassandraSummit 2015 25
CREATE CUSTOM INDEX breakdowns_idx
ON breakdowns (lucene)
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'refresh_seconds' : '1',
'schema' : '{
fields : {
duration: {
type : "date_range",
from : "start_date",
to : "stop_date",
pattern : "yyyy-MM-dd"
},
cause: {type : "string" }
}
}
'};
CREATE TABLE breakdowns (
system text PRIMARY KEY,
cause text,
start_date timestamp,
stop_date timestamp);
• No native date range type in CQL
• Many-to-one column mapping
• Spatial operations
26. Searching for date ranges
#CassandraSummit 2015 26
SELECT * FROM breakdowns
WHERE lucene =
'{
filter :
{
type : "date_range",
field : "duration",
from : "2015-01-01",
to : "2015-01-05",
operation : "intersects"
}
}';
SELECT * FROM users
WHERE lucene =
'{ filter : {
type : "boolean",
must : [
{
type : "date_range",
field : "duration",
from : "2015-01-01",
to : "2015-01-05",
operation : "is_within"
},
{
type : "match",
field : "cause",
value : "human error"
}
] } }';
28. The bitemporal data model
#CassandraSummit 2015 28
• Stores WHAT and WHEN
• Support for corrections.
• Reproducible business perspective history at a point of time.
• Trace why a decision was made.
29. The bitemporal data model
#CassandraSummit 2015 29
• Valid Time
- The application period
- WHAT happened, the real time fact period
• Transaction Time
- The system period
- WHEN the system consider it true
30. The bitemporal data model: example
#CassandraSummit 2015 30
person city vt_from vt_to tt_from tt_to
John Smallville 3-Apr-1975 ∞ 4-Apr-1975 26-Dec-1994
John Smallville 3-Apr-1975 25-Aug-1994 27-Dec-1994 ∞
John Bigtown 26-Aug-1994 ∞ 27-Dec-1994 1-Feb-2001
John Bigtown 26-Aug-1994 30-May-1995 2-Feb-2001 ∞
John Beachy 1-Jun-1995 3-Sep-2000 2-Feb-2001 ∞
John Bigtown 3-Sep-2000 ∞ 2-Feb-2001 31-Mar-2001
John Mediumtown 1-Apr-2001 ∞ 1-Apr-2001 ∞
Modified example from Wikipedia
https://en.wikipedia.org/wiki/Temporal_database
31. A naïve approach
#CassandraSummit 2015 31
CREATE CUSTOM INDEX census_idx
ON census (lucene)
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'refresh_seconds' : '1',
'schema' : '{
fields : {
vt_from : { type : "date", pattern : "yyyyMMdd" },
vt_to : { type : "date", pattern : "yyyyMMdd" },
tt_from : { type : "date", pattern : "yyyyMMdd" },
tt_to : { type : "date", pattern : "yyyyMMdd" }
}} '};
Using 4 dates
32. A naive approach
#CassandraSummit 2015 32
SELECT * FROM census WHERE lucene =
'{ filter : { type : "boolean",
must : [
should : [
{ type : "range", field : "vt_from", lower : "", upper : "",
include_lower=true, include_upper=true },
{ type : "range", field : "vt_to", lower : "", upper : "",
include_lower=true, include_upper=true },
must : [
{ type : "range", field : "vt_from", upper : "", include_upper=true},
{ type : "range", field : "vt_to", lower : "", include_lower=true}]
],
should : [
{ type : "range", field : “tt_from", lower : "", upper : "",
include_lower=true, include_upper=true },
{ type : "range", field : “tt_to", lower : "", upper : "",
include_lower=true, include_upper=true },
must : [
{ type : "range", field : “tt_from", upper : "", include_upper=true},
{ type : "range", field : “tt_to", lower : "", include_lower=true}
]
]
] } }' AND person = 'John Doe';
33. A naive approach: Issues
#CassandraSummit 2015 33
• Very difficult to understand/build the query.
• Now value (∞) using Long.MAX_VALUE is costly.
34. A spatial approach
#CassandraSummit 2015 34
CREATE CUSTOM INDEX census_idx
ON census (lucene)
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'schema' : '{ fields : {
vt: {
type : "date_range", pattern : "yyyyMMdd",
from : "vt_from", to : "vt_to"
},
tt: {
type : "date_range", pattern : "yyyyMMdd",
from : "tt_from", to : "tt_to"
},
} } '};
Using 2 date ranges
35. A spatial approach
#CassandraSummit 2015 35
SELECT * FROM users WHERE lucene =
'{ filter : {
type : "boolean",
must : [
{
type : "date_range", field : "vt",
from : "20150501", to : "99999999",
operation : "intersects"
},
{
type : "date_range", field : "tt",
from : "20150501", to : "9999999999",
operation : "intersects"
}
] } }';
36. A spatial approach: performance issues
#CassandraSummit 2015 36
• Very difficult to understand/build the query.
• Now value (∞) using Long.MAX_VALUE is costly.
37. 4R-Tree to the rescue
#CassandraSummit 2015 37
• Based on
Bliujute, R., Jensen, C. S., & Slivinskas, G. (2000). Light-weight indexing of
general bitemporal data
• The Now Value is never stored.
• The data is stored in 4 R-Trees.
• Queries are transformed and distributed among the trees.
39. 4R-Tree to the rescue: searching data
#CassandraSummit 2015 39
IF (TT_FROM!=NOW) && (TT_TO >= VT_FROM):
searchR1(0, TT_TO, 0,VT_TO) U
searchR2(0, TT_TO, VT_FROM,VT_TO) U
searchR3(max(TT_FROM,VT_FROM),TT_TO,0,VT_TO)U
searchR4(TT_FROM,TT_TO, VT_FROM, VT_TO)
IF (TT_FROM!=NOW) && (TT_TO < VT_FROM):
searchR2(0, TT_TO, VT_FROM,VT_TO) U
searchR4(TT_FROM,TT_TO, VT_FROM, VT_TO)
IF (TT_FROM==NOW) && ([VT_FROM,VT_TO]≠[0,MAX]) && (TT_TO >= VT_FROM):
searchR1(0, TT_TO, 0,VT_TO) U searchR2(0, TT_TO, VT_FROM,VT_TO)
IF (TT_FROM==NOW) && ([VT_FROM,VT_TO]≠[0,MAX]) && (TT_TO < VT_FROM):
searchR2(0, TT_TO, VT_FROM,VT_TO)
IF (TT_FROM==NOW) && ([VT_FROM,VT_TO]=[0,MAX]):
R1 U R2
40. 4R-Tree to the rescue:
#CassandraSummit 2015 40
• Problem!!! Lucene does not have support for R-Tree
• Our Solution:
- Use 2 DateRangePrefixTrees for each R-Tree
• Future Work: Experiment with other Lucene spatial trees and strategies.
41. The bitemporal data model: example
#CassandraSummit 2015 41
Modified example from Wikipedia
https://en.wikipedia.org/wiki/Temporal_database
person city vt_from vt_to tt_from tt_to
John Smallville 3-Apr-1975 ∞ 4-Apr-1975 26-Dec-1994
John Smallville 3-Apr-1975 25-Aug-1994 27-Dec-1994 ∞
John Bigtown 26-Aug-1994 ∞ 27-Dec-1994 1-Feb-2001
John Bigtown 26-Aug-1994 30-May-1995 2-Feb-2001 ∞
John Beachy 1-Jun-1995 3-Sep-2000 2-Feb-2001 ∞
John Bigtown 3-Sep-2000 ∞ 2-Feb-2001 31-Mar-2001
John Mediumtown 1-Apr-2001 ∞ 1-Apr-2001 ∞
42. Indexing bitemporal data
#CassandraSummit 2015 42
CREATE CUSTOM INDEX census_idx
ON census (lucene)
USING
'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'schema' : '{
fields : {
bitemporal : {
type : "bitemporal",
vt_from : "vt_from",
vt_to : "vt_to",
tt_from : "vt_from",
tt_to : "tt_to",
pattern : "yyyyMMdd"
now_value : "99999999"
},
city : { type : "string" }
}
} '};
CREATE TABLE census (
person text,
city text,
vt_from text,
vt_to text,
tt_from text,
tt_to text,
lucene text,
PRIMARY KEY((person),vt_from,tt_from)
);
43. Searching for bitemporal data, several queries
#CassandraSummit 2015 43
SELECT * FROM users WHERE lucene =
'{
filter :
{
type : "bitemporal",
field : "bitemporal",
vt_from : "99999999",
vt_to : "99999999",
tt_from : "99999999",
tt_to : "99999999"
}
}' AND person = 'John Doe';
Where does the system currently
think that John lives right now?
person city vt_from vt_to tt_from tt_to
John Mediumtown 1-Apr-2001 ∞ 1-Apr-2001 ∞
44. Searching for bitemporal data
#CassandraSummit 2015 44
person city vt_from vt_to tt_from tt_to
John Beachy 1-Jun-1995 3-Sep-2000 2-Feb-2001 ∞
Where does the system currently
think that John lived in 1999?
SELECT * FROM users WHERE lucene =
'{
filter :
{
type : "bitemporal",
field : "bitemporal",
vt_from : "19990101",
vt_to : "19991231",
tt_from : "99999999",
tt_to : "99999999"
}
}' AND person = 'John Doe';
45. #CassandraSummit 2015 45
On 01-Jan-2000, where did the
system think John was living back in
1999?
SELECT * FROM users WHERE lucene =
'{
filter :
{
type : "bitemporal",
field : "bitemporal",
vt_from : "19990101",
vt_to : "19991231",
tt_from : “20000101",
tt_to : “20000101"
}
}' AND person = 'John Doe';
person city vt_from vt_to tt_from tt_to
John Bigtown 26-Aug-1994 ∞ 27-Dec-1994 1-Feb-2001
Searching for bitemporal data
46. #CassandraSummit 2015 46
SELECT * FROM users WHERE lucene =
'{
filter :
{
type : "boolean",
must : [
{ type : "bitemporal", field : "bitemporal",
vt_from : "99999999", vt_to : "99999999",
tt_from : "99999999", tt_to : "99999999"
},
{ type : "match",
field : "city",
value : "smallville"}
]}
}}';
Who currently lives at Smallville?
Searching for bitemporal data