SlideShare a Scribd company logo
1 of 50
Download to read offline
1
Eduardo Alonso
eduardoalonso@stratio.com
Andrés de la Peña
andres@stratio.com
GEOSPATIAL
AND BITEMPORAL
SEARCH IN C* WITH PLUGGABLE LUCENE INDEX
@a_de_la_pena @eAlonsoDB
•  Stratio is a Big Data Company
•  Certified Spark distribution
•  Founded in 2013
•  120+ employees in Madrid
•  Offices in Madrid and San Francisco
#CassandraSummit 2015
WHO WE ARE
Pluggable Lucene based 2i
Geospatial Search
Bitemporal Indexes
1
2
3
CONTENTS
PLUGGABLE
LUCENE 2i
primary key
secondary indexes
token ranges
Throughput
Expressiveness
Cassandra query methods
#CassandraSummit 2015 5
primary key
secondary indexes
token ranges
Cassandra query methods by use case
#CassandraSummit 2015 6
primary key
secondary indexes
token ranges
Real time Analytics
Cassandra query methods trade offs
#CassandraSummit 2015 7
•  Pure-range queries limited to partition
•  No Boolean logic
•  No Full text search
•  Sorting limited to partition
•  Full-table scan
•  High load
•  High latency
•  Low concurrency
primary key
secondary indexes
token ranges
primary key
secondary indexes
token ranges
Real time Analytics
A third use case
#CassandraSummit 2015 8
AnalyticsReal-time Search
•  Not as fast as primary key queries
•  Not as expressive as map reduce
•  Search can be used for both cases
#CassandraSummit 2015 9
CQL + Lucene
A Lucene based secondary index implementation
A Lucene based secondary index implementation
•  Proven stable and fast indexing solution
•  Expressive queries
- Multivariable, ranges, full text, sorting, top-k, etc.
•  Mature distributed search solutions built on top of it
- Solr, ElasticSearch
•  Just a small embeddable library
•  Easily extensible
•  Published under the Apache License
#CassandraSummit 2015 10
Cassandra query methods
#CassandraSummit 2015 11
primary key
token ranges
primary key
secondary indexes
token ranges
primary key
secondary indexes
token ranges
•  Mid expressiveness
•  Mid latency
•  Mid load
•  Low expressiveness
•  Low latency
•  Low load
•  High expressiveness
•  High latency
•  High load
Real time AnalyticsSearch
A Lucene based secondary index implementation
CLIENT
C*
node
C*
node
C*
node
Lucene
index
Lucene
index
Lucene
index
#CassandraSummit 2015 12
•  Each node indexes its own data
•  Keep P2P architecture
•  Distribution and replication managed by C*
•  Just a single pluggable JAR file
- CASSANDRA-8717
JVM
JVM
JVM
CREATE TABLE tweets (
id bigint,
created timestamp,
message text,
userid bigint,
username text,
PRIMARY KEY (userid, created,
id) );
Create index
•  Built in the background in any moment
•  Real time updates
•  Mapping eases ETL
•  Language aware
#CassandraSummit 2015 13
ALTER TABLE tweets ADD lucene TEXT;
CREATE CUSTOM INDEX tweets_idx ON tweets (lucene)
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'refresh_seconds' : '10',
'schema' : ' fields : {
created : {type : "date", pattern : "yyyy-MM-dd"},
message : {type : "text", analyzer : "english"},
userid : {type : "string"},
username : {type : "string"} } '};
SELECT * FROM tweets WHERE lucene = '{
filter : {
type : "boolean",
must : [
{type : "range", field : "created_at", lower : "2015/01/01"},
{type : "wildcard", field : "user", value : "a*"}
],
not : [
{type : "match", field : "user", value : "andres"}
]
},
sort : {
fields: [
{field : "time", reverse : true},
{field : "user", reverse : false}
]
}
}' LIMIT 10000;
Searching for rows
#CassandraSummit 2015 14
Integrating Lucene & Spark
CLIENT
Spark
master
C*
node
C*
node
C*
node
Lucene
Lucene
Lucene
•  Compute large amounts of data
•  Filtering push-down
•  Avoid systematic full scan
•  Reduces the amount of data to be processed
#CassandraSummit 2015 15
Index performance in Spark
#CassandraSummit 2015 16
0
500
1000
1500
2000
2500
0 10 20 30 40 50 60 70 80 90 100
seconds
millions of collected rows
index
full scan
SPATIAL SEARCH
Lucene spatial module
•  Spatial4J shapes
-  Points, rectangles, circles, etc.
•  Spatial search strategies
-  BBox, RecursivePrefixTree, PointVector, etc.
•  Not only geographical data
-  Numbers, dates
•  It can be combined with other searches
#CassandraSummit 2015 18
Indexing geographical locations
#CassandraSummit 2015 19
CREATE CUSTOM INDEX restaurants_idx
ON restaurants (lucene)
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'refresh_seconds' : '1',
'schema' : '{
fields : {
location : {
type : "geo_point",
latitude : "lat",
longitude : "lon"
},
stars: {type : "integer" }
}
}
'};
CREATE TABLE restaurants(
name text PRIMARY KEY,
stars bigint,
lat double,
lon double);
•  No native shape data types in CQL
•  Many-to-one column mapping
•  Just points. For now.
Bounding box search
#CassandraSummit 2015 20
SELECT * FROM restaurants
WHERE lucene =
'{
filter :
{
type : "geo_bbox",
field : "location",
min_latitude : 40.425978,
max_latitude : 40.445886,
min_longitude : -3.808252,
max_longitude : -3.770999
}
}';
Distance search
#CassandraSummit 2015 21
SELECT * FROM restaurants
WHERE lucene =
'{
filter :
{
type : "geo_distance",
field : "location",
latitude : 40.443270,
longitude : -3.800498,
min_distance : "100m",
max_distance : "2km"
}
}';
Combining geospatial searches
#CassandraSummit 2015 22
SELECT * FROM restaurants WHERE lucene =
'{ filter : {
type : "boolean",
must : [
{
type : "geo_distance",
field : "location",
latitude : 40.443270,
longitude : -3.800498,
max_distance : "10km"
},
{
type : "range",
field : "stars",
lower : 2,
upper : 4
}
] } }';
Lucene spatial is not only geospatial…
#CassandraSummit 2015 23
•  General geometry
•  Numeric ranges
-  NumberRangePrefixTree
•  Date ranges/durations
-  DateRangePrefixTree
Temporal/Date durations
#CassandraSummit 2015 24
•  A pair composed by a start-date and a stop-date
-  Can be indexed as points in a 2D space
•  David Smiley's DateRangePrefixTree
-  Levels for common date-ranges: years, months, days…
-  Spatial operations: intersects, is_within, contains
27 Nov 2015 29 Dec 2015
intersects
is - within
contains
Indexing date ranges
#CassandraSummit 2015 25
CREATE CUSTOM INDEX breakdowns_idx
ON breakdowns (lucene)
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'refresh_seconds' : '1',
'schema' : '{
fields : {
duration: {
type : "date_range",
from : "start_date",
to : "stop_date",
pattern : "yyyy-MM-dd"
},
cause: {type : "string" }
}
}
'};
CREATE TABLE breakdowns (
system text PRIMARY KEY,
cause text,
start_date timestamp,
stop_date timestamp);
•  No native date range type in CQL
•  Many-to-one column mapping
•  Spatial operations
Searching for date ranges
#CassandraSummit 2015 26
SELECT * FROM breakdowns
WHERE lucene =
'{
filter :
{
type : "date_range",
field : "duration",
from : "2015-01-01",
to : "2015-01-05",
operation : "intersects"
}
}';
SELECT * FROM users
WHERE lucene =
'{ filter : {
type : "boolean",
must : [
{
type : "date_range",
field : "duration",
from : "2015-01-01",
to : "2015-01-05",
operation : "is_within"
},
{
type : "match",
field : "cause",
value : "human error"
}
] } }';
INDEXING
BITEMPORAL
DATA
The bitemporal data model
#CassandraSummit 2015 28
•  Stores WHAT and WHEN
•  Support for corrections.
•  Reproducible business perspective history at a point of time.
•  Trace why a decision was made.
The bitemporal data model
#CassandraSummit 2015 29
•  Valid Time
- The application period
- WHAT happened, the real time fact period
•  Transaction Time
- The system period
- WHEN the system consider it true
The bitemporal data model: example
#CassandraSummit 2015 30
person city vt_from vt_to tt_from tt_to
John Smallville 3-Apr-1975 ∞ 4-Apr-1975 26-Dec-1994
John Smallville 3-Apr-1975 25-Aug-1994 27-Dec-1994 ∞
John Bigtown 26-Aug-1994 ∞ 27-Dec-1994 1-Feb-2001
John Bigtown 26-Aug-1994 30-May-1995 2-Feb-2001 ∞
John Beachy 1-Jun-1995 3-Sep-2000 2-Feb-2001 ∞
John Bigtown 3-Sep-2000 ∞ 2-Feb-2001 31-Mar-2001
John Mediumtown 1-Apr-2001 ∞ 1-Apr-2001 ∞
Modified example from Wikipedia
https://en.wikipedia.org/wiki/Temporal_database
A naïve approach
#CassandraSummit 2015 31
CREATE CUSTOM INDEX census_idx
ON census (lucene)
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'refresh_seconds' : '1',
'schema' : '{
fields : {
vt_from : { type : "date", pattern : "yyyyMMdd" },
vt_to : { type : "date", pattern : "yyyyMMdd" },
tt_from : { type : "date", pattern : "yyyyMMdd" },
tt_to : { type : "date", pattern : "yyyyMMdd" }
}} '};
Using 4 dates
A naive approach
#CassandraSummit 2015 32
SELECT * FROM census WHERE lucene =
'{ filter : { type : "boolean",
must : [
should : [
{ type : "range", field : "vt_from", lower : "", upper : "",
include_lower=true, include_upper=true },
{ type : "range", field : "vt_to", lower : "", upper : "",
include_lower=true, include_upper=true },
must : [
{ type : "range", field : "vt_from", upper : "", include_upper=true},
{ type : "range", field : "vt_to", lower : "", include_lower=true}]
],
should : [
{ type : "range", field : “tt_from", lower : "", upper : "",
include_lower=true, include_upper=true },
{ type : "range", field : “tt_to", lower : "", upper : "",
include_lower=true, include_upper=true },
must : [
{ type : "range", field : “tt_from", upper : "", include_upper=true},
{ type : "range", field : “tt_to", lower : "", include_lower=true}
]
]
] } }' AND person = 'John Doe';
A naive approach: Issues
#CassandraSummit 2015 33
•  Very difficult to understand/build the query.
•  Now value (∞) using Long.MAX_VALUE is costly.
A spatial approach
#CassandraSummit 2015 34
CREATE CUSTOM INDEX census_idx
ON census (lucene)
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'schema' : '{ fields : {
vt: {
type : "date_range", pattern : "yyyyMMdd",
from : "vt_from", to : "vt_to"
},
tt: {
type : "date_range", pattern : "yyyyMMdd",
from : "tt_from", to : "tt_to"
},
} } '};
Using 2 date ranges
A spatial approach
#CassandraSummit 2015 35
SELECT * FROM users WHERE lucene =
'{ filter : {
type : "boolean",
must : [
{
type : "date_range", field : "vt",
from : "20150501", to : "99999999",
operation : "intersects"
},
{
type : "date_range", field : "tt",
from : "20150501", to : "9999999999",
operation : "intersects"
}
] } }';
A spatial approach: performance issues
#CassandraSummit 2015 36
•  Very difficult to understand/build the query.
•  Now value (∞) using Long.MAX_VALUE is costly.
4R-Tree to the rescue
#CassandraSummit 2015 37
•  Based on
Bliujute, R., Jensen, C. S., & Slivinskas, G. (2000). Light-weight indexing of
general bitemporal data
•  The Now Value is never stored.
•  The data is stored in 4 R-Trees.
•  Queries are transformed and distributed among the trees.
Point(vt_from, tt_from) Line(vt_from,vt_to,tt_to)
Rectangle(vt_from,vt_to,
tt_from,tt_to)Line(vt_from,vt_to,tt_to)
4R-Tree to the rescue: storing data
#CassandraSummit 2015 38
TT_TO==NOW && VT_TO==NOW
TT_TO==NOW && VT_TO!=NOW
TT_TO!=NOW && VT_TO==NOW
TT_TO!=NOW && VT_TO!=NOW
•
R1 R2 R3 R4
4R-Tree to the rescue: searching data
#CassandraSummit 2015 39
IF (TT_FROM!=NOW) && (TT_TO >= VT_FROM):
searchR1(0, TT_TO, 0,VT_TO) U
searchR2(0, TT_TO, VT_FROM,VT_TO) U
searchR3(max(TT_FROM,VT_FROM),TT_TO,0,VT_TO)U
searchR4(TT_FROM,TT_TO, VT_FROM, VT_TO)
IF (TT_FROM!=NOW) && (TT_TO < VT_FROM):
searchR2(0, TT_TO, VT_FROM,VT_TO) U
searchR4(TT_FROM,TT_TO, VT_FROM, VT_TO)
IF (TT_FROM==NOW) && ([VT_FROM,VT_TO]≠[0,MAX]) && (TT_TO >= VT_FROM):
searchR1(0, TT_TO, 0,VT_TO) U searchR2(0, TT_TO, VT_FROM,VT_TO)
IF (TT_FROM==NOW) && ([VT_FROM,VT_TO]≠[0,MAX]) && (TT_TO < VT_FROM):
searchR2(0, TT_TO, VT_FROM,VT_TO)
IF (TT_FROM==NOW) && ([VT_FROM,VT_TO]=[0,MAX]):
R1 U R2
4R-Tree to the rescue:
#CassandraSummit 2015 40
•  Problem!!! Lucene does not have support for R-Tree
•  Our Solution:
- Use 2 DateRangePrefixTrees for each R-Tree
•  Future Work: Experiment with other Lucene spatial trees and strategies.
The bitemporal data model: example
#CassandraSummit 2015 41
Modified example from Wikipedia
https://en.wikipedia.org/wiki/Temporal_database
person city vt_from vt_to tt_from tt_to
John Smallville 3-Apr-1975 ∞ 4-Apr-1975 26-Dec-1994
John Smallville 3-Apr-1975 25-Aug-1994 27-Dec-1994 ∞
John Bigtown 26-Aug-1994 ∞ 27-Dec-1994 1-Feb-2001
John Bigtown 26-Aug-1994 30-May-1995 2-Feb-2001 ∞
John Beachy 1-Jun-1995 3-Sep-2000 2-Feb-2001 ∞
John Bigtown 3-Sep-2000 ∞ 2-Feb-2001 31-Mar-2001
John Mediumtown 1-Apr-2001 ∞ 1-Apr-2001 ∞
Indexing bitemporal data
#CassandraSummit 2015 42
CREATE CUSTOM INDEX census_idx
ON census (lucene)
USING
'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'schema' : '{
fields : {
bitemporal : {
type : "bitemporal",
vt_from : "vt_from",
vt_to : "vt_to",
tt_from : "vt_from",
tt_to : "tt_to",
pattern : "yyyyMMdd"
now_value : "99999999"
},
city : { type : "string" }
}
} '};
CREATE TABLE census (
person text,
city text,
vt_from text,
vt_to text,
tt_from text,
tt_to text,
lucene text,
PRIMARY KEY((person),vt_from,tt_from)
);
Searching for bitemporal data, several queries
#CassandraSummit 2015 43
SELECT * FROM users WHERE lucene =
'{
filter :
{
type : "bitemporal",
field : "bitemporal",
vt_from : "99999999",
vt_to : "99999999",
tt_from : "99999999",
tt_to : "99999999"
}
}' AND person = 'John Doe';
Where does the system currently
think that John lives right now?
person city vt_from vt_to tt_from tt_to
John Mediumtown 1-Apr-2001 ∞ 1-Apr-2001 ∞
Searching for bitemporal data
#CassandraSummit 2015 44
person city vt_from vt_to tt_from tt_to
John Beachy 1-Jun-1995 3-Sep-2000 2-Feb-2001 ∞
Where does the system currently
think that John lived in 1999?
SELECT * FROM users WHERE lucene =
'{
filter :
{
type : "bitemporal",
field : "bitemporal",
vt_from : "19990101",
vt_to : "19991231",
tt_from : "99999999",
tt_to : "99999999"
}
}' AND person = 'John Doe';
#CassandraSummit 2015 45
On 01-Jan-2000, where did the
system think John was living back in
1999?
SELECT * FROM users WHERE lucene =
'{
filter :
{
type : "bitemporal",
field : "bitemporal",
vt_from : "19990101",
vt_to : "19991231",
tt_from : “20000101",
tt_to : “20000101"
}
}' AND person = 'John Doe';
person city vt_from vt_to tt_from tt_to
John Bigtown 26-Aug-1994 ∞ 27-Dec-1994 1-Feb-2001
Searching for bitemporal data
#CassandraSummit 2015 46
SELECT * FROM users WHERE lucene =
'{
filter :
{
type : "boolean",
must : [
{ type : "bitemporal", field : "bitemporal",
vt_from : "99999999", vt_to : "99999999",
tt_from : "99999999", tt_to : "99999999"
},
{ type : "match",
field : "city",
value : "smallville"}
]}
}}';
Who currently lives at Smallville?
Searching for bitemporal data
CONCLUSIONS
Conclusions
•  Pluggable Lucene features in Cassandra
•  Basic geospatial features
•  Date/Time durations
•  Bitemporal data model indexing
•  Compatible with MapReduce frameworks
•  Preserves Cassandra's functionality
#CassandraSummit 2015 48
github.com/stratio/cassandra-lucene-index
•  Published as plugin for Apache Cassandra
•  Apache License Version 2.0
Its open source
#CassandraSummit 2015 49
BIG DATA
CHILD`S PLAY
Andrés de la Peña
andres@stratio.com
@a_de_la_pena
Eduardo Alonso
eduardoalonso@stratio.com
@eAlonsoDB

More Related Content

Similar to Geospatial and bitemporal search in cassandra with pluggable lucene index

Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...Big Data Spain
 
Stratio's Cassandra Lucene index: Geospatial use cases
Stratio's Cassandra Lucene index: Geospatial use casesStratio's Cassandra Lucene index: Geospatial use cases
Stratio's Cassandra Lucene index: Geospatial use casesAndrés de la Peña
 
N1QL workshop: Indexing & Query turning.
N1QL workshop: Indexing & Query turning.N1QL workshop: Indexing & Query turning.
N1QL workshop: Indexing & Query turning.Keshav Murthy
 
Couchbase Tutorial: Big data Open Source Systems: VLDB2018
Couchbase Tutorial: Big data Open Source Systems: VLDB2018Couchbase Tutorial: Big data Open Source Systems: VLDB2018
Couchbase Tutorial: Big data Open Source Systems: VLDB2018Keshav Murthy
 
Advanced search and Top-K queries in Cassandra
Advanced search and Top-K queries in CassandraAdvanced search and Top-K queries in Cassandra
Advanced search and Top-K queries in CassandraStratio
 
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014Andrés de la Peña
 
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014dhiguero
 
Couchbase N1QL: Index Advisor
Couchbase N1QL: Index AdvisorCouchbase N1QL: Index Advisor
Couchbase N1QL: Index AdvisorKeshav Murthy
 
SenchaCon 2016: Integrating Geospatial Maps & Big Data Using CartoDB via Ext ...
SenchaCon 2016: Integrating Geospatial Maps & Big Data Using CartoDB via Ext ...SenchaCon 2016: Integrating Geospatial Maps & Big Data Using CartoDB via Ext ...
SenchaCon 2016: Integrating Geospatial Maps & Big Data Using CartoDB via Ext ...Sencha
 
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5Keshav Murthy
 
N1QL: What's new in Couchbase 5.0
N1QL: What's new in Couchbase 5.0N1QL: What's new in Couchbase 5.0
N1QL: What's new in Couchbase 5.0Keshav Murthy
 
Apache Cassandra for Timeseries- and Graph-Data
Apache Cassandra for Timeseries- and Graph-DataApache Cassandra for Timeseries- and Graph-Data
Apache Cassandra for Timeseries- and Graph-DataGuido Schmutz
 
Stratio's Cassandra Lucene index: Geospatial use cases - Big Data Spain 2016
Stratio's Cassandra Lucene index: Geospatial use cases - Big Data Spain 2016Stratio's Cassandra Lucene index: Geospatial use cases - Big Data Spain 2016
Stratio's Cassandra Lucene index: Geospatial use cases - Big Data Spain 2016Stratio
 
Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Robbie Strickland
 
Cassandra Community Webinar | Become a Super Modeler
Cassandra Community Webinar | Become a Super ModelerCassandra Community Webinar | Become a Super Modeler
Cassandra Community Webinar | Become a Super ModelerDataStax
 
MongoDB Stitch Introduction
MongoDB Stitch IntroductionMongoDB Stitch Introduction
MongoDB Stitch IntroductionMongoDB
 
NoSQL Data Modeling using Couchbase
NoSQL Data Modeling using CouchbaseNoSQL Data Modeling using Couchbase
NoSQL Data Modeling using CouchbaseBrant Burnett
 
[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise
[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise
[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital EnterpriseWSO2
 
Advanced data modeling with apache cassandra
Advanced data modeling with apache cassandraAdvanced data modeling with apache cassandra
Advanced data modeling with apache cassandraPatrick McFadin
 

Similar to Geospatial and bitemporal search in cassandra with pluggable lucene index (20)

Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...
 
Stratio's Cassandra Lucene index: Geospatial use cases
Stratio's Cassandra Lucene index: Geospatial use casesStratio's Cassandra Lucene index: Geospatial use cases
Stratio's Cassandra Lucene index: Geospatial use cases
 
N1QL workshop: Indexing & Query turning.
N1QL workshop: Indexing & Query turning.N1QL workshop: Indexing & Query turning.
N1QL workshop: Indexing & Query turning.
 
Couchbase Tutorial: Big data Open Source Systems: VLDB2018
Couchbase Tutorial: Big data Open Source Systems: VLDB2018Couchbase Tutorial: Big data Open Source Systems: VLDB2018
Couchbase Tutorial: Big data Open Source Systems: VLDB2018
 
Advanced search and Top-K queries in Cassandra
Advanced search and Top-K queries in CassandraAdvanced search and Top-K queries in Cassandra
Advanced search and Top-K queries in Cassandra
 
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
 
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
 
Couchbase N1QL: Index Advisor
Couchbase N1QL: Index AdvisorCouchbase N1QL: Index Advisor
Couchbase N1QL: Index Advisor
 
Presentation
PresentationPresentation
Presentation
 
SenchaCon 2016: Integrating Geospatial Maps & Big Data Using CartoDB via Ext ...
SenchaCon 2016: Integrating Geospatial Maps & Big Data Using CartoDB via Ext ...SenchaCon 2016: Integrating Geospatial Maps & Big Data Using CartoDB via Ext ...
SenchaCon 2016: Integrating Geospatial Maps & Big Data Using CartoDB via Ext ...
 
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
 
N1QL: What's new in Couchbase 5.0
N1QL: What's new in Couchbase 5.0N1QL: What's new in Couchbase 5.0
N1QL: What's new in Couchbase 5.0
 
Apache Cassandra for Timeseries- and Graph-Data
Apache Cassandra for Timeseries- and Graph-DataApache Cassandra for Timeseries- and Graph-Data
Apache Cassandra for Timeseries- and Graph-Data
 
Stratio's Cassandra Lucene index: Geospatial use cases - Big Data Spain 2016
Stratio's Cassandra Lucene index: Geospatial use cases - Big Data Spain 2016Stratio's Cassandra Lucene index: Geospatial use cases - Big Data Spain 2016
Stratio's Cassandra Lucene index: Geospatial use cases - Big Data Spain 2016
 
Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015
 
Cassandra Community Webinar | Become a Super Modeler
Cassandra Community Webinar | Become a Super ModelerCassandra Community Webinar | Become a Super Modeler
Cassandra Community Webinar | Become a Super Modeler
 
MongoDB Stitch Introduction
MongoDB Stitch IntroductionMongoDB Stitch Introduction
MongoDB Stitch Introduction
 
NoSQL Data Modeling using Couchbase
NoSQL Data Modeling using CouchbaseNoSQL Data Modeling using Couchbase
NoSQL Data Modeling using Couchbase
 
[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise
[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise
[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise
 
Advanced data modeling with apache cassandra
Advanced data modeling with apache cassandraAdvanced data modeling with apache cassandra
Advanced data modeling with apache cassandra
 

Recently uploaded

ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 

Recently uploaded (20)

ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 

Geospatial and bitemporal search in cassandra with pluggable lucene index

  • 1. 1 Eduardo Alonso eduardoalonso@stratio.com Andrés de la Peña andres@stratio.com GEOSPATIAL AND BITEMPORAL SEARCH IN C* WITH PLUGGABLE LUCENE INDEX @a_de_la_pena @eAlonsoDB
  • 2. •  Stratio is a Big Data Company •  Certified Spark distribution •  Founded in 2013 •  120+ employees in Madrid •  Offices in Madrid and San Francisco #CassandraSummit 2015 WHO WE ARE
  • 3. Pluggable Lucene based 2i Geospatial Search Bitemporal Indexes 1 2 3 CONTENTS
  • 5. primary key secondary indexes token ranges Throughput Expressiveness Cassandra query methods #CassandraSummit 2015 5
  • 6. primary key secondary indexes token ranges Cassandra query methods by use case #CassandraSummit 2015 6 primary key secondary indexes token ranges Real time Analytics
  • 7. Cassandra query methods trade offs #CassandraSummit 2015 7 •  Pure-range queries limited to partition •  No Boolean logic •  No Full text search •  Sorting limited to partition •  Full-table scan •  High load •  High latency •  Low concurrency primary key secondary indexes token ranges primary key secondary indexes token ranges Real time Analytics
  • 8. A third use case #CassandraSummit 2015 8 AnalyticsReal-time Search •  Not as fast as primary key queries •  Not as expressive as map reduce •  Search can be used for both cases
  • 9. #CassandraSummit 2015 9 CQL + Lucene A Lucene based secondary index implementation
  • 10. A Lucene based secondary index implementation •  Proven stable and fast indexing solution •  Expressive queries - Multivariable, ranges, full text, sorting, top-k, etc. •  Mature distributed search solutions built on top of it - Solr, ElasticSearch •  Just a small embeddable library •  Easily extensible •  Published under the Apache License #CassandraSummit 2015 10
  • 11. Cassandra query methods #CassandraSummit 2015 11 primary key token ranges primary key secondary indexes token ranges primary key secondary indexes token ranges •  Mid expressiveness •  Mid latency •  Mid load •  Low expressiveness •  Low latency •  Low load •  High expressiveness •  High latency •  High load Real time AnalyticsSearch
  • 12. A Lucene based secondary index implementation CLIENT C* node C* node C* node Lucene index Lucene index Lucene index #CassandraSummit 2015 12 •  Each node indexes its own data •  Keep P2P architecture •  Distribution and replication managed by C* •  Just a single pluggable JAR file - CASSANDRA-8717 JVM JVM JVM
  • 13. CREATE TABLE tweets ( id bigint, created timestamp, message text, userid bigint, username text, PRIMARY KEY (userid, created, id) ); Create index •  Built in the background in any moment •  Real time updates •  Mapping eases ETL •  Language aware #CassandraSummit 2015 13 ALTER TABLE tweets ADD lucene TEXT; CREATE CUSTOM INDEX tweets_idx ON tweets (lucene) USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'refresh_seconds' : '10', 'schema' : ' fields : { created : {type : "date", pattern : "yyyy-MM-dd"}, message : {type : "text", analyzer : "english"}, userid : {type : "string"}, username : {type : "string"} } '};
  • 14. SELECT * FROM tweets WHERE lucene = '{ filter : { type : "boolean", must : [ {type : "range", field : "created_at", lower : "2015/01/01"}, {type : "wildcard", field : "user", value : "a*"} ], not : [ {type : "match", field : "user", value : "andres"} ] }, sort : { fields: [ {field : "time", reverse : true}, {field : "user", reverse : false} ] } }' LIMIT 10000; Searching for rows #CassandraSummit 2015 14
  • 15. Integrating Lucene & Spark CLIENT Spark master C* node C* node C* node Lucene Lucene Lucene •  Compute large amounts of data •  Filtering push-down •  Avoid systematic full scan •  Reduces the amount of data to be processed #CassandraSummit 2015 15
  • 16. Index performance in Spark #CassandraSummit 2015 16 0 500 1000 1500 2000 2500 0 10 20 30 40 50 60 70 80 90 100 seconds millions of collected rows index full scan
  • 18. Lucene spatial module •  Spatial4J shapes -  Points, rectangles, circles, etc. •  Spatial search strategies -  BBox, RecursivePrefixTree, PointVector, etc. •  Not only geographical data -  Numbers, dates •  It can be combined with other searches #CassandraSummit 2015 18
  • 19. Indexing geographical locations #CassandraSummit 2015 19 CREATE CUSTOM INDEX restaurants_idx ON restaurants (lucene) USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'refresh_seconds' : '1', 'schema' : '{ fields : { location : { type : "geo_point", latitude : "lat", longitude : "lon" }, stars: {type : "integer" } } } '}; CREATE TABLE restaurants( name text PRIMARY KEY, stars bigint, lat double, lon double); •  No native shape data types in CQL •  Many-to-one column mapping •  Just points. For now.
  • 20. Bounding box search #CassandraSummit 2015 20 SELECT * FROM restaurants WHERE lucene = '{ filter : { type : "geo_bbox", field : "location", min_latitude : 40.425978, max_latitude : 40.445886, min_longitude : -3.808252, max_longitude : -3.770999 } }';
  • 21. Distance search #CassandraSummit 2015 21 SELECT * FROM restaurants WHERE lucene = '{ filter : { type : "geo_distance", field : "location", latitude : 40.443270, longitude : -3.800498, min_distance : "100m", max_distance : "2km" } }';
  • 22. Combining geospatial searches #CassandraSummit 2015 22 SELECT * FROM restaurants WHERE lucene = '{ filter : { type : "boolean", must : [ { type : "geo_distance", field : "location", latitude : 40.443270, longitude : -3.800498, max_distance : "10km" }, { type : "range", field : "stars", lower : 2, upper : 4 } ] } }';
  • 23. Lucene spatial is not only geospatial… #CassandraSummit 2015 23 •  General geometry •  Numeric ranges -  NumberRangePrefixTree •  Date ranges/durations -  DateRangePrefixTree
  • 24. Temporal/Date durations #CassandraSummit 2015 24 •  A pair composed by a start-date and a stop-date -  Can be indexed as points in a 2D space •  David Smiley's DateRangePrefixTree -  Levels for common date-ranges: years, months, days… -  Spatial operations: intersects, is_within, contains 27 Nov 2015 29 Dec 2015 intersects is - within contains
  • 25. Indexing date ranges #CassandraSummit 2015 25 CREATE CUSTOM INDEX breakdowns_idx ON breakdowns (lucene) USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'refresh_seconds' : '1', 'schema' : '{ fields : { duration: { type : "date_range", from : "start_date", to : "stop_date", pattern : "yyyy-MM-dd" }, cause: {type : "string" } } } '}; CREATE TABLE breakdowns ( system text PRIMARY KEY, cause text, start_date timestamp, stop_date timestamp); •  No native date range type in CQL •  Many-to-one column mapping •  Spatial operations
  • 26. Searching for date ranges #CassandraSummit 2015 26 SELECT * FROM breakdowns WHERE lucene = '{ filter : { type : "date_range", field : "duration", from : "2015-01-01", to : "2015-01-05", operation : "intersects" } }'; SELECT * FROM users WHERE lucene = '{ filter : { type : "boolean", must : [ { type : "date_range", field : "duration", from : "2015-01-01", to : "2015-01-05", operation : "is_within" }, { type : "match", field : "cause", value : "human error" } ] } }';
  • 28. The bitemporal data model #CassandraSummit 2015 28 •  Stores WHAT and WHEN •  Support for corrections. •  Reproducible business perspective history at a point of time. •  Trace why a decision was made.
  • 29. The bitemporal data model #CassandraSummit 2015 29 •  Valid Time - The application period - WHAT happened, the real time fact period •  Transaction Time - The system period - WHEN the system consider it true
  • 30. The bitemporal data model: example #CassandraSummit 2015 30 person city vt_from vt_to tt_from tt_to John Smallville 3-Apr-1975 ∞ 4-Apr-1975 26-Dec-1994 John Smallville 3-Apr-1975 25-Aug-1994 27-Dec-1994 ∞ John Bigtown 26-Aug-1994 ∞ 27-Dec-1994 1-Feb-2001 John Bigtown 26-Aug-1994 30-May-1995 2-Feb-2001 ∞ John Beachy 1-Jun-1995 3-Sep-2000 2-Feb-2001 ∞ John Bigtown 3-Sep-2000 ∞ 2-Feb-2001 31-Mar-2001 John Mediumtown 1-Apr-2001 ∞ 1-Apr-2001 ∞ Modified example from Wikipedia https://en.wikipedia.org/wiki/Temporal_database
  • 31. A naïve approach #CassandraSummit 2015 31 CREATE CUSTOM INDEX census_idx ON census (lucene) USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'refresh_seconds' : '1', 'schema' : '{ fields : { vt_from : { type : "date", pattern : "yyyyMMdd" }, vt_to : { type : "date", pattern : "yyyyMMdd" }, tt_from : { type : "date", pattern : "yyyyMMdd" }, tt_to : { type : "date", pattern : "yyyyMMdd" } }} '}; Using 4 dates
  • 32. A naive approach #CassandraSummit 2015 32 SELECT * FROM census WHERE lucene = '{ filter : { type : "boolean", must : [ should : [ { type : "range", field : "vt_from", lower : "", upper : "", include_lower=true, include_upper=true }, { type : "range", field : "vt_to", lower : "", upper : "", include_lower=true, include_upper=true }, must : [ { type : "range", field : "vt_from", upper : "", include_upper=true}, { type : "range", field : "vt_to", lower : "", include_lower=true}] ], should : [ { type : "range", field : “tt_from", lower : "", upper : "", include_lower=true, include_upper=true }, { type : "range", field : “tt_to", lower : "", upper : "", include_lower=true, include_upper=true }, must : [ { type : "range", field : “tt_from", upper : "", include_upper=true}, { type : "range", field : “tt_to", lower : "", include_lower=true} ] ] ] } }' AND person = 'John Doe';
  • 33. A naive approach: Issues #CassandraSummit 2015 33 •  Very difficult to understand/build the query. •  Now value (∞) using Long.MAX_VALUE is costly.
  • 34. A spatial approach #CassandraSummit 2015 34 CREATE CUSTOM INDEX census_idx ON census (lucene) USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'schema' : '{ fields : { vt: { type : "date_range", pattern : "yyyyMMdd", from : "vt_from", to : "vt_to" }, tt: { type : "date_range", pattern : "yyyyMMdd", from : "tt_from", to : "tt_to" }, } } '}; Using 2 date ranges
  • 35. A spatial approach #CassandraSummit 2015 35 SELECT * FROM users WHERE lucene = '{ filter : { type : "boolean", must : [ { type : "date_range", field : "vt", from : "20150501", to : "99999999", operation : "intersects" }, { type : "date_range", field : "tt", from : "20150501", to : "9999999999", operation : "intersects" } ] } }';
  • 36. A spatial approach: performance issues #CassandraSummit 2015 36 •  Very difficult to understand/build the query. •  Now value (∞) using Long.MAX_VALUE is costly.
  • 37. 4R-Tree to the rescue #CassandraSummit 2015 37 •  Based on Bliujute, R., Jensen, C. S., & Slivinskas, G. (2000). Light-weight indexing of general bitemporal data •  The Now Value is never stored. •  The data is stored in 4 R-Trees. •  Queries are transformed and distributed among the trees.
  • 38. Point(vt_from, tt_from) Line(vt_from,vt_to,tt_to) Rectangle(vt_from,vt_to, tt_from,tt_to)Line(vt_from,vt_to,tt_to) 4R-Tree to the rescue: storing data #CassandraSummit 2015 38 TT_TO==NOW && VT_TO==NOW TT_TO==NOW && VT_TO!=NOW TT_TO!=NOW && VT_TO==NOW TT_TO!=NOW && VT_TO!=NOW • R1 R2 R3 R4
  • 39. 4R-Tree to the rescue: searching data #CassandraSummit 2015 39 IF (TT_FROM!=NOW) && (TT_TO >= VT_FROM): searchR1(0, TT_TO, 0,VT_TO) U searchR2(0, TT_TO, VT_FROM,VT_TO) U searchR3(max(TT_FROM,VT_FROM),TT_TO,0,VT_TO)U searchR4(TT_FROM,TT_TO, VT_FROM, VT_TO) IF (TT_FROM!=NOW) && (TT_TO < VT_FROM): searchR2(0, TT_TO, VT_FROM,VT_TO) U searchR4(TT_FROM,TT_TO, VT_FROM, VT_TO) IF (TT_FROM==NOW) && ([VT_FROM,VT_TO]≠[0,MAX]) && (TT_TO >= VT_FROM): searchR1(0, TT_TO, 0,VT_TO) U searchR2(0, TT_TO, VT_FROM,VT_TO) IF (TT_FROM==NOW) && ([VT_FROM,VT_TO]≠[0,MAX]) && (TT_TO < VT_FROM): searchR2(0, TT_TO, VT_FROM,VT_TO) IF (TT_FROM==NOW) && ([VT_FROM,VT_TO]=[0,MAX]): R1 U R2
  • 40. 4R-Tree to the rescue: #CassandraSummit 2015 40 •  Problem!!! Lucene does not have support for R-Tree •  Our Solution: - Use 2 DateRangePrefixTrees for each R-Tree •  Future Work: Experiment with other Lucene spatial trees and strategies.
  • 41. The bitemporal data model: example #CassandraSummit 2015 41 Modified example from Wikipedia https://en.wikipedia.org/wiki/Temporal_database person city vt_from vt_to tt_from tt_to John Smallville 3-Apr-1975 ∞ 4-Apr-1975 26-Dec-1994 John Smallville 3-Apr-1975 25-Aug-1994 27-Dec-1994 ∞ John Bigtown 26-Aug-1994 ∞ 27-Dec-1994 1-Feb-2001 John Bigtown 26-Aug-1994 30-May-1995 2-Feb-2001 ∞ John Beachy 1-Jun-1995 3-Sep-2000 2-Feb-2001 ∞ John Bigtown 3-Sep-2000 ∞ 2-Feb-2001 31-Mar-2001 John Mediumtown 1-Apr-2001 ∞ 1-Apr-2001 ∞
  • 42. Indexing bitemporal data #CassandraSummit 2015 42 CREATE CUSTOM INDEX census_idx ON census (lucene) USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'schema' : '{ fields : { bitemporal : { type : "bitemporal", vt_from : "vt_from", vt_to : "vt_to", tt_from : "vt_from", tt_to : "tt_to", pattern : "yyyyMMdd" now_value : "99999999" }, city : { type : "string" } } } '}; CREATE TABLE census ( person text, city text, vt_from text, vt_to text, tt_from text, tt_to text, lucene text, PRIMARY KEY((person),vt_from,tt_from) );
  • 43. Searching for bitemporal data, several queries #CassandraSummit 2015 43 SELECT * FROM users WHERE lucene = '{ filter : { type : "bitemporal", field : "bitemporal", vt_from : "99999999", vt_to : "99999999", tt_from : "99999999", tt_to : "99999999" } }' AND person = 'John Doe'; Where does the system currently think that John lives right now? person city vt_from vt_to tt_from tt_to John Mediumtown 1-Apr-2001 ∞ 1-Apr-2001 ∞
  • 44. Searching for bitemporal data #CassandraSummit 2015 44 person city vt_from vt_to tt_from tt_to John Beachy 1-Jun-1995 3-Sep-2000 2-Feb-2001 ∞ Where does the system currently think that John lived in 1999? SELECT * FROM users WHERE lucene = '{ filter : { type : "bitemporal", field : "bitemporal", vt_from : "19990101", vt_to : "19991231", tt_from : "99999999", tt_to : "99999999" } }' AND person = 'John Doe';
  • 45. #CassandraSummit 2015 45 On 01-Jan-2000, where did the system think John was living back in 1999? SELECT * FROM users WHERE lucene = '{ filter : { type : "bitemporal", field : "bitemporal", vt_from : "19990101", vt_to : "19991231", tt_from : “20000101", tt_to : “20000101" } }' AND person = 'John Doe'; person city vt_from vt_to tt_from tt_to John Bigtown 26-Aug-1994 ∞ 27-Dec-1994 1-Feb-2001 Searching for bitemporal data
  • 46. #CassandraSummit 2015 46 SELECT * FROM users WHERE lucene = '{ filter : { type : "boolean", must : [ { type : "bitemporal", field : "bitemporal", vt_from : "99999999", vt_to : "99999999", tt_from : "99999999", tt_to : "99999999" }, { type : "match", field : "city", value : "smallville"} ]} }}'; Who currently lives at Smallville? Searching for bitemporal data
  • 48. Conclusions •  Pluggable Lucene features in Cassandra •  Basic geospatial features •  Date/Time durations •  Bitemporal data model indexing •  Compatible with MapReduce frameworks •  Preserves Cassandra's functionality #CassandraSummit 2015 48
  • 49. github.com/stratio/cassandra-lucene-index •  Published as plugin for Apache Cassandra •  Apache License Version 2.0 Its open source #CassandraSummit 2015 49
  • 50. BIG DATA CHILD`S PLAY Andrés de la Peña andres@stratio.com @a_de_la_pena Eduardo Alonso eduardoalonso@stratio.com @eAlonsoDB