Duy Hai Doan presented information on SASI (SSTable Attached Secondary Index), a new secondary indexing technique for Apache Cassandra. SASI indexes follow the life cycle of SSTables, using new data structures like radix trees. It allows full text search without dependency on Lucene. The presentation covered how SASI indexes are distributed, its query planning and optimizations, and benchmarks showing its performance. While not intended for complex search or analytics, SASI can address 80% of typical search use cases within Cassandra.
The 7 Things I Know About Cyber Security After 25 Years | April 2024
Cassandra SASI Index Provides Full-Text Search
1. @doanduyhai
SASI, Cassandra on the full text search ride
DuyHai DOAN
Apache Cassandra Evangelist
AMSTERDAM 11-12 MAY 2016
2. @doanduyhai
Who Am I ?
Duy Hai DOAN
Apache Cassandra Evangelist
• talks, meetups, confs
• open-source devs (Achilles, Apache Zeppelin…)
• OSS Cassandra point of contact
☞ duy_hai.doan@datastax.com
☞ @doanduyhai
2
3. @doanduyhai
Datastax
• Founded in April 2010
• We contribute a lot to Apache Cassandra™
• 400+ customers (25 of the Fortune 100), 450+ employees
• Headquarter in San Francisco Bay area
• EU headquarter in London, offices in France and Germany
• Datastax Enterprise = OSS Cassandra + extra features
3
4. SASI Index
• What is SASI ?
• Distributed Index
• Life-cycle
• Query Planner
7. @doanduyhai
How ?
7
New secondary index re-designed from scratch
• follow SSTable life-cycle (flush, compaction)
• new data-strutures
• full text search options
• no dependency on Apache Lucene
SASI = SSTable-Attached Secondary Index
11. @doanduyhai
Index on user country
11
H
A
E
D
B C
G F
FR user1 user102 … user493
US user54 user483 … user938
FR user87 user176 … user987
FR user17 user409 … user787
19. @doanduyhai
Caveat 2: 1-to-1 index (user_email)
19
H
A
E
D
B C
G F
coordinator
Not found WHERE user_email
LIKE '%xxx%'
20. @doanduyhai
Caveat 2: 1-to-1 index (user_email)
20
H
A
E
D
B C
G F
coordinator
Still no result
WHERE user_email
LIKE '%xxx%'
21. @doanduyhai
Caveat 2: 1-to-1 index (user_email)
21
H
A
E
D
B C
G F
coordinator
At best 1 user found
At worst 0 user found
WHERE user_email
LIKE '%xxx%'
22. @doanduyhai
Caveat 2 solution: use materalized views
22
For 1-to-1 index/relationship, use materialized views instead
CREATE MATERIALIZED VIEW user_by_email AS
SELECT * FROM users
WHERE user_id IS NOT NULL and user_email IS NOT NULL
PRIMARY KEY (user_email, user_id)
24. @doanduyhai
Caveat 3 solution: use co-located Apache Spark
24
H
A
E
D
B C
G F
Local index filtering in Cassandra
Aggregation in Spark
Local index query
27. @doanduyhai
SASI Life-cycle: in-memory
27
Commit log1
. . .
1
Commit log2
Commit logn
Memory
. . .
MemTable
Table1
MemTable
Table2
MemTable
TableN
2
Index
MemTable1
Index
MemTable2
. . .
Index
MemTableN
3
ACK the client
28. @doanduyhai
IndexMemtable
28
Index mode, data type Data structure Usage
PREFIX, text Guava ConcurrentRadixTree name LIKE 'John%'
CONTAINS, text Guava ConcurrentSuffixTree
name LIKE ’%John%'
name LIKE ’%ny’
PREFIX, other JDK ConcurrentSkipListSet
age = 20
age >= 20 AND age <= 30
SPARSE, other JDK ConcurrentSkipListSet
age = 20
age >= 20 AND age <= 30
42. @doanduyhai
Hardware specs
42
13 bare-metal machines
• 6 CPU HT (12 vcores)
• 64Gb RAM
• 4 SSDs in RAID0 for a total of 1.5Tb
Data set
• 13 billions of rows
• 1 numerical index with 36 distinct values
• 2 text index with 7 distinct values
• 1 text index with 3 distinct values
49. @doanduyhai
Conclusion
49
Is it available ?
• yes in Cassandra 3.5
Future enhancement ?
• index on collections (List, Set & Map) !
• OR clause (WHERE (xxx OR yyy) AND zzz )
• != operator
50. @doanduyhai
Conclusion
50
SASI vs Solr/ElasticSearch ?
• Cassandra is not a search engine !!! (database = durability)
• always slower because 2 passes (SASI index read + original Cassandra data)
• no scoring
• no ordering (ORDER BY)
• no grouping (GROUP BY) à Apache Spark for analytics
Still, SASI covers 80% of search use-cases and people are happy !