Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Sasi, cassandra on full text search ride

3,088 views

Published on

New full text search index for Apache Cassandra

Published in: Technology
  • A question, why SASI only supports sparse value (less than 5 rows matched)?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Sasi, cassandra on full text search ride

  1. 1. SASI, Cassandra on full text search ride DuyHai DOAN Apache Cassandra Evangelist
  2. 2. @doanduyhai Who Am I ? Duy Hai DOAN Apache Cassandra Evangelist •  talks, meetups, confs •  open-source devs (Achilles, Apache Zeppelin…) •  OSS Cassandra point of contact ☞ duy_hai.doan@datastax.com ☞ @doanduyhai 2
  3. 3. @doanduyhai Datastax •  Founded in April 2010 •  We contribute a lot to Apache Cassandra™ •  400+ customers (25 of the Fortune 100), 450+ employees •  Headquarter in San Francisco Bay area •  EU headquarter in London, offices in France and Germany •  Datastax Enterprise = OSS Cassandra + extra features 3
  4. 4. SASI Index •  What is SASI ? •  Distributed Index •  Life-cycle •  Query Planner
  5. 5. What is SASI ?
  6. 6. @doanduyhai Who ? •  Open source contribution by an engineers team 6
  7. 7. @doanduyhai How ? 7 New secondary index re-designed from scratch •  follow SSTable life-cycle (flush, compaction) •  new data-strutures •  full text search options •  no dependency on Apache Lucene SASI = SSTable-Attached Secondary Index
  8. 8. SASI Demo
  9. 9. SASI Demo 9
  10. 10. Distributed Index
  11. 11. @doanduyhai Index on user country 11 H A E D B C G F FR user1 user102 … user493 US user54 user483 … user938 FR user87 user176 … user987 FR user17 user409 … user787
  12. 12. @doanduyhai Distributed search query handling 12 H A E D B C G F coordinator 1st round Concurrency factor = 1
  13. 13. @doanduyhai Distributed search query handling 13 H A E D B C G F coordinator Not enough results ?
  14. 14. @doanduyhai Distributed search query handling 14 H A E D B C G F coordinator 2nd round Concurrency factor = 2
  15. 15. @doanduyhai Distributed search query handling 15 H A E D B C G F coordinator Still not enough results ?
  16. 16. @doanduyhai Distributed search query handling 16 H A E D B C G F coordinator 3rd round Concurrency factor = 4
  17. 17. @doanduyhai Caveat 1: query with non-restrictive filters 17 H A E D B C G F coordinator Hit all nodes L
  18. 18. @doanduyhai Caveat 1 solution: always use LIMIT 18 H A E D B C G F coordinator SELECT * FROM … WHERE ... LIMIT 1000
  19. 19. @doanduyhai Caveat 2: 1-to-1 index (user_email) 19 H A E D B C G F coordinator Not found WHERE user_email LIKE '%xxx%'
  20. 20. @doanduyhai Caveat 2: 1-to-1 index (user_email) 20 H A E D B C G F coordinator Still no result WHERE user_email LIKE '%xxx%'
  21. 21. @doanduyhai Caveat 2: 1-to-1 index (user_email) 21 H A E D B C G F coordinator At best 1 user found At worst 0 user found WHERE user_email LIKE '%xxx%'
  22. 22. @doanduyhai Caveat 2 solution: use materalized views 22 For 1-to-1 index/relationship, use materialized views instead CREATE MATERIALIZED VIEW user_by_email AS SELECT * FROM users WHERE user_id IS NOT NULL and user_email IS NOT NULL PRIMARY KEY (user_email, user_id)
  23. 23. @doanduyhai Caveat 3: fetch all rows for analytics use-case 23 H A E D B C G F coordinator Hit all nodes L
  24. 24. @doanduyhai Caveat 3 solution: use co-located Apache Spark 24 H A E D B C G F Local index filtering in Cassandra Aggregation in Spark Local index query
  25. 25. 25 Q & A ! "
  26. 26. SASI Life-cycle
  27. 27. @doanduyhai SASI Life-cycle: in-memory 27 Commit log1 . . . 1 Commit log2 Commit logn Memory . . . MemTable Table1 MemTable Table2 MemTable TableN 2 Index MemTable1 Index MemTable2 . . . Index MemTableN 3 ACK the client
  28. 28. @doanduyhai IndexMemtable 28 Index mode, data type Data structure Usage PREFIX, text Guava ConcurrentRadixTree name LIKE 'John%' CONTAINS, text Guava ConcurrentSuffixTree name LIKE ’%John%' name LIKE ’%ny’ PREFIX, other JDK ConcurrentSkipListSet age = 20 age >= 20 AND age <= 30 SPARSE, other JDK ConcurrentSkipListSet age = 20 age >= 20 AND age <= 30
  29. 29. @doanduyhai SASI Life-cycle: flush to SSTable 29 Commit log1 . . . 1 Commit log2 Commit logn Memory Table1 SStable1 Table2 Table3 SStable2 SStable3 4 OnDiskIndex1 OnDiskIndex2 OnDiskIndex3
  30. 30. @doanduyhai SASI Life-cycle: compaction 30 SSTable1 SSTable2 SSTable3 New SSTable OnDiskIndex1 OnDiskIndex2 OnDiskIndex3 New OnDiskIndex
  31. 31. @doanduyhai OnDiskIndex Files 31 SStable1 SStable2 user_id4 FR user_id1 US user_id5 FR user_id3 UK user_id2 DE OnDiskIndex1 FR US OnDiskIndex2 UK DE
  32. 32. @doanduyhai OnDiskIndex Files 32 SStable1 SStable2 user_id4 FR user_id1 US user_id5 FR user_id3 UK user_id2 DE OnDiskIndex1 FR US OnDiskIndex2 UK DE Suffix Tree Data structures
  33. 33. 33 Q & A ! "
  34. 34. Query Planner
  35. 35. @doanduyhai Integrated query planner 35 Perform optimizations on predicates 1.  build predicates tree 2.  predicates push-down & re-ordering 3.  predicate fusions for != operator
  36. 36. @doanduyhai Query optimization example 36 WHERE age < 100 AND fname = 'p*' AND first_name != 'pa*' AND age > 21
  37. 37. @doanduyhai Query optimization example 37 AND is associative and commutative
  38. 38. @doanduyhai Query optimization example 38 != transformed to exclusion on range scan
  39. 39. @doanduyhai Query optimization example 39 AND is associative and commutative
  40. 40. 40 Q & A ! "
  41. 41. Some Benchmarks
  42. 42. @doanduyhai Hardware specs 42 13 bare-metal machines •  6 CPU HT (12 vcores) •  64Gb RAM •  4 SSDs in RAID0 for a total of 1.5Tb Data set •  13 billions of rows •  1 numerical index with 36 distinct values •  2 text index with 7 distinct values •  1 text index with 3 distinct values
  43. 43. @doanduyhai Benchmark results 43
  44. 44. @doanduyhai Benchmark results 44
  45. 45. @doanduyhai Benchmark results 45
  46. 46. @doanduyhai Benchmark results 46
  47. 47. @doanduyhai Benchmark results 47 Full scan using server-side paging Predicate count Fetched rows Query time in sec 1 36 109 986 609 2 2 781 492 330 3 1 044 547 372 4 360 334 116
  48. 48. Take Away
  49. 49. @doanduyhai Conclusion 49 Is it available ? •  yes in Cassandra 3.5 Future enhancement ? •  index on collections (List, Set & Map) ! •  OR clause (WHERE (xxx OR yyy) AND zzz ) •  != operator
  50. 50. @doanduyhai Conclusion 50 SASI vs Solr/ElasticSearch ? •  Cassandra is not a search engine !!! (database = durability) •  always slower because 2 passes (SASI index read + original Cassandra data) •  no scoring •  no ordering (ORDER BY) •  no grouping (GROUP BY) à Apache Spark for analytics Still, SASI covers 80% of search use-cases and people are happy !
  51. 51. 51 @doanduyhai duy_hai.doan@datastax.com https://academy.datastax.com/ Thank You

×