Top-k queries in real-time with
Cassandra and Intravert
Jonathan Halliday, JBoss
jonathan.halliday@redhat.com

Rui Vieira,...
What is Top-k ?

#CassandraEU
What is Top-k ?

#CassandraEU
Top-k queries
• Rank matching results for the term(s)
– We don't really care about the scoring
algorithm

• Application: t...
yawn ?
• SELECT document_id, score
FROM data
WHERE term='top-k'
ORDER BY score DESC, document_id
LIMIT 100
• Lunch time!
#...
Not so fast...
• SELECT document_id, score
FROM data
WHERE term IN('top-k', 'algorithm')
GROUP BY document_id
ORDER BY sco...
Distributed Top-k
• We have a lot of data
• It's spread out
• We need to combine a subset efficiently
• Map/Reduce to the ...
'real-time'
• Web pages, not control systems
• Performance, not Timeliness
• Pre-compute as much as possible
– scores for ...
Naive method
foreach(term in searchTerms) {
SELECT ... FROM ... WHERE ...

}
• Handle group by in the application code
• I...
How much data is enough?
• Data is stored keyed (i.e. sorted) by
{ term, score DESC, doc_id }
or { time_period, score DESC...
Bring on the clever algorithms
• Smart People thought about this
problem already...
• ...but not in quite the same context...
Inside a clever algorithm
• Fetch a little bit of data
• Look at it, decide how much more we
need
• Fetch some more
• Rins...
Desirable Characteristics
• Fixed number of communication rounds
is key
• Generality is good
– Cope with any distribution ...
Meet the candidates
Three-Phase Uniform Threshold (TPUT)
'Efficient Top-K Query Calculation in Distributed
Networks', Stan...
Implementation Issues
• Algorithms assume server side code
execution
• Limitations of CQL3 add some round
trips, increase ...
Data Transfer vs. k

#CassandraEU
Execution Time vs. k

#CassandraEU
Execution Time vs. peers

#CassandraEU
#CassandraEU
YMMV
• Test with your own data
• Test with your own hardware
• Hybrid Threshold for exact top-k
– Intravert optional

• KL...
Intravert
• Cassandra++
– Embed and extend the existing server
– Based on Vert.x

• JSON over HTTP, REST API
– yup, virgil...
Intravert
• Server side code execution
– Groovy (for now – Vert.x is polyglot)

• Filter result sets
• Write path triggers...
Intravert
• Good trade-off between power and
operational complexity
• More complex development cycle
– Not easy to move co...
Back to the clever algorithms
• Intravert server side execution enables
cleaner, more efficient implementation
• Reduces n...
Pre-aggregation
• For text search, can't predict common
term sets
• For time periods, can predict contiguous
periods
• Pre...
Really clever algorithms
• Hierarchical node topology
– Map to cassandra ring: same node may
own multiple keys (peers != n...
Questions?
Or email us:
Jonathan Halliday, JBoss
jonathan.halliday@redhat.com

Rui Vieira, Newcastle University
r.vieira2@...
Upcoming SlideShare
Loading in …5
×

C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert

1,816 views

Published on

Speakers: Jonathan Halliday, Core Developer at JBoss & Rui Vieira, Postgrad Student at Newcastle University
Video: http://www.youtube.com/watch?v=SRejy08zM7Y&list=PLqcm6qE9lgKLoYaakl3YwIWP4hmGsHm5e&index=15
Performing ranking queries to find the most relevant documents, most popular urls, etc on huge datasets is trivial —if you're willing to wait a while for the answers. For those with less time to waste, this session describes techniques for performing such queries efficiently. We'll describe the ranking queries problem, outline the Cassandra CQL3 data structures and code that can be used to solve it and describe the trade-offs available. We describe intravert, an innovative server-side programming solution for Cassandra, and show how it can be used to reduce network usage and improve performance by filtering data closer to source.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,816
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
21
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert

  1. 1. Top-k queries in real-time with Cassandra and Intravert Jonathan Halliday, JBoss jonathan.halliday@redhat.com Rui Vieira, Newcastle University r.vieira2@newcastle.ac.uk #CassandraEU
  2. 2. What is Top-k ? #CassandraEU
  3. 3. What is Top-k ? #CassandraEU
  4. 4. Top-k queries • Rank matching results for the term(s) – We don't really care about the scoring algorithm • Application: text search – Documents containing the search words • Application: log analysis – Popular URLs in the time period #CassandraEU
  5. 5. yawn ? • SELECT document_id, score FROM data WHERE term='top-k' ORDER BY score DESC, document_id LIMIT 100 • Lunch time! #CassandraEU
  6. 6. Not so fast... • SELECT document_id, score FROM data WHERE term IN('top-k', 'algorithm') GROUP BY document_id ORDER BY score DESC, document_id LIMIT 100 #CassandraEU
  7. 7. Distributed Top-k • We have a lot of data • It's spread out • We need to combine a subset efficiently • Map/Reduce to the rescue! – HiveQL, Stinger, Impala, Hawq • Easy! But not fast #CassandraEU
  8. 8. 'real-time' • Web pages, not control systems • Performance, not Timeliness • Pre-compute as much as possible – scores for each term • Assemble pre-computed fragments at query time – 'group by' #CassandraEU
  9. 9. Naive method foreach(term in searchTerms) { SELECT ... FROM ... WHERE ... } • Handle group by in the application code • Inefficient – transfers ALL the data for each term, even low scores #CassandraEU
  10. 10. How much data is enough? • Data is stored keyed (i.e. sorted) by { term, score DESC, doc_id } or { time_period, score DESC, Url } • Partition keys IN the query params – We can filter efficiently • Can we range limit on score? – Avoid going into the long tail #CassandraEU
  11. 11. Bring on the clever algorithms • Smart People thought about this problem already... • ...but not in quite the same context – WAN distributed logs from CDNs • Identify, adapt and reuse existing solutions – faster and less risky than starting over #CassandraEU
  12. 12. Inside a clever algorithm • Fetch a little bit of data • Look at it, decide how much more we need • Fetch some more • Rinse and repeat – but not too many times. #CassandraEU
  13. 13. Desirable Characteristics • Fixed number of communication rounds is key • Generality is good – Cope with any distribution of data • So is flexibility – Tune for different use cases #CassandraEU
  14. 14. Meet the candidates Three-Phase Uniform Threshold (TPUT) 'Efficient Top-K Query Calculation in Distributed Networks', Stanford/Princeton, 2004 Hybrid Threshold 'Efficient Processing of Distributed Top-k Queries', UCSB, 2005 KLEE 'KLEE: a framework for distributed top-k query algorithms', Max-Planck Institute, 2005 #CassandraEU
  15. 15. Implementation Issues • Algorithms assume server side code execution • Limitations of CQL3 add some round trips, increase network I/O • Previous performance comparisons of algorithms may no longer be valid #CassandraEU
  16. 16. Data Transfer vs. k #CassandraEU
  17. 17. Execution Time vs. k #CassandraEU
  18. 18. Execution Time vs. peers #CassandraEU
  19. 19. #CassandraEU
  20. 20. YMMV • Test with your own data • Test with your own hardware • Hybrid Threshold for exact top-k – Intravert optional • KLEE for tunable approximate top-k – Inefficient without intravert – Requires metadata #CassandraEU
  21. 21. Intravert • Cassandra++ – Embed and extend the existing server – Based on Vert.x • JSON over HTTP, REST API – yup, virgil did that already • Multiple commands per call, chain operations with REFs #CassandraEU
  22. 22. Intravert • Server side code execution – Groovy (for now – Vert.x is polyglot) • Filter result sets • Write path triggers – C* 2.0 has CASSANDRA-1311 • Run groovy scripts on the server – Easier than extending thrift api #CassandraEU
  23. 23. Intravert • Good trade-off between power and operational complexity • More complex development cycle – Not easy to move code between client and server • Client not topology aware – 'run x on each node' not possible #CassandraEU
  24. 24. Back to the clever algorithms • Intravert server side execution enables cleaner, more efficient implementation • Reduces network round trips • Some dev and ops complexity increase • Less complexity than custom server deployment – Reuse existing tools #CassandraEU
  25. 25. Pre-aggregation • For text search, can't predict common term sets • For time periods, can predict contiguous periods • Pre-calculate the rollups – Hours, days, weeks, months – Reduces number of terms (peers) to group at query time #CassandraEU
  26. 26. Really clever algorithms • Hierarchical node topology – Map to cassandra ring: same node may own multiple keys (peers != nodes) • Budget constrained approximate top-k – Get as close as possible with the allowable time and I/O constraints • Fault tolerance – Approximation given available nodes #CassandraEU
  27. 27. Questions? Or email us: Jonathan Halliday, JBoss jonathan.halliday@redhat.com Rui Vieira, Newcastle University r.vieira2@newcastle.ac.uk #CassandraEU

×