Efficient top-k queries processing in column-family distributed databases

732 views

Published on

Published in: Technology
2 Comments
4 Likes
Statistics
Notes
No Downloads
Views
Total views
732
On SlideShare
0
From Embeds
0
Number of Embeds
45
Actions
Shares
0
Downloads
0
Comments
2
Likes
4
Embeds 0
No embeds

No notes for slide

Efficient top-k queries processing in column-family distributed databases

  1. 1. 14/08/13 Rui Vieira, MSc ITEC 1 Efficient top-k query processing on distributed column family databases Efficient top-k query processing on distributed column family databases
  2. 2. 14/08/13 Rui Vieira, MSc ITEC 2 Efficient top-k query processing on distributed column family databases Ranking (top-k) queriesRanking (top-k) queries We use top-k queries everydayWe use top-k queries everyday ● Search engines (top 100 pages for certain words) ● Analytics applications (most visited pages per day) Text search: Time periods:
  3. 3. 14/08/13 Rui Vieira, MSc ITEC 3 Efficient top-k query processing on distributed column family databases Ranking (top-k) queriesRanking (top-k) queries DefinitionDefinition Find all k objects with the highest aggregated score over function f (f is usually a summation function over attributes) Example: Find the top 10 students with highest grades over all modules. ... Module n ... Module 2 John, 89% Emma, 88% Brian, 70% Steve, 65% Anna, 60% Peter, 59% Paul, 50% Mary, 49% Richard, 31% ... Module 1 ... John, 39% Emma, 48% Brian, 50% Steve, 75% Anna, 50% Peter, 59% Paul, 80% Mary, 89% Richard, 91% John, 82% Emma, 78% Brian, 90% Steve, 85% Anna, 83% Peter, 81% Paul, 70% Mary, 59% Richard, 51%
  4. 4. 14/08/13 Rui Vieira, MSc ITEC 4 Efficient top-k query processing on distributed column family databases Motivation: real-time distributed top-k queriesMotivation: real-time distributed top-k queries Why real-time top-k queries? • To be integrated in a larger real-time analytics platform ● “User” real-time = hundred milliseconds ~ one second • Implement solutions make efficient use of: • Memory, Bandwidth and Computations • Can handle massive amounts of data Use case: We logging page views in a website. Can we find the top 10 most visited in the last 7 days? What about 10 months? All under 1 second?
  5. 5. 14/08/13 Rui Vieira, MSc ITEC 5 Efficient top-k query processing on distributed column family databases Top-k queries: simplistic solutionTop-k queries: simplistic solution “Naive” method • Fetch all objects and scores from all sources • Aggregate them in memory • Sort all aggregations • Select top-k highest scoring Solutions to provide ranking queries answers (but not real-time): <O 1 , 1000> <O 89 , 900> <O 99 , 1> ...peer 1 Query Coordinator peer 2 ... peer n merge all data aggregate scores sort all aggregated select k highest Not feasible: • For large amounts of data • Possibly doesn't fit in RAM • Execution time most likely not real-time • Not efficient: low-scoring objects processed • Due to all of the above: not scalable
  6. 6. 14/08/13 Rui Vieira, MSc ITEC 6 Efficient top-k query processing on distributed column family databases Top-k queries: Batch solutionsTop-k queries: Batch solutions Batch operations (Hadoop / Map-Reduce) Pros • Proven solution to (some) top-k scenarios • Excellent for “report” style use cases Cons • Still has to process all the information • Not real-time
  7. 7. 14/08/13 Rui Vieira, MSc ITEC 7 Efficient top-k query processing on distributed column family databases Our requirements ● Work with “Peers” which are distributed logically (rows) as well as physically (nodes) ● Nodes in the cluster have (very) limited instructions ● Low latency (fixed number of round-trips) ● Offer considerable savings of bandwidth and execution time ● Possible to adapt to data access patterns and models in Cassandra
  8. 8. 14/08/13 Rui Vieira, MSc ITEC 8 Efficient top-k query processing on distributed column family databases Algorithms
  9. 9. 14/08/13 Rui Vieira, MSc ITEC 9 Efficient top-k query processing on distributed column family databases Algorithms: related Work Threshold family of algorithms pioneered by Faggins et al. Objective: determine a threshold below which an object cannot be a top-k object Initial Threshold Algorithms (TA) however: • Not designed with distributed data sources in mind • Performance highly dependent on data shape (skewness, correlation ...) • Unbounded round-trips to data source → unbounded latency • TA keeps performing random accesses until it reaches a stopping point
  10. 10. 14/08/13 Rui Vieira, MSc ITEC 10 Efficient top-k query processing on distributed column family databases Algorithms: Related Work Three algorithms were selected: • Three-Phase Uniform Threshold (TPUT) • Distributed fixed round-trip exact algorithm • Hybrid Threshold • Distributed fixed round-trip exact algorithm • KLEE • Distributed fixed round-trip approximate algorithm • However these algorithms were developed for P2P networks • As far as we know, they have never been implemented with distributed column-family databases previously
  11. 11. 14/08/13 Rui Vieira, MSc ITEC 11 Efficient top-k query processing on distributed column family databases Algorithms: TPUT Request top-k From each peer peer1 peer2 peer3 peer4 peerm calculate a Partial sum select kth score As min-k Request all objects with score⩾ mink m re-calculate a Partial sum select kth score as threshold Request all objects with score > threshold Partial sum (missing scores = 0) Partial sum (missing scores = min-k/m) worst-score best-score Best-score > worst-score = candidate Request candidates peer1 peer2 peer3 peer4 peerm peer1 peer2 peer3 peer4 peerm peer1 peer2 peer3 peer4 peerm Final partial sum K highest are top-k
  12. 12. 14/08/13 Rui Vieira, MSc ITEC 12 Efficient top-k query processing on distributed column family databases Algorithms: Hybrid Threshold Phase 1 Same as in TPUT. i.e., the objective is to determine the first threshold: T = mink m score⩾Ti =max(Slowest ,T ) Send to each peer candidates So far and T peer1 peer2 peer3 peer4 peerm Each peer determines lowest scoring candidate and returns candidates with Phase 2 Phase 3 re-calculate a Partial sum select kth score as τ2 If T i< τ2 m peer Fetch score > τ2 m re-calculate a Partial sum select kth score as τ3 Candidates = partial sum > τ3
  13. 13. 14/08/13 Rui Vieira, MSc ITEC 13 Efficient top-k query processing on distributed column family databases Algorithms: KLEE • TPUT variant • Trade-off between accuracy and bandwidth • Relies on summary data (statistical meta-data) to better estimate min-k without going “deep” on index lists Fundamental data structures for meta-data: • Histograms • Bloom filters
  14. 14. 14/08/13 Rui Vieira, MSc ITEC 14 Efficient top-k query processing on distributed column family databases Algorithms: KLEE (Histograms) ● Equi-width cells ● Configurable number of cells ● Each cell n stores: ● Highest score in n (ub) ● Lowest score in n (lb) ● Average score for n (avg) ● Number of objects in n (freq) Example: Cell #10 (covers scores from 900-1000): ● ub = 989 ● lb = 901 ● avg = 937.4 ● freq = 200
  15. 15. 14/08/13 Rui Vieira, MSc ITEC 15 Efficient top-k query processing on distributed column family databases Algorithms: KLEE (Bloom filters) 00 1 2 3 4 5 6 7 ... m 0 1 0 0 1 0 0 0 0 1 h 1 (O) h 2 (O)h n (O) h 1 (P) h 2 (P) h n (P) ∴ P ∉ S ● Bit set with objects hashed into positions ● Allows for very fast membership queries ● Space-efficient data structure ● However, not isomorphic → cannot determine objects from Bloom filter alone
  16. 16. 14/08/13 Rui Vieira, MSc ITEC 16 Efficient top-k query processing on distributed column family databases Algorithms: KLEE Consists of 4 or (optionally) 3 steps 1 - Exploration Step Approximate a min-k threshold based on statistical meta-data 2 - Optimisation Step Decide whether execute step 3 or directly 4 3 - Candidate Filtering Filter high-scoring candidates 4 - Candidate Retrieval Fetch candidates from peers
  17. 17. 14/08/13 Rui Vieira, MSc ITEC 17 Efficient top-k query processing on distributed column family databases Algorithms: KLEE (Phase 1) Fetch top-k objects Fetch c “top” histograms + Bloom filters Fetch c “low” freq and avg peer1 peer2 peer3 peer4 peerm For each object seen so far Is object in Bloom filter? Use weighted avg Of low cells Use corresponding avg value no yes noWas in top-k? Partial sum select kth score As min-k score> mink m candidates
  18. 18. 14/08/13 Rui Vieira, MSc ITEC 18 Efficient top-k query processing on distributed column family databases Algorithms: KLEE (Phase 3) ● Request a bit set with all objects scoring higher than ● Perform a statistical pruning leaving only the most “common” objects (Note: this step was not implement due to the computational limitation of Cassandra nodes)
  19. 19. 14/08/13 Rui Vieira, MSc ITEC 19 Efficient top-k query processing on distributed column family databases Algorithms: KLEE (Phase 4) ● Request all the candidates from the peers ● Perform a partial sum with the true scores of objects ● Select the k highest as our top-k
  20. 20. 14/08/13 Rui Vieira, MSc ITEC 20 Efficient top-k query processing on distributed column family databases CassandraCassandra
  21. 21. 14/08/13 Rui Vieira, MSc ITEC 21 Efficient top-k query processing on distributed column family databases Cassandra (architecture overview)Cassandra (architecture overview) ● Fully decentralised column-family store ● High (almost linear) scalability ● No single point of failure (no “master” or “slave” nodes) ● Automatic replication ● Clients can read and write to any node in cluster ● Cassandra takes over duties of partitioning and replicating automatically
  22. 22. 14/08/13 Rui Vieira, MSc ITEC 22 Efficient top-k query processing on distributed column family databases Cassandra (architecture overview)Cassandra (architecture overview) ● Automatic partitioning of data (commonly used is Random partitioning) ● Rows are distributed in nodes by hash of partition key (1st PK) "2013-08-14" id = O 1 score = 7919 column table foo nodeA nodeB nodeC nodeD hashing (MD5) on key ... id = O n score = 9109 id = O 1 score = 1219 ... id = O n score = 109 id = O 1 score = 59 ... id = O n score = 91 id = O 1 score = 7919 ... id = On score = 9109 id = O 1 score = 1219 ... id = On score = 109 id = O 1 score = 59 ... id = On score = 91 "2013-08-15" "2013-08-16" "2013-08-14" "2013-08-15" "2013-08-16"
  23. 23. 14/08/13 Rui Vieira, MSc ITEC 23 Efficient top-k query processing on distributed column family databases Cassandra (data model) ● Columns to be ordered upon insertion (ordered by PKs) ● Columns in the same row are physically co-located ● Range searches are fast: score < 10000 (simply a linear seek on disk) "2013-08-16" id = O 1 score = 7919 id = O 2 score = 7901 column Comparator is id (ascending) "2013-08-16" id = O 1 score = 7919 id = O 2 score = 7901 column Comparator is score (ascending) table_forward table_reverse
  24. 24. 14/08/13 Rui Vieira, MSc ITEC 24 Efficient top-k query processing on distributed column family databases Cassandra (CQL) Data manipulation language for Cassandra is CQL ● Similar in syntax to SQL INSERT INTO table (foo, bar) VALUES (42, 'Meaning') SELECT foo, bar FROM table WHERE foo = 42 Limitations ● No joins, unions or sub-selects ● No aggregation functions (min, max, etc...) ● Inequality search are bound to primary key declaration order (next slide)
  25. 25. 14/08/13 Rui Vieira, MSc ITEC 25 Efficient top-k query processing on distributed column family databases Cassandra (CQL) Consider the following table CREATE TABLE visits( date timestamp, user_id bigint, hits bigint, PRIMARY KEY (date, user_id)) Although the following queries would be valid SQL queries They are not valid CQL: SELECT * FROM visits WHERE hits > 1000 SELECT * FROM visits WHERE user_id > 900 AND hits = 0 Inequality queries are restricted to PKs and return contiguous columns, such as SELECT * FROM visits WHERE date = 1368438171000 AND user_id > 1000
  26. 26. 14/08/13 Rui Vieira, MSc ITEC 26 Efficient top-k query processing on distributed column family databases Implementation
  27. 27. 14/08/13 Rui Vieira, MSc ITEC 27 Efficient top-k query processing on distributed column family databases Implementation (overview) Query Coordinator peer1 peer2 peern Peer interface driver nodeA nodeB nodeC nodeD asynchronous call asynchronous call asynchronous call callbackcallbackcallbackasynchronous callcallbackasynchronous callcallbackasynchronous callcallback KLEE HT TPUT JVM
  28. 28. 14/08/13 Rui Vieira, MSc ITEC 28 Efficient top-k query processing on distributed column family databases Implementation: challenges Implement forward and reverse tables to allow lookup by score and id ● Space is cheap ● Space is even cheaper as Cassandra uses in-built data compression ● Space is even cheaper as denormalised data usually compresses better than normalised data. ● Advantage of scores columns being pre-ordered at the row level "2013-08-16" id = O 1 score = 7919 id = O 2 score = 7901 column Comparator is id (ascending) "2013-08-16" id = O 1 score = 7919 id = O 2 score = 7901 column Comparator is score (ascending) table_forward table_reverse
  29. 29. 14/08/13 Rui Vieira, MSc ITEC 29 Efficient top-k query processing on distributed column family databases Implementation: challenges Map algorithmic steps to CQL logic Decompose tasks ● Single step in algorithm: (node can execute arbitrary code) ● Multiple step in this implementation: (we can only communicate with node via CQL) peeri Query Coordinator select O > max(T, S lowest ) List of candidates determines local lowest scoring, S lowest T peeri Query Coordinator T i= max(T, S lowest ) List of candidates determines local lowest scoring, S lowest candidates peeri fetch > T i objects
  30. 30. 14/08/13 Rui Vieira, MSc ITEC 30 Efficient top-k query processing on distributed column family databases Implementation: TPUT (phase 1) • Query Coordinator (QC) asks for top-k list from each peer 1..m invoking Peer async methods • QC stores a set of all distinct objects received in a concurrent safe collection • QC calculates a partial sum for each object using a thread-safe Accumulator data structure. Lets assume the partial sums are: [O89 , 1590] , [O73 , 1590], [O1 , 1000], [O21 , 990], [O12 , 880], [O51 , 780], [O801 , 680] Calculate the first threshold: S psum(O)=S peer1 ' (O)+…+S peerm ' (O) T = τ1 m Si ' (O)={Si (O) if O hasbeenreturned by node i 0 if otherwise } 1000, O1 900 , O 89 800, O 73 700, O 51 600, O 21 500, O 801 300, O 780 200, O 12 1, O 99 ... peer 1 190, O 1 690, O 89 790, O 73 590, O 51 990, O 21 390, O 801 10, O 780 490, O 12 290, O 99 ... peer 2 580, O 1 7, O 89 380, O 73 780, O 51 480, O 21 680, O 801 280, O 780 880, O 12 180, O 99 ... peer n Query Coordinator fetch top- k ... inverse table
  31. 31. 14/08/13 Rui Vieira, MSc ITEC 31 Efficient top-k query processing on distributed column family databases Implementation: TPUT (phase 2) QC issues a requests for all objects with a score > T from the inverse table (peer.getAbove(T)) With the received objects, recalculates the partial sum. (for each Pair → accumulator.add(pair)) Designates the kth partial sum as t2 = accumulator.getKthValue(k) 1000, O1 900 , O 89 800, O 73 700, O 51 600, O 21 500, O 801 300, O 780 200, O 12 1, O 99 ... peer 1 190, O 1 690, O 89 790, O 73 590, O 51 990, O 21 390, O 801 10, O 780 490, O 12 290, O 99 ... peer 2 580, O 1 7, O 89 380, O 73 780, O 51 480, O 21 680, O 801 280, O 780 880, O 12 180, O 99 ... peer n Query Coordinator fetch score > T ... inverse table
  32. 32. 14/08/13 Rui Vieira, MSc ITEC 32 Efficient top-k query processing on distributed column family databases Implementation: TPUT (phase 3) ● Fetch the final candidates from the forward table. ● Call async Peer methods ● Aggregate scores and nominate k highest scoring as the top-k forward table O 1 , 1000 O 89 , 900 O 73 , 800 O 51 , 700 O 21 , 600 O 801 , 500 O 780 , 300 O 12 , 200 O 99 , 1 ... peer 1 O 1 , 190 O 89 , 690 O 73 , 790 O 51 , 590 O 21 , 990 O 801 , 390 O 780 , 10 O 12 , 490 O 99 , 290 ... peer 2 O 1 , 580 O 89 , 7 O 73 , 380 O 51 , 780 O 21 , 480 O 801 , 680 O 780 , 280 O 12 , 880 O 99 , 180 ... peer n Query Coordinator fetch final candidates ...
  33. 33. 14/08/13 Rui Vieira, MSc ITEC 33 Efficient top-k query processing on distributed column family databases Implementation: challenges Sequential vs. Random lookups All algorithms at some point require random access Random access much slower than sequential forward table 1000, O1 900 , O 89 800, O 73 700, O 51 600, O 21 500, O 801 300, O 780 200, O 12 1, O 99 ... peer 1 ... peer1 inverse table sequential O1, 1000 O89, 900 O73, 800 O51, 700 O21, 600 O801, 500 O780, 300 O12, 200 O99, 1 "random" Lookup # objects Time (ms) 95% CI (ms) Sequential 240 1.70 0.27 Random 240 115.16 1.32 Sample size n = 100
  34. 34. 14/08/13 Rui Vieira, MSc ITEC 34 Efficient top-k query processing on distributed column family databases Implementation: KLEE challenges Sequential vs. Random lookups As a consequence of expensive random lookups a modified KLEE3 variant was implemented KLEE3-M: In the final phase, instead of filtering candidates with Do a range scan per peer for objects with Trade-off: score< mink m score⩾ mink m data transfer execution time
  35. 35. 14/08/13 Rui Vieira, MSc ITEC 35 Efficient top-k query processing on distributed column family databases CREATE TABLE table_metadata( peer text, cell int, lb double, ub double, freq bigint, avg double, binmax double, binmin double, filter blob, PRIMARY KEY (date,cell) ) WITH CLUSTERING ORDER BY (cell DESC) Implementation: KLEE challenges Mapping data structures to Cassandra's data model Serialised filter = 0x0000000600000002020100f0084263884418154205141c11
  36. 36. 14/08/13 Rui Vieira, MSc ITEC 36 Efficient top-k query processing on distributed column family databases Implementation: KLEE challenges Mapping data structures to Cassandra's data model peeri determine maximum score and create n equi-width bins fetch entire row serialise Bloom filter and save row Histogram Creator cell=0 cell=1 cell=2 cell=3 cell=4 cell=5 cell= n freq =0 freq =2 freq =0 freq =10 freq =140 freq =986 freq =10234 avg =0 avg =4590.2 avg =0 avg =678.1 avg =230.1 avg =56.7 avg =1.02 partition object per bin and add to Bloom Filter filter0 filter1 filter2 filter3 filter4 filter5 filtern Flexible: ● Configurable number of bins ● Configurable maximum false positive ratio for filters
  37. 37. 14/08/13 Rui Vieira, MSc ITEC 37 Efficient top-k query processing on distributed column family databases Implementation: KLEE ...row 1 Query Coordinator metadata table ... row n Peer Peer getFullHistAsync cell:0 freq,avg,filter, ... cell:1 freq,avg,filter ... cell:2 freq,avg,filter ... cell:3 freq,avg,filter ... cell:n freq,avg,filter ... ...row 1 Query Coordinator metadata table ... row n Peer Peer getPartialHistAsync cell:0 freq,avg,filter, ... cell:1 freq,avg,filter ... cell:2 freq,avg,filter ... cell:3 freq,avg ,filter ... cell:n freq,avg ,filter ... ...row 1 Query Coordinator inverse table ... row n Peer Peer getTopKAsync 1000, O1 900, O12 800, O7 700, O18 1, O 145 ResultResultResultResult ResultResultResultHistoBloom estimate min-k > min-k ...row 1 Query Coordinator forward table ... row n Peer Peer getObjectsAsync O1, 1000 O12, 900O7, 800 O18, 700 O145, 1 ResultResultResultResult aggregate
  38. 38. 14/08/13 Rui Vieira, MSc ITEC 38 Efficient top-k query processing on distributed column family databases Implementation: KLEE challenges final HistogramCreator hc = new CassandraHistogramCreator(tableDefinition); // Optionally a max false positive ratio can be defined hc.createHistogramTableSchema(); hc.createHistogramTable(“1998-05-01”, … , “1998-07-26“); Simple API for Histogram/Bloom tables creation
  39. 39. 14/08/13 Rui Vieira, MSc ITEC 39 Efficient top-k query processing on distributed column family databases Implementation: KLEE challenges  Fast generation ● Feasible for “on-the-fly” jobs ● Roughly linear with execution time of 56 ms per peer with 100,000 elements
  40. 40. 14/08/13 Rui Vieira, MSc ITEC 40 Efficient top-k query processing on distributed column family databases Implementation: asynchronous communication ● Driver used allowed for asynchronous communication ● Extensive use of ListenableFuture ● Allows for highly concurrent access with smaller thread pool ● Allows asynchronous transformations (eg ResultSet to POJO) public ListenableFuture<ResultList> getAboveAsync(final long value) { final ResultSetFuture above = session.executeAsync(statement.bind(value)); final Function<ResultSet, ResultList> transformResults = new Function<ResultSet, ResultList>() { @Override public ListenableFuture<ResultList> apply(ResultSet rs) { final ResultList resultList = new ResultList(); final List<Row> rows = rs.all(); for (final Row row : rows) { resultList.add( Pair.create(row.getBytes(object.getName()), row.getLong(score.getName())) ); } return resultList; } }; return Futures.transform(above, transformResults, executor); }
  41. 41. 14/08/13 Rui Vieira, MSc ITEC 41 Efficient top-k query processing on distributed column family databases Implementation: API { "wc98_ids": { "name": "wc98_ids", "inverse": "wc98_ids_inverse", "metadata": "wc98_ids_metadata", "score": { "name": "visits", "type": "bigint" }, "id": { "name": "id", "type": "text" }, "peer": { "name": "date", "type": "text" } } } JSON declaration of tables and columns final QueryCoordinator coordinator = QueryCoordinator.create(KLEE.class, tableDefinition); coordinator.setKeys(“1998-05-01”, … , “1998-07-26”); final List<Pair> topK = coordinator.getTopK(10);
  42. 42. 14/08/13 Rui Vieira, MSc ITEC 42 Efficient top-k query processing on distributed column family databases Datasets Test data
  43. 43. 14/08/13 Rui Vieira, MSc ITEC 43 Efficient top-k query processing on distributed column family databases Datasets: Synthetic (Zipf) Used in literature as a good approximation of “real-world” data
  44. 44. 14/08/13 Rui Vieira, MSc ITEC 44 Efficient top-k query processing on distributed column family databases Datasets: 1998 World Cup Data ● Data in Common Log Format (CLF) from the 1998 World Cup web servers ● IP addresses replaced by unique anonymous id ● Widely used in the literature as “real-world” test data ● Around 1.4 billion entries (approximately 2 million unique visitors) ● Range from 1st of May to 26th of July 1998 ● Highly skewed data
  45. 45. 14/08/13 Rui Vieira, MSc ITEC 45 Efficient top-k query processing on distributed column family databases Results
  46. 46. 14/08/13 Rui Vieira, MSc ITEC 46 Efficient top-k query processing on distributed column family databases Results: varying k
  47. 47. 14/08/13 Rui Vieira, MSc ITEC 47 Efficient top-k query processing on distributed column family databases Results: varying number of peers
  48. 48. 14/08/13 Rui Vieira, MSc ITEC 48 Efficient top-k query processing on distributed column family databases Results: Datasets (1998 World Cup Data) Algorithm Data (KB) Execution time (ms) 95% CI (ms) Precision (%) KLEE3 80 319.95 ±8.58 100 KLEE3-M 1271 84.75 ±6.5 100 Hybrid Threshold 14,306 1921.9 ±65.28 100 TPUT 44 141.5 ±7.36 100 Naive (baseline) 43,572 8514.6 ±61.38 100 Data for 18 peers = daily from 1st June 1998 to 18th June 1998 Sample size n = 20 Give me the top 20 visitors from 1st June to 18th June
  49. 49. 14/08/13 Rui Vieira, MSc ITEC 49 Efficient top-k query processing on distributed column family databases Implementation: Pre-aggregation Mix and match keys for aggregation results "2013-08" 192.0.43.10192.0.43.11 "2013-08-02" 192.0.43.10 98 192.0.43.11 234 96327404 "2013-09" 192.0.43.10 5398 192.0.43.11 23234 "2013-08-01" 192.0.43.10 98 192.0.43.11 234coordinator .setKeys(“1998-05”, “1998-06”, “1998-07-01”, “1998-07-02”); final List<Pair> topK = coordinator.getTopK(10); Mix and match keys for aggregation results top-k results the same, but computed over 4 peers instead of 63 peers.
  50. 50. 14/08/13 Rui Vieira, MSc ITEC 50 Efficient top-k query processing on distributed column family databases Results: Pre-aggregation Algorithm Data transfer (KB) Execution time (ms) full aggregated savings full aggregated savings KLEE 20756 633 97% 2412.2 44.3 98% HT 14404 5894 59% 4842.6 818.6 83% TPUT 2215 61 97% 1657.1 162.2 90%
  51. 51. 14/08/13 Rui Vieira, MSc ITEC 51 Efficient top-k query processing on distributed column family databases Conclusions
  52. 52. 14/08/13 Rui Vieira, MSc ITEC 52 Efficient top-k query processing on distributed column family databases Conclusions • TPUT and HT are well suited for real-time top-k queries with minimal structural changes in the infrastructure. • Savings of 98% (TPUT) and 77% (HT) in execution time with no loss of precision • Savings of 99.9% (TPUT) and 67% (HT) in data transfer also with no loss of precision • KLEE3 requires additional changes to infrastructure, but: • Efficient to create • Can discard final patch phase for approximate results with configurable trade-off between precision and data transfer / execution time • Savings of 99% in execution time and 97% in data transfer
  53. 53. 14/08/13 Rui Vieira, MSc ITEC 53 Efficient top-k query processing on distributed column family databases Conclusions • Scalability can be addressed with good planning of data models together with pre-aggregation • KLEE3 is more resilient to low object correlation (the case in real • world data) • TPUT and KLEE3 are resilient to high k variations which could have further practical implementations
  54. 54. 14/08/13 Rui Vieira, MSc ITEC 54 Efficient top-k query processing on distributed column family databases Future work Implementing KLEE4 ● Intravert1 is an application server built on top of a Cassandra node ● Based on the vert.x application framework ● Communication is done either in a RESTful way or directly with Java client ● Allows passing code (in several JVM languages such as Groovy, Clojure, etc) which is executed at the “server side” ● Acting as middleware, it is possible to implement processing (such as the candidate hash set) remotely and return it to our client ● TPUT and HT already implemented using Intravert ● KLEE4 in progress 1- https://github.com/zznate/intravert-ug
  55. 55. 14/08/13 Rui Vieira, MSc ITEC 55 Efficient top-k query processing on distributed column family databases Acknowledgements Jonathan Halliday (Red Hat) For technical expertise, supervision and support
  56. 56. 14/08/13 Rui Vieira, MSc ITEC 56 Efficient top-k query processing on distributed column family databases Questions ?

×