Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Efficient top-k queries processing in column-family distributed databases
1. 14/08/13 Rui Vieira, MSc ITEC 1
Efficient top-k query processing on distributed column family databases
Efficient top-k query
processing on distributed
column family databases
2. 14/08/13 Rui Vieira, MSc ITEC 2
Efficient top-k query processing on distributed column family databases
Ranking (top-k) queriesRanking (top-k) queries
We use top-k queries everydayWe use top-k queries everyday
● Search engines (top 100 pages for certain words)
● Analytics applications (most visited pages per day)
Text search: Time periods:
3. 14/08/13 Rui Vieira, MSc ITEC 3
Efficient top-k query processing on distributed column family databases
Ranking (top-k) queriesRanking (top-k) queries
DefinitionDefinition
Find all k objects with the highest aggregated score over function f
(f is usually a summation function over attributes)
Example:
Find the top 10 students with highest
grades over all modules.
...
Module n
...
Module 2
John, 89%
Emma, 88%
Brian, 70%
Steve, 65%
Anna, 60%
Peter, 59%
Paul, 50%
Mary, 49%
Richard, 31%
...
Module 1
...
John, 39%
Emma, 48%
Brian, 50%
Steve, 75%
Anna, 50%
Peter, 59%
Paul, 80%
Mary, 89%
Richard, 91%
John, 82%
Emma, 78%
Brian, 90%
Steve, 85%
Anna, 83%
Peter, 81%
Paul, 70%
Mary, 59%
Richard, 51%
4. 14/08/13 Rui Vieira, MSc ITEC 4
Efficient top-k query processing on distributed column family databases
Motivation: real-time distributed top-k queriesMotivation: real-time distributed top-k queries
Why real-time top-k queries?
• To be integrated in a larger real-time analytics platform
● “User” real-time = hundred milliseconds ~ one second
• Implement solutions make efficient use of:
• Memory, Bandwidth and Computations
• Can handle massive amounts of data
Use case:
We logging page views in a website. Can we find the top 10 most
visited in the last 7 days? What about 10 months? All under 1 second?
5. 14/08/13 Rui Vieira, MSc ITEC 5
Efficient top-k query processing on distributed column family databases
Top-k queries: simplistic solutionTop-k queries: simplistic solution
“Naive” method
• Fetch all objects and scores from all sources
• Aggregate them in memory
• Sort all aggregations
• Select top-k highest scoring
Solutions to provide ranking queries answers (but not real-time):
<O 1 , 1000> <O
89 , 900>
<O
99 , 1>
...peer 1
Query
Coordinator
peer
2
...
peer n
merge all
data
aggregate
scores
sort all
aggregated
select
k highest
Not feasible:
• For large amounts of data
• Possibly doesn't fit in RAM
• Execution time most likely not real-time
• Not efficient: low-scoring objects processed
• Due to all of the above: not scalable
6. 14/08/13 Rui Vieira, MSc ITEC 6
Efficient top-k query processing on distributed column family databases
Top-k queries: Batch solutionsTop-k queries: Batch solutions
Batch operations (Hadoop / Map-Reduce)
Pros
• Proven solution to (some) top-k scenarios
• Excellent for “report” style use cases
Cons
• Still has to process all the information
• Not real-time
7. 14/08/13 Rui Vieira, MSc ITEC 7
Efficient top-k query processing on distributed column family databases
Our requirements
● Work with “Peers” which are distributed logically (rows)
as well as physically (nodes)
● Nodes in the cluster have (very) limited instructions
● Low latency (fixed number of round-trips)
● Offer considerable savings of bandwidth and execution time
● Possible to adapt to data access patterns and models in Cassandra
8. 14/08/13 Rui Vieira, MSc ITEC 8
Efficient top-k query processing on distributed column family databases
Algorithms
9. 14/08/13 Rui Vieira, MSc ITEC 9
Efficient top-k query processing on distributed column family databases
Algorithms: related Work
Threshold family of algorithms pioneered by Faggins et al.
Objective: determine a threshold below which an object cannot be
a top-k object
Initial Threshold Algorithms (TA) however:
• Not designed with distributed data sources in mind
• Performance highly dependent on data shape (skewness, correlation ...)
• Unbounded round-trips to data source → unbounded latency
• TA keeps performing random accesses until it reaches a
stopping point
10. 14/08/13 Rui Vieira, MSc ITEC 10
Efficient top-k query processing on distributed column family databases
Algorithms: Related Work
Three algorithms were selected:
• Three-Phase Uniform Threshold (TPUT)
• Distributed fixed round-trip exact algorithm
• Hybrid Threshold
• Distributed fixed round-trip exact algorithm
• KLEE
• Distributed fixed round-trip approximate algorithm
• However these algorithms were developed for P2P networks
• As far as we know, they have never been implemented with
distributed column-family databases previously
11. 14/08/13 Rui Vieira, MSc ITEC 11
Efficient top-k query processing on distributed column family databases
Algorithms: TPUT
Request top-k
From each peer
peer1
peer2
peer3
peer4
peerm
calculate a
Partial sum
select kth
score
As min-k
Request all objects with score⩾
mink
m
re-calculate a
Partial sum
select kth
score
as threshold
Request all objects
with score > threshold
Partial sum
(missing scores = 0)
Partial sum
(missing scores = min-k/m)
worst-score best-score
Best-score > worst-score = candidate
Request candidates
peer1
peer2
peer3
peer4
peerm
peer1
peer2
peer3
peer4
peerm
peer1
peer2
peer3
peer4
peerm
Final partial
sum
K highest
are top-k
12. 14/08/13 Rui Vieira, MSc ITEC 12
Efficient top-k query processing on distributed column family databases
Algorithms: Hybrid Threshold
Phase 1
Same as in TPUT.
i.e., the objective is to determine the first threshold: T =
mink
m
score⩾Ti =max(Slowest ,T )
Send to each peer candidates
So far and T
peer1
peer2
peer3
peer4
peerm
Each peer determines lowest scoring
candidate and returns candidates with
Phase 2
Phase 3
re-calculate a
Partial sum
select kth
score
as τ2
If T i<
τ2
m
peer
Fetch score >
τ2
m
re-calculate a
Partial sum
select kth
score
as τ3
Candidates = partial sum > τ3
13. 14/08/13 Rui Vieira, MSc ITEC 13
Efficient top-k query processing on distributed column family databases
Algorithms: KLEE
• TPUT variant
• Trade-off between accuracy and bandwidth
• Relies on summary data (statistical meta-data)
to better estimate min-k without going “deep” on index lists
Fundamental data structures for meta-data:
• Histograms
• Bloom filters
14. 14/08/13 Rui Vieira, MSc ITEC 14
Efficient top-k query processing on distributed column family databases
Algorithms: KLEE (Histograms)
● Equi-width cells
● Configurable number of cells
● Each cell n stores:
● Highest score in n (ub)
● Lowest score in n (lb)
● Average score for n (avg)
● Number of objects in n (freq)
Example:
Cell #10 (covers scores from 900-1000):
● ub = 989
● lb = 901
● avg = 937.4
● freq = 200
15. 14/08/13 Rui Vieira, MSc ITEC 15
Efficient top-k query processing on distributed column family databases
Algorithms: KLEE (Bloom filters)
00 1 2 3 4 5 6 7 ... m
0 1 0 0 1 0 0 0 0 1
h 1 (O) h 2 (O)h n (O)
h 1 (P) h 2 (P) h n (P) ∴ P ∉ S
● Bit set with objects hashed into positions
● Allows for very fast membership queries
● Space-efficient data structure
● However, not isomorphic → cannot determine objects from Bloom filter alone
16. 14/08/13 Rui Vieira, MSc ITEC 16
Efficient top-k query processing on distributed column family databases
Algorithms: KLEE
Consists of 4 or (optionally) 3 steps
1 - Exploration Step
Approximate a min-k threshold based on statistical meta-data
2 - Optimisation Step
Decide whether execute step 3 or directly 4
3 - Candidate Filtering
Filter high-scoring candidates
4 - Candidate Retrieval
Fetch candidates from peers
17. 14/08/13 Rui Vieira, MSc ITEC 17
Efficient top-k query processing on distributed column family databases
Algorithms: KLEE (Phase 1)
Fetch top-k objects
Fetch c “top” histograms + Bloom filters
Fetch c “low” freq and avg
peer1
peer2
peer3
peer4
peerm
For each object
seen so far
Is object in
Bloom filter?
Use weighted avg
Of low cells
Use corresponding
avg value
no
yes
noWas in top-k?
Partial sum
select kth
score
As min-k
score>
mink
m
candidates
18. 14/08/13 Rui Vieira, MSc ITEC 18
Efficient top-k query processing on distributed column family databases
Algorithms: KLEE (Phase 3)
● Request a bit set with all objects scoring higher than
● Perform a statistical pruning leaving only the most “common”
objects
(Note: this step was not implement due to the computational
limitation of Cassandra nodes)
19. 14/08/13 Rui Vieira, MSc ITEC 19
Efficient top-k query processing on distributed column family databases
Algorithms: KLEE (Phase 4)
● Request all the candidates from the peers
● Perform a partial sum with the true scores of objects
● Select the k highest as our top-k
20. 14/08/13 Rui Vieira, MSc ITEC 20
Efficient top-k query processing on distributed column family databases
CassandraCassandra
21. 14/08/13 Rui Vieira, MSc ITEC 21
Efficient top-k query processing on distributed column family databases
Cassandra (architecture overview)Cassandra (architecture overview)
● Fully decentralised column-family store
● High (almost linear) scalability
● No single point of failure (no “master” or “slave” nodes)
● Automatic replication
● Clients can read and write to any node in cluster
● Cassandra takes over duties of partitioning and replicating automatically
22. 14/08/13 Rui Vieira, MSc ITEC 22
Efficient top-k query processing on distributed column family databases
Cassandra (architecture overview)Cassandra (architecture overview)
● Automatic partitioning of data (commonly used is Random partitioning)
●
Rows are distributed in nodes by hash of partition key (1st
PK)
"2013-08-14"
id = O 1
score = 7919
column
table foo
nodeA
nodeB
nodeC
nodeD
hashing
(MD5) on key
... id = O n
score = 9109
id = O 1
score = 1219
... id = O n
score = 109
id = O 1
score = 59
... id = O n
score = 91
id = O 1
score = 7919
... id = On
score = 9109
id = O 1
score = 1219
... id = On
score = 109
id = O 1
score = 59
... id = On
score = 91
"2013-08-15"
"2013-08-16"
"2013-08-14"
"2013-08-15"
"2013-08-16"
23. 14/08/13 Rui Vieira, MSc ITEC 23
Efficient top-k query processing on distributed column family databases
Cassandra (data model)
● Columns to be ordered upon insertion (ordered by PKs)
● Columns in the same row are physically co-located
● Range searches are fast: score < 10000
(simply a linear seek on disk)
"2013-08-16"
id = O 1
score = 7919
id = O 2
score = 7901
column
Comparator is id (ascending)
"2013-08-16" id = O 1
score = 7919
id = O 2
score = 7901
column
Comparator is score (ascending)
table_forward
table_reverse
24. 14/08/13 Rui Vieira, MSc ITEC 24
Efficient top-k query processing on distributed column family databases
Cassandra (CQL)
Data manipulation language for Cassandra is CQL
● Similar in syntax to SQL
INSERT INTO table (foo, bar) VALUES (42, 'Meaning')
SELECT foo, bar FROM table WHERE foo = 42
Limitations
● No joins, unions or sub-selects
● No aggregation functions (min, max, etc...)
● Inequality search are bound to primary key declaration order (next slide)
25. 14/08/13 Rui Vieira, MSc ITEC 25
Efficient top-k query processing on distributed column family databases
Cassandra (CQL)
Consider the following table
CREATE TABLE visits(
date timestamp,
user_id bigint,
hits bigint,
PRIMARY KEY (date, user_id))
Although the following queries would be valid SQL queries
They are not valid CQL:
SELECT * FROM visits WHERE hits > 1000
SELECT * FROM visits WHERE user_id > 900 AND hits = 0
Inequality queries are restricted to PKs and return
contiguous columns, such as
SELECT * FROM visits WHERE date = 1368438171000 AND user_id > 1000
26. 14/08/13 Rui Vieira, MSc ITEC 26
Efficient top-k query processing on distributed column family databases
Implementation
28. 14/08/13 Rui Vieira, MSc ITEC 28
Efficient top-k query processing on distributed column family databases
Implementation: challenges
Implement forward and reverse tables to allow lookup by score and id
● Space is cheap
● Space is even cheaper as Cassandra uses in-built data compression
● Space is even cheaper as denormalised data usually compresses better
than normalised data.
● Advantage of scores columns being pre-ordered at the row level
"2013-08-16"
id = O 1
score = 7919
id = O 2
score = 7901
column
Comparator is id (ascending)
"2013-08-16" id = O 1
score = 7919
id = O 2
score = 7901
column
Comparator is score (ascending)
table_forward
table_reverse
29. 14/08/13 Rui Vieira, MSc ITEC 29
Efficient top-k query processing on distributed column family databases
Implementation: challenges
Map algorithmic steps to CQL logic
Decompose tasks
● Single step in algorithm:
(node can execute arbitrary code)
● Multiple step in this implementation:
(we can only communicate with node via CQL)
peeri
Query
Coordinator
select O >
max(T, S lowest )
List of
candidates determines local
lowest scoring, S lowest
T
peeri
Query
Coordinator
T i=
max(T, S lowest )
List of
candidates
determines local
lowest scoring, S lowest
candidates
peeri
fetch > T i
objects
30. 14/08/13 Rui Vieira, MSc ITEC 30
Efficient top-k query processing on distributed column family databases
Implementation: TPUT (phase 1)
• Query Coordinator (QC) asks for top-k list from each peer 1..m invoking Peer async methods
• QC stores a set of all distinct objects received in a concurrent safe collection
• QC calculates a partial sum for each object
using a thread-safe Accumulator data structure.
Lets assume the partial sums are:
[O89
, 1590] , [O73
, 1590], [O1
, 1000],
[O21
, 990], [O12
, 880], [O51
, 780], [O801
, 680]
Calculate the first threshold:
S psum(O)=S peer1
'
(O)+…+S peerm
'
(O)
T =
τ1
m
Si
'
(O)={Si (O) if O hasbeenreturned by node i
0 if otherwise } 1000, O1
900 , O
89
800, O
73
700, O
51
600, O
21
500, O
801
300, O
780
200, O 12
1, O
99
...
peer 1
190, O 1
690, O
89
790, O
73
590, O
51
990, O
21
390, O
801
10, O
780
490, O 12
290, O
99
...
peer 2
580, O 1
7, O
89
380, O
73
780, O
51
480, O
21
680, O
801
280, O
780
880, O 12
180, O
99
...
peer n
Query
Coordinator
fetch top- k
...
inverse table
31. 14/08/13 Rui Vieira, MSc ITEC 31
Efficient top-k query processing on distributed column family databases
Implementation: TPUT (phase 2)
QC issues a requests for all objects with a score > T
from the inverse table (peer.getAbove(T))
With the received objects, recalculates the
partial sum.
(for each Pair → accumulator.add(pair))
Designates the kth
partial sum as
t2 = accumulator.getKthValue(k)
1000, O1
900 , O
89
800, O
73
700, O
51
600, O
21
500, O
801
300, O
780
200, O 12
1, O
99
...
peer 1
190, O 1
690, O
89
790, O
73
590, O
51
990, O
21
390, O
801
10, O
780
490, O 12
290, O
99
...
peer 2
580, O 1
7, O
89
380, O
73
780, O
51
480, O
21
680, O
801
280, O
780
880, O 12
180, O
99
...
peer n
Query
Coordinator
fetch score > T
...
inverse table
32. 14/08/13 Rui Vieira, MSc ITEC 32
Efficient top-k query processing on distributed column family databases
Implementation: TPUT (phase 3)
● Fetch the final candidates from the
forward table.
● Call async Peer methods
● Aggregate scores and nominate k highest
scoring as the top-k
forward table
O 1 , 1000
O
89 , 900
O
73 , 800
O
51 , 700
O
21 , 600
O
801 , 500
O
780 , 300
O 12 , 200
O
99 , 1
...
peer 1
O 1 , 190
O
89 , 690
O
73 , 790
O
51 , 590
O
21 , 990
O
801 , 390
O
780 , 10
O 12 , 490
O
99 , 290
...
peer 2
O 1 , 580
O
89 , 7
O
73 , 380
O
51 , 780
O
21 , 480
O
801 , 680
O
780 , 280
O 12 , 880
O
99 , 180
...
peer n
Query
Coordinator
fetch final candidates
...
33. 14/08/13 Rui Vieira, MSc ITEC 33
Efficient top-k query processing on distributed column family databases
Implementation: challenges
Sequential vs. Random lookups
All algorithms at some point require random
access
Random access much slower than sequential
forward table
1000, O1
900 , O
89
800, O
73
700, O
51
600, O
21
500, O
801
300, O
780
200, O 12
1, O
99
...
peer 1
...
peer1
inverse table
sequential
O1, 1000
O89, 900
O73, 800
O51, 700
O21, 600
O801, 500
O780, 300
O12, 200
O99, 1
"random"
Lookup # objects Time (ms) 95% CI (ms)
Sequential 240 1.70 0.27
Random 240 115.16 1.32
Sample size n = 100
34. 14/08/13 Rui Vieira, MSc ITEC 34
Efficient top-k query processing on distributed column family databases
Implementation: KLEE challenges
Sequential vs. Random lookups
As a consequence of expensive random lookups a modified KLEE3 variant was
implemented
KLEE3-M:
In the final phase, instead of filtering candidates with
Do a range scan per peer for objects with
Trade-off:
score<
mink
m
score⩾
mink
m
data transfer
execution
time
35. 14/08/13 Rui Vieira, MSc ITEC 35
Efficient top-k query processing on distributed column family databases
CREATE TABLE table_metadata(
peer text,
cell int,
lb double,
ub double,
freq bigint,
avg double,
binmax double,
binmin double,
filter blob,
PRIMARY KEY (date,cell)
) WITH CLUSTERING ORDER BY (cell DESC)
Implementation: KLEE challenges
Mapping data structures to Cassandra's data model
Serialised filter = 0x0000000600000002020100f0084263884418154205141c11
36. 14/08/13 Rui Vieira, MSc ITEC 36
Efficient top-k query processing on distributed column family databases
Implementation: KLEE challenges
Mapping data structures to Cassandra's data model
peeri
determine
maximum score
and create n
equi-width bins
fetch entire
row serialise Bloom filter
and save row
Histogram
Creator
cell=0
cell=1
cell=2
cell=3
cell=4
cell=5
cell= n
freq =0
freq =2
freq =0
freq =10
freq =140
freq =986
freq =10234
avg =0
avg =4590.2
avg =0
avg =678.1
avg =230.1
avg =56.7
avg =1.02
partition object
per bin and add
to Bloom Filter
filter0
filter1
filter2
filter3
filter4
filter5
filtern
Flexible:
● Configurable number of bins
● Configurable maximum false positive ratio for filters
38. 14/08/13 Rui Vieira, MSc ITEC 38
Efficient top-k query processing on distributed column family databases
Implementation: KLEE challenges
final HistogramCreator hc =
new CassandraHistogramCreator(tableDefinition);
// Optionally a max false positive ratio can be defined
hc.createHistogramTableSchema();
hc.createHistogramTable(“1998-05-01”, … , “1998-07-26“);
Simple API for Histogram/Bloom tables creation
39. 14/08/13 Rui Vieira, MSc ITEC 39
Efficient top-k query processing on distributed column family databases
Implementation: KLEE challenges
Fast generation
● Feasible for “on-the-fly” jobs
● Roughly linear with
execution time of 56 ms per
peer with 100,000 elements
40. 14/08/13 Rui Vieira, MSc ITEC 40
Efficient top-k query processing on distributed column family databases
Implementation: asynchronous communication
● Driver used allowed for asynchronous communication
● Extensive use of ListenableFuture
● Allows for highly concurrent access with smaller thread pool
● Allows asynchronous transformations (eg ResultSet to POJO)
public ListenableFuture<ResultList> getAboveAsync(final long value) {
final ResultSetFuture above = session.executeAsync(statement.bind(value));
final Function<ResultSet, ResultList> transformResults = new Function<ResultSet, ResultList>() {
@Override
public ListenableFuture<ResultList> apply(ResultSet rs) {
final ResultList resultList = new ResultList();
final List<Row> rows = rs.all();
for (final Row row : rows) {
resultList.add(
Pair.create(row.getBytes(object.getName()), row.getLong(score.getName()))
);
}
return resultList;
}
};
return Futures.transform(above, transformResults, executor);
}
42. 14/08/13 Rui Vieira, MSc ITEC 42
Efficient top-k query processing on distributed column family databases
Datasets
Test data
43. 14/08/13 Rui Vieira, MSc ITEC 43
Efficient top-k query processing on distributed column family databases
Datasets: Synthetic (Zipf)
Used in literature as a good approximation of “real-world” data
44. 14/08/13 Rui Vieira, MSc ITEC 44
Efficient top-k query processing on distributed column family databases
Datasets: 1998 World Cup Data
● Data in Common Log Format (CLF) from the 1998 World Cup web servers
● IP addresses replaced by unique anonymous id
● Widely used in the literature as “real-world” test data
● Around 1.4 billion entries (approximately 2 million unique visitors)
●
Range from 1st
of May to 26th
of July 1998
● Highly skewed data
45. 14/08/13 Rui Vieira, MSc ITEC 45
Efficient top-k query processing on distributed column family databases
Results
46. 14/08/13 Rui Vieira, MSc ITEC 46
Efficient top-k query processing on distributed column family databases
Results: varying k
47. 14/08/13 Rui Vieira, MSc ITEC 47
Efficient top-k query processing on distributed column family databases
Results: varying number of peers
48. 14/08/13 Rui Vieira, MSc ITEC 48
Efficient top-k query processing on distributed column family databases
Results: Datasets (1998 World Cup Data)
Algorithm Data (KB)
Execution
time (ms)
95% CI (ms)
Precision
(%)
KLEE3 80 319.95 ±8.58 100
KLEE3-M 1271 84.75 ±6.5 100
Hybrid Threshold 14,306 1921.9 ±65.28 100
TPUT 44 141.5 ±7.36 100
Naive
(baseline)
43,572 8514.6 ±61.38 100
Data for 18 peers = daily from 1st
June 1998 to 18th
June 1998
Sample size n = 20
Give me the top 20 visitors from 1st
June to 18th
June
49. 14/08/13 Rui Vieira, MSc ITEC 49
Efficient top-k query processing on distributed column family databases
Implementation: Pre-aggregation
Mix and match keys for aggregation
results
"2013-08" 192.0.43.10192.0.43.11
"2013-08-02" 192.0.43.10
98
192.0.43.11
234
96327404
"2013-09" 192.0.43.10
5398
192.0.43.11
23234
"2013-08-01" 192.0.43.10
98
192.0.43.11
234coordinator
.setKeys(“1998-05”,
“1998-06”,
“1998-07-01”,
“1998-07-02”);
final List<Pair> topK =
coordinator.getTopK(10);
Mix and match keys for aggregation
results
top-k results the same, but computed over 4 peers instead of 63 peers.
50. 14/08/13 Rui Vieira, MSc ITEC 50
Efficient top-k query processing on distributed column family databases
Results: Pre-aggregation
Algorithm
Data transfer (KB) Execution time (ms)
full aggregated savings full aggregated savings
KLEE 20756 633 97% 2412.2 44.3 98%
HT 14404 5894 59% 4842.6 818.6 83%
TPUT 2215 61 97% 1657.1 162.2 90%
51. 14/08/13 Rui Vieira, MSc ITEC 51
Efficient top-k query processing on distributed column family databases
Conclusions
52. 14/08/13 Rui Vieira, MSc ITEC 52
Efficient top-k query processing on distributed column family databases
Conclusions
• TPUT and HT are well suited for real-time top-k queries with
minimal structural changes in the infrastructure.
• Savings of 98% (TPUT) and 77% (HT) in execution time with no
loss of precision
• Savings of 99.9% (TPUT) and 67% (HT) in data transfer also with no
loss of precision
• KLEE3 requires additional changes to infrastructure, but:
• Efficient to create
• Can discard final patch phase for approximate results with configurable
trade-off between precision and data transfer / execution time
• Savings of 99% in execution time and 97% in data transfer
53. 14/08/13 Rui Vieira, MSc ITEC 53
Efficient top-k query processing on distributed column family databases
Conclusions
• Scalability can be addressed with good planning of data models
together with pre-aggregation
• KLEE3 is more resilient to low object correlation (the case in real
• world data)
• TPUT and KLEE3 are resilient to high k variations which could
have further practical implementations
54. 14/08/13 Rui Vieira, MSc ITEC 54
Efficient top-k query processing on distributed column family databases
Future work
Implementing KLEE4
●
Intravert1
is an application server built on top of a Cassandra node
● Based on the vert.x application framework
● Communication is done either in a RESTful way or directly with Java client
● Allows passing code (in several JVM languages such as Groovy, Clojure, etc)
which is executed at the “server side”
● Acting as middleware, it is possible to implement processing
(such as the candidate hash set) remotely and return it to our client
● TPUT and HT already implemented using Intravert
● KLEE4 in progress
1- https://github.com/zznate/intravert-ug
55. 14/08/13 Rui Vieira, MSc ITEC 55
Efficient top-k query processing on distributed column family databases
Acknowledgements
Jonathan Halliday (Red Hat)
For technical expertise, supervision and support
56. 14/08/13 Rui Vieira, MSc ITEC 56
Efficient top-k query processing on distributed column family databases
Questions ?