Lucene today is a default indexing engine. Including Lucene and indexing in general, such technologies run on top of local filesystems and do not consider throughput of read/write operations as a limited resource. However, with proliferation of clouds today, over-the-network access to data is becoming commonplace. This paper proposes a new design for over-the-network indexing which is built on top of the core assumption that read/write throughput has to be optimized. As a separate function, the proposed design is created to be easily browsable whereas Lucene-like indexing can only execute search queries. Software implementation of the proposed engine is released as open source.
A New Practical Design for Browsable Over-the-Network Indexing
1.
2. .
The Over-the-Network Problem
M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 2/18
...
2/18
4. .
Everything is Over-the-Network
• ... in clouds
• ... inside data centers
• ... in home networks
.
When running over-the-network
..
.
... the biggest problem is that there is a hard physical limit to
throughput
M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 4/18
...
4/18
5. .
The "Best" Tools Today
M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 5/18
...
5/18
6. .
The Closests Tools
1. Lucene running locally only
2. Google Data APIs, that allow for shared control
◦ not really indexing, through
3. .... that's pretty much it!
M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 6/18
...
6/18
7. .
Target Applications
M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 7/18
...
7/18
8. .
Target Applications
Data
Indexer
Index
Stringex
Client
The
• server-less applications (read:
fully distributed)
• large-scale crowdsourcing
connected via cloud storage
• distributed storage --
the same problem
• ....
M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 8/18
...
8/18
9. .
The Stringex Problem
M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 9/18
...
9/18
10. .
The Stringex Problem
• a very straightforward optimization problem
minimize w1ROUT + w2RIN (1)
subject to (2)
0 < RIN ≤ ROUT ≤ C, (3)
SLOCAL ≤ M ≤ SREMOTE, (4)
NLOCAL ≤ NREMOTE ≤ NUSER, (5)
• R is rate, throughput, etc.
• S is storage size, can be local and
remote
• C and M are constants, set by user
• N is number of files over which the
index is split
M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 10/18
...
10/18
11. .
Naive Stringex Client
M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 11/18
...
11/18
12. .
Practical Assumptions
• JSON input, only top level is indexed, otherwise stringified
• several efficiency tricks
1. split index in relatively small files
2. distribute smoothly using random hashing
3. update parts on timeout -- accumulate multiple intensive updates
4. create special mapswhich allow for browsing
• JSON aggregations in files : one line is base64( JSON sring)
◦ if bzip2 algorithm is within reach, you can have base64( bzip2( JSON
string))
M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 12/18
...
12/18
13. .
Naive Client: Data Structure
INPUT JSON { name : value1, age : value2, …}
Files
…name .imap
{
bk : {
ik : start,end ,
… next ik
},
… next bk
}
name .vmap
{
value : bk ,
… next value
}
name .bk1
name .bk2
…
Key: name
…
Key: age
docs .imap
{
bk : {
docid :
start,end ,
… next docid
},
… next bk
}
docs .bk1
docs .bk2
…
Docs
No . vmap
SameSame
Index Data
• meta is separate from
data
• smart maps, lets to read/
write sections of files
◦ specifically for chunk*
API in Dropbox
• filenames are head 2-3
symbols of MD5 hashes
M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 13/18
...
13/18
14. .
Naive Client: Sync Engine Design
Stringex
Index
Stringex
Client
The
Sync
Engine
Optimization
Local
Cache
Check
1 2
Use
M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 14/18
...
14/18
16. .
Stringex vs Lucene
3.15 3.85 4.55 5.25 5.95 6.65
Index Size (log)
2.55
2.65
2.75
2.85
2.95
3.05
3.15
3.25
Throughput(logofbytes/doc)
Lucene
Stringex
M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 16/18
...
16/18
17. .
Wrapup
• https://github.com/maratishe/stringex has JS client
• I also have a PHP client for command line Stringex
• stringex is better for browsing because items cluster naturally -- better than
Lucene
◦ I use it for small browsable summaries of datasets
◦ ... and context-based browsable datasets
• many other uses are possible
M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 17/18
...
17/18
18. .
That’s all, thank you ...
M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 18/18
...
18/18