A New Practical Design for Browsable Over-the-Network Indexing

.
The Over-the-Network Problem
M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 2/18
...
2/18

.
Over-the-Network Problem
Data
Indexer
Index
Network
Traditional
Client
Data
Indexer
IndexRead,
Write
Stringex
Client
The
...
3/18

.
Everything is Over-the-Network
• ... in clouds
• ... inside data centers
• ... in home networks
.
When running over-the-network
..
.
... the biggest problem is that there is a hard physical limit to
throughput
...
4/18

.
The "Best" Tools Today
...
5/18

.
The Closests Tools
1. Lucene running locally only
2. Google Data APIs, that allow for shared control
◦ not really indexing, through
3. .... that's pretty much it!
...
6/18

.
Target Applications
...
7/18

.
Target Applications
Data
Indexer
Index
Stringex
Client
The
• server-less applications (read:
fully distributed)
• large-scale crowdsourcing
connected via cloud storage
• distributed storage --
the same problem
• ....
...
8/18

.
The Stringex Problem
...
9/18

.
The Stringex Problem
• a very straightforward optimization problem
minimize w1ROUT + w2RIN (1)
subject to (2)
0 < RIN ≤ ROUT ≤ C, (3)
SLOCAL ≤ M ≤ SREMOTE, (4)
NLOCAL ≤ NREMOTE ≤ NUSER, (5)
• R is rate, throughput, etc.
• S is storage size, can be local and
remote
• C and M are constants, set by user
• N is number of files over which the
index is split
...
10/18

.
Naive Stringex Client
...
11/18

.
Practical Assumptions
• JSON input, only top level is indexed, otherwise stringified
• several efficiency tricks
1. split index in relatively small files
2. distribute smoothly using random hashing
3. update parts on timeout -- accumulate multiple intensive updates
4. create special mapswhich allow for browsing
• JSON aggregations in files : one line is base64( JSON sring)
◦ if bzip2 algorithm is within reach, you can have base64( bzip2( JSON
string))
...
12/18

.
Naive Client: Data Structure
INPUT JSON { name : value1, age : value2, …}
Files
…name .imap
{
bk : {
ik : start,end ,
… next ik
},
… next bk
}
name .vmap
{
value : bk ,
… next value
}
name .bk1
name .bk2
…
Key: name
…
Key: age
docs .imap
{
bk : {
docid :
start,end ,
… next docid
},
… next bk
}
docs .bk1
docs .bk2
…
Docs
No . vmap
SameSame
Index Data
• meta is separate from
data
• smart maps, lets to read/
write sections of files
◦ specifically for chunk*
API in Dropbox
• filenames are head 2-3
symbols of MD5 hashes
...
13/18

.
Naive Client: Sync Engine Design
Stringex
Index
Stringex
Client
The
Sync
Engine
Optimization
Local
Cache
Check
1 2
Use
...
14/18

.
Evaluation
...
15/18

.
Stringex vs Lucene
3.15 3.85 4.55 5.25 5.95 6.65
Index Size (log)
2.55
2.65
2.75
2.85
2.95
3.05
3.15
3.25
Throughput(logofbytes/doc)
Lucene
Stringex
...
16/18

.
Wrapup
• https://github.com/maratishe/stringex has JS client
• I also have a PHP client for command line Stringex
• stringex is better for browsing because items cluster naturally -- better than
Lucene
◦ I use it for small browsable summaries of datasets
◦ ... and context-based browsable datasets
• many other uses are possible
...
17/18

.
That’s all, thank you ...
...
18/18

A New Practical Design for Browsable Over-the-Network Indexing

Recommended

Recommended

More Related Content

Similar to A New Practical Design for Browsable Over-the-Network Indexing

Similar to A New Practical Design for Browsable Over-the-Network Indexing (20)

More from Tokyo University of Science

More from Tokyo University of Science (20)

Recently uploaded

Recently uploaded (20)

A New Practical Design for Browsable Over-the-Network Indexing