A New Practical Design for Browsable Over-the-Network Indexing

314 views
252 views

Published on

Lucene today is a default indexing engine. Including Lucene and indexing in general, such technologies run on top of local filesystems and do not consider throughput of read/write operations as a limited resource. However, with proliferation of clouds today, over-the-network access to data is becoming commonplace. This paper proposes a new design for over-the-network indexing which is built on top of the core assumption that read/write throughput has to be optimized. As a separate function, the proposed design is created to be easily browsable whereas Lucene-like indexing can only execute search queries. Software implementation of the proposed engine is released as open source.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
314
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

A New Practical Design for Browsable Over-the-Network Indexing

  1. 1. . The Over-the-Network Problem M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 2/18 ... 2/18
  2. 2. . Over-the-Network Problem Data Indexer Index Network Traditional Client Data Indexer IndexRead, Write Stringex Client The M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 3/18 ... 3/18
  3. 3. . Everything is Over-the-Network • ... in clouds • ... inside data centers • ... in home networks . When running over-the-network .. . ... the biggest problem is that there is a hard physical limit to throughput M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 4/18 ... 4/18
  4. 4. . The "Best" Tools Today M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 5/18 ... 5/18
  5. 5. . The Closests Tools 1. Lucene running locally only 2. Google Data APIs, that allow for shared control ◦ not really indexing, through 3. .... that's pretty much it! M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 6/18 ... 6/18
  6. 6. . Target Applications M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 7/18 ... 7/18
  7. 7. . Target Applications Data Indexer Index Stringex Client The • server-less applications (read: fully distributed) • large-scale crowdsourcing connected via cloud storage • distributed storage -- the same problem • .... M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 8/18 ... 8/18
  8. 8. . The Stringex Problem M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 9/18 ... 9/18
  9. 9. . The Stringex Problem • a very straightforward optimization problem minimize w1ROUT + w2RIN (1) subject to (2) 0 < RIN ≤ ROUT ≤ C, (3) SLOCAL ≤ M ≤ SREMOTE, (4) NLOCAL ≤ NREMOTE ≤ NUSER, (5) • R is rate, throughput, etc. • S is storage size, can be local and remote • C and M are constants, set by user • N is number of files over which the index is split M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 10/18 ... 10/18
  10. 10. . Naive Stringex Client M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 11/18 ... 11/18
  11. 11. . Practical Assumptions • JSON input, only top level is indexed, otherwise stringified • several efficiency tricks 1. split index in relatively small files 2. distribute smoothly using random hashing 3. update parts on timeout -- accumulate multiple intensive updates 4. create special mapswhich allow for browsing • JSON aggregations in files : one line is base64( JSON sring) ◦ if bzip2 algorithm is within reach, you can have base64( bzip2( JSON string)) M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 12/18 ... 12/18
  12. 12. . Naive Client: Data Structure INPUT JSON { name : value1, age : value2, …} Files …name .imap { bk : { ik : start,end , … next ik }, … next bk } name .vmap { value : bk , … next value } name .bk1 name .bk2 … Key: name … Key: age docs .imap { bk : { docid : start,end , … next docid }, … next bk } docs .bk1 docs .bk2 … Docs No . vmap SameSame Index Data • meta is separate from data • smart maps, lets to read/ write sections of files ◦ specifically for chunk* API in Dropbox • filenames are head 2-3 symbols of MD5 hashes M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 13/18 ... 13/18
  13. 13. . Naive Client: Sync Engine Design Stringex Index Stringex Client The Sync Engine Optimization Local Cache Check 1 2 Use M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 14/18 ... 14/18
  14. 14. . Evaluation M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 15/18 ... 15/18
  15. 15. . Stringex vs Lucene 3.15 3.85 4.55 5.25 5.95 6.65 Index Size (log) 2.55 2.65 2.75 2.85 2.95 3.05 3.15 3.25 Throughput(logofbytes/doc) Lucene Stringex M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 16/18 ... 16/18
  16. 16. . Wrapup • https://github.com/maratishe/stringex has JS client • I also have a PHP client for command line Stringex • stringex is better for browsing because items cluster naturally -- better than Lucene ◦ I use it for small browsable summaries of datasets ◦ ... and context-based browsable datasets • many other uses are possible M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 17/18 ... 17/18
  17. 17. . That’s all, thank you ... M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 18/18 ... 18/18

×