Your SlideShare is downloading. ×
A New Practical Design for Browsable Over-the-Network Indexing
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

A New Practical Design for Browsable Over-the-Network Indexing

151
views

Published on

Lucene today is a default indexing engine. Including Lucene and indexing in general, such technologies run on top of local filesystems and do not consider throughput of read/write operations as a …

Lucene today is a default indexing engine. Including Lucene and indexing in general, such technologies run on top of local filesystems and do not consider throughput of read/write operations as a limited resource. However, with proliferation of clouds today, over-the-network access to data is becoming commonplace. This paper proposes a new design for over-the-network indexing which is built on top of the core assumption that read/write throughput has to be optimized. As a separate function, the proposed design is created to be easily browsable whereas Lucene-like indexing can only execute search queries. Software implementation of the proposed engine is released as open source.

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
151
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. . The Over-the-Network Problem M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 2/18 ... 2/18
  • 2. . Over-the-Network Problem Data Indexer Index Network Traditional Client Data Indexer IndexRead, Write Stringex Client The M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 3/18 ... 3/18
  • 3. . Everything is Over-the-Network • ... in clouds • ... inside data centers • ... in home networks . When running over-the-network .. . ... the biggest problem is that there is a hard physical limit to throughput M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 4/18 ... 4/18
  • 4. . The "Best" Tools Today M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 5/18 ... 5/18
  • 5. . The Closests Tools 1. Lucene running locally only 2. Google Data APIs, that allow for shared control ◦ not really indexing, through 3. .... that's pretty much it! M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 6/18 ... 6/18
  • 6. . Target Applications M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 7/18 ... 7/18
  • 7. . Target Applications Data Indexer Index Stringex Client The • server-less applications (read: fully distributed) • large-scale crowdsourcing connected via cloud storage • distributed storage -- the same problem • .... M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 8/18 ... 8/18
  • 8. . The Stringex Problem M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 9/18 ... 9/18
  • 9. . The Stringex Problem • a very straightforward optimization problem minimize w1ROUT + w2RIN (1) subject to (2) 0 < RIN ≤ ROUT ≤ C, (3) SLOCAL ≤ M ≤ SREMOTE, (4) NLOCAL ≤ NREMOTE ≤ NUSER, (5) • R is rate, throughput, etc. • S is storage size, can be local and remote • C and M are constants, set by user • N is number of files over which the index is split M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 10/18 ... 10/18
  • 10. . Naive Stringex Client M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 11/18 ... 11/18
  • 11. . Practical Assumptions • JSON input, only top level is indexed, otherwise stringified • several efficiency tricks 1. split index in relatively small files 2. distribute smoothly using random hashing 3. update parts on timeout -- accumulate multiple intensive updates 4. create special mapswhich allow for browsing • JSON aggregations in files : one line is base64( JSON sring) ◦ if bzip2 algorithm is within reach, you can have base64( bzip2( JSON string)) M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 12/18 ... 12/18
  • 12. . Naive Client: Data Structure INPUT JSON { name : value1, age : value2, …} Files …name .imap { bk : { ik : start,end , … next ik }, … next bk } name .vmap { value : bk , … next value } name .bk1 name .bk2 … Key: name … Key: age docs .imap { bk : { docid : start,end , … next docid }, … next bk } docs .bk1 docs .bk2 … Docs No . vmap SameSame Index Data • meta is separate from data • smart maps, lets to read/ write sections of files ◦ specifically for chunk* API in Dropbox • filenames are head 2-3 symbols of MD5 hashes M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 13/18 ... 13/18
  • 13. . Naive Client: Sync Engine Design Stringex Index Stringex Client The Sync Engine Optimization Local Cache Check 1 2 Use M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 14/18 ... 14/18
  • 14. . Evaluation M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 15/18 ... 15/18
  • 15. . Stringex vs Lucene 3.15 3.85 4.55 5.25 5.95 6.65 Index Size (log) 2.55 2.65 2.75 2.85 2.95 3.05 3.15 3.25 Throughput(logofbytes/doc) Lucene Stringex M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 16/18 ... 16/18
  • 16. . Wrapup • https://github.com/maratishe/stringex has JS client • I also have a PHP client for command line Stringex • stringex is better for browsing because items cluster naturally -- better than Lucene ◦ I use it for small browsable summaries of datasets ◦ ... and context-based browsable datasets • many other uses are possible M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 17/18 ... 17/18
  • 17. . That’s all, thank you ... M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 18/18 ... 18/18