lots of facets, fast     Anne Veling, BeyondTreesanne@beyondtrees.com, May 26th 2011
introduction Anne Veling  • Freelance Search Architect  • Lucene Trainer Proquest New York Times                       ...
visualization data  • 1851 up to 2006: almost 60k newspapers How to give semantic overview  • Context, where am I  • Det...
zoom Present all newspapers on one canvas Dynamic zooming and panning Search interface   • for discovery Front-end by ...
architecture           Tile                   Webimages                   tiles         Generator               Server    ...
tiling Newspaper images, old ones scanned  • TIFF form  • Wrinkles, coffee stains Tile generator  • Convert to jpg  • On...
search 25,072,989 articles 867M solr index DataImportHandler  • Issue with memory: load all XML URLs in    memory first...
results   facets             0query                           …        maxDoc                  4   2                      ...
faceting memory Store each facet as BitSet over 25M articles  • 58k facets x 25M docs x 1 bit = 169Gb (memory!) So we us...
faceting performancequery                     Facet initialization                       • Takes ~1.5minute              ...
performance Facet initialization/creation Runtime faceting Solr LRU cache Creation of all facets ~72s Runtime evaluat...
<filterCache class="solr.FastLRUCache" size="70000"initialSize="512" autowarmCount="0"/> Improved performance to ~300ms f...
runtime facet optimization                     16 decades               160 years      1,920 months58,560 days   60,656 f...
optimization Custom facet runtime Collector  • Break if facet matched      single value per doc per facet      each doc...
show us or it didn’t happen Web Application iPad App                                 16
zooming          17
facet heatmap        “television”                       “inflation”                                     18
conclusions Great exploratory UI Use domain knowledge to optimize for  performance  • If you can Next  •   Bring it liv...
enhancement suggestions Lucene Collector  • def collect(doc: Int):Boolean                           class ExistsCollector...
lessons learned Java Graphics has limitations for large fonts  (>26,000) Handling large data sets is tricky  • Indexing ...
thank you      anne@beyondtrees.com              @anneveling                             22
Upcoming SlideShare
Loading in …5
×

Lots of facets, fast

1,535 views

Published on

We created a web application for a well-known US newspaper, to create a maps-like zooming application on top of the 60,000 newspapers since 1850 and using Solr over the 28,000,000 articles to create an interactive heatmap over it. The out-of-the-box faceting solution was optimized using domain knowledge by order-of-magnitude which allowed us to create a great visual way of exploring trends in historical newspapers.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,535
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Lots of facets, fast

  1. 1. lots of facets, fast Anne Veling, BeyondTreesanne@beyondtrees.com, May 26th 2011
  2. 2. introduction Anne Veling • Freelance Search Architect • Lucene Trainer Proquest New York Times 3
  3. 3. visualization data • 1851 up to 2006: almost 60k newspapers How to give semantic overview • Context, where am I • Detail Exploration and Discovery 4
  4. 4. zoom Present all newspapers on one canvas Dynamic zooming and panning Search interface • for discovery Front-end by Q42 • HTML5 app • iPad app Not yet live 5
  5. 5. architecture Tile Webimages tiles Generator Server client text solr solr Indexer index server facet plugin 6
  6. 6. tiling Newspaper images, old ones scanned • TIFF form • Wrinkles, coffee stains Tile generator • Convert to jpg • One virtual canvas of 512Gpixel • Multilayers 3M tiles: ~100Gb in 11 levels 7
  7. 7. search 25,072,989 articles 867M solr index DataImportHandler • Issue with memory: load all XML URLs in memory first • Solved by indexing in batches Special • Nothing stored, not even IDs • We need nothing returned from search… 8
  8. 8. results facets 0query … maxDoc 4 2 9
  9. 9. faceting memory Store each facet as BitSet over 25M articles • 58k facets x 25M docs x 1 bit = 169Gb (memory!) So we use DocSet from Solr • Scarce bitarray -> now fits in 1Gb memory 10
  10. 10. faceting performancequery  Facet initialization • Takes ~1.5minute • Cached  Facet evaluation • Runtime! • #docs x #facets 11
  11. 11. performance Facet initialization/creation Runtime faceting Solr LRU cache Creation of all facets ~72s Runtime evaluation ootb: 71 seconds… /select/?q=Amsterdam&version=2.2&start=0&rows=10&indent=on &facet.date=thedate &facet.date.start=1850-01-01T00:00:00Z &facet.date.end=2007-01-01T00:00:00Z &facet.date.gap=%2B1DAY &facet=true Client-side bottleneck vs Server-side 12
  12. 12. <filterCache class="solr.FastLRUCache" size="70000"initialSize="512" autowarmCount="0"/> Improved performance to ~300ms for “Amsterdam” [1825] query! • 2.3Mb output…<requestHandler name="/zoomr"class="com.proquest.zoom.ZoomrRequestHandler"></requestHandler> Custom json output • Base 36 encoded heatmap 01111111111111111122111222777986878768885568855899beddbce bbadabcbfgffggjmkgilrrwxwzuonpb9noolnljjjkkhhllllkjgipmdi mlbbhkahf77987afghhihjihjikjikifeefgppsomf8000 13
  13. 13. runtime facet optimization 16 decades 160 years 1,920 months58,560 days  60,656 facets  Worst case facet #DocSet.exists(doc) • Originally: 25M x 60k = 1.5E12 checks, 60k per doc • Now: average 0.5x for each level = 34.5 per doc 14
  14. 14. optimization Custom facet runtime Collector • Break if facet matched  single value per doc per facet  each doc has only 1 day • Top-down facet selection  decade – year – month – day Performance for 1850 docs and 60k docs improved from 300ms to 10ms Custom optimized heatmap json Bottleneck now in the client/canvas/js 15
  15. 15. show us or it didn’t happen Web Application iPad App 16
  16. 16. zooming 17
  17. 17. facet heatmap “television” “inflation” 18
  18. 18. conclusions Great exploratory UI Use domain knowledge to optimize for performance • If you can Next • Bring it live on the Web and in App Store • Using it for 1.2M books/CDs/DVDs of Belgium • More search options • Multipage 19
  19. 19. enhancement suggestions Lucene Collector • def collect(doc: Int):Boolean class ExistsCollector extends Collector { var exists = false def collect(doc: Int) = { exists = true false } def acceptsDocsOutOfOrder() = true def setNextReader(reader: IndexReader, base: Int) {} def setScorer(scorer: Scorer) {} } Solr SingleValueFacet  Break after first find  Automatic order based on #counts? 20
  20. 20. lessons learned Java Graphics has limitations for large fonts (>26,000) Handling large data sets is tricky • Indexing • Copying There’s technology and there’s corporate agendas You can always make things 10x faster • Lucene is ridiculously fast  If you configure it well • Using domain knowledge can get you far 21
  21. 21. thank you anne@beyondtrees.com @anneveling 22

×