Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
lots of facets, fast     Anne Veling, BeyondTreesanne@beyondtrees.com, May 26th 2011
introduction Anne Veling  • Freelance Search Architect  • Lucene Trainer Proquest New York Times                       ...
visualization data  • 1851 up to 2006: almost 60k newspapers How to give semantic overview  • Context, where am I  • Det...
zoom Present all newspapers on one canvas Dynamic zooming and panning Search interface   • for discovery Front-end by ...
architecture           Tile                   Webimages                   tiles         Generator               Server    ...
tiling Newspaper images, old ones scanned  • TIFF form  • Wrinkles, coffee stains Tile generator  • Convert to jpg  • On...
search 25,072,989 articles 867M solr index DataImportHandler  • Issue with memory: load all XML URLs in    memory first...
results   facets             0query                           …        maxDoc                  4   2                      ...
faceting memory Store each facet as BitSet over 25M articles  • 58k facets x 25M docs x 1 bit = 169Gb (memory!) So we us...
faceting performancequery                     Facet initialization                       • Takes ~1.5minute              ...
performance Facet initialization/creation Runtime faceting Solr LRU cache Creation of all facets ~72s Runtime evaluat...
<filterCache class="solr.FastLRUCache" size="70000"initialSize="512" autowarmCount="0"/> Improved performance to ~300ms f...
runtime facet optimization                     16 decades               160 years      1,920 months58,560 days   60,656 f...
optimization Custom facet runtime Collector  • Break if facet matched      single value per doc per facet      each doc...
show us or it didn’t happen Web Application iPad App                                 16
zooming          17
facet heatmap        “television”                       “inflation”                                     18
conclusions Great exploratory UI Use domain knowledge to optimize for  performance  • If you can Next  •   Bring it liv...
enhancement suggestions Lucene Collector  • def collect(doc: Int):Boolean                           class ExistsCollector...
lessons learned Java Graphics has limitations for large fonts  (>26,000) Handling large data sets is tricky  • Indexing ...
thank you      anne@beyondtrees.com              @anneveling                             22
Upcoming SlideShare
Loading in …5
×

Lots of facets, fast

1,578 views

Published on

We created a web application for a well-known US newspaper, to create a maps-like zooming application on top of the 60,000 newspapers since 1850 and using Solr over the 28,000,000 articles to create an interactive heatmap over it. The out-of-the-box faceting solution was optimized using domain knowledge by order-of-magnitude which allowed us to create a great visual way of exploring trends in historical newspapers.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Lots of facets, fast

  1. 1. lots of facets, fast Anne Veling, BeyondTreesanne@beyondtrees.com, May 26th 2011
  2. 2. introduction Anne Veling • Freelance Search Architect • Lucene Trainer Proquest New York Times 3
  3. 3. visualization data • 1851 up to 2006: almost 60k newspapers How to give semantic overview • Context, where am I • Detail Exploration and Discovery 4
  4. 4. zoom Present all newspapers on one canvas Dynamic zooming and panning Search interface • for discovery Front-end by Q42 • HTML5 app • iPad app Not yet live 5
  5. 5. architecture Tile Webimages tiles Generator Server client text solr solr Indexer index server facet plugin 6
  6. 6. tiling Newspaper images, old ones scanned • TIFF form • Wrinkles, coffee stains Tile generator • Convert to jpg • One virtual canvas of 512Gpixel • Multilayers 3M tiles: ~100Gb in 11 levels 7
  7. 7. search 25,072,989 articles 867M solr index DataImportHandler • Issue with memory: load all XML URLs in memory first • Solved by indexing in batches Special • Nothing stored, not even IDs • We need nothing returned from search… 8
  8. 8. results facets 0query … maxDoc 4 2 9
  9. 9. faceting memory Store each facet as BitSet over 25M articles • 58k facets x 25M docs x 1 bit = 169Gb (memory!) So we use DocSet from Solr • Scarce bitarray -> now fits in 1Gb memory 10
  10. 10. faceting performancequery  Facet initialization • Takes ~1.5minute • Cached  Facet evaluation • Runtime! • #docs x #facets 11
  11. 11. performance Facet initialization/creation Runtime faceting Solr LRU cache Creation of all facets ~72s Runtime evaluation ootb: 71 seconds… /select/?q=Amsterdam&version=2.2&start=0&rows=10&indent=on &facet.date=thedate &facet.date.start=1850-01-01T00:00:00Z &facet.date.end=2007-01-01T00:00:00Z &facet.date.gap=%2B1DAY &facet=true Client-side bottleneck vs Server-side 12
  12. 12. <filterCache class="solr.FastLRUCache" size="70000"initialSize="512" autowarmCount="0"/> Improved performance to ~300ms for “Amsterdam” [1825] query! • 2.3Mb output…<requestHandler name="/zoomr"class="com.proquest.zoom.ZoomrRequestHandler"></requestHandler> Custom json output • Base 36 encoded heatmap 01111111111111111122111222777986878768885568855899beddbce bbadabcbfgffggjmkgilrrwxwzuonpb9noolnljjjkkhhllllkjgipmdi mlbbhkahf77987afghhihjihjikjikifeefgppsomf8000 13
  13. 13. runtime facet optimization 16 decades 160 years 1,920 months58,560 days  60,656 facets  Worst case facet #DocSet.exists(doc) • Originally: 25M x 60k = 1.5E12 checks, 60k per doc • Now: average 0.5x for each level = 34.5 per doc 14
  14. 14. optimization Custom facet runtime Collector • Break if facet matched  single value per doc per facet  each doc has only 1 day • Top-down facet selection  decade – year – month – day Performance for 1850 docs and 60k docs improved from 300ms to 10ms Custom optimized heatmap json Bottleneck now in the client/canvas/js 15
  15. 15. show us or it didn’t happen Web Application iPad App 16
  16. 16. zooming 17
  17. 17. facet heatmap “television” “inflation” 18
  18. 18. conclusions Great exploratory UI Use domain knowledge to optimize for performance • If you can Next • Bring it live on the Web and in App Store • Using it for 1.2M books/CDs/DVDs of Belgium • More search options • Multipage 19
  19. 19. enhancement suggestions Lucene Collector • def collect(doc: Int):Boolean class ExistsCollector extends Collector { var exists = false def collect(doc: Int) = { exists = true false } def acceptsDocsOutOfOrder() = true def setNextReader(reader: IndexReader, base: Int) {} def setScorer(scorer: Scorer) {} } Solr SingleValueFacet  Break after first find  Automatic order based on #counts? 20
  20. 20. lessons learned Java Graphics has limitations for large fonts (>26,000) Handling large data sets is tricky • Indexing • Copying There’s technology and there’s corporate agendas You can always make things 10x faster • Lucene is ridiculously fast  If you configure it well • Using domain knowledge can get you far 21
  21. 21. thank you anne@beyondtrees.com @anneveling 22

×