SlideShare a Scribd company logo
lots of facets, fast

     Anne Veling, BeyondTrees
anne@beyondtrees.com, May 26th 2011
introduction
 Anne Veling
  • Freelance Search Architect
  • Lucene Trainer


 Proquest
 New York Times




                                 3
visualization
 data
  • 1851 up to 2006: almost 60k newspapers
 How to give semantic overview
  • Context, where am I
  • Detail
 Exploration and Discovery




                                             4
zoom
 Present all newspapers on one canvas
 Dynamic zooming and panning
 Search interface
   • for discovery


 Front-end by Q42
   • HTML5 app
   • iPad app


 Not yet live

                                         5
architecture




           Tile                   Web
images                   tiles
         Generator               Server

                                              client

 text                     solr    solr
          Indexer
                         index   server
                                      facet
                                     plugin




                                                       6
tiling
 Newspaper images, old ones scanned
  • TIFF form
  • Wrinkles, coffee stains
 Tile generator
  • Convert to jpg
  • One virtual canvas of 512Gpixel
  • Multilayers 3M tiles: ~100Gb in 11 levels




                                                7
search
 25,072,989 articles
 867M solr index
 DataImportHandler
  • Issue with memory: load all XML URLs in
    memory first
  • Solved by indexing in batches
 Special
  • Nothing stored, not even IDs
  • We need nothing returned from search…



                                              8
results   facets
             0

query




                           …




        maxDoc
                  4   2


                               9
faceting memory
 Store each facet as BitSet over 25M articles
  • 58k facets x 25M docs x 1 bit = 169Gb (memory!)
 So we use DocSet from Solr
  • Scarce bitarray -> now fits in 1Gb memory




                                                      10
faceting performance
query
                     Facet initialization
                       • Takes ~1.5minute
                       • Cached


                     Facet evaluation
                       • Runtime!
                       • #docs x #facets




                                             11
performance
 Facet initialization/creation
 Runtime faceting

 Solr LRU cache
 Creation of all facets ~72s
 Runtime evaluation ootb: 71 seconds…
  /select/?q=Amsterdam&version=2.2&start=0&rows=10&indent=on
  &facet.date=thedate
  &facet.date.start=1850-01-01T00:00:00Z
  &facet.date.end=2007-01-01T00:00:00Z
  &facet.date.gap=%2B1DAY
  &facet=true


 Client-side bottleneck vs Server-side
                                                               12
<filterCache class="solr.FastLRUCache" size="70000"
initialSize="512" autowarmCount="0"/>
 Improved performance to ~300ms for
  “Amsterdam” [1825] query!
   • 2.3Mb output…
<requestHandler name="/zoomr"
class="com.proquest.zoom.ZoomrRequestHandler">
</requestHandler>
 Custom json output
   • Base 36 encoded heatmap
                          01111111111111111122111222777986878768885568855899beddbce
                          bbadabcbfgffggjmkgilrrwxwzuonpb9noolnljjjkkhhllllkjgipmdi
                          mlbbhkahf77987afghhihjihjikjikifeefgppsomf8000




                                                                                      13
runtime facet optimization

                     16 decades

               160 years

      1,920 months

58,560 days

   60,656 facets
   Worst case facet #DocSet.exists(doc)
      • Originally: 25M x 60k = 1.5E12 checks, 60k per
        doc
      • Now: average 0.5x for each level = 34.5 per doc
                                                          14
optimization
 Custom facet runtime Collector
  • Break if facet matched
      single value per doc per facet
      each doc has only 1 day
  • Top-down facet selection
      decade – year – month – day
 Performance for 1850 docs and 60k docs
  improved from 300ms to 10ms
 Custom optimized heatmap json
 Bottleneck now in the client/canvas/js


                                           15
show us or it didn’t happen
 Web Application
 iPad App




                                 16
zooming




          17
facet heatmap

        “television”




                       “inflation”




                                     18
conclusions
 Great exploratory UI
 Use domain knowledge to optimize for
  performance
  • If you can
 Next
  •   Bring it live on the Web and in App Store
  •   Using it for 1.2M books/CDs/DVDs of Belgium
  •   More search options
  •   Multipage



                                                    19
enhancement suggestions
 Lucene Collector
  • def collect(doc: Int):Boolean
                           class ExistsCollector extends Collector {
                             var exists = false

                               def collect(doc: Int) = {
                                 exists = true
                                 false
                               }

                               def acceptsDocsOutOfOrder() = true
                               def setNextReader(reader: IndexReader, base: Int) {}
                               def setScorer(scorer: Scorer) {}
                           }



 Solr SingleValueFacet
      Break after first find
      Automatic order based on #counts?


                                                                                  20
lessons learned
 Java Graphics has limitations for large fonts
  (>26,000)
 Handling large data sets is tricky
  • Indexing
  • Copying
 There’s technology and there’s corporate
  agendas
 You can always make things 10x faster
  • Lucene is ridiculously fast
      If you configure it well
  • Using domain knowledge can get you far
                                                  21
thank you




      anne@beyondtrees.com
              @anneveling



                             22

More Related Content

What's hot

Staying friendly with the gc
Staying friendly with the gcStaying friendly with the gc
Staying friendly with the gc
Oren Eini
 
To Cloud or Not To Cloud?
To Cloud or Not To Cloud?To Cloud or Not To Cloud?
To Cloud or Not To Cloud?
Greg Lindahl
 
What's New on AWS and What it Means to You
What's New on AWS and What it Means to YouWhat's New on AWS and What it Means to You
What's New on AWS and What it Means to You
Amazon Web Services
 
The Wix Microservice Stack
The Wix Microservice StackThe Wix Microservice Stack
The Wix Microservice Stack
Tomer Gabel
 
Maximizing EC2 and Elastic Block Store Disk Performance (STG302) | AWS re:Inv...
Maximizing EC2 and Elastic Block Store Disk Performance (STG302) | AWS re:Inv...Maximizing EC2 and Elastic Block Store Disk Performance (STG302) | AWS re:Inv...
Maximizing EC2 and Elastic Block Store Disk Performance (STG302) | AWS re:Inv...
Amazon Web Services
 
Scaling with mongo db (with notes)
Scaling with mongo db (with notes)Scaling with mongo db (with notes)
Scaling with mongo db (with notes)
emiltamas
 
Deep Dive on Amazon Elastic Block Store
Deep Dive on Amazon Elastic Block StoreDeep Dive on Amazon Elastic Block Store
Deep Dive on Amazon Elastic Block Store
Amazon Web Services
 
MongoDB and Ecommerce : A perfect combination
MongoDB and Ecommerce : A perfect combinationMongoDB and Ecommerce : A perfect combination
MongoDB and Ecommerce : A perfect combination
Steven Francia
 
Accelerating NoSQL
Accelerating NoSQLAccelerating NoSQL
Accelerating NoSQL
sunnygleason
 
AWS re:Invent 2016: Case Study: Librato's Experience Running Cassandra Using ...
AWS re:Invent 2016: Case Study: Librato's Experience Running Cassandra Using ...AWS re:Invent 2016: Case Study: Librato's Experience Running Cassandra Using ...
AWS re:Invent 2016: Case Study: Librato's Experience Running Cassandra Using ...
Amazon Web Services
 
SRV413 Deep Dive on Elastic Block Storage (Amazon EBS)
SRV413 Deep Dive on Elastic Block Storage (Amazon EBS)SRV413 Deep Dive on Elastic Block Storage (Amazon EBS)
SRV413 Deep Dive on Elastic Block Storage (Amazon EBS)
Amazon Web Services
 
NOSQL, CouchDB, and the Cloud
NOSQL, CouchDB, and the CloudNOSQL, CouchDB, and the Cloud
NOSQL, CouchDB, and the Cloud
boorad
 
Rebooting design in RavenDB
Rebooting design in RavenDBRebooting design in RavenDB
Rebooting design in RavenDB
Oren Eini
 
AWS re:Invent 2016: Deep Dive on Amazon Elastic Block Store (STG301)
AWS re:Invent 2016: Deep Dive on Amazon Elastic Block Store (STG301)AWS re:Invent 2016: Deep Dive on Amazon Elastic Block Store (STG301)
AWS re:Invent 2016: Deep Dive on Amazon Elastic Block Store (STG301)
Amazon Web Services
 
Running Open Source Solutions on Windows Azure
Running Open Source Solutions on Windows AzureRunning Open Source Solutions on Windows Azure
Running Open Source Solutions on Windows Azure
Simon Evans
 
ECS위에 Log Server 구축하기
ECS위에 Log Server 구축하기ECS위에 Log Server 구축하기
ECS위에 Log Server 구축하기
AWSKRUG - AWS한국사용자모임
 
Deep Dive on Amazon EBS Elastic Volumes - March 2017 AWS Online Tech Talks
Deep Dive on Amazon EBS Elastic Volumes - March 2017 AWS Online Tech TalksDeep Dive on Amazon EBS Elastic Volumes - March 2017 AWS Online Tech Talks
Deep Dive on Amazon EBS Elastic Volumes - March 2017 AWS Online Tech Talks
Amazon Web Services
 
Odnoklassniki.ru Architecture
Odnoklassniki.ru ArchitectureOdnoklassniki.ru Architecture
Odnoklassniki.ru Architecture
Dmitry Buzdin
 
How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...
How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...
How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...
Malin Weiss
 
Spring Camp 2016 - List query performance improvement using Couchbase
Spring Camp 2016 - List query performance improvement using CouchbaseSpring Camp 2016 - List query performance improvement using Couchbase
Spring Camp 2016 - List query performance improvement using Couchbase
Intae Kim
 

What's hot (20)

Staying friendly with the gc
Staying friendly with the gcStaying friendly with the gc
Staying friendly with the gc
 
To Cloud or Not To Cloud?
To Cloud or Not To Cloud?To Cloud or Not To Cloud?
To Cloud or Not To Cloud?
 
What's New on AWS and What it Means to You
What's New on AWS and What it Means to YouWhat's New on AWS and What it Means to You
What's New on AWS and What it Means to You
 
The Wix Microservice Stack
The Wix Microservice StackThe Wix Microservice Stack
The Wix Microservice Stack
 
Maximizing EC2 and Elastic Block Store Disk Performance (STG302) | AWS re:Inv...
Maximizing EC2 and Elastic Block Store Disk Performance (STG302) | AWS re:Inv...Maximizing EC2 and Elastic Block Store Disk Performance (STG302) | AWS re:Inv...
Maximizing EC2 and Elastic Block Store Disk Performance (STG302) | AWS re:Inv...
 
Scaling with mongo db (with notes)
Scaling with mongo db (with notes)Scaling with mongo db (with notes)
Scaling with mongo db (with notes)
 
Deep Dive on Amazon Elastic Block Store
Deep Dive on Amazon Elastic Block StoreDeep Dive on Amazon Elastic Block Store
Deep Dive on Amazon Elastic Block Store
 
MongoDB and Ecommerce : A perfect combination
MongoDB and Ecommerce : A perfect combinationMongoDB and Ecommerce : A perfect combination
MongoDB and Ecommerce : A perfect combination
 
Accelerating NoSQL
Accelerating NoSQLAccelerating NoSQL
Accelerating NoSQL
 
AWS re:Invent 2016: Case Study: Librato's Experience Running Cassandra Using ...
AWS re:Invent 2016: Case Study: Librato's Experience Running Cassandra Using ...AWS re:Invent 2016: Case Study: Librato's Experience Running Cassandra Using ...
AWS re:Invent 2016: Case Study: Librato's Experience Running Cassandra Using ...
 
SRV413 Deep Dive on Elastic Block Storage (Amazon EBS)
SRV413 Deep Dive on Elastic Block Storage (Amazon EBS)SRV413 Deep Dive on Elastic Block Storage (Amazon EBS)
SRV413 Deep Dive on Elastic Block Storage (Amazon EBS)
 
NOSQL, CouchDB, and the Cloud
NOSQL, CouchDB, and the CloudNOSQL, CouchDB, and the Cloud
NOSQL, CouchDB, and the Cloud
 
Rebooting design in RavenDB
Rebooting design in RavenDBRebooting design in RavenDB
Rebooting design in RavenDB
 
AWS re:Invent 2016: Deep Dive on Amazon Elastic Block Store (STG301)
AWS re:Invent 2016: Deep Dive on Amazon Elastic Block Store (STG301)AWS re:Invent 2016: Deep Dive on Amazon Elastic Block Store (STG301)
AWS re:Invent 2016: Deep Dive on Amazon Elastic Block Store (STG301)
 
Running Open Source Solutions on Windows Azure
Running Open Source Solutions on Windows AzureRunning Open Source Solutions on Windows Azure
Running Open Source Solutions on Windows Azure
 
ECS위에 Log Server 구축하기
ECS위에 Log Server 구축하기ECS위에 Log Server 구축하기
ECS위에 Log Server 구축하기
 
Deep Dive on Amazon EBS Elastic Volumes - March 2017 AWS Online Tech Talks
Deep Dive on Amazon EBS Elastic Volumes - March 2017 AWS Online Tech TalksDeep Dive on Amazon EBS Elastic Volumes - March 2017 AWS Online Tech Talks
Deep Dive on Amazon EBS Elastic Volumes - March 2017 AWS Online Tech Talks
 
Odnoklassniki.ru Architecture
Odnoklassniki.ru ArchitectureOdnoklassniki.ru Architecture
Odnoklassniki.ru Architecture
 
How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...
How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...
How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...
 
Spring Camp 2016 - List query performance improvement using Couchbase
Spring Camp 2016 - List query performance improvement using CouchbaseSpring Camp 2016 - List query performance improvement using Couchbase
Spring Camp 2016 - List query performance improvement using Couchbase
 

Similar to Lots of facets, fast

Loom promises: be there!
Loom promises: be there!Loom promises: be there!
Loom promises: be there!
Jean-Francois James
 
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
DATAVERSITY
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
smallerror
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
xlight
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
Roger Xia
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
liujianrong
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼
NAVER D2
 
Restful web services with nodejs
Restful web services with nodejsRestful web services with nodejs
Restful web services with nodejs
Aspenware
 
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Lucidworks
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
John Adams
 
On-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceOn-boarding with JanusGraph Performance
On-boarding with JanusGraph Performance
Chin Huang
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling Twitter
John Adams
 
Eureka Moment UKLUG
Eureka Moment UKLUGEureka Moment UKLUG
Eureka Moment UKLUG
Paul Withers
 
How does the Cloud Foundry Diego Project Run at Scale?
How does the Cloud Foundry Diego Project Run at Scale?How does the Cloud Foundry Diego Project Run at Scale?
How does the Cloud Foundry Diego Project Run at Scale?
VMware Tanzu
 
How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...
How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...
How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...
Amit Gupta
 
Inside Wordnik's Architecture
Inside Wordnik's ArchitectureInside Wordnik's Architecture
Inside Wordnik's Architecture
Tony Tam
 
Writing high performance code in NetCore 3.0
Writing high performance code in NetCore 3.0Writing high performance code in NetCore 3.0
Writing high performance code in NetCore 3.0
Javier Cantón Ferrero
 
DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0
DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0
DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0
Plain Concepts
 
Scaling a MeteorJS SaaS app on AWS
Scaling a MeteorJS SaaS app on AWSScaling a MeteorJS SaaS app on AWS
Scaling a MeteorJS SaaS app on AWS
Brett McLain
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
Hiram Fleitas León
 

Similar to Lots of facets, fast (20)

Loom promises: be there!
Loom promises: be there!Loom promises: be there!
Loom promises: be there!
 
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼
 
Restful web services with nodejs
Restful web services with nodejsRestful web services with nodejs
Restful web services with nodejs
 
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
On-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceOn-boarding with JanusGraph Performance
On-boarding with JanusGraph Performance
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling Twitter
 
Eureka Moment UKLUG
Eureka Moment UKLUGEureka Moment UKLUG
Eureka Moment UKLUG
 
How does the Cloud Foundry Diego Project Run at Scale?
How does the Cloud Foundry Diego Project Run at Scale?How does the Cloud Foundry Diego Project Run at Scale?
How does the Cloud Foundry Diego Project Run at Scale?
 
How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...
How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...
How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...
 
Inside Wordnik's Architecture
Inside Wordnik's ArchitectureInside Wordnik's Architecture
Inside Wordnik's Architecture
 
Writing high performance code in NetCore 3.0
Writing high performance code in NetCore 3.0Writing high performance code in NetCore 3.0
Writing high performance code in NetCore 3.0
 
DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0
DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0
DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0
 
Scaling a MeteorJS SaaS app on AWS
Scaling a MeteorJS SaaS app on AWSScaling a MeteorJS SaaS app on AWS
Scaling a MeteorJS SaaS app on AWS
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
 

Recently uploaded

GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
Project Management Semester Long Project - Acuity
Project Management Semester Long Project - AcuityProject Management Semester Long Project - Acuity
Project Management Semester Long Project - Acuity
jpupo2018
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 

Recently uploaded (20)

GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
Project Management Semester Long Project - Acuity
Project Management Semester Long Project - AcuityProject Management Semester Long Project - Acuity
Project Management Semester Long Project - Acuity
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 

Lots of facets, fast

  • 1. lots of facets, fast Anne Veling, BeyondTrees anne@beyondtrees.com, May 26th 2011
  • 2. introduction  Anne Veling • Freelance Search Architect • Lucene Trainer  Proquest  New York Times 3
  • 3. visualization  data • 1851 up to 2006: almost 60k newspapers  How to give semantic overview • Context, where am I • Detail  Exploration and Discovery 4
  • 4. zoom  Present all newspapers on one canvas  Dynamic zooming and panning  Search interface • for discovery  Front-end by Q42 • HTML5 app • iPad app  Not yet live 5
  • 5. architecture Tile Web images tiles Generator Server client text solr solr Indexer index server facet plugin 6
  • 6. tiling  Newspaper images, old ones scanned • TIFF form • Wrinkles, coffee stains  Tile generator • Convert to jpg • One virtual canvas of 512Gpixel • Multilayers 3M tiles: ~100Gb in 11 levels 7
  • 7. search  25,072,989 articles  867M solr index  DataImportHandler • Issue with memory: load all XML URLs in memory first • Solved by indexing in batches  Special • Nothing stored, not even IDs • We need nothing returned from search… 8
  • 8. results facets 0 query … maxDoc 4 2 9
  • 9. faceting memory  Store each facet as BitSet over 25M articles • 58k facets x 25M docs x 1 bit = 169Gb (memory!)  So we use DocSet from Solr • Scarce bitarray -> now fits in 1Gb memory 10
  • 10. faceting performance query  Facet initialization • Takes ~1.5minute • Cached  Facet evaluation • Runtime! • #docs x #facets 11
  • 11. performance  Facet initialization/creation  Runtime faceting  Solr LRU cache  Creation of all facets ~72s  Runtime evaluation ootb: 71 seconds… /select/?q=Amsterdam&version=2.2&start=0&rows=10&indent=on &facet.date=thedate &facet.date.start=1850-01-01T00:00:00Z &facet.date.end=2007-01-01T00:00:00Z &facet.date.gap=%2B1DAY &facet=true  Client-side bottleneck vs Server-side 12
  • 12. <filterCache class="solr.FastLRUCache" size="70000" initialSize="512" autowarmCount="0"/>  Improved performance to ~300ms for “Amsterdam” [1825] query! • 2.3Mb output… <requestHandler name="/zoomr" class="com.proquest.zoom.ZoomrRequestHandler"> </requestHandler>  Custom json output • Base 36 encoded heatmap 01111111111111111122111222777986878768885568855899beddbce bbadabcbfgffggjmkgilrrwxwzuonpb9noolnljjjkkhhllllkjgipmdi mlbbhkahf77987afghhihjihjikjikifeefgppsomf8000 13
  • 13. runtime facet optimization 16 decades 160 years 1,920 months 58,560 days  60,656 facets  Worst case facet #DocSet.exists(doc) • Originally: 25M x 60k = 1.5E12 checks, 60k per doc • Now: average 0.5x for each level = 34.5 per doc 14
  • 14. optimization  Custom facet runtime Collector • Break if facet matched  single value per doc per facet  each doc has only 1 day • Top-down facet selection  decade – year – month – day  Performance for 1850 docs and 60k docs improved from 300ms to 10ms  Custom optimized heatmap json  Bottleneck now in the client/canvas/js 15
  • 15. show us or it didn’t happen  Web Application  iPad App 16
  • 16. zooming 17
  • 17. facet heatmap “television” “inflation” 18
  • 18. conclusions  Great exploratory UI  Use domain knowledge to optimize for performance • If you can  Next • Bring it live on the Web and in App Store • Using it for 1.2M books/CDs/DVDs of Belgium • More search options • Multipage 19
  • 19. enhancement suggestions  Lucene Collector • def collect(doc: Int):Boolean class ExistsCollector extends Collector { var exists = false def collect(doc: Int) = { exists = true false } def acceptsDocsOutOfOrder() = true def setNextReader(reader: IndexReader, base: Int) {} def setScorer(scorer: Scorer) {} }  Solr SingleValueFacet  Break after first find  Automatic order based on #counts? 20
  • 20. lessons learned  Java Graphics has limitations for large fonts (>26,000)  Handling large data sets is tricky • Indexing • Copying  There’s technology and there’s corporate agendas  You can always make things 10x faster • Lucene is ridiculously fast  If you configure it well • Using domain knowledge can get you far 21
  • 21. thank you anne@beyondtrees.com @anneveling 22