IIPC GA 2014 Solr

1,515 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,515
On SlideShare
0
From Embeds
0
Number of Embeds
203
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Indexing with SOLR issues and best practices
  • IIPC GA 2014 Solr

    1. 1. Large-Scale Web Archive Discovery & Analytics Using Apache Solr Andrew Jackson UK Web Archive Technical Lead
    2. 2. www.bl.uk 2 Context • Three collections: – Selective since 2004 – Legal Deposit since 2013 – Historical 1996-2013 from IA • Iterative Development: – Work directly with researchers – Today‟s historical research tools provide tomorrow‟s reading rooms • Using Solr to support: – Discovery – Preservation – Analytics
    3. 3. www.bl.uk 3 Discovery • Web archives tend to be messy – Lots of poor quality content, e.g. from crawler traps. – Spam, e.g. link spam from link farms. – Utility of PageRank over time is unclear • Faceted search – Invest in developing facets to allow filtering rather than PageRank or boosts to rank results. – e.g. basic facets from embedded metadata: • Last-Modified, Author, etc.
    4. 4. www.bl.uk 4 Discovery: HTML Links (also)
    5. 5. www.bl.uk 5 Discovery: Embedded Licenses
    6. 6. www.bl.uk 6 Discovery: Text features • No stemming or lemmatization – Researchers hated it • Natural language detection – e.g. gov.uk + fr • Postcode-based geoindex • Sentiment analysis • Similarity hashing via ssdeep – To detect similar texts
    7. 7. www.bl.uk 7 Discovery: Image features • Basic properties: – width, height, pixel count • Face detection – Number of faces & location • Dominant colour extraction – „Characteristic‟ colours
    8. 8. www.bl.uk 8 Preservation • Format analysis: – Using extended MIME types (inc. version + charset): • Served • Apache Tika • DROID – First-four-bytes – File extension • Examples – Understanding Unidentified Resources
    9. 9. www.bl.uk 9 HTML Versions Over Time
    10. 10. www.bl.uk 10 Preservation • Deeper characterisation – Software identifiers – (X)HTML: Elements Used – XML: Root Namespace – PDF: Apache Preflight – Apache Tika's parse errors – Will consider adding: • DRMLint (SCAPE) • JHOVE
    11. 11. www.bl.uk 11 Elements Over Time
    12. 12. www.bl.uk 12 PDF/A Validation Errors
    13. 13. www.bl.uk 13 Parse Errors
    14. 14. www.bl.uk 14 Analytics • Researcher Expectations – “How big is the UK Web?” • From Crawl To Web – Crawl schedule, parameters, logs. – "Files over 10MB are not archived” – De-duplication handling critical – Can't forget HTTP 30x, 40x, 50x • Compensate via normalisation strategies – c.f. Google Books Ngram
    15. 15. www.bl.uk 15 Technical Architecture • Core indexer can run from CLI or Hadoop – Makes development much easier • Hadoop indexer has two modes: – SolrCloud: • Performance acceptable as long as shards map to cores and there's good I/O (1 billion, 1 server, 1 week) • Memory issues relating to query complexity – Direct to HDFS: • Really fast for moderate data volumes • Slows down as shards grow
    16. 16. www.bl.uk 16 Scale • 1996-2010 Tranch of the IA dataset: – 2.5 Billion HTTP 200 URLs • Performance issues: – Data quality – Robustness – Configuration errors • Currently re-indexing: – with better duplicate handling – on three dedicated servers
    17. 17. www.bl.uk 17 Open Collaboration • Fully open source stack: – webarchive-discovery indexer – Begun developing an analytics UI • Keen to collaborate – This community faces a common problem: • But not a core SolrCloud/ElasticSearch use case – Danish SolrCloud on SSD discovered via Solr mailing list • http://sbdevel.wordpress.com/2013/12/06/danish- webscale/
    18. 18. www.bl.uk 18 Thank you

    ×