Access and Analytics to the UK Web Archive Lewis Crawford, Web Archive Technical Lead The British Library
Introduction <ul><li>This talk will cover: </li></ul><ul><li>Background of the UK Web Archive </li></ul><ul><li>Traditiona...
Web Archiving: the basics <ul><li>What Selecting, capturing, storing, preserving and managing access to  snapshots of webs...
UK Web Archive:
Web archive as historical documents
Multimedia based content
3D visualisation wall
Full text search
N-gram visualisation
N-gram visualisation
Media based results
Semantic analysis
Scale: needle and haystack   <ul><li>Google:  “seen 1 trillion unique URLs” </li></ul><ul><li>more than a billion new page...
The value of the haystacks – content visualisation
Big Data analytics <ul><li>Java Map/Reduce to use Tika to extract text and generate XML files for Solr ingest </li></ul><u...
Search indexing process SOLR Dedicated Indexer SOLR Dedicated Search Hadoop Node 1 Node 50 (w)arcs Document Meta Service M...
Tag cloud analysis – General Election 2005 <ul><li>Special Collection 2005 general election  </li></ul><ul><ul><ul><li>147...
The value of the haystacks – postcode-based access
1: Blue 2-5: Green 5+ Purple 50+ Yellow 100+ Red
Questions? <ul><li>Thank you. </li></ul><ul><li>http://www.webarchive.org.uk </li></ul><ul><li>[email_address] </li></ul><...
Upcoming SlideShare
Loading in …5
×

Analytics and Access to the UK web archive

560 views
539 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
560
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Header text here... Footer text here... Page Footer text here...
  • Header text here... Footer text here... Page Footer text here...
  • Header text here... Footer text here... Page
  • Header text here... Footer text here... Page
  • Header text here... Footer text here... Page Footer text here...
  • Header text here... Footer text here... Page
  • Header text here... Footer text here... Page Footer text here...
  • Header text here... Footer text here... Page Footer text here...
  • Header text here... Footer text here... Page
  • Header text here... Footer text here... Page
  • Header text here... Footer text here... Page
  • Header text here... Footer text here... Page Footer text here...
  • Header text here... Footer text here... Page
  • Header text here... Footer text here... Page
  • Analytics and Access to the UK web archive

    1. 1. Access and Analytics to the UK Web Archive Lewis Crawford, Web Archive Technical Lead The British Library
    2. 2. Introduction <ul><li>This talk will cover: </li></ul><ul><li>Background of the UK Web Archive </li></ul><ul><li>Traditional access methods to Web Archives </li></ul><ul><li>Full text search for resource discovery </li></ul><ul><li>Problems of scale – needles and haystacks </li></ul>
    3. 3. Web Archiving: the basics <ul><li>What Selecting, capturing, storing, preserving and managing access to snapshots of websites over time </li></ul><ul><li>How Use crawler software to download websites automatically </li></ul><ul><ul><ul><li>Selective or domain archiving </li></ul></ul></ul><ul><ul><ul><li>Provide access in a Web Archive </li></ul></ul></ul><ul><li>When Since mid 1990s </li></ul><ul><li>Who Heritage and memory organisations, eg (IIPC) </li></ul><ul><ul><ul><li>University libraries </li></ul></ul></ul><ul><ul><ul><li>Not-for-profit and commercial organisations, eg Internet Archive </li></ul></ul></ul><ul><ul><ul><li>Individual researchers </li></ul></ul></ul><ul><li>Why Global information resource </li></ul><ul><ul><ul><li>Artefact of cultural and technology change </li></ul></ul></ul><ul><ul><ul><li>Representative sample of the web: historical and sociological data that may not be found elsewhere </li></ul></ul></ul><ul><ul><ul><li>Part of national digital heritage - legal requirements </li></ul></ul></ul>
    4. 4. UK Web Archive:
    5. 5. Web archive as historical documents
    6. 6. Multimedia based content
    7. 7. 3D visualisation wall
    8. 8. Full text search
    9. 9. N-gram visualisation
    10. 10. N-gram visualisation
    11. 11. Media based results
    12. 12. Semantic analysis
    13. 13. Scale: needle and haystack <ul><li>Google: “seen 1 trillion unique URLs” </li></ul><ul><li>more than a billion new pages are added to the web every day </li></ul><ul><li>The UK web domain </li></ul><ul><ul><li>9 million .uk domain names registered in December 2010 </li></ul></ul><ul><ul><li>~ 1 million using other domain names </li></ul></ul><ul><ul><li>Growing at 11% - 14% per year </li></ul></ul><ul><ul><li>40% estimated to be in scope for Legal Deposit </li></ul></ul><ul><ul><li>Estimated ~110TB each UK domain crawl </li></ul></ul><ul><li>Subject hierarchy visualisation UK Web Archive </li></ul><ul><li>~ 10,000 websites collected since 2004 </li></ul><ul><li>~ 40,000 instances </li></ul>
    14. 14. The value of the haystacks – content visualisation
    15. 15. Big Data analytics <ul><li>Java Map/Reduce to use Tika to extract text and generate XML files for Solr ingest </li></ul><ul><li>Hive & Pig for ad hoc query analysis </li></ul>
    16. 16. Search indexing process SOLR Dedicated Indexer SOLR Dedicated Search Hadoop Node 1 Node 50 (w)arcs Document Meta Service Meta Database XML Document store Web Access Replication WCT Crawlers Generate (w)arcs Insert meta information Retrieve (w)arcs and meta information Generate xml files DIH Indexes new xml SOLR Dedicated Indexer XML Image store DIH Indexes new xml SOLR Dedicated Indexer XML Media store DIH Indexes new xml SOLR Dedicated Search SOLR Dedicated Search Replication Replication
    17. 17. Tag cloud analysis – General Election 2005 <ul><li>Special Collection 2005 general election </li></ul><ul><ul><ul><li>147 websites archived during and immediately after the UK general election campaign of 2005. </li></ul></ul></ul><ul><ul><li>Tag clouds (or weighted lists) generated for websites belonging to key political parties </li></ul></ul><ul><ul><li>Shows the most frequently used words in the websites during the 2005 election campaign </li></ul></ul><ul><li>Special collection 2010 general election now available </li></ul>
    18. 18. The value of the haystacks – postcode-based access
    19. 19. 1: Blue 2-5: Green 5+ Purple 50+ Yellow 100+ Red
    20. 20. Questions? <ul><li>Thank you. </li></ul><ul><li>http://www.webarchive.org.uk </li></ul><ul><li>[email_address] </li></ul><ul><li>@relephantdata </li></ul>

    ×