Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Slides from Humanities on the Web: Is it working?
Date: Thursday, 19 March 2009, 10-4
Location: Oxford University, Oxford,...
Selecting and Analysing
the WWI and WWII
collections

Christine Madsen
Eric Meyer
19 March 2009
Why WWI and WWII?




Many branches of the humanities

      History      Journalism       Art


     Art history   Advert...
Why WWI and WWII?

Well-rounded set of materials
Why WWI and WWII?




• Changes                             • Differences
  over time                             between
...
Building the Collection



Supplemented with
keyword searches in
    the Archive


  Harvested from
   the Internet
     A...
Building the Collection




                                  Seed
Seeds are:              Seed        1
                 ...
Building the Collection



                    Expanded
         www
                    Collection
                      ...
Building the Collection



Started with WWI




   Too small (under 1,000,000 pages / object)
           Target was 250 mi...
Building the Collection



Expanded to WWII




  Final collection: 5,362,425 unique URLs
Building the Collection




‘World War One’     ‘the great war’

‘World War I’
                    ‘Première Guerre
‘First...
Building the Collection




Returning to   Record links
 ‘hub’ sites   from first 20
 for further     pages of
   analysis...
Building the Collection




  Expanding scope



http://www.greatwar.co.uk/westfront/Somme/index.htm




               ht...
Building the Collection




   Expanding scope



memory.loc.gov/ammem/collections/maps/wwii/index.html




      www.memo...
Building the Collection




 Dealing with illogical or flat directory structures

    www.eyewitnesstohistory.com/ <= don’...
Building the Collection




• Stop when most results are redundant
• Narrow in on more specific topics


                 ...
Building the Collection



• Materials in Foreign language
   – Focused on German sites
   – Consider local conventions, n...
• Other foreign languages were included, but
  not sought after


   Belarusian; Catalan/Valencian; Chamorro;
   Czech; Da...
Building the Collection




                  Difficult to find and include:

                     Museums, libraries, arc...
The World Wide Web of Humanities
     “Extracting The Data”

       St Anne's College, Oxford
            March 19, 2009

...
Agenda

 Brief Introduction to IA‟s Web Archives

 Discipline Specific Data Extraction from
 Longitudinal Web Archives: ...
Brief Introduction to
IA‟s Web Archives
The Internet Archive is…
          A digital library of ~4 petabytes of information



     Web Pages
     Educational C...
IA Web Archives

1.6+ petabytes of primary data (compressed)

 150+ billion URIs, culled from 85+ million
  sites, harves...
Discipline Specific Data Extraction
 from Longitudinal Web Archives:

    The WWWoH Case Study
WWWoH Case Study




http://neh-access.archive.org/neh/
WWWoH Case Study

 Unique URLs in the collection: 5,362,425


 Total number of captures: 23,006,857

 Captures span: Ma...
The Data Extraction
                  Process
 Oxford Internet Institute selected relevant
 sites/URLs
 Identified all c...
The Data Extraction Process

 Relevant URLs not identified as seeds
  were not extracted.
   Automatically harvesting AL...
WWWoH Case Study: WWI



 Number of Seeds: 2263


 Unique Hosts: 906

 Number of Links: 143+ mil
WWWoH Case Study: WWI
WWWoH Case Study: WWI
WWI: Example
WWI: Example
WWI: Example
WWI: Example
WWWoH Case Study:
      WWII
 Number of Seeds: 2592


 Unique Hosts: 1475

 Number of Links: 252+ mil
WWWoH Case Study:
WWII
WWWoH Case Study:
WWII
WWII: Example
WWII: Example
WWII: Example
Challenges


Identifying subject matter-specific
resources of interest for an extraction and
then automating those procedu...
Recommendations for
Future Research and Tools
   Development Efforts
Implications for Future
       Research

 Need link and web graphing tools
 that use inbound and outbound link
 data to i...
Ideas/Concepts to Explore:
Nomination Tools
Ideas/Concepts to Explore:
Nomination Tools
Opportunities


 Extractions make it easier for humanities
scholars to locate and assemble source
materials of interest.
...
Thank You!


http://neh-access.archive.org/neh/

   Molly Bragg, Partner Specialist
  The Internet Archive, Web Group
    ...
Search and Analysis of
Data in WWWoH

Mark Middleton
      www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limit...
Agenda
  Brief introduction to Hanzo
  Open Source Search-Tools: a toolkit for implementing analytical
  applications usin...
Introduction to Hanzo




www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.
Hanzo Archives Limited
  Web Archiving Services
    Company websites and intranets
    Litigation support
    E-Discovery
...
Hanzo Archiving Technology
  Need advanced capabilities very quickly — continuous product innovation
    Rapid development...
WWWoH and Development of
 Open Source Search-Tools




     www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limi...
Objectives
  Deliver an open source search engine for web archives that is simple to
  extend, easy to install and deploy
...
Full Text Search
  Implemented FT search on top of WARC Tools — the toolkit for
  manipulating ISO-28500 WARC files
  Revi...
Component Architecture
  Full text search engine based
  on Open Source Ferret
  Knowledge Base stores search
  results
  ...
Ferret
                                                                                       Ferret is FAST, both indexin...
Advanced Search
  url: (+bbc +wwii) -- search for URLs containing both „bbc‟ and „wwii‟
  date: [2001 2002] -- search with...
Working with the Data




www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.
Migrating ARC to WARC
  Data extracted from IA in ARC files
  Hanzo WARC Tools and Search Tools projects combined enabled ...
Programmable Access to Data
  WARC Tools and Search Tools provide a rich collection of programmable
  tools to enable anal...
http://wwwoh.hanzoarchives.com/




         www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.
www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.
www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.
www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.
Analytical Tools
  Frequency Tables for:
    Domains, MIME Types, Countries

  Graphing Tools:
    GUESS -- an exploratory...
www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.
Graphing Tools




          www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.
Recommendations for Future
Research and Tools Development




         www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Ar...
Future Research
  Faster, richer analytics
  Rich API for analytics, to be developed in collaboration with IA, other
  arc...
Future Tools Development
  Multi-machine indexing and application engine
  Tighter integration of graphing tools, with mor...
Deliverables at End March 2009




        www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.
Deliverables
  The Search Tools project home is http://code.google.com/p/search-tools/
    Source code
    Documentation
 ...
Thank You
       Hanzo Archives Limited
       +44 20 8816 8226
       www.hanzoarchives.com




www.hanzoarchives.com   ◀...
Upcoming SlideShare
Loading in …5
×

WWWoH

4,857 views

Published on

Presentations from Oxford Internet Institute, the Internet Archive, and Hanzo Archives Ltd presenting the results of a JISC-NEH funded transatlantic digitisation project.

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

WWWoH

  1. 1. Slides from Humanities on the Web: Is it working? Date: Thursday, 19 March 2009, 10-4 Location: Oxford University, Oxford, UK Webcast URL: http://webcast.oii.ox.ac.uk/?view=Webcast&ID=20090319_275 Slide URL: http://www.slideshare.net/etmeyer/WWWoH Afternoon Event: 1:30 – 2:45: JISC/NEH Transatlantic Digitisation Collaboration Programme in conjunction with the Internet Archive: The World Wide Web of Humanities OII: Selecting and analysing the sample WWI and WWII collections (Christine Madsen & Dr. Eric Meyer) The Internet Archive: Extracting the data (Molly Bragg) Hanzo Archives Ltd.: Working with the data (Mark Middleton) Discussion and questions Full details: http://www.oii.ox.ac.uk/events/details.cfm?id=238
  2. 2. Selecting and Analysing the WWI and WWII collections Christine Madsen Eric Meyer 19 March 2009
  3. 3. Why WWI and WWII? Many branches of the humanities History Journalism Art Art history Advertising Literature Political Military Poetry science history
  4. 4. Why WWI and WWII? Well-rounded set of materials
  5. 5. Why WWI and WWII? • Changes • Differences over time between WWI and Language Doc types WWII Secondary Top-level domains domains
  6. 6. Building the Collection Supplemented with keyword searches in the Archive Harvested from the Internet Archive Selected from the live web
  7. 7. Building the Collection Seed Seeds are: Seed 1 2 the website or Seed portion of the website that you 3 plan to include in your collection Initial Collection
  8. 8. Building the Collection Expanded www Collection www A seed is also a web www Seed www www site from which 3 Seed www 2 www additional sites can Seed be discovered via www www 5 www the hyperlinks of the www www site www www www Seed www 6 www Seed www 1 www Seed www 4 www www www www
  9. 9. Building the Collection Started with WWI Too small (under 1,000,000 pages / object) Target was 250 million
  10. 10. Building the Collection Expanded to WWII Final collection: 5,362,425 unique URLs
  11. 11. Building the Collection ‘World War One’ ‘the great war’ ‘World War I’ ‘Première Guerre ‘First world war’ Mondiale’ ‘World War II’ ‘zweiter Weltkrieg’ ‘World War Two’
  12. 12. Building the Collection Returning to Record links ‘hub’ sites from first 20 for further pages of analysis search [include Following dead links] links
  13. 13. Building the Collection Expanding scope http://www.greatwar.co.uk/westfront/Somme/index.htm http://www.greatwar.co.uk
  14. 14. Building the Collection Expanding scope memory.loc.gov/ammem/collections/maps/wwii/index.html www.memory.loc.gov/ammem/collections maps/wwii/
  15. 15. Building the Collection Dealing with illogical or flat directory structures www.eyewitnesstohistory.com/ <= don’t want whole site www.eyewitnesstohistory.com/blitzkrieg.htm www.eyewitnesstohistory.com/dday.html www.eyewitnesstohistory.com/midway.htm www.eyewitnesstohistory.com/airbattle.htm www.eyewitnesstohistory.com/dunkirk.htm www.eyewitnesstohistory.com/francesurrenders.htm
  16. 16. Building the Collection • Stop when most results are redundant • Narrow in on more specific topics Churchill Hitler ‘zweiter ‘Battle of the Weltkrieg’ Bulge’ ‘Great war’ Guadalcanal WWI Allies WWII Home front
  17. 17. Building the Collection • Materials in Foreign language – Focused on German sites – Consider local conventions, not just translations WWII (zweiter Weltkrieg) the period of National Socialism (Zeit des Nationalsozialismus) the period in which the Nazis ruled (Nazizeit)
  18. 18. • Other foreign languages were included, but not sought after Belarusian; Catalan/Valencian; Chamorro; Czech; Danish; German; Dzongkha; English; Spanish/Castilian; Finnish; French; Hebrew; Hungarian; Italian; Japanese; Luba-Katanga; Dutch/Flemish; Polish; Portuguese; Russian; Slovenian; Turkish; Ukrainian; Chinese
  19. 19. Building the Collection Difficult to find and include: Museums, libraries, archives Some improvement through targeted searches NYPL (2,100 photographs) Harvard Libraries (1,000 WWI Pamphlets) Directory Structures still limiting http://pds.lib.harvard.edu/pds/view/7845178 (first page of a multipage object)
  20. 20. The World Wide Web of Humanities “Extracting The Data” St Anne's College, Oxford March 19, 2009 Molly Bragg, Partner Specialist Web Group The Internet Archive
  21. 21. Agenda  Brief Introduction to IA‟s Web Archives  Discipline Specific Data Extraction from Longitudinal Web Archives: The WWWoH Case Study  Recommendations for Future Research and Tools Development Efforts
  22. 22. Brief Introduction to IA‟s Web Archives
  23. 23. The Internet Archive is… A digital library of ~4 petabytes of information  Web Pages  Educational Courseware  Films & Videos  Music & Spoken Word  Books & Texts  Software  Images The Archive’s combined collections receive over 6 mil downloads a day! www.archive.org
  24. 24. IA Web Archives 1.6+ petabytes of primary data (compressed)  150+ billion URIs, culled from 85+ million sites, harvested from 1996 to the present  Includes captures from every domain  Encompasses content in over 40 languages  As of 2009, IA will add ½ petabyte to 1 petabyte of data to these collections each year.
  25. 25. Discipline Specific Data Extraction from Longitudinal Web Archives: The WWWoH Case Study
  26. 26. WWWoH Case Study http://neh-access.archive.org/neh/
  27. 27. WWWoH Case Study  Unique URLs in the collection: 5,362,425  Total number of captures: 23,006,857  Captures span: May, 1996 to Aug, 2008  Total size of compressed data: ~250 GBs
  28. 28. The Data Extraction Process  Oxford Internet Institute selected relevant sites/URLs  Identified all captures related to the seeds  Identified all files embedded in each capture (on & off seed domains) for extraction  Attempted to locate additional candidate seed URLs/domains for inclusion in the collection using outbound link data
  29. 29. The Data Extraction Process  Relevant URLs not identified as seeds were not extracted.  Automatically harvesting ALL outbound links can capture relevant non-seed urls however it can also introduce a large amount of extraneous content into the collection  Manually curating outbound links excludes non-relevant content, however it can be an overwhelming task due to the volume of links
  30. 30. WWWoH Case Study: WWI  Number of Seeds: 2263  Unique Hosts: 906  Number of Links: 143+ mil
  31. 31. WWWoH Case Study: WWI
  32. 32. WWWoH Case Study: WWI
  33. 33. WWI: Example
  34. 34. WWI: Example
  35. 35. WWI: Example
  36. 36. WWI: Example
  37. 37. WWWoH Case Study: WWII  Number of Seeds: 2592  Unique Hosts: 1475  Number of Links: 252+ mil
  38. 38. WWWoH Case Study: WWII
  39. 39. WWWoH Case Study: WWII
  40. 40. WWII: Example
  41. 41. WWII: Example
  42. 42. WWII: Example
  43. 43. Challenges Identifying subject matter-specific resources of interest for an extraction and then automating those procedures.  Tools are missing from the workflow that might make the initial scoping of an extraction easier to define and revise  Available tools for collection building and access are too technically focused for the average humanities scholar
  44. 44. Recommendations for Future Research and Tools Development Efforts
  45. 45. Implications for Future Research  Need link and web graphing tools that use inbound and outbound link data to identify further resources of interest  Need to experiment with a more diverse range of UI navigational paradigms that address the dimension of time and curatorial input
  46. 46. Ideas/Concepts to Explore: Nomination Tools
  47. 47. Ideas/Concepts to Explore: Nomination Tools
  48. 48. Opportunities  Extractions make it easier for humanities scholars to locate and assemble source materials of interest.  These collections can accelerate and/or augment discipline specific research efforts  Extractions can encourage distributed collaboration and cooperation between entities who might not otherwise be aware of one another
  49. 49. Thank You! http://neh-access.archive.org/neh/ Molly Bragg, Partner Specialist The Internet Archive, Web Group mbragg@archive.org
  50. 50. Search and Analysis of Data in WWWoH Mark Middleton www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  51. 51. Agenda Brief introduction to Hanzo Open Source Search-Tools: a toolkit for implementing analytical applications using web archives WWWoH — working with the data Recommendations for future research Recommendations for future tools development WWWoH Tools Deliverables www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  52. 52. Introduction to Hanzo www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  53. 53. Hanzo Archives Limited Web Archiving Services Company websites and intranets Litigation support E-Discovery IP protection Focus on legally defensible web archives of exceptional quality Very advanced crawlers and access tools: dynamic html, video, flash, web 2.0 Some public archives Mainly closed archives www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  54. 54. Hanzo Archiving Technology Need advanced capabilities very quickly — continuous product innovation Rapid development of tools Create research and open source projects to promote mainstream awareness of web archives and web archiving technology Open source projects include WARC Tools Search Tools www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  55. 55. WWWoH and Development of Open Source Search-Tools www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  56. 56. Objectives Deliver an open source search engine for web archives that is simple to extend, easy to install and deploy Integrate with WARC Tools, the open source web archive file manipulation tools (Hanzo and IIPC) Extend the search engine with interesting directives and options Extend the search engine to provide data to analytical tools, develop an API, tools, and exemplar analytical tools Encourage third party analytical tools to use web archives as their data repository Migrate WWWoH extraction from ARC to WARC and ingest into Search Tools www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  57. 57. Full Text Search Implemented FT search on top of WARC Tools — the toolkit for manipulating ISO-28500 WARC files Reviewed several options: Java Lucene (and clones), Xapian, DB indexing (Sphinx, OpenFTS), etc. Criteria: vibrant development community, extensible (searching web archives is different: temporal dimension, duplicate handling, etc.), fast and full-featured (boolean, time queries, ability to index multiple fields, query language) www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  58. 58. Component Architecture Full text search engine based on Open Source Ferret Knowledge Base stores search results Python application with Django model and Django WUI Memcache Plug-in architecture to support multiple analytical applications www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  59. 59. Ferret Ferret is FAST, both indexing and searching Highly scalable, up to 100m documents on a single CPU Documents/s Supports distributed search Phrase search, proximity ranking, stemming in several languages, stopwords, multiple document fields Ferret Query Language http://ferret.davebalmain.com/trac/wiki/FerretVsLucene www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  60. 60. Advanced Search url: (+bbc +wwii) -- search for URLs containing both „bbc‟ and „wwii‟ date: [2001 2002] -- search within date range tag: wwwoh -- search content with the tag „wwoh‟ title: (+wilfred +owen) -- search for Wilfred and Owen within the title domain: fr -- restrict search to within .fr domain www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  61. 61. Working with the Data www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  62. 62. Migrating ARC to WARC Data extracted from IA in ARC files Hanzo WARC Tools and Search Tools projects combined enabled us to migrate ARC to WARC files (WARC is the new ISO standard): Some challenges: broken ARCs, scale, etc. 3,264 WARC files www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  63. 63. Programmable Access to Data WARC Tools and Search Tools provide a rich collection of programmable tools to enable analytics tools developers to use web archives: Object-oriented C, REST API, fast iterators Command lines for manipulating WARCs, indexing, searching Web applications for browsing, searching, demonstrator analytics C/C++, Python, Ruby, Perl, … and if you need to, Java, C# Demonstration: the web applications www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  64. 64. http://wwwoh.hanzoarchives.com/ www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  65. 65. www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  66. 66. www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  67. 67. www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  68. 68. Analytical Tools Frequency Tables for: Domains, MIME Types, Countries Graphing Tools: GUESS -- an exploratory data analysis and visualization tool for graphs and networks Graphviz -- makes diagrams in several formats: images and SVG for web pages, Postscript; or display in an interactive graph browser Hypergraph -- provides visualisation of hyperbolic geometry, to handle graphs and to layout hyperbolic trees www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  69. 69. www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  70. 70. Graphing Tools www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  71. 71. Recommendations for Future Research and Tools Development www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  72. 72. Future Research Faster, richer analytics Rich API for analytics, to be developed in collaboration with IA, other archives, and IIPC Temporal analytics and techniques Link and network graphing and analytics Enhance outreach/dissemination to the mainstream development community and research community www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  73. 73. Future Tools Development Multi-machine indexing and application engine Tighter integration of graphing tools, with more user parameters and configurations Temporal analysis (animation of link graphs over time) Enhance WARC Tools integration and investigate interoperability with other IIPC toolsets Developer documentation Analyst/researcher documentation Installation tools for Linux, Mac OS X and Windows XP/Vista www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  74. 74. Deliverables at End March 2009 www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  75. 75. Deliverables The Search Tools project home is http://code.google.com/p/search-tools/ Source code Documentation Issue management Mailing list The WARC Tools project home is http://code.google.com/p/warc-tools/ The prototype application is http://wwwoh.hanzoarchives.com/ www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  76. 76. Thank You Hanzo Archives Limited +44 20 8816 8226 www.hanzoarchives.com www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

×