Web Archives and Digital Methods

4,028 views

Published on

Historically, the practice of web archiving has involved various institutions and the development of various practices, approaches and tools. Among them, three main approaches to web archiving have been developed: web archive research using the Internet Archive and Wayback Machine, the practice of archiving special collections of websites, and the national approach of archiving webs of specific countries. These approaches and practices do not only reflect the time in which they were conceived in the history of web archiving, but also put forward distinct ways in which they may be used and consequently what type of historiographical research can be done with them. However, there are also limits to what these tools and practices offer. The purpose of this talk is to introduce the limits of doing research with the Internet Archive with existing tools such as the Wayback Machine and in addition, to show how digital methods are used to repurpose the Wayback Machine in order to go beyond the single-site historical research that is enabled by the Internet Archive. This will be illustrated in a case study on the Dutch blogosphere where by means of custom tools built on top of the Wayback Machine yearly snapshots of the historical Dutch blogosphere were created between 1999-2009. By reconstructing the interlinked set of blogs, the blogosphere, one can trace and map transitions in linking technologies and practices in the Dutch blogosphere over time. This approach allows for studying the emergence and decline of blog platforms and social media platforms within the blogosphere and for investigating local blog cultures.

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
4,028
On SlideShare
0
From Embeds
0
Number of Embeds
1,179
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Hi, I would like to thank the organizers for inviting me. I’m Anne Helmond, a PhD Candidate and lecturer at the UvA with the DMI \n
  • The Digital Methods Initiative is a contribution to doing research into the "natively digital" where the focus is on how methods may change, however slightly or wholesale, owing to the technical specificities of new media. Digital Methods is a term coined as a counter-point to virtual methods, which typically digitize existing methods and port them onto the Web. Digital Methods, contrariwise, seek to learn from the methods built into the dominant devices online, and repurpose them for social and cultural research. That is, the challenge is to study the info-web and the social web with the tools that organize them *and* archive them. \n
  • Historically, the practice of web archiving has involved various institutions and the development of various practices, approaches and tools. Among them, three main approaches to web archiving have been developed: web archive research using the Internet Archive and Wayback Machine, the practice of archiving special collections of websites, and the national approach of archiving webs of specific countries. These approaches and practices do not only reflect the time in which they were conceived in the history of web archiving, but also put forward distinct ways in which they may be used and consequently what type of historiographical research can be done with them. \n
  • The first period in web archiving was the Internet Archive, which, with the aid of the Alexa toolbar aimed to archive ‘everything.’ Archived websites could be accessed through the Alexa toolbar but in 2001 the archive became available through the WayBack Machine to search and access the archives.\n
  • a second period deals with web sphere analysis as put forward by Foot & Schneider > and special collections such as the Election archives or the 9/11 archive of the Library of Congress.\n
  • a third period may be described as the national turn or the long march of the national libraries into web archiving as for example netarkivet.dk which “collects and preserves the Danish part of the internet”\n
  • A Web archive as an object, formed by the archiving process, embeds particular preferences for how it is used, and for the type of research performed with it. Which methods of research are privileged by the specific form assumed by the Web archive, and which are precluded? And which types of historiographies can be produced with these archives?\n
  • Various approach to archiving favor a various type of research and historiographies. The Internet Archive and its Wayback Machine produce single site histories because one accesses the archive by entering the URL of a particular website.\nThe second approach with a focus on special collections, in particular elections and disasters, favor a event-based historiography while the third one where national libraries or institutions focus on their own portion of the web put forward national historiographies.\n
  • We will now zoom into some use scenarios of one of the publicly available web archives: The Internet Archive. But before doing so, due to the way the archive is constructed there are also limitations to what can be done with the archive through it’s default interface. Because the Wayback Machine requires the input of a single URL one can only see the changes to a single website over time. An important feature that is missing, is to be able to search for a specific word within a page or over time. In this way the archive reflects the time it was built in: cyberspace with its surfing whereas in 2012 the default on the web is searching. We no longer surf, we search.\nA second limitation of the Internet Archive is related to the idea of continuous surfing as the Wayback Machine will jump through time if the requested link is not available. It will show a page closest to the date requested, or go to the live web. This means that followed links could be seven months apart, a very important issue to consider for researchers.\n
  • But despite the limitations the IA offers a rich resource for researchers. I will now show a few examples of different types digital methods style of research that can be done with the Internet Archive. As explained before, due to its construction the IA privileges single-site histories, or website biography. A website biography may be understood as a story of the history of the website, and it may tell a larger story of the history of the web more generally. The example movie that will be shown next is a screencast documentary that tells the history of Google seen through its interface from 1998 until late 2007. It makes use of all the available Google front-pages in the Internet Archive. The movie “Google and the politics of tabs” chronicles the subtle changes to the Google frontpage real estate\n
  • \n
  • A second type of digital methods style research that can be done is creating your own collections through the archive. For example this collection of 10 years of right wing websites was created by the Dutch newspaper NRC Handelsblad which showed the rise of right-wing websites after 11 September and the difference in language use became smaller showing how the internet is revealing a hardening of the Netherlands.\n
  • A third type of IA use is to capture periods of web history, in this case the Early blogosphere as seen through the Eatonweb portal where the image shows the missing blogs in the IA in the middle.\n
  • In addition to looking at the history of a single site, or a collection of sites, one can also look at the larger link space of a collection of sites by reconstructing the network with historical link analysis using network visualization tools. By mapping the outlinks one can reconstruct the network. A next step is to reconstruct these types of networks over time to show change. \n
  • This short movie by Anat Ben-David (who is present here) shows the Palestinian refugee space on the Web over time where you can see the expansion of the space during the second intifada, and then its narrowing down after it moved to the .ps domain\n
  • We will now look at a specific case study to show the steps involved in reconstructing historical web spaces. What this related digital methods case study, done together with colleague Esther Weltevrede, will show is a mapping of the historical Dutch Blogosphere which uses data from the Internet Archive. It builds on the previously shown examples and proposes new methods for more fine-grained historical link analysis.\n\nIn this study, we aimed to map and analyze transitions in the Dutch blogosphere from its beginning, 1999 to ‘now’ 2009. So how to map a national, historical blogosphere? This section will talk you through some of the methods we have developed and some of our prelimenary findings.\n
  • The annual blogospheres are created from a collection of blogs retrieved from the Internet Archive using existing and custom tools. First, we used the Internet Archive’s Wayback Machine for which we used a custom collection tool created by Erik Borra because the IA only privileges single-url analysis and not collection analysis by default. As such, the Internet Archive privileges single site histories, instead of national histories.\nSecond, Google Refine for cleaning and transforming data. Third, Gephi and the G-Atlas for visualizing the data and therewith constructing the blogosphere by turning it into a visible and ‘tangible’ blogosphere.\n
  • As starting points we took a collection of authoritative sources and expert lists from Dutch blogosphere historians, including the Loglijst (we were given a 2001 database dump and retrieved all the links), a blog indexing site - an early Technorati - that was started in 2001. Relying on these sources to provide us with a collection of Dutch blogs led us to include a small number of .be Belgian blogs that were considered to be part of the Dutch blogosphere by our sources.\n\nWe then retrieved all these URLs from the Archive and ripped all the links from the front page. \n
  • We requested 2507 urls from our starting lists and were able to retrieve 946 in the Archive. These are the number of blogs we retrieved per year and this our corpus for further analysis.\n
  • One of the consequences of studying transitions in a national blogosphere over time with the Internet Archive is that it is only possible to do research on front-page level and not on a post level. Thus, as a consequence this method may be viewed as a more structural ‘blogosphere’ analysis instead of ‘issue’ or ‘event’ analysis.\n\nUsing the open source network visualization software Gephi we created yearly snapshots of the blogosphere. This image shows the growth and decline of the Dutch blogosphere. In grey we see the total blogosphere and in red the blogosphere per year. As you can see there is no ‘blogosphere’ in 1999 yet.\n
  • In this image of the 1999 pre-blogosphere you can see some of the early Dutch bloggers and their outlinks. There are some heavy linkers but they don’t link to each other and therefor do not create a blogosphere. \n
  • 2000 is the first year of the Dutch blogosphere and in this research we were specifically interested in the types of platforms bloggers use and in this map you can see in blue personal homepage providers, student pages in pink and early blog platforms in yellow such as weblogs.com from Dave Winer. \nA prominent node in the early Dutch blogosphere is Ludo van Hove, a Belgian blogger as mentioned before included in the Dutch blogosphere by our sources and by the other blogs in the network. \n\nThis is a type of link analysis we call historical link analysis and we would now like to present you with some new types of analysis: a URL analysis and source-code analysis.\n
  • We sought to contribute to the definition of a “national blogosphere” by investigating the Dutchness of Top Level Domains, software and platforms thereby transforming the question of “what is a Dutch blog?” into “where do Dutch bloggers blog?” in order to enrich and complicate the understanding of the location of web content.\n\nIn a first type of URL analysis, a tld-analysis, we looked at the top level domains of our blogs. What this visualization shows is the steady rise of the .nl domain at the expense of the .com domain which as we will now see also coincides with the preference for Dutch blog software and platforms.\n
  • Moving beyond the tld to analyze where bloggers blog we look at the software that powers the dutch blogosphere.\nBloggers often install software themselves, using self-hosted software such as Blogger or WordPress or they blog on platforms, such as Blogspot.com and com.\n\nDetecting blog platforms is fairly straightforward and may be conducted through a second type of URL analysis (eg looking for blogspot.com in the URL) where we compiled a list of blog platforms and coded all the blogs in our corpus. \n\nSelf-hosted software is less straight forward and not standardized - > often ‘powered by’ but not always. sometimes footer & sidebar, but not always. This requires a different type of analysis: a source-code analysis.\n\n
  • -> we made use of the self-reflexive blogging practices of bloggers, blogging about their choices in blog software, in order to discover and enhance our list of software to query in the source code. We built a custom search feature on top of our archived special collection to search for software used in our set of Dutch blogs.\n\n
  • We visualized the outcome of our platform-url analysis and source-code software analysis.\nWhat you see the relative amount of blog platforms that use specific blog software in our set. \n\nFindings: Our findings suggest that the early Dutch bloggers, the founding fathers of the Dutch blogosphere, do not use platforms. In general, the early Dutch bloggers prefer to create their blogs manually, or use specifically designed self-hosted blog software.\n\nWhile there is more to say about this graph we would like to focus on the orange bars which represent Dutch software and platforms. Pivot is Dutch blog software that as you can see is very much used in the Dutch blogosphere.\n
  • AN: When we zoom into platform usage we see the relative amount of Dutch platforms over time which slowly overtakes other platforms such as Blogspot and WordPress. We seeked to contribute to the definition of a “national blogosphere” and the understanding of the location of web content by not only investigating the Dutchness of Top Level Domains but also software and platforms.\nIn a final step we will focus on a rather ‘new’ actor in the blogosphere, the social media platforms.\n
  • With the rise of social media we see the increasing integration of social media features into blogs, creating what we would call a platformblog, which is characterized by embedding and linking content from social media platforms like Flickr, YouTube and Facebook and by referring to the author's presence on these platforms in sidebar widgets. Whereas in the mid and late 90s the self was defined on the personal homepage and later on the blog, with the rise of social networking sites and content platforms the self is now also defined and performed elsewhere. The platformblog is often used to present the distributed self across social media platforms. How would one study the platformblog and the distributed self? \n\nAsking this question lead us to a common problem in online network visualizations: the problem of big platform nodes that take a prominent position in the graph because all references to platforms are collapsed in one single node. In an attempt to demystify the position of the big platform nodes in the Dutch blogosphere, we further specified the analysis by asking, who are the actors in the blogosphere?\n\nMost network analysis software treats the host and in some cases sub-host as the actor. However, in our case the ‘actor’ or blogger is often defined after the slash. Think, for example, of the early bloggers that started blogging from their personal homepage to the recent micro bloggers on Twitter (/username).To identify nodes in the blogosphere as actors, we redefined what actors are on a URL level.\n\nComparing the 2009 blogosphere with and without actor definition, it becomes clear that the social media platforms privilege a more fine-grained analysis. Social media are the big nodes in the network without actor definition, however, with actor definition the social media platforms seem to lose prominence in the blogosphere. \n
  • This social media research project is a first attempt to develop measures to analyze more closely the linking practices between blogs and social media.\n\nThe strategy for research is to further specify what is linked to within social media: user pages or content (e.g. video, photo, status update). Further analysis suggests the links to social media mostly contain self-references and references to embedded (temporary) content such as video’s and photos. We found that from the 160 unique bloggers who link to Twitter user pages, 98 (also) link to themselves. For Twitter at least, this supports the claim that the distributed self can be found in the sidebar, as a new actor in the blogosphere.\n\nFlickr/YouTube - embedded content whereas Twitter and MySpace are widgetize selves. Unlike Dutch software, Dutch social media is not prominent is not prominent in this space. Twitter = micro-blogging, very prominent.\n
  • summarizing, this case study repurposed the Wayback Machine so as to trace and map transitions in linking technologies and practices in the blogosphere over time by means of digital methods and custom software. By creating a custom special collection from the Internet Archive we were are able to create yearly network visualizations of the historical Dutch blogosphere between 1999 and 2009. This approach allowed us to to investigate local blog cultures and to study the emergence and decline of blog platforms and social media platforms within the historical Dutch blogosphere.\n\nThank you.\n
  • Web Archives and Digital Methods

    1. 1. Web Archives and Digital MethodsReconstructing the Dutch Blogosphere with the Internet Archive Anne Helmond www.digitalmethods.net NWO CATCH Meeting "Supporting Media Studies Research: Exploration and Contextualization" by BRIDGE on 22 June 2012, at the Netherlands Institute for Sound and Vision in HilversumTomás Saraceno, galaxies forming along filaments, Venice Biennial 2009
    2. 2. the digital methods initiativewww.digitalmethods.net
    3. 3. a history of web archiving 1. Internet Archive and the Wayback Machine (1996 - ) 2. Web sphere analysis and special collections (1999 - ) 3. The national turn (late 90s - )See: Rogers, Richard. ‘The Website as Archived Object.’in: Digital Methods. MIT Press, forthcoming.
    4. 4. a history of web archiving1. Internet Archive and the Wayback Machine (1996 - )
    5. 5. a history of web archiving2. Web sphere analysis and special collections (1999 - )
    6. 6. a history of web archiving3. The national turn (late 90s - )
    7. 7. web archives & types of historiographies "[U]nlike other well-known media, the Internet does not simply exist in a form suited to being archived, but rather is first formed as an object of study in the archiving, and it is formed differently depending on who does the archiving, when, and for what purpose" (Brügger, 2005).See: Rogers, Richard. ‘The Website as Archived Object.’in: Digital Methods. MIT Press, forthcoming.
    8. 8. web archives & types of historiographies 1. Wayback machine produces single site histories: "biographical" historiography 2. Special collections focuses on elections and disasters: "event-based" historiography 3. National web archives focus on own portion of web "national historiography"See: Rogers, Richard. ‘The Website as Archived Object.’in: Digital Methods. MIT Press, forthcoming.
    9. 9. Internet Archive use: the limitsInput: Query. No search. Jump-cuts through time
    10. 10. Internet Archive use (digital methods)Single website history - Capture history of website, andplayback as screencast documentary (time-lapsed photography)
    11. 11. "Google and the politics of tabs" by Govcom.org, Amsterdam, 2008.
    12. 12. Internet Archive use (digital methods)Collection making. Build collections from the archive(e.g., Dutch extremist sites by NRC Handelsblad)
    13. 13. Internet Archive use (digital methods)Capture periods of web history. Early blogosphere. Showwhat is missing from archive. Also give missing sites context.
    14. 14. Internet Archive use (digital methods)Historical link analysis Ammann, R. (2009) Stevenson, M. et al (2009)
    15. 15. Internet Archive use (digital methods)Historical link analysis over time Ben-David, A. (2011)
    16. 16. case studyReconstructing the Dutch Blogosphere with the Internet Archive Weltevrede & Helmond, 2012
    17. 17. “not tonight dear, I’m busy playing with weblog software”

    ×