• Like
  • Save
Collecting in the Moment
Upcoming SlideShare
Loading in...5

Collecting in the Moment



In the wake of recent events at the University of Virginia surrounding the ousting, and later reinstatement, of President Teresa Sullivan, the University Library, including the University Archives, in ...

In the wake of recent events at the University of Virginia surrounding the ousting, and later reinstatement, of President Teresa Sullivan, the University Library, including the University Archives, in the Albert and Shirley Small Special Collections Library, scrambled to collect picket signs, gather tweets and Facebook postings, and bring together other materials documenting the events on Grounds, even as they were unfolding. In the light of this event and others like the Occupy Movements, Arab Spring, and 9/11, this discussion will explore questions of how institutions document, save, and preserve materials pertaining to current events, especially when those events are born through social networking sites. Gretchen Gueguen, Digital Archivist at the University of Virginia, will discuss her work to capture digital material from social media and websites during the Sullivan episode. A wide-ranging discussion with all audience members will follow to uncover questions of how to approach such events and what the role of the Archives or Special Collection might be in creating and managing such records.

Speaker: Gretchen Gueguen, Albert and Shirley Small Special Collections Library, University of Virginia

Moderator: Nicole Bouché, Albert and Shirley Small Special Collections Library, University of Virginia



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • June 10, 2012 was a Sunday, the sun was shining, Uva was just getting into the swing of summer courses, when on June 10, Teresa A. Sullivan, the President of the University, suddently and unexpectedly announced her resignation… … meanwhile, I, the Digital Archivist at Uva, prepared to attend a Rare Book School class on the NINETEENTH CENTURY BOOK TRADE for the coming week
  • Reactions grow increasingly vocal around Grounds as both town & gown become suspicious of the motives and actions of the Board of Visitors, and especially its Rector, Helen Dragas… … meanwhile, I had turned OFF my computer for the week in order to fully pay attention to my class…needless to say, I was not the most informed person on campus about what was going on
  • On June 18 th , the first public demonstration in support of Sullivan occurs on Uva’s historic lawn during a specially convened Board of Visitor’s meeting to discuss the resignation.
  • At this time activities around the library began to coalesce. Several groups across the libraries began to discuss how to work together to save the historic record related to events. University Archives: already decided to attend the first rally in order to collect signs, have been in touch with faculty senate A cross-departmental group in the library including Archives, our Digital Humanities unit, and a part of our IT department formed to discuss coordinating activities and outreach. In addition, University Records Manager: consults with group about her activities/buried in FOIA requests
  • So it was that around 9 a.m. on July 19 th , that I finally thought… Wait a minute, did the president of the university get fired or something? I’m over-emphasizing here to be funny, but the events really did catch me a bit by surprise. In addition, we hadn’t really gotten to the point of discussing things like capture of current social media or web-based information in relation to my newly-formed position at UVa. It was something that I knew would have to be addressed, but it hadn’t been yet.
  • So I’m going to spend the rest of this morning talking about the work that I did over the weeks of the Uva leadership crisis last summer to try and capture some of the online content created by the university community in response. I think that the issues I had to figure out are really emblematic of the kinds of work that libraries, and particularly those that gather archives of unique materials, are going to have to face. Hopefully they won’t face them in the midst of a crisis, but they will have to face them eventually.
  • And they will have to face them because the internet is no longer “ephemera.” It is a publishing platform, it is a space for interaction and creation, it is a storage medium, it is all these things. The public events here at Uva, the emergency faculty senate meeting, the protests, even the content gathered for articles in the newspapers were all based in one way or another on web-based mediums. Because of the use of these emerging technologies, I felt that capturing material related to campus events would be important not just because they documented undoubtedly historic events, but also because of the novel use of these technologies to communicate. In some sense, the medium was at least partly the message here.
  • So, on the 19 th I set to work trying to document these various online sources. I’m going to spend the rest of my time this morning talking with you about attempts to harvest content from these sources: twitter, blogs and other web objects, facebook, news sources, and video.
  • Twitter was the first and most important source to figure out. It was also really difficult. Twitter has an application programming interface (API), which is basically a set of open protocols that allow people to build tools that work with twitter’s data. This means that third-parties can build tools that allow you to download tweets into different data forms like xml or spreadsheets. The api limits you to no more than1500 tweets at a time. 1500 tweets is a lot, but when a topic is really popular, 1500 tweets can go by in no time at all. So time was of the essence… My goals then for creating a twitter archive were first to find a good tool for finding and saving tweets. Secondly, I needed to figure out how to get the oldest tweets related to the subject that I could. Third, I wanted to use the tweets to figure out what people were talking about, posting links to, and organizing.
  • I ended up using two different tools. The first, shown here, is called The Archivist. It does a search, just as you would on twitter. This can be for hashtag, keyword, user profile (will get both that user’s tweets as well as those that reference them). Can save the output as XML or a tab-delimited text that can be imported into excel The biggest drawback was that the Archivist wasn’t set up to run simultaneous searches and had to be actually opened and running to capture them. This meant that sometimes once an hour if it was a busy day, I had to open the tool and do a dozen or so searches to get what had been posted in the last hour. This was obviously time-consuming and I did want to occasionally eat, sleep, or leave my desk. If it was really busy though and I waited too long between captures, I would exceed the api’s limit of 1500 tweets and be unable to capture some of them. I continued to use the Archivist, but also continued searching for another option
  • The other tool I used was a script created for Google Spreadsheet. You just open this customized google spreadsheet, tweak a couple of lines in the associated script and let it do it’s thing. This ended up being the best option. It would capture tweets even when I didn’t have the spreadsheet open. It was saved in a spreadsheet, so it was exportable data, and it captured the most complete data of all the tools. It would crash if it became too full, which it did about once a day early on. But I quickly figured out that I could export out the data that was actually saved, then delete it from the live version and start again. So now instead of having to go through a series of procedures every hour, I only had to do it once a day, if that. I did still continue to use the Archivist as a back up. The Archivist was actually also better at searching individual accounts. But once the initial uproar died down, I didn’t need to back up using the Archivist as frequently. And I obviously felt much better having a back-up of the data. Overall, I estimate that we probably collected around 80,000 unique tweets. I have no analysis right now though of how much of that was retweeted content or irrelevant.
  • All files contained Retweets Overlapping capture times at beginning and end Tweets from accounts that have since been deleted
  • Twitter has since updated their API to provide access to older tweets. This means that we could re-harvest these tweets by using their twitter id number and then have a set with a u niform data format. So far we’ve re-harvested about 53,000 of them based on the twitter ids that were taken from the XML files. Another attempt to retrieve them from the spreadsheets hit a snag as the spreadsheets were formatted differently, but that should be easy to solve. Twitter has also released a basic tool for exploring an archive that could be used, called Grailbird. One problem is that if an account has been deleted since it was originally gathered, those tweets are no longer available. Once an account is deleted all of its content is removed from the stream. So parody accounts (of which there were several) that were taken down would be effectively lost. In our initial re-harvest this number was under 5%. We are considering how to map the initial capture of those tweets to match the data format of the re-harvest
  • I realized that the twitter content could be a great lead into other web-based content related to the events. These were the websites, videos, pictures and articles that people were really talking about and which formed their conception of events. The issues were that I didn’t readily have on hand a tool that would extract these links for me for referral. The api would make it possible to have a tool that would do that and I have used a tool like that in the past, but it was open source and is no longer available because it was bought by a third party who discontinued that service. I didn’t have a lot of time to exhaustively look for tools that performed this activity, so I spent a lot of time clicking on links. The two main issues with this was that many people shorten their URLs so that there was no way to tell what they led to without clicking on them. In addition, many people retweeted links. So, I saw the same sources again and again when I clicked on them
  • This is an example (this is just from my current twitter feed, by the way) that shows what it looks like when someone posts a link. The one item in the square there is a link to a picture. Twitter allows you to post pictures natively, so instead of posting them somewhere like instagram or facebook and linking to them, they really only exist within your twitter account. The only way to ensure these weren’t lost was to try and grab them when I saw them.
  • This is an example of a twitter picture. As you can see the picture just shows up within your twitter wallpaper, so that adds somewhat to the context of how these were presented.
  • Some of these pictures were really great and showed a view of the events that didn’t really make it into official sources and captured some things, like the slogans on the beta bridge, that didn’t really last very long
  • So, I began collecting links to sources, but they needed to be captured very soon in order to ensure that they were grabbed before they possibly disappeared. I knew that tools existed to set up web crawls, and that these tools were the basis of efforts at other institutions to capture their online presence. However, these tools are somewhat difficult to implement, requiring some sophisticated configuration. The output they produce is somewhat limited as well. The type of capture that I wanted to do (one post on a blog, for example, not the entire thing) was also not exactly the same as web-crawling. In the end, I realized that I was looking at each of these sources to decide if they should be included, I could just use the “Save-as” command in firefox to save them as a HTML document with a folder of associated content. In addtion, I used a firefox plugin called Screengrab to make a screenshot of the entire page. After twitter and facebook (which I’ll discuss in a minute), this content is some of the best that was captured. It was completely unmediated and therefore was sometimes interesting, sometimes completely uninformed and biased, occasionally hilarious, and really captures the essence of reactions to the situation.
  • Overall, they add a really human element and they give the face of the everyday public embroiled in the event. This kind of narrative of the average person is something that archivists and historians really prize. The public statement of great figures tends to be valued and kept, while that of the common person can slip through the cracks. It also highlights how much of the message of these events was about personalizing the event: “I AM UVA” and how much people self-identify with the university
  • On the other hand, there were some really serious and well-read analyses of events on blogs and other unvetted sources
  • Some, like this particular blog post by a UVA alum really galvanized people to discuss what was going on (and share conspiracy theories) in the comments
  • Other things were interesting in how they took advantage of the media
  • And how they tried to use social media as a tool to effect change in a grass roots way (this is a petition on Change.org)
  • At the time I had no easy way to create web-archive files (WARC) for individual pages. Tools exist to create these fiels, most notably Heritrix which is the tool used for the internet archive. But these do web crawls…you supply a seed page and then it follows links to a depth that you determine and it continues to gather content as it goes along. I didn’t want to crawl, but in some cases I did want to capture page two of an article or something like that. I ended up using the “Save As” command in firefox which saves an HTML file and a folder of associated files like stylesheets, scripts, and images. I also used a firefox plugin called “Screengrab” to create a screenshot. Since then I’ve learned of a tool called WAIL, which stands for Web Archive Integrated Layer, as well as another called WARCreate which is a chrome extension. Both will do exactly what I needed, which is create a WARC file of a single page. However, as both are side projects by a computer science grad student with a lot of competing demands, as yet they have not proven viable (i.e. I’ve downloaded and had mixed results). The plan however, is to try and re-harvest all of this content (I saved a spreadsheet with the URL, source, and date of every web page captured). An interesting question will be how much has been removed since then…
  • A subset of this kind of content with some particularly unique characteristics is facebook. The main rallies that took place on grounds were largely organized through facebook “Groups”…anyone on facebook can start a group and it’s just another wall where the members can post content as a way of discussing it with each other. Group administration allows the administrators and members to make some of their content visible to their members only. Collecting this would seem to be a violation of privacy since membership had to be requested and granted by an administrator. However, none of the content at all was visible to someone without a facebook account.
  • So this is what the group for Students, Families, & Friends United to Reinstate Teresa Sullivan looks like if you aren’t logged in to facebook.
  • But if I just sign in as a facebook-user, but not as a member of this group, I can see all of these posts as well as who is a member
  • So there are a lot of questions here, facebook accounts are free and anyone can get one, so the default of having groups invisible to the public but visible to facebook users seems somewhat contradictory (and a way for facebook to get more people to join). We were advised by our University Legal Counsel that we should go ahead and capture the content and keep a preservation copy, but not to make it accessible without first discussing the issue with facebook.
  • This group evolved over time, and I tried to capture it as it changed its look and message. By this point things were far more organized and focused on planning events rather than just joining together to share outrage. I also changed my facebook profile picture. An interesting issue I didn’t think about until was too late was that I needed to log in as myself in order to see these at all since we didn’t have a departmental account, and even if we did I doubt we would have set it up for this reason (although we may do so in the future). This is kind of embarrassing for me as, at the time my facebook profile picture was me and my dad in 1979 after I got a bath… (removing this from these pages is one of the next steps I want to work on…) In retrospect, setting up a facebook account for the department in order to capture pages could be a better solution. The profile could be public or private depending on what else we wanted to do with it.
  • The other big source of online content was from news sources: papers, radio, and tv. In general this content isn’t much different from the other web content, but it did get to be really overwhelming in volume. The nice thing about it though, was that since these were more established sources with significant resources, I was less worried about the content disappearing. So much of the material I’ve collected from these sources has been gathered after the fact. The question of why to capture this content is an intriguing one. We have also collected the paper versions of the local paper and some of the local weeklies and a lot of the content is redundant between those two sources. In addtion, the content of many of the papers is aggregated into databases like LexisNexis. Some things do appear only in one source or the other, so gathering both web and paper for things that are not preserved elsewhere makes sense. The paper also captured a lot of intangible factors. For example, seeing the huge bold REINSTATED headline on the top of the Washington post there (this is a scan of the front page grabbed from the Newseum) carried a different message than the online version. The other question of why save it locally when it is saved elsewhere, has two answers. First we are creating an easier access point for researchers. So in this case the capture really has to do more with access than preservation. But, the databases only grab content, not comments. The commentary is really interesting and this is one of those places where the medium is really shaping the message. That sort of content would not have typically been captured in the past unless someone wrote a letter to the editor and it was published. Even then it would only be one side of the story, not the ongoing dialogue that is found in some article (and, to be honest, a lot of trolling, spam and other nonsense as well). We decided that capturing this was most important for the local papers, and so have tried to be exhaustive with those. We also are trying to capture them at least twice in case details are updated over time..
  • The only issue we’ve found is with newspapers that require a subscription to view it’s “premium” content. Even when the article is downloaded during a free trial, the script which triggers this authentication is still saved and so it pops up and requires authentication before you can read the content in the HTML view. The HTML content is still saved however, and could be read in the code directly. In the end I decided that capturing this content wasn’t worth this particular barrier to access and we collected only a limited number of articles from this source (the local newspaper)
  • A number of online sources involved some type of audio and video and these presented some particular difficulties for capture The most prevalent one was YouTube, which seems pretty obvious. While these are public posted and there isn’t any privacy violation to capture them, there is not an easy way to download them. YouTube’s basic license state that the owner is placing them on YouTube for access and basically says that they are not there to be downloaded. Users can opt to use a Creative Commons license instead which doesn’t have this restriction, but then the issue how to download is still present. I found another Firefox plugin called “download helper” that enabled me to download the ones to which we felt we could do so. We again consulted our legal counsel who advised us to do the same as for facebook: download for preservation, talk with YouTube before doing anything else. News sources also do not encourage downloading, so we kept a list of these videos and could therefore ask for them at a later time. Several events were actually streamed online, which by default, means there is no download. There are tools that will allow you to hack a stream and download a copy in real time, but I did not feel that was worth it. We’ve established a relationship with the Public Affairs office on campus who want to deposit their original video with us as well.
  • Finally, although I didn’t mention this at the top, I want to also note the creation of the online contribution site created by Scholar’s Lab. Everyone involved agreed that this would be a great way to capture what the public thought was important, especially since we could have eyes to see everything. We realized that there would be issues though if we didn’t protect ourselves from the possibility of people posting content that infringed on someone else’s copyright. To protect against this we actually had a member of the University Legal Counsel approve a disclaimer regarding rights and created an option in the contribution form to allow people to indicate whether or not they wanted an item to be public. We also required that one of us on the staff approve the item before it was public. So far, we’ve had over 100 contributions. There have been pictures, video, copies of emails, and links to online sources.
  • A lot of this content are not images that I’ve collected from other sources and in some cases it provides a better source than anything else I’ve collected. This picture, for example, documents one of the signs that we really liked, but haven’t yet gotten as a donation.
  • So the final tally of what we have in a digital format so far: Tweets: 80,000 News articles: 571 Blog posts: 147 Other web content: 196 Twitter pictures: 243 Video: 69 Documents: 21 User-Contributed Items: 118 These numbers will continue to grow, I’m sure. For example, this does not take into account pictures and video from public affairs. And this is in addition to the 100 or so rally signs and a couple dozen newspapers. Some of the rally signs are simply too large or fragile for us to properly store, so we will probably scan them for access and dispose of them, thereby growing the collection more.
  • As for what happened with the governance crisis, the President was reinstated by the Board of Visitors 18 days after resigning, however the Rector of the University was also re-appointed by the Governor and still holds her office. Relations between the Board and particularly the faculty continue to remain very strained and information is still coming out about the events of last summer. We have decided that our archive will only pertain to the events of those two weeks and we are no longer actively collecting social media.
  • As far the work that has been done so far, a preliminary finding aid has been created. A small group of technical folks have helped to think through the data issues and we’ve begun to plan for the twitter and web re-harvest. I’ve provided access to the twitter data set in a couple of cases for students and faculty.
  • But that still leaves a lot to actually figure out. When we are ready to provide more routine use in the reading room questions of how users will navigate and search the collection will need to be figured out. The objects also need to be prepared for ingest into our repository, which primarily means figuring out their metadata needs: technical, descriptive and preservation. And finally, we may decide to go through a do further appraisal and de-accessioning before doing that ingest.
  • So, with that, I would be happy to take any questions and I thank you for being here today.

Collecting in the Moment Collecting in the Moment Presentation Transcript

  • Collecting in the MomentCollecting in the Moment Gretchen Gueguen University of Virginia RBMS Pre-conference June 24, 2013
  • June 10, 2012 Teresa A. Sullivan, President of the University of Virginia announces resignation… …Gretchen M. Gueguen, Digital Archivist at Uva, prepares to attend Rare Book School the next day
  • June 11-16, 2012
  • June 18, 2012
  • June 18, 2012 • Decision is made to form a cross- departmental group within the library to discuss saving the historic record related to these events
  • At 9:00 a.m. on July 19th … me
  • What’s the Big Deal? • Digital is THE publishing platform • Event was important for both the historic nature of the events (message) but also HOW it was communicated (medium)
  • Springing into action • Twitter • Blogs and Web • Facebook • News • Video
  • Twitter API • Allows you to download tweets as data for a given hashtag, user, or keyword search (#woo-hoo!) • Has many tools available for doing all kinds of neat stuff (#woo-hoo!) • Limits you to just the last 1500 tweets for any given search (#d’oh!)
  • Info at: http://mashe.hawksey.info/2011/11/twitter-how-to-archive-event-hashtags-and-visualize-conversation/
  • Final Collection • 47 XML files – #BOV, #UVA, #rally4honor, #dragasmustgo – @cavalierdaily, @LarrySabato, @Rector Drago, @strategydynamo • 47 spreadsheets – Hashtags only (#UVA, #sullivan, #BOV, #fillthelawn, #strine, #united4honor,)
  • Twitter API update Re-harvest has returned ~53,000 tweets – Data issues – Deleted accounts
  • Posted content • Links, pictures, video related to the story • Could not find a tool to just extract these to look through later • Many shortened links that had to be clicked on to find out what they held • Many links were retweeted
  • Blogs and other web content • How to capture everything else • Tools for web capture – Difficult to implement – Don’t do exactly what is needed – I’m running out of time! • Solution: – I have to look at it anyway to select, so • Firefox “Save As” • Screengrab plugin for screenshot
  • Web sites • No way to create web-archive standard (WARC) files at the time – ~1,000 HTML +archive – Screenshot Investigation of WAIL (Web Archive Integrated Layer) to create WARC files – Will require a re-harvest of URLs to ensure proper header metadata – But has automated way of doing this
  • Facebook & “Privacy” • Rallies on grounds were organized through Facebook “groups.” • Some posts are visible only to members of the group. All others are only visible to those with a Facebook account.
  • Facebook & Privacy • Facebook accounts are free • But this still means the content wasn’t “public” as per the TOS
  • News • Relatively easy to capture • Overwhelming in volume • Why capture the online version? – Some things only appear online, some only in print – Online version, for many sources, allows commenting • Why capture this when it will be saved elsewhere? – Reference collection – Databases may capture content but not commentary
  • Subscriptions
  • Audio/Video • YouTube • News • WINA podcast • WUVA streaming • Streaming Board Meetings • Public Affairs
  • User Contributions • Capture what the public thought was important • Possible violations of privacy or intellectual property
  • Final Tally • Tweets: 80,000 ? • News articles: 572 • Blog posts: 147 • Other web content: 196 • Twitter pictures: 243 • Video: 69 • Documents: 21 • User-Contributed Items: 118
  • What’s Been Done • Preliminary collection finding aid • Working with small group on twitter and web data issues • Twitter and web re- harvest • Access provided in a few cases
  • What Needs to Be Done? • Access – Searching – Use • Metadata • Further appraisal decisions/de- accessioning
  • What About Next Time? • Need to establish a web/social media collection plan – If we are routinely capturing certain things we won’t have to worry about them during a crisis – Tools change rapidly, working on collecting routinely will better position ourselves to adapt