Social Feed Manager
Laura Wrubel
@liblaura @SocialFeedMgr
http://go.gwu.edu/sfm
Web Archives and Digital Libraries workshop, JCDL 2016
Social Feed Manager is supported by the National Historical Publications & Records Commission
Allows users to create collections of data
from social media platforms
Open source software, not a black box
Research documentation (for researchers)
≈ provenance metadata (for archivists)
(and it’s really important for both)
Creation
Authoring of the social media
● Creation metadata is provided by Twitter as JSON via API.
● Social media user metadata:
○ Screen name
○ Date account created
○ Location
● Tweet metadata:
○ Date
○ Tweet text
○ Mentions
○ Hashtags
○ URLs
○ Source (how posted)
● SFM records it in WARC files.
Selection
Decisions by the SFM user which leads SFM to harvest the tweet
Recorded in the SFM database
● Collection information
○ Harvest type
○ Harvest options (e.g., incremental, harvest web resources)
○ Credentials (API keys)
○ Description of collection
● Seeds for the collection (which vary by platform)
○ Screen name
○ UID
○ Keywords to filter on
● Change log
○ Change note
○ Fields changed
○ User who made change
○ Date of change
Collection
How SFM retrieved the tweet from Twitter’s API
● Collection metadata is received by SFM’s Twitter harvester & recorded
within WARCs.
● WARCs include the exact HTTP request/response
○ URL with params such as user account id or keywords
○ HTTP headers
● WARC record headers also include:
○ Date WARC record created
○ Server information
○ Fixities
Collection (cont)
● WARC file metadata, recorded in the SFM database:
○ File location
○ File size
○ Fixity
○ Creation date
● Harvest metadata:
○ Date
○ Collection
○ Date harvest started
○ Date harvest ended
○ Messages (informational, warning, or error)
○ Token/seed updates
○ Basic stats on number of items collected
Working paper: http://bit.ly/tweet-prov
Comments welcome!
How is this useful? http://bit.ly/tweet-prov
● Which of this provenance metadata do you (researcher,
archivist, librarian, etc.) want access to?
● How do you want access to this metadata? In SFM’s UI? In
reports when exports are created? Exposed via SFM’s
software libraries? A REST API? Machine-readable?
Human-readable?
● What metadata have we missed?
● Do the answers to the previous questions vary by discipline
(e.g., humanities, social science, etc.)?
● Are there other relevant specifications or standards that we
should consider? Is there value in a mapping to or providing
output in accordance with metadata standards such as
PREMIS or PROV?

Social Feed Manager, WADL/JCDL 2016

  • 1.
    Social Feed Manager LauraWrubel @liblaura @SocialFeedMgr http://go.gwu.edu/sfm Web Archives and Digital Libraries workshop, JCDL 2016 Social Feed Manager is supported by the National Historical Publications & Records Commission
  • 2.
    Allows users tocreate collections of data from social media platforms
  • 3.
    Open source software,not a black box
  • 7.
    Research documentation (forresearchers) ≈ provenance metadata (for archivists) (and it’s really important for both)
  • 8.
    Creation Authoring of thesocial media ● Creation metadata is provided by Twitter as JSON via API. ● Social media user metadata: ○ Screen name ○ Date account created ○ Location ● Tweet metadata: ○ Date ○ Tweet text ○ Mentions ○ Hashtags ○ URLs ○ Source (how posted) ● SFM records it in WARC files.
  • 9.
    Selection Decisions by theSFM user which leads SFM to harvest the tweet Recorded in the SFM database ● Collection information ○ Harvest type ○ Harvest options (e.g., incremental, harvest web resources) ○ Credentials (API keys) ○ Description of collection ● Seeds for the collection (which vary by platform) ○ Screen name ○ UID ○ Keywords to filter on ● Change log ○ Change note ○ Fields changed ○ User who made change ○ Date of change
  • 10.
    Collection How SFM retrievedthe tweet from Twitter’s API ● Collection metadata is received by SFM’s Twitter harvester & recorded within WARCs. ● WARCs include the exact HTTP request/response ○ URL with params such as user account id or keywords ○ HTTP headers ● WARC record headers also include: ○ Date WARC record created ○ Server information ○ Fixities
  • 11.
    Collection (cont) ● WARCfile metadata, recorded in the SFM database: ○ File location ○ File size ○ Fixity ○ Creation date ● Harvest metadata: ○ Date ○ Collection ○ Date harvest started ○ Date harvest ended ○ Messages (informational, warning, or error) ○ Token/seed updates ○ Basic stats on number of items collected
  • 12.
  • 13.
    How is thisuseful? http://bit.ly/tweet-prov ● Which of this provenance metadata do you (researcher, archivist, librarian, etc.) want access to? ● How do you want access to this metadata? In SFM’s UI? In reports when exports are created? Exposed via SFM’s software libraries? A REST API? Machine-readable? Human-readable? ● What metadata have we missed? ● Do the answers to the previous questions vary by discipline (e.g., humanities, social science, etc.)? ● Are there other relevant specifications or standards that we should consider? Is there value in a mapping to or providing output in accordance with metadata standards such as PREMIS or PROV?