At the Internet Archive (IA) we collect static and dynamic lists of seeds from various sources (like Save Page Now, Wikipedia EventStream, Cloudflare, etc.) for archiving. Some of these seeds include web pages with videos on them. Those URLs are curated based on certain criteria to identify potential videos that should be archived or excluded. Candidate video page URLs for archiving are placed in a queue (currently using Kafka) to be consumed by a separate process. We maintain a persistent database of videos we have already archived, which is used both for status tracking as well as a seen-check system to avoid duplicate downloads of large media files that usually do not change. We use youtube-dl (or one of its forks) to download videos and their metadata. We archive the container HTML page, associated video metadata, any transcriptions, thumbnails, and at least one of the many video files with different resolutions and formats. These pieces are stored in separate WARC records (some with “response” type and others as “metadata”). Some popular video streaming services do not have static links to embed video files, which makes it difficult to identify and serve video files corresponding to their container HTML pages on archival replay. To glue related pieces together for replay we are currently using a key-value store, but exploring ways to get away with an additional index. We are using a custom video player and perform necessary rewriting in the container HTML page for a more reliable video playback experience. We create a daily summary of metadata of videos that we have archived and load it in a custom-built Video Archiving Insights dashboard to identify any issues or biases, which are utilized as a feedback loop for quality assurance and to enhance our curation criteria and archiving strategies. We are always looking forward to ways to improve the system that works at scale as well as means to interoperate.
Recording: youtube.com/watch?v=6MiYKOq_DKo
Video Archiving and Playback in the Wayback Machine
1. Video Archiving and Playback
in the Wayback Machine
Sawood Alam, Bill O'Connor, Corentin Barreau, Kenji Nagahashi, Vangelis Banos, Karim Ratib, Owen Lampe, Mark Graham
Wayback Machine, Internet Archive
sawood@archive.org
@ibnesayeed @waybackmachine @internetarchive
IIPC Web Archiving Conference (WAC), May 24, 2022, Online
2. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Static Video Files vs. Dynamic Video Streams
2
https://web.archive.org/web/20220504160213/https://developer.mo
zilla.org/en-US/docs/Web/HTML/Element/video
https://web.archive.org/web/20220330074939/https://www.youtube
.com/watch?v=dCBy9z3f9Mw
Archiving static video files and playing them back is easy,
using traditional methods
Archiving videos with dynamic stream at a large scale and
playing them back reliably requires specialized solutions
3. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Video Archiving Pipeline
3
2
Check Status
Check if the video is already
archived to avoid duplicates
Fetch Metadata
Collect metadata of videos on
selected pages
3
Filter/Curate
Apply curation filters to exclude
certain videos
4
Archive
Create WARC records for web
page, metadata, video, etc.
5
1
Identify Candidates
Select web page URLs that are
likely to have embedded videos
4. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
1: Identify Candidate Pages With Videos
4
Derive based,
End of Term,
Wide/Survey
Crawls
Match certain patterns in the
URI feeds from various sources
(except those that contribute
WARC files directly)
5. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
2: Check Video Status
● Query a custom HTTP video status API
○ Maintains a database of previously archived/attempted videos
○ Serves as a seen-check service for videos
○ Provides other useful information such as datetime and source
● Ignore the video URI if it is already archived
● Update the video status database after archiving the video
5
6. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
3: Fetch Video Metadata
{
"title": "How to use the Internet Archive",
"description": "The Internet Archive (archive.org) is a nonprofit library…",
"duration": 1371,
"id": "dCBy9z3f9Mw",
"original_url": "https://www.youtube.com/watch?v=dCBy9z3f9Mw",
"upload_date": "20210105",
"uploader": "Internet Archive",
"uploader_id": "UCFa_X02QhJnP0FNpFAKyRRg",
"channel": "Internet Archive",
"channel_id": "UCFa_X02QhJnP0FNpFAKyRRg",
"automatic_captions": {},
"availability": "public",
"categories": [],
"extractor": "youtube",
"formats": [],
"requested_formats": [],
"subtitles": {},
"tags": [],
"thumbnails": [],
"language": null,
"...": "..."
}
6
Fetched using a custom HTTP
video metadata API wrapper
around “youtube-dl”
7. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
4: Apply Curation Filters
7
Exclusions &
Ignore List
Channels &
Uploaders
Video
Categories
Keywords &
Tags
Filesize,
Duration, etc.
8. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
5: Archive and Create WARC Records
8
Container HTML
web page
Video metadata
API response
At least one
video file
Thumbnails and
CSS sprites
Captions in
many languages
Page requisites
CSS/JS/images
9. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Video Playback in Wayback Machine
9
● Replace the video player
with a custom JWPlayer
instance
● Query a key-value
database to find the video
file associated with the
video ID
● One capture per video
policy may cause
temporal violations
● Old Flash videos may
require emulation or
migration
10. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Metadata Amendment and Aggregation
10
lang = detect_lang(title)
if lang == "en":
lang = detect_lang(title + description)
FIELDS = ["id", "upload_date", "duration", "..."]
for record in ArchiveIterator(stream):
if record.rec_type == "metadata" and
record.content_type == "application/json;generator-youtube-dl":
j = json.load(record.content_stream())
print(json.dumps({k: j[k] for k in FIELDS}))
Patch missing language metadata using title
with fallback to description for best-effort results
Print only a subset of metadata fields (necessary for statistical analysis)
in JSONL (i.e., one record per line) format
Create hourly
JSONL files
and save as
daily Petabox
items
11. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
What Are We Archiving? Lexical Insights
11
Statistics from
Feb 24, 2022,
clearly show
our increased
activity in
archiving
videos related
to Russian
invasion of
Ukraine
12. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
What Are We Archiving? Temporal Insights
12
Less than
20% of the
longest videos
acquire more
than 80% of
the total daily
duration
Longest
videos can be
24 hours long
13. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Summary
Archiving
videos at a
large scale
13
Preserving rich
metadata & provenance
for research & insights
Learning and improving
our systems & practices
as we go
Exploring opportunities
of interoperability &
standardization
Planning to open-source
our tools as they
become generalizable