Video Archiving and Playback in the Wayback Machine

Video Archiving and Playback
in the Wayback Machine
Sawood Alam, Bill O'Connor, Corentin Barreau, Kenji Nagahashi, Vangelis Banos, Karim Ratib, Owen Lampe, Mark Graham
Wayback Machine, Internet Archive
sawood@archive.org
@ibnesayeed @waybackmachine @internetarchive
IIPC Web Archiving Conference (WAC), May 24, 2022, Online

Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Static Video Files vs. Dynamic Video Streams
2
https://web.archive.org/web/20220504160213/https://developer.mo
zilla.org/en-US/docs/Web/HTML/Element/video
https://web.archive.org/web/20220330074939/https://www.youtube
.com/watch?v=dCBy9z3f9Mw
Archiving static video files and playing them back is easy,
using traditional methods
Archiving videos with dynamic stream at a large scale and
playing them back reliably requires specialized solutions

Video Archiving Pipeline
3
2
Check Status
Check if the video is already
archived to avoid duplicates
Fetch Metadata
Collect metadata of videos on
selected pages
3
Filter/Curate
Apply curation ﬁlters to exclude
certain videos
4
Archive
Create WARC records for web
page, metadata, video, etc.
5
1
Identify Candidates
Select web page URLs that are
likely to have embedded videos

1: Identify Candidate Pages With Videos
4
Derive based,
End of Term,
Wide/Survey
Crawls
Match certain patterns in the
URI feeds from various sources
(except those that contribute
WARC files directly)

2: Check Video Status
● Query a custom HTTP video status API
○ Maintains a database of previously archived/attempted videos
○ Serves as a seen-check service for videos
○ Provides other useful information such as datetime and source
● Ignore the video URI if it is already archived
● Update the video status database after archiving the video
5

3: Fetch Video Metadata
{
"title": "How to use the Internet Archive",
"description": "The Internet Archive (archive.org) is a nonprofit library…",
"duration": 1371,
"id": "dCBy9z3f9Mw",
"original_url": "https://www.youtube.com/watch?v=dCBy9z3f9Mw",
"upload_date": "20210105",
"uploader": "Internet Archive",
"uploader_id": "UCFa_X02QhJnP0FNpFAKyRRg",
"channel": "Internet Archive",
"channel_id": "UCFa_X02QhJnP0FNpFAKyRRg",
"automatic_captions": {},
"availability": "public",
"categories": [],
"extractor": "youtube",
"formats": [],
"requested_formats": [],
"subtitles": {},
"tags": [],
"thumbnails": [],
"language": null,
"...": "..."
}
6
Fetched using a custom HTTP
video metadata API wrapper
around “youtube-dl”

4: Apply Curation Filters
7
Exclusions &
Ignore List
Channels &
Uploaders
Video
Categories
Keywords &
Tags
Filesize,
Duration, etc.

5: Archive and Create WARC Records
8
Container HTML
web page
Video metadata
API response
At least one
video file
Thumbnails and
CSS sprites
Captions in
many languages
Page requisites
CSS/JS/images

Video Playback in Wayback Machine
9
● Replace the video player
with a custom JWPlayer
instance
● Query a key-value
database to find the video
file associated with the
video ID
● One capture per video
policy may cause
temporal violations
● Old Flash videos may
require emulation or
migration

Metadata Amendment and Aggregation
10
lang = detect_lang(title)
if lang == "en":
lang = detect_lang(title + description)
FIELDS = ["id", "upload_date", "duration", "..."]
for record in ArchiveIterator(stream):
if record.rec_type == "metadata" and
record.content_type == "application/json;generator-youtube-dl":
j = json.load(record.content_stream())
print(json.dumps({k: j[k] for k in FIELDS}))
Patch missing language metadata using title
with fallback to description for best-effort results
Print only a subset of metadata fields (necessary for statistical analysis)
in JSONL (i.e., one record per line) format
Create hourly
JSONL files
and save as
daily Petabox
items

What Are We Archiving? Lexical Insights
11
Statistics from
Feb 24, 2022,
clearly show
our increased
activity in
archiving
videos related
to Russian
invasion of
Ukraine

What Are We Archiving? Temporal Insights
12
Less than
20% of the
longest videos
acquire more
than 80% of
the total daily
duration
Longest
videos can be
24 hours long

Summary
Archiving
videos at a
large scale
13
Preserving rich
metadata & provenance
for research & insights
Learning and improving
our systems & practices
as we go
Exploring opportunities
of interoperability &
standardization
Planning to open-source
our tools as they
become generalizable

Video Archiving and Playback in the Wayback Machine

Recommended

Recommended

More Related Content

Similar to Video Archiving and Playback in the Wayback Machine

Similar to Video Archiving and Playback in the Wayback Machine (20)

More from Sawood Alam

More from Sawood Alam (20)

Recently uploaded

Recently uploaded (20)

Video Archiving and Playback in the Wayback Machine