SlideShare a Scribd company logo
1 of 13
Download to read offline
Video Archiving and Playback
in the Wayback Machine
Sawood Alam, Bill O'Connor, Corentin Barreau, Kenji Nagahashi, Vangelis Banos, Karim Ratib, Owen Lampe, Mark Graham
Wayback Machine, Internet Archive
sawood@archive.org
@ibnesayeed @waybackmachine @internetarchive
IIPC Web Archiving Conference (WAC), May 24, 2022, Online
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Static Video Files vs. Dynamic Video Streams
2
https://web.archive.org/web/20220504160213/https://developer.mo
zilla.org/en-US/docs/Web/HTML/Element/video
https://web.archive.org/web/20220330074939/https://www.youtube
.com/watch?v=dCBy9z3f9Mw
Archiving static video files and playing them back is easy,
using traditional methods
Archiving videos with dynamic stream at a large scale and
playing them back reliably requires specialized solutions
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Video Archiving Pipeline
3
2
Check Status
Check if the video is already
archived to avoid duplicates
Fetch Metadata
Collect metadata of videos on
selected pages
3
Filter/Curate
Apply curation filters to exclude
certain videos
4
Archive
Create WARC records for web
page, metadata, video, etc.
5
1
Identify Candidates
Select web page URLs that are
likely to have embedded videos
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
1: Identify Candidate Pages With Videos
4
Derive based,
End of Term,
Wide/Survey
Crawls
Match certain patterns in the
URI feeds from various sources
(except those that contribute
WARC files directly)
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
2: Check Video Status
● Query a custom HTTP video status API
○ Maintains a database of previously archived/attempted videos
○ Serves as a seen-check service for videos
○ Provides other useful information such as datetime and source
● Ignore the video URI if it is already archived
● Update the video status database after archiving the video
5
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
3: Fetch Video Metadata
{
"title": "How to use the Internet Archive",
"description": "The Internet Archive (archive.org) is a nonprofit library…",
"duration": 1371,
"id": "dCBy9z3f9Mw",
"original_url": "https://www.youtube.com/watch?v=dCBy9z3f9Mw",
"upload_date": "20210105",
"uploader": "Internet Archive",
"uploader_id": "UCFa_X02QhJnP0FNpFAKyRRg",
"channel": "Internet Archive",
"channel_id": "UCFa_X02QhJnP0FNpFAKyRRg",
"automatic_captions": {},
"availability": "public",
"categories": [],
"extractor": "youtube",
"formats": [],
"requested_formats": [],
"subtitles": {},
"tags": [],
"thumbnails": [],
"language": null,
"...": "..."
}
6
Fetched using a custom HTTP
video metadata API wrapper
around “youtube-dl”
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
4: Apply Curation Filters
7
Exclusions &
Ignore List
Channels &
Uploaders
Video
Categories
Keywords &
Tags
Filesize,
Duration, etc.
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
5: Archive and Create WARC Records
8
Container HTML
web page
Video metadata
API response
At least one
video file
Thumbnails and
CSS sprites
Captions in
many languages
Page requisites
CSS/JS/images
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Video Playback in Wayback Machine
9
● Replace the video player
with a custom JWPlayer
instance
● Query a key-value
database to find the video
file associated with the
video ID
● One capture per video
policy may cause
temporal violations
● Old Flash videos may
require emulation or
migration
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Metadata Amendment and Aggregation
10
lang = detect_lang(title)
if lang == "en":
lang = detect_lang(title + description)
FIELDS = ["id", "upload_date", "duration", "..."]
for record in ArchiveIterator(stream):
if record.rec_type == "metadata" and
record.content_type == "application/json;generator-youtube-dl":
j = json.load(record.content_stream())
print(json.dumps({k: j[k] for k in FIELDS}))
Patch missing language metadata using title
with fallback to description for best-effort results
Print only a subset of metadata fields (necessary for statistical analysis)
in JSONL (i.e., one record per line) format
Create hourly
JSONL files
and save as
daily Petabox
items
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
What Are We Archiving? Lexical Insights
11
Statistics from
Feb 24, 2022,
clearly show
our increased
activity in
archiving
videos related
to Russian
invasion of
Ukraine
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
What Are We Archiving? Temporal Insights
12
Less than
20% of the
longest videos
acquire more
than 80% of
the total daily
duration
Longest
videos can be
24 hours long
Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive>
Summary
Archiving
videos at a
large scale
13
Preserving rich
metadata & provenance
for research & insights
Learning and improving
our systems & practices
as we go
Exploring opportunities
of interoperability &
standardization
Planning to open-source
our tools as they
become generalizable

More Related Content

Similar to Video Archiving and Playback in the Wayback Machine

HTML5 Programming
HTML5 ProgrammingHTML5 Programming
HTML5 Programming
hotrannam
 
Upgrade to HTML5 Video
Upgrade to HTML5 VideoUpgrade to HTML5 Video
Upgrade to HTML5 Video
steveheffernan
 

Similar to Video Archiving and Playback in the Wayback Machine (20)

Html5 Open Video Tutorial
Html5 Open Video TutorialHtml5 Open Video Tutorial
Html5 Open Video Tutorial
 
Real-time Code Sharing Service for one-to-many coding classes
Real-time Code Sharing Service for one-to-many coding classesReal-time Code Sharing Service for one-to-many coding classes
Real-time Code Sharing Service for one-to-many coding classes
 
Building a Video Encoding Pipeline at The New York Times
Building a Video Encoding Pipeline at The New York TimesBuilding a Video Encoding Pipeline at The New York Times
Building a Video Encoding Pipeline at The New York Times
 
NodeJS Edinburgh Video Killed My Data Plan
NodeJS Edinburgh Video Killed My Data PlanNodeJS Edinburgh Video Killed My Data Plan
NodeJS Edinburgh Video Killed My Data Plan
 
Caching Enhancement in ASP.NET 4.0
Caching Enhancement in ASP.NET 4.0Caching Enhancement in ASP.NET 4.0
Caching Enhancement in ASP.NET 4.0
 
Microsoft Windows Server AppFabric
Microsoft Windows Server AppFabricMicrosoft Windows Server AppFabric
Microsoft Windows Server AppFabric
 
T3fest video
T3fest videoT3fest video
T3fest video
 
5 steps to faster web sites & HTML5 games - updated for DDDscot
5 steps to faster web sites & HTML5 games - updated for DDDscot5 steps to faster web sites & HTML5 games - updated for DDDscot
5 steps to faster web sites & HTML5 games - updated for DDDscot
 
Media Content Delivery Systems: 2nd Presentation
Media Content Delivery Systems: 2nd PresentationMedia Content Delivery Systems: 2nd Presentation
Media Content Delivery Systems: 2nd Presentation
 
yapi.js introduction (mopcon 2016 version)
yapi.js introduction (mopcon 2016 version)yapi.js introduction (mopcon 2016 version)
yapi.js introduction (mopcon 2016 version)
 
Office365 Video - Learn it - Love it - Use it | Collab365
Office365 Video - Learn it - Love it - Use it | Collab365Office365 Video - Learn it - Love it - Use it | Collab365
Office365 Video - Learn it - Love it - Use it | Collab365
 
(BAC307) The Cold Data Playbook: Building the Ultimate Archive Solution in Am...
(BAC307) The Cold Data Playbook: Building the Ultimate Archive Solution in Am...(BAC307) The Cold Data Playbook: Building the Ultimate Archive Solution in Am...
(BAC307) The Cold Data Playbook: Building the Ultimate Archive Solution in Am...
 
Backend Cloud Storage Access in Video Streaming
Backend Cloud Storage Access in Video StreamingBackend Cloud Storage Access in Video Streaming
Backend Cloud Storage Access in Video Streaming
 
audio, video and canvas in HTML5 - standards>next Manchester 29.09.2010
audio, video and canvas in HTML5 - standards>next Manchester 29.09.2010audio, video and canvas in HTML5 - standards>next Manchester 29.09.2010
audio, video and canvas in HTML5 - standards>next Manchester 29.09.2010
 
HTML5 Programming
HTML5 ProgrammingHTML5 Programming
HTML5 Programming
 
Creating and Sharing Personalized Time-Based Annotations of Videos on the Web
Creating and Sharing Personalized Time-Based Annotations of Videos on the WebCreating and Sharing Personalized Time-Based Annotations of Videos on the Web
Creating and Sharing Personalized Time-Based Annotations of Videos on the Web
 
Video performance glasgow
Video performance glasgowVideo performance glasgow
Video performance glasgow
 
060320 mmtf presentation
060320 mmtf presentation060320 mmtf presentation
060320 mmtf presentation
 
Upgrade to HTML5 Video
Upgrade to HTML5 VideoUpgrade to HTML5 Video
Upgrade to HTML5 Video
 
HTML5 Video Player - HTML5 Dev Conf 2012
HTML5 Video Player - HTML5 Dev Conf 2012HTML5 Video Player - HTML5 Dev Conf 2012
HTML5 Video Player - HTML5 Dev Conf 2012
 

More from Sawood Alam

More from Sawood Alam (20)

TrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web PagesTrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web Pages
 
CDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection InsightsCDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection Insights
 
Profiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento RoutingProfiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento Routing
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web Bundles
 
Summarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSummarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMap
 
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingMementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
 
Supporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSupporting Web Archiving via Web Packaging
Supporting Web Archiving via Web Packaging
 
MementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination FrameworkMementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination Framework
 
Impact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesImpact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web Archives
 
Archive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification Framework
 
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingMementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
 
Web ARChive (WARC) File Format
Web ARChive (WARC) File FormatWeb ARChive (WARC) File Format
Web ARChive (WARC) File Format
 
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web ArchivingInterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
 
MemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoMemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in Go
 
Dockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationDockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to Containerization
 
Avoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerAvoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorker
 
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerClient-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorker
 
TPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingTPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive Profiling
 
Introducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research GroupIntroducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research Group
 
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
 

Recently uploaded

一比一原版(毕业证书)新加坡南洋理工学院毕业证原件一模一样
一比一原版(毕业证书)新加坡南洋理工学院毕业证原件一模一样一比一原版(毕业证书)新加坡南洋理工学院毕业证原件一模一样
一比一原版(毕业证书)新加坡南洋理工学院毕业证原件一模一样
AS
 
原版定制英国赫瑞瓦特大学毕业证原件一模一样
原版定制英国赫瑞瓦特大学毕业证原件一模一样原版定制英国赫瑞瓦特大学毕业证原件一模一样
原版定制英国赫瑞瓦特大学毕业证原件一模一样
AS
 
如何办理(UCLA毕业证)加州大学洛杉矶分校毕业证成绩单本科硕士学位证留信学历认证
如何办理(UCLA毕业证)加州大学洛杉矶分校毕业证成绩单本科硕士学位证留信学历认证如何办理(UCLA毕业证)加州大学洛杉矶分校毕业证成绩单本科硕士学位证留信学历认证
如何办理(UCLA毕业证)加州大学洛杉矶分校毕业证成绩单本科硕士学位证留信学历认证
hfkmxufye
 
一比一原版(NYU毕业证书)美国纽约大学毕业证学位证书
一比一原版(NYU毕业证书)美国纽约大学毕业证学位证书一比一原版(NYU毕业证书)美国纽约大学毕业证学位证书
一比一原版(NYU毕业证书)美国纽约大学毕业证学位证书
c6eb683559b3
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
ayvbos
 
Abortion Clinic in Germiston +27791653574 WhatsApp Abortion Clinic Services i...
Abortion Clinic in Germiston +27791653574 WhatsApp Abortion Clinic Services i...Abortion Clinic in Germiston +27791653574 WhatsApp Abortion Clinic Services i...
Abortion Clinic in Germiston +27791653574 WhatsApp Abortion Clinic Services i...
mikehavy0
 
一比一原版英国格林多大学毕业证如何办理
一比一原版英国格林多大学毕业证如何办理一比一原版英国格林多大学毕业证如何办理
一比一原版英国格林多大学毕业证如何办理
AS
 
一比一原版(毕业证书)新西兰怀特克利夫艺术设计学院毕业证原件一模一样
一比一原版(毕业证书)新西兰怀特克利夫艺术设计学院毕业证原件一模一样一比一原版(毕业证书)新西兰怀特克利夫艺术设计学院毕业证原件一模一样
一比一原版(毕业证书)新西兰怀特克利夫艺术设计学院毕业证原件一模一样
AS
 
一比一原版澳大利亚迪肯大学毕业证如何办理
一比一原版澳大利亚迪肯大学毕业证如何办理一比一原版澳大利亚迪肯大学毕业证如何办理
一比一原版澳大利亚迪肯大学毕业证如何办理
SS
 
一比一原版帝国理工学院毕业证如何办理
一比一原版帝国理工学院毕业证如何办理一比一原版帝国理工学院毕业证如何办理
一比一原版帝国理工学院毕业证如何办理
F
 
一比一原版美国北卡罗莱纳大学毕业证如何办理
一比一原版美国北卡罗莱纳大学毕业证如何办理一比一原版美国北卡罗莱纳大学毕业证如何办理
一比一原版美国北卡罗莱纳大学毕业证如何办理
A
 
一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理
F
 
一比一原版(Polytechnic毕业证书)新加坡理工学院毕业证原件一模一样
一比一原版(Polytechnic毕业证书)新加坡理工学院毕业证原件一模一样一比一原版(Polytechnic毕业证书)新加坡理工学院毕业证原件一模一样
一比一原版(Polytechnic毕业证书)新加坡理工学院毕业证原件一模一样
AS
 

Recently uploaded (20)

一比一原版(毕业证书)新加坡南洋理工学院毕业证原件一模一样
一比一原版(毕业证书)新加坡南洋理工学院毕业证原件一模一样一比一原版(毕业证书)新加坡南洋理工学院毕业证原件一模一样
一比一原版(毕业证书)新加坡南洋理工学院毕业证原件一模一样
 
原版定制英国赫瑞瓦特大学毕业证原件一模一样
原版定制英国赫瑞瓦特大学毕业证原件一模一样原版定制英国赫瑞瓦特大学毕业证原件一模一样
原版定制英国赫瑞瓦特大学毕业证原件一模一样
 
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
 
如何办理(UCLA毕业证)加州大学洛杉矶分校毕业证成绩单本科硕士学位证留信学历认证
如何办理(UCLA毕业证)加州大学洛杉矶分校毕业证成绩单本科硕士学位证留信学历认证如何办理(UCLA毕业证)加州大学洛杉矶分校毕业证成绩单本科硕士学位证留信学历认证
如何办理(UCLA毕业证)加州大学洛杉矶分校毕业证成绩单本科硕士学位证留信学历认证
 
一比一原版(NYU毕业证书)美国纽约大学毕业证学位证书
一比一原版(NYU毕业证书)美国纽约大学毕业证学位证书一比一原版(NYU毕业证书)美国纽约大学毕业证学位证书
一比一原版(NYU毕业证书)美国纽约大学毕业证学位证书
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
 
Abortion Clinic in Germiston +27791653574 WhatsApp Abortion Clinic Services i...
Abortion Clinic in Germiston +27791653574 WhatsApp Abortion Clinic Services i...Abortion Clinic in Germiston +27791653574 WhatsApp Abortion Clinic Services i...
Abortion Clinic in Germiston +27791653574 WhatsApp Abortion Clinic Services i...
 
Washington Football Commanders Redskins Feathers Shirt
Washington Football Commanders Redskins Feathers ShirtWashington Football Commanders Redskins Feathers Shirt
Washington Football Commanders Redskins Feathers Shirt
 
一比一原版英国格林多大学毕业证如何办理
一比一原版英国格林多大学毕业证如何办理一比一原版英国格林多大学毕业证如何办理
一比一原版英国格林多大学毕业证如何办理
 
一比一原版(毕业证书)新西兰怀特克利夫艺术设计学院毕业证原件一模一样
一比一原版(毕业证书)新西兰怀特克利夫艺术设计学院毕业证原件一模一样一比一原版(毕业证书)新西兰怀特克利夫艺术设计学院毕业证原件一模一样
一比一原版(毕业证书)新西兰怀特克利夫艺术设计学院毕业证原件一模一样
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
 
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrStory Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
 
一比一原版澳大利亚迪肯大学毕业证如何办理
一比一原版澳大利亚迪肯大学毕业证如何办理一比一原版澳大利亚迪肯大学毕业证如何办理
一比一原版澳大利亚迪肯大学毕业证如何办理
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf
 
一比一原版帝国理工学院毕业证如何办理
一比一原版帝国理工学院毕业证如何办理一比一原版帝国理工学院毕业证如何办理
一比一原版帝国理工学院毕业证如何办理
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
 
一比一原版美国北卡罗莱纳大学毕业证如何办理
一比一原版美国北卡罗莱纳大学毕业证如何办理一比一原版美国北卡罗莱纳大学毕业证如何办理
一比一原版美国北卡罗莱纳大学毕业证如何办理
 
一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
 
一比一原版(Polytechnic毕业证书)新加坡理工学院毕业证原件一模一样
一比一原版(Polytechnic毕业证书)新加坡理工学院毕业证原件一模一样一比一原版(Polytechnic毕业证书)新加坡理工学院毕业证原件一模一样
一比一原版(Polytechnic毕业证书)新加坡理工学院毕业证原件一模一样
 

Video Archiving and Playback in the Wayback Machine

  • 1. Video Archiving and Playback in the Wayback Machine Sawood Alam, Bill O'Connor, Corentin Barreau, Kenji Nagahashi, Vangelis Banos, Karim Ratib, Owen Lampe, Mark Graham Wayback Machine, Internet Archive sawood@archive.org @ibnesayeed @waybackmachine @internetarchive IIPC Web Archiving Conference (WAC), May 24, 2022, Online
  • 2. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> Static Video Files vs. Dynamic Video Streams 2 https://web.archive.org/web/20220504160213/https://developer.mo zilla.org/en-US/docs/Web/HTML/Element/video https://web.archive.org/web/20220330074939/https://www.youtube .com/watch?v=dCBy9z3f9Mw Archiving static video files and playing them back is easy, using traditional methods Archiving videos with dynamic stream at a large scale and playing them back reliably requires specialized solutions
  • 3. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> Video Archiving Pipeline 3 2 Check Status Check if the video is already archived to avoid duplicates Fetch Metadata Collect metadata of videos on selected pages 3 Filter/Curate Apply curation filters to exclude certain videos 4 Archive Create WARC records for web page, metadata, video, etc. 5 1 Identify Candidates Select web page URLs that are likely to have embedded videos
  • 4. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> 1: Identify Candidate Pages With Videos 4 Derive based, End of Term, Wide/Survey Crawls Match certain patterns in the URI feeds from various sources (except those that contribute WARC files directly)
  • 5. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> 2: Check Video Status ● Query a custom HTTP video status API ○ Maintains a database of previously archived/attempted videos ○ Serves as a seen-check service for videos ○ Provides other useful information such as datetime and source ● Ignore the video URI if it is already archived ● Update the video status database after archiving the video 5
  • 6. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> 3: Fetch Video Metadata { "title": "How to use the Internet Archive", "description": "The Internet Archive (archive.org) is a nonprofit library…", "duration": 1371, "id": "dCBy9z3f9Mw", "original_url": "https://www.youtube.com/watch?v=dCBy9z3f9Mw", "upload_date": "20210105", "uploader": "Internet Archive", "uploader_id": "UCFa_X02QhJnP0FNpFAKyRRg", "channel": "Internet Archive", "channel_id": "UCFa_X02QhJnP0FNpFAKyRRg", "automatic_captions": {}, "availability": "public", "categories": [], "extractor": "youtube", "formats": [], "requested_formats": [], "subtitles": {}, "tags": [], "thumbnails": [], "language": null, "...": "..." } 6 Fetched using a custom HTTP video metadata API wrapper around “youtube-dl”
  • 7. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> 4: Apply Curation Filters 7 Exclusions & Ignore List Channels & Uploaders Video Categories Keywords & Tags Filesize, Duration, etc.
  • 8. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> 5: Archive and Create WARC Records 8 Container HTML web page Video metadata API response At least one video file Thumbnails and CSS sprites Captions in many languages Page requisites CSS/JS/images
  • 9. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> Video Playback in Wayback Machine 9 ● Replace the video player with a custom JWPlayer instance ● Query a key-value database to find the video file associated with the video ID ● One capture per video policy may cause temporal violations ● Old Flash videos may require emulation or migration
  • 10. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> Metadata Amendment and Aggregation 10 lang = detect_lang(title) if lang == "en": lang = detect_lang(title + description) FIELDS = ["id", "upload_date", "duration", "..."] for record in ArchiveIterator(stream): if record.rec_type == "metadata" and record.content_type == "application/json;generator-youtube-dl": j = json.load(record.content_stream()) print(json.dumps({k: j[k] for k in FIELDS})) Patch missing language metadata using title with fallback to description for best-effort results Print only a subset of metadata fields (necessary for statistical analysis) in JSONL (i.e., one record per line) format Create hourly JSONL files and save as daily Petabox items
  • 11. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> What Are We Archiving? Lexical Insights 11 Statistics from Feb 24, 2022, clearly show our increased activity in archiving videos related to Russian invasion of Ukraine
  • 12. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> What Are We Archiving? Temporal Insights 12 Less than 20% of the longest videos acquire more than 80% of the total daily duration Longest videos can be 24 hours long
  • 13. Sawood Alam <@ibnesayeed> | Wayback Machine <@waybackmachine> | Internet Archive <@internetarchive> Summary Archiving videos at a large scale 13 Preserving rich metadata & provenance for research & insights Learning and improving our systems & practices as we go Exploring opportunities of interoperability & standardization Planning to open-source our tools as they become generalizable