SlideShare a Scribd company logo
Webarchiv
Památník českého internetu, více
OpenAlt 2016
Mezi snem a realitou.
Otevřená data českého webového archivu.
http://www.slideshare.net/webarchivCZ/presentations
Proč archivujeme web?
Kdo a jak archivuje web?
Metadata
Rudolf.Kreibich@nkp.cz
vedoucí podpory aplikací NK ČR
Proč archivujeme web?
“… více jak 70% URL v Harvard Law
Review a 50% URL v nálezích nejvyššího
soudu Spojených států amerických, odkazuje
k již neexistujícímu webovému zdroji. “
Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations. Jonathan Zittrain,
Kendra Albert a Lawrence Lessig. Legal Information Management / Volume 14 / Issue 02 / June 2014, pp 88-99,
DOI: http://dx.doi.org/10.1017/S1472669614000255, Published online: 12 June 2014
404 Not Found
The 404 (Not Found) status code indicates that the origin server did
not find a current representation for the target resource or is not
willing to disclose that one exists. A 404 status code does not
indicate whether this lack of representation is temporary or
permanent; the 410 (Gone) status code is preferred over 404 if the
origin server knows, presumably through some configurable means, that
the condition is likely to be permanent.
A 404 response is cacheable by default; i.e., unless otherwise
indicated by the method definition or explicit cache controls (see
Section 4.2.2 of [RFC7234]).
✝
uri
“Je snažší nalézt exemplář filmu z roku
1924, než webové stránky z roku 1994.”
M.S. Ankerson. “Writing web histories with an eye on the analog past.” 2012. 

http://nms.sagepub.com/content/14/3/384.full.pdf+html
“Bude možné studovat naše století bez
webových archivů?”
Ian Milligan, Professor in the Department of History at the University of Waterloo.
Kdo a jak archivuje web?
“Univerzální dostupnost veškerého vědění.”
Brewster Kahle
IIPC | Internationl Internet Preservation Consortium
Složení členů
2x Regionální knihovny
32x Národní knihovny (včetně ČR)
3x Neziskové organizace
9x Výzkumné organizace nebo univerzity
http://netpreserve.org/about-us/members
Heritrix / OpenWayback
sklízení / zpřístupnění
Otevřený software
Mezinárodní komunita
https://github.com/iipc/openwayback
https://github.com/internetarchive/heritrix3
Temný věk Java Scriptu
“Brozzler is a distributed web crawler
(爬⾍) that uses a real browser (chrome
or chromium) to fetch pages and
embedded urls and to extract links.”
https://github.com/internetarchive/brozzler
Heritrix sklízí 2065 URL/s
PhantomJS sklízí 172 URL/s
=>
škálovat JS intepretory
Měsíční výběrové sklizně
Občasné tématické sklizně
Půl roční sklizně domény cz
(spolupráce s nic.cz)
… od roku 2001
~ 221 TB
~ 6 miliard digitálních objektů / URL
~1,2 miliónu domén .cz
méně než 1 % je volně přístupné
=
~ 4738 webů z 1,2 miliónu webů
Operation | postupný přesun do Infrastructre as Code
Dobrá strana síly
Ansible
Vagrant
Packer
Docker?
…
Temná a svůdná strana
VMware vCenter
IBM GPFS
http://arquivo.pt/search.jsp?l=en&query=prase
“The Common Crawl corpus contains petabytes of data collected
over the last 7 years.
It contains raw web page data, extracted metadata and text
extractions.
The Common Crawl dataset lives on Amazon S3 as part of the
Amazon Public Datasets program.
From Public Data Sets, you can download the files entirely free using
HTTP or S3.
As the Common Crawl Foundation has evolved over the years, so has
the format and metadata that accompany the crawls themselves.”
http://commoncrawl.org/the-data/get-started/
“Google podle mně nearchivuje, ale
cachuje.”
já, u vícero příležitostí
metadata
WARC | ISO 28500:2009 | Prochází revizí
WARC/1.0
WARC-Type: response
WARC-Date: 2014-08-02T09:52:13Z
WARC-Record-ID:
Content-Length: 43428
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID:
WARC-Concurrent-To:
WARC-IP-Address: 212.58.244.61
WARC-Target-URI: http://news.bbc.co.uk/2/hi/africa/3414345.stm
WARC-Payload-Digest:
sha1:M63W6MNGFDWXDSLTHF7GWUPCJUH4JK3J
WARC-Block-Digest:
sha1:YHKQUSBOS4CLYFEKQDVGJ457OAPD6IJO
WARC-Truncated: length
Wayback CDX Server API
plain text or JSON array of the CDX data
urlkey: org,archive
timestamp: 19970126045828
original: http://www.archive.org:80
mimetype: text/html
statuscode: 200
digest: Q4YULN754FHV2U6Q5JUT6Q2P57WEWNNY
length: 1415
https://github.com/internetarchive/wayback/blob/master/wayback-
cdx-server/README.md
WAT | Metadata k archivovaným objektům | JSON
WARC-Header-Metadata:
WARC-Target-URI: http://news.bbc.co.uk/2/hi/africa/3414345.stm
WARC-Type: response
WARC-Date 2014-08-02T09:52:13Z
…
Payload-Metadata:
HTTP-Response-Metadata:
Headers:
Content-Language:
Content-Encoding:
...
HTML-Metadata:
Head:
Title: BBC NEWS | Africa | Namibia braces for Nujoma exit
…
Metas:
name: keywords
content: BBC, News, BBC News, news online, world, uk, international, foreign, british, online, service
…
Links:
href: /css/screen/shared/styles.css
path: STYLE/#text
…
http://commoncrawl.org/the-data/get-started/

https://gist.github.com/Smerity/e750f0ef0ab9aa366558#file-bbc-pretty-wat
https://webarchive.jira.com/wiki/display/Iresearch/archive-metadata-extractor.jar
WAT | Metadata k archivovaným objektům | JSON
Server response
"Headers" : {
"Date" : "Sat, 02 Aug 2014 09:52:13 GMT",
"Cache-Control" : "max-age=0",
"Connection" : "close",
"Expires" : "Sat, 02 Aug 2014 09:52:13 GMT",
"Content-Type" : "text/html",
"Server" : "Apache",
"Vary" : "X-CDN",
"Set-Cookie" :
“BBC UID=15730d9c1b741c0d3942e2aca1317fbf39e57b90be68a329d375ba9d5
a8964080CCBot%2f2%2e0%20%28http%3a%2f%2fcommoncrawl%2eorg%2ffaq
%2f%29; expires=Sun, 02-Aug-15 09:52:13 GMT; path=/; domain=bbc.co.uk;"
http://commoncrawl.org/the-data/get-started/

https://gist.github.com/Smerity/e750f0ef0ab9aa366558#file-bbc-pretty-wat
https://webarchive.jira.com/wiki/display/Iresearch/archive-metadata-extractor.jar
WET | Extrahovaný fulltext
WARC/1.0
WARC-Type: conversion
WARC-Target-URI: http://news.bbc.co.uk/2/hi/africa/3414345.stm
WARC-Date: 2014-08-02T09:52:13Z
WARC-Record-ID: <urn:uuid:007d632a-ab5a-4c4e-afc2-c455066a82de>
WARC-Refers-To: <urn:uuid:ffbfb0c0-6456-42b0-af03-3867be6fc09f>
WARC-Block-Digest: sha1:JROHLCS5SKMBR6XY46WXREW7RXM64EJC
Content-Type: text/plain
Content-Length: 6724
BBC NEWS | Africa | Namibia braces for Nujoma exit
[an error occurred while processing this directive]
…
Your news when you want it
News Front Page
Africa
…
HausaPortuguese Africa More Last Updated: Thursday, 22 January, 2004, 00:48 GMT
E-mail this to a friend
Printable version
…
Swapo has been careful to secure the Ovambo vote by ploughing a large slice of development funding into the region,
and the people there get more than their fair share of government positions.
For the moment, Mr Nujoma's biggest headache is land reform. Huge tracks of land are still owned by a few white
farmers and black Namibians are impatient at the slow pace of reform. White farmers say they are falling over
backwards to please the government, but Mr Pahamba says that they are only handing over poor quality land.
Meanwhile, the militant black farmer's union is threatening farm occupations similar to those in Zimbabwe. Guard dogs
…
http://commoncrawl.org/the-data/get-started/

https://gist.github.com/Smerity/e750f0ef0ab9aa366558#file-bbc-pretty-wat
LGA | Metadata pro vztahy mezi URL v čase
ID-Map
url: https://www.youtube.com/watch?v=--FDzShdFjw&gl=US&hl=en
surt_url: com,youtube)/watch?gl=us&hl=en&v=–fdzshdfjw
id: 294869


příklad

{"url":"https://www.youtube.com/watch?v=--
FDzShdFjw&gl=US&hl=en","surt_url":"com,youtube)/watch?gl=us&hl=en&v=–
fdzshdfjw","id":294869}
ID-Graph
timestamp: 20150209052911
id: 20150209052911
outilink_ids: 31, 31366, 62596, 91594, 91595, …


příklad
{“timestamp":"20150209052911","id":294869,"outlink_ids":
[31,31366,62596,91594,91595,129599, …]}


https://webarchive.jira.com/wiki/display/ARS/LGA+Overview+and+Technical+Details
WANE | Extrahované jmenné entity
url: http://dissonantwinstonsmith.wordpress.com/2014/08/24/im-sick-of/?
like_comment=79&_wpnonce=0fc57aa499&replytocom=93
timestamp: 20141019212346
named_entities:
locations: North County, America, St. Louis County St. Louis County
Police St. Louis County, WordPress.com, Middle East, …
organizations: Twitter Facebook Google, Google, Facebook, Wal-Mart,
CNN, Bearcats, …
persons: Stell, Tom Jackson, Smith, Pamela Fillingim, Darren Wilson
Eric Fowler Eric Vickers Ferguson Ferguson, Ferguson, …
digest: sha1:747IKFWUCVQVXY7TX2NMYFL422T4TRQX
Extrahováno se Stanford Named Entity Recognizer (NER)
http://nlp.stanford.edu/software/CRF-NER.shtml
https://webarchive.jira.com/wiki/display/ARS/
WANE+Overview+and+Technical+Details
NameTag / CNES 2.0 | WANE?
http://ufal.mff.cuni.cz/nametag

https://ufal.mff.cuni.cz/cnec/cnec2.0
Open nsfw model
“This repo contains code for running Not Suitable for Work
(NSFW) classification deep neural network Caffe models. “
https://github.com/yahoo/open_nsfw/blob/master/
audio2text
NameTag / CNES 2.0 | WANE?
http://ufal.mff.cuni.cz/nametag

https://ufal.mff.cuni.cz/cnec/cnec2.0
Jak metadata zpřístupnit?
bulk data
bulk data v S3
API
webová služba
Co s metadaty?
vývoj formátů na webu
vývoj prolinkování webů
vývoj nsfw webů na doméně
vývoj poměru grafiky / textu na webu
vývoj web technologií
…
Oddělení archivace webu | ODIF | NK ČR
Vedoucí: Jaroslav Kvasnica
Kurátoři: Marie Haškovcová, Monika Holoubková, Markéta
Hrdličková
IT Operation: Rudolf.Kreibich@nkp.cz
webarchiv.cz
facebook.com/webarchivcz
slideshare.net/webarchivCZ
github.com/webarchivcz

More Related Content

What's hot

Supporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSupporting Web Archiving via Web Packaging
Supporting Web Archiving via Web Packaging
Sawood Alam
 
CloudKit
CloudKitCloudKit
CloudKit
Jon Crosby
 
Hacklu2012 v07
Hacklu2012 v07Hacklu2012 v07
Hacklu2012 v07
F _
 
Attacking Network Infrastructure to Generate a 4 Tbs DDoS
Attacking Network Infrastructure to Generate a 4 Tbs DDoSAttacking Network Infrastructure to Generate a 4 Tbs DDoS
Attacking Network Infrastructure to Generate a 4 Tbs DDoS
mark-smith
 
Google Hacking 101
Google Hacking 101Google Hacking 101
Google Hacking 101
Sais Abdelkrim
 
URI Disambiguation in the Context of Linked Data
URI Disambiguation in the Context of Linked DataURI Disambiguation in the Context of Linked Data
URI Disambiguation in the Context of Linked Data
butest
 
Digital Library Federation, Fall 07, Connotea Presentation
Digital Library Federation, Fall 07, Connotea PresentationDigital Library Federation, Fall 07, Connotea Presentation
Digital Library Federation, Fall 07, Connotea Presentation
Ian Mulvany
 
Deepweb Tools
Deepweb ToolsDeepweb Tools
Deepweb Tools
manigandan_ramkumar
 
CITEC #CON2-Dirty Attack with Google Hacking
CITEC #CON2-Dirty Attack with Google HackingCITEC #CON2-Dirty Attack with Google Hacking
CITEC #CON2-Dirty Attack with Google Hacking
Prathan Phongthiproek
 
The Web, one huge database ...
The Web, one huge database ...The Web, one huge database ...
The Web, one huge database ...
Michael Hausenblas
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web Bundles
Sawood Alam
 
Archive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification Framework
Sawood Alam
 
Summarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSummarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMap
Sawood Alam
 
courts circuits : l'innovation dans le luxe 'mon idendité de luxe" (partie 3)
courts circuits : l'innovation dans le luxe 'mon idendité de luxe" (partie 3)courts circuits : l'innovation dans le luxe 'mon idendité de luxe" (partie 3)
courts circuits : l'innovation dans le luxe 'mon idendité de luxe" (partie 3)
nous sommes vivants
 
20190516 web security-basic
20190516 web security-basic20190516 web security-basic
20190516 web security-basic
MksYi
 
Maphub and Annotorious
Maphub and AnnotoriousMaphub and Annotorious
Maphub and Annotorious
Bernhard Haslhofer
 
20100614 ISWSA Keynote
20100614 ISWSA Keynote20100614 ISWSA Keynote
20100614 ISWSA Keynote
Axel Polleres
 

What's hot (17)

Supporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSupporting Web Archiving via Web Packaging
Supporting Web Archiving via Web Packaging
 
CloudKit
CloudKitCloudKit
CloudKit
 
Hacklu2012 v07
Hacklu2012 v07Hacklu2012 v07
Hacklu2012 v07
 
Attacking Network Infrastructure to Generate a 4 Tbs DDoS
Attacking Network Infrastructure to Generate a 4 Tbs DDoSAttacking Network Infrastructure to Generate a 4 Tbs DDoS
Attacking Network Infrastructure to Generate a 4 Tbs DDoS
 
Google Hacking 101
Google Hacking 101Google Hacking 101
Google Hacking 101
 
URI Disambiguation in the Context of Linked Data
URI Disambiguation in the Context of Linked DataURI Disambiguation in the Context of Linked Data
URI Disambiguation in the Context of Linked Data
 
Digital Library Federation, Fall 07, Connotea Presentation
Digital Library Federation, Fall 07, Connotea PresentationDigital Library Federation, Fall 07, Connotea Presentation
Digital Library Federation, Fall 07, Connotea Presentation
 
Deepweb Tools
Deepweb ToolsDeepweb Tools
Deepweb Tools
 
CITEC #CON2-Dirty Attack with Google Hacking
CITEC #CON2-Dirty Attack with Google HackingCITEC #CON2-Dirty Attack with Google Hacking
CITEC #CON2-Dirty Attack with Google Hacking
 
The Web, one huge database ...
The Web, one huge database ...The Web, one huge database ...
The Web, one huge database ...
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web Bundles
 
Archive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification Framework
 
Summarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSummarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMap
 
courts circuits : l'innovation dans le luxe 'mon idendité de luxe" (partie 3)
courts circuits : l'innovation dans le luxe 'mon idendité de luxe" (partie 3)courts circuits : l'innovation dans le luxe 'mon idendité de luxe" (partie 3)
courts circuits : l'innovation dans le luxe 'mon idendité de luxe" (partie 3)
 
20190516 web security-basic
20190516 web security-basic20190516 web security-basic
20190516 web security-basic
 
Maphub and Annotorious
Maphub and AnnotoriousMaphub and Annotorious
Maphub and Annotorious
 
20100614 ISWSA Keynote
20100614 ISWSA Keynote20100614 ISWSA Keynote
20100614 ISWSA Keynote
 

Viewers also liked

Aries Errand Service LLC - Bond docs - 10 13 16
Aries Errand Service LLC - Bond docs - 10 13 16Aries Errand Service LLC - Bond docs - 10 13 16
Aries Errand Service LLC - Bond docs - 10 13 16
Joyce Stafford
 
Group presentation FOTM
Group presentation FOTMGroup presentation FOTM
Group presentation FOTM
martsu kichu
 
El cerebro
El cerebroEl cerebro
El cerebro
nikol rolong
 
Organismo y ambiente 2ºmedio
Organismo y ambiente 2ºmedioOrganismo y ambiente 2ºmedio
Organismo y ambiente 2ºmedio
Sebastián Bahamondes
 
Biomas martha julia borrayo
Biomas martha julia borrayoBiomas martha julia borrayo
Biomas martha julia borrayo
marthaaaaaaa
 
Par2 2 0901(1)
Par2 2 0901(1)Par2 2 0901(1)
Crowd новый формат
Crowd новый форматCrowd новый формат
Crowd новый формат
1PS.RU
 
About page
About pageAbout page
About page
Travious Mitchell
 
Lambda Expression
Lambda ExpressionLambda Expression
Lambda Expression
Sumit Sinhmar
 
Loadแนวข้อสอบ นักโบราณคดี กรมศิลปากร
Loadแนวข้อสอบ นักโบราณคดี กรมศิลปากรLoadแนวข้อสอบ นักโบราณคดี กรมศิลปากร
Loadแนวข้อสอบ นักโบราณคดี กรมศิลปากร
nawaporn khamseanwong
 
Codigo pa sql
Codigo pa sqlCodigo pa sql
Codigo pa sql
sigiandre
 
Simbolos patrios del perú
Simbolos patrios del perúSimbolos patrios del perú
Simbolos patrios del perú
daiell100
 
Rio de Janeiro
Rio de JaneiroRio de Janeiro
Rio de Janeiro
MaisDestinos.com
 

Viewers also liked (14)

Aries Errand Service LLC - Bond docs - 10 13 16
Aries Errand Service LLC - Bond docs - 10 13 16Aries Errand Service LLC - Bond docs - 10 13 16
Aries Errand Service LLC - Bond docs - 10 13 16
 
Group presentation FOTM
Group presentation FOTMGroup presentation FOTM
Group presentation FOTM
 
El cerebro
El cerebroEl cerebro
El cerebro
 
Organismo y ambiente 2ºmedio
Organismo y ambiente 2ºmedioOrganismo y ambiente 2ºmedio
Organismo y ambiente 2ºmedio
 
LEAN w farmacji
LEAN w farmacjiLEAN w farmacji
LEAN w farmacji
 
Biomas martha julia borrayo
Biomas martha julia borrayoBiomas martha julia borrayo
Biomas martha julia borrayo
 
Par2 2 0901(1)
Par2 2 0901(1)Par2 2 0901(1)
Par2 2 0901(1)
 
Crowd новый формат
Crowd новый форматCrowd новый формат
Crowd новый формат
 
About page
About pageAbout page
About page
 
Lambda Expression
Lambda ExpressionLambda Expression
Lambda Expression
 
Loadแนวข้อสอบ นักโบราณคดี กรมศิลปากร
Loadแนวข้อสอบ นักโบราณคดี กรมศิลปากรLoadแนวข้อสอบ นักโบราณคดี กรมศิลปากร
Loadแนวข้อสอบ นักโบราณคดี กรมศิลปากร
 
Codigo pa sql
Codigo pa sqlCodigo pa sql
Codigo pa sql
 
Simbolos patrios del perú
Simbolos patrios del perúSimbolos patrios del perú
Simbolos patrios del perú
 
Rio de Janeiro
Rio de JaneiroRio de Janeiro
Rio de Janeiro
 

Similar to Mezi snem a realitou. Otevřená data českého webového archivu.

2017-07-22 Common Workflow Language Viewer
2017-07-22 Common Workflow Language Viewer2017-07-22 Common Workflow Language Viewer
2017-07-22 Common Workflow Language Viewer
Stian Soiland-Reyes
 
Organization
OrganizationOrganization
Organization
cat509
 
The Impact of Bibframe
The Impact of BibframeThe Impact of Bibframe
The Impact of Bibframe
Thomas Meehan
 
APAN 50: RPKI industry trends and initiatives
APAN 50: RPKI industry trends and initiatives APAN 50: RPKI industry trends and initiatives
APAN 50: RPKI industry trends and initiatives
APNIC
 
Web Browser Basics, Tips & Tricks Draft 17
Web Browser Basics, Tips & Tricks Draft 17Web Browser Basics, Tips & Tricks Draft 17
Web Browser Basics, Tips & Tricks Draft 17
msz
 
Romulus OWASP
Romulus OWASPRomulus OWASP
Romulus OWASP
Grupo Gesfor I+D+i
 
Bio2RDF@BH2010
Bio2RDF@BH2010Bio2RDF@BH2010
Bio2RDF@BH2010
François Belleau
 
Semantic web and Drupal: an introduction
Semantic web and Drupal: an introductionSemantic web and Drupal: an introduction
Semantic web and Drupal: an introduction
Kristof Van Tomme
 
Presentation at the VIVO 2011 conference
Presentation at the VIVO 2011 conferencePresentation at the VIVO 2011 conference
Presentation at the VIVO 2011 conference
Johannes Keizer
 
Experiments in Data Portability 2
Experiments in Data Portability 2Experiments in Data Portability 2
Experiments in Data Portability 2
Glenn Jones
 
GDG Meets U event - Big data & Wikidata - no lies codelab
GDG Meets U event - Big data & Wikidata -  no lies codelabGDG Meets U event - Big data & Wikidata -  no lies codelab
GDG Meets U event - Big data & Wikidata - no lies codelab
CAMELIA BOBAN
 
OWASP Free Training - SF2014 - Keary and Manico
OWASP Free Training - SF2014 - Keary and ManicoOWASP Free Training - SF2014 - Keary and Manico
OWASP Free Training - SF2014 - Keary and Manico
Eoin Keary
 
URL Design
URL DesignURL Design
URL Design
Walter Ebert
 
Network Security Data Visualization
Network Security Data VisualizationNetwork Security Data Visualization
Network Security Data Visualization
amiable_indian
 
RESTful Rabbits
RESTful RabbitsRESTful Rabbits
RESTful Rabbits
Gareth Rushgrove
 
Presentation at the EMBL-EBI Industry RDF meeting
Presentation at the EMBL-EBI  Industry RDF meetingPresentation at the EMBL-EBI  Industry RDF meeting
Presentation at the EMBL-EBI Industry RDF meeting
Johannes Keizer
 
AGROVOC, AGRIS and the CIARD RING, using RDF vocabularies and technologies f...
AGROVOC, AGRIS and the CIARD RING,  using RDF vocabularies and technologies f...AGROVOC, AGRIS and the CIARD RING,  using RDF vocabularies and technologies f...
AGROVOC, AGRIS and the CIARD RING, using RDF vocabularies and technologies f...
AIMS (Agricultural Information Management Standards)
 
RDFa Introductory Course Session 2/4 How RDFa
RDFa Introductory Course Session 2/4 How RDFaRDFa Introductory Course Session 2/4 How RDFa
RDFa Introductory Course Session 2/4 How RDFa
Platypus
 
How RDFa works
How RDFa worksHow RDFa works
How RDFa works
JISC Netskills
 
Web Browser Basics, Tips & Tricks - Draft 20 (Revised 5/18/17)
Web Browser Basics, Tips & Tricks - Draft 20 (Revised 5/18/17)Web Browser Basics, Tips & Tricks - Draft 20 (Revised 5/18/17)
Web Browser Basics, Tips & Tricks - Draft 20 (Revised 5/18/17)
msz
 

Similar to Mezi snem a realitou. Otevřená data českého webového archivu. (20)

2017-07-22 Common Workflow Language Viewer
2017-07-22 Common Workflow Language Viewer2017-07-22 Common Workflow Language Viewer
2017-07-22 Common Workflow Language Viewer
 
Organization
OrganizationOrganization
Organization
 
The Impact of Bibframe
The Impact of BibframeThe Impact of Bibframe
The Impact of Bibframe
 
APAN 50: RPKI industry trends and initiatives
APAN 50: RPKI industry trends and initiatives APAN 50: RPKI industry trends and initiatives
APAN 50: RPKI industry trends and initiatives
 
Web Browser Basics, Tips & Tricks Draft 17
Web Browser Basics, Tips & Tricks Draft 17Web Browser Basics, Tips & Tricks Draft 17
Web Browser Basics, Tips & Tricks Draft 17
 
Romulus OWASP
Romulus OWASPRomulus OWASP
Romulus OWASP
 
Bio2RDF@BH2010
Bio2RDF@BH2010Bio2RDF@BH2010
Bio2RDF@BH2010
 
Semantic web and Drupal: an introduction
Semantic web and Drupal: an introductionSemantic web and Drupal: an introduction
Semantic web and Drupal: an introduction
 
Presentation at the VIVO 2011 conference
Presentation at the VIVO 2011 conferencePresentation at the VIVO 2011 conference
Presentation at the VIVO 2011 conference
 
Experiments in Data Portability 2
Experiments in Data Portability 2Experiments in Data Portability 2
Experiments in Data Portability 2
 
GDG Meets U event - Big data & Wikidata - no lies codelab
GDG Meets U event - Big data & Wikidata -  no lies codelabGDG Meets U event - Big data & Wikidata -  no lies codelab
GDG Meets U event - Big data & Wikidata - no lies codelab
 
OWASP Free Training - SF2014 - Keary and Manico
OWASP Free Training - SF2014 - Keary and ManicoOWASP Free Training - SF2014 - Keary and Manico
OWASP Free Training - SF2014 - Keary and Manico
 
URL Design
URL DesignURL Design
URL Design
 
Network Security Data Visualization
Network Security Data VisualizationNetwork Security Data Visualization
Network Security Data Visualization
 
RESTful Rabbits
RESTful RabbitsRESTful Rabbits
RESTful Rabbits
 
Presentation at the EMBL-EBI Industry RDF meeting
Presentation at the EMBL-EBI  Industry RDF meetingPresentation at the EMBL-EBI  Industry RDF meeting
Presentation at the EMBL-EBI Industry RDF meeting
 
AGROVOC, AGRIS and the CIARD RING, using RDF vocabularies and technologies f...
AGROVOC, AGRIS and the CIARD RING,  using RDF vocabularies and technologies f...AGROVOC, AGRIS and the CIARD RING,  using RDF vocabularies and technologies f...
AGROVOC, AGRIS and the CIARD RING, using RDF vocabularies and technologies f...
 
RDFa Introductory Course Session 2/4 How RDFa
RDFa Introductory Course Session 2/4 How RDFaRDFa Introductory Course Session 2/4 How RDFa
RDFa Introductory Course Session 2/4 How RDFa
 
How RDFa works
How RDFa worksHow RDFa works
How RDFa works
 
Web Browser Basics, Tips & Tricks - Draft 20 (Revised 5/18/17)
Web Browser Basics, Tips & Tricks - Draft 20 (Revised 5/18/17)Web Browser Basics, Tips & Tricks - Draft 20 (Revised 5/18/17)
Web Browser Basics, Tips & Tricks - Draft 20 (Revised 5/18/17)
 

More from Webarchive of National Library of the Czech Republic

Inzerat - datovy analytik / datova analyticka
Inzerat - datovy analytik / datova analyticka Inzerat - datovy analytik / datova analyticka
Inzerat - datovy analytik / datova analyticka
Webarchive of National Library of the Czech Republic
 
Inzerát datovy analytik_wa
Inzerát datovy analytik_waInzerát datovy analytik_wa
Volné pracovní místo - kurátor/ka webového archivu
Volné pracovní místo - kurátor/ka webového archivuVolné pracovní místo - kurátor/ka webového archivu
Volné pracovní místo - kurátor/ka webového archivu
Webarchive of National Library of the Czech Republic
 
Webarchiv - Curatorial approaches, topic collections and cooperation with the...
Webarchiv - Curatorial approaches, topic collections and cooperation with the...Webarchiv - Curatorial approaches, topic collections and cooperation with the...
Webarchiv - Curatorial approaches, topic collections and cooperation with the...
Webarchive of National Library of the Czech Republic
 
Volné místo - analytik českého webového archivu
Volné místo - analytik českého webového archivuVolné místo - analytik českého webového archivu
Volné místo - analytik českého webového archivu
Webarchive of National Library of the Czech Republic
 
Webarchiv aneb až po lokty v mrtvolách
Webarchiv aneb až po lokty v mrtvoláchWebarchiv aneb až po lokty v mrtvolách
Webarchiv aneb až po lokty v mrtvolách
Webarchive of National Library of the Czech Republic
 
Kurz webové archivace 2018/2
Kurz webové archivace 2018/2Kurz webové archivace 2018/2
Blok expertu
Blok expertuBlok expertu
Kurz webové archivace 2018/1
Kurz webové archivace 2018/1Kurz webové archivace 2018/1
Webarchiv
WebarchivWebarchiv
Datovy analytik
Datovy analytikDatovy analytik
Webarchiv CZ 2017
Webarchiv CZ 2017Webarchiv CZ 2017
Kurz webové archivace 2017/4
Kurz webové archivace 2017/4Kurz webové archivace 2017/4
Kurz webové archivace 2017/3
Kurz webové archivace 2017/3Kurz webové archivace 2017/3
Kurz webové archivace 2017/2
Kurz webové archivace 2017/2Kurz webové archivace 2017/2
Kurz webové archivace 2017/1
Kurz webové archivace 2017/1Kurz webové archivace 2017/1
Tematické kolekce jako měřítko kvality webových archivů
Tematické kolekce jako měřítko kvality webových archivůTematické kolekce jako měřítko kvality webových archivů
Tematické kolekce jako měřítko kvality webových archivů
Webarchive of National Library of the Czech Republic
 
WARC 1.1 je skoro tady - co přinese nová verze?
WARC 1.1 je skoro tady - co přinese nová verze?WARC 1.1 je skoro tady - co přinese nová verze?
WARC 1.1 je skoro tady - co přinese nová verze?
Webarchive of National Library of the Czech Republic
 
WARC 1.1 je skoro tady - co přinese nová verze
WARC 1.1 je skoro tady - co přinese nová verzeWARC 1.1 je skoro tady - co přinese nová verze
WARC 1.1 je skoro tady - co přinese nová verze
Webarchive of National Library of the Czech Republic
 

More from Webarchive of National Library of the Czech Republic (20)

Inzerat - datovy analytik / datova analyticka
Inzerat - datovy analytik / datova analyticka Inzerat - datovy analytik / datova analyticka
Inzerat - datovy analytik / datova analyticka
 
Inzerát datovy analytik_wa
Inzerát datovy analytik_waInzerát datovy analytik_wa
Inzerát datovy analytik_wa
 
Sys admin wa_rvv
Sys admin wa_rvvSys admin wa_rvv
Sys admin wa_rvv
 
Volné pracovní místo - kurátor/ka webového archivu
Volné pracovní místo - kurátor/ka webového archivuVolné pracovní místo - kurátor/ka webového archivu
Volné pracovní místo - kurátor/ka webového archivu
 
Webarchiv - Curatorial approaches, topic collections and cooperation with the...
Webarchiv - Curatorial approaches, topic collections and cooperation with the...Webarchiv - Curatorial approaches, topic collections and cooperation with the...
Webarchiv - Curatorial approaches, topic collections and cooperation with the...
 
Volné místo - analytik českého webového archivu
Volné místo - analytik českého webového archivuVolné místo - analytik českého webového archivu
Volné místo - analytik českého webového archivu
 
Webarchiv aneb až po lokty v mrtvolách
Webarchiv aneb až po lokty v mrtvoláchWebarchiv aneb až po lokty v mrtvolách
Webarchiv aneb až po lokty v mrtvolách
 
Kurz webové archivace 2018/2
Kurz webové archivace 2018/2Kurz webové archivace 2018/2
Kurz webové archivace 2018/2
 
Blok expertu
Blok expertuBlok expertu
Blok expertu
 
Kurz webové archivace 2018/1
Kurz webové archivace 2018/1Kurz webové archivace 2018/1
Kurz webové archivace 2018/1
 
Webarchiv
WebarchivWebarchiv
Webarchiv
 
Datovy analytik
Datovy analytikDatovy analytik
Datovy analytik
 
Webarchiv CZ 2017
Webarchiv CZ 2017Webarchiv CZ 2017
Webarchiv CZ 2017
 
Kurz webové archivace 2017/4
Kurz webové archivace 2017/4Kurz webové archivace 2017/4
Kurz webové archivace 2017/4
 
Kurz webové archivace 2017/3
Kurz webové archivace 2017/3Kurz webové archivace 2017/3
Kurz webové archivace 2017/3
 
Kurz webové archivace 2017/2
Kurz webové archivace 2017/2Kurz webové archivace 2017/2
Kurz webové archivace 2017/2
 
Kurz webové archivace 2017/1
Kurz webové archivace 2017/1Kurz webové archivace 2017/1
Kurz webové archivace 2017/1
 
Tematické kolekce jako měřítko kvality webových archivů
Tematické kolekce jako měřítko kvality webových archivůTematické kolekce jako měřítko kvality webových archivů
Tematické kolekce jako měřítko kvality webových archivů
 
WARC 1.1 je skoro tady - co přinese nová verze?
WARC 1.1 je skoro tady - co přinese nová verze?WARC 1.1 je skoro tady - co přinese nová verze?
WARC 1.1 je skoro tady - co přinese nová verze?
 
WARC 1.1 je skoro tady - co přinese nová verze
WARC 1.1 je skoro tady - co přinese nová verzeWARC 1.1 je skoro tady - co přinese nová verze
WARC 1.1 je skoro tady - co přinese nová verze
 

Recently uploaded

Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
LucaBarbaro3
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Jeffrey Haguewood
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
flufftailshop
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
GDSC PJATK
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
Pravash Chandra Das
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 

Recently uploaded (20)

Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 

Mezi snem a realitou. Otevřená data českého webového archivu.

  • 1. Webarchiv Památník českého internetu, více OpenAlt 2016 Mezi snem a realitou. Otevřená data českého webového archivu. http://www.slideshare.net/webarchivCZ/presentations
  • 2. Proč archivujeme web? Kdo a jak archivuje web? Metadata Rudolf.Kreibich@nkp.cz vedoucí podpory aplikací NK ČR
  • 4.
  • 5. “… více jak 70% URL v Harvard Law Review a 50% URL v nálezích nejvyššího soudu Spojených států amerických, odkazuje k již neexistujícímu webovému zdroji. “ Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations. Jonathan Zittrain, Kendra Albert a Lawrence Lessig. Legal Information Management / Volume 14 / Issue 02 / June 2014, pp 88-99, DOI: http://dx.doi.org/10.1017/S1472669614000255, Published online: 12 June 2014
  • 6.
  • 7. 404 Not Found The 404 (Not Found) status code indicates that the origin server did not find a current representation for the target resource or is not willing to disclose that one exists. A 404 status code does not indicate whether this lack of representation is temporary or permanent; the 410 (Gone) status code is preferred over 404 if the origin server knows, presumably through some configurable means, that the condition is likely to be permanent. A 404 response is cacheable by default; i.e., unless otherwise indicated by the method definition or explicit cache controls (see Section 4.2.2 of [RFC7234]).
  • 9. “Je snažší nalézt exemplář filmu z roku 1924, než webové stránky z roku 1994.” M.S. Ankerson. “Writing web histories with an eye on the analog past.” 2012. 
 http://nms.sagepub.com/content/14/3/384.full.pdf+html
  • 10. “Bude možné studovat naše století bez webových archivů?” Ian Milligan, Professor in the Department of History at the University of Waterloo.
  • 11. Kdo a jak archivuje web?
  • 12.
  • 13. “Univerzální dostupnost veškerého vědění.” Brewster Kahle
  • 14.
  • 15.
  • 16.
  • 17.
  • 18. IIPC | Internationl Internet Preservation Consortium Složení členů 2x Regionální knihovny 32x Národní knihovny (včetně ČR) 3x Neziskové organizace 9x Výzkumné organizace nebo univerzity http://netpreserve.org/about-us/members
  • 19. Heritrix / OpenWayback sklízení / zpřístupnění Otevřený software Mezinárodní komunita https://github.com/iipc/openwayback https://github.com/internetarchive/heritrix3
  • 20.
  • 21. Temný věk Java Scriptu “Brozzler is a distributed web crawler (爬⾍) that uses a real browser (chrome or chromium) to fetch pages and embedded urls and to extract links.” https://github.com/internetarchive/brozzler
  • 22. Heritrix sklízí 2065 URL/s PhantomJS sklízí 172 URL/s => škálovat JS intepretory
  • 23.
  • 24.
  • 25. Měsíční výběrové sklizně Občasné tématické sklizně Půl roční sklizně domény cz (spolupráce s nic.cz)
  • 26. … od roku 2001 ~ 221 TB ~ 6 miliard digitálních objektů / URL ~1,2 miliónu domén .cz
  • 27.
  • 28. méně než 1 % je volně přístupné = ~ 4738 webů z 1,2 miliónu webů
  • 29.
  • 30.
  • 31. Operation | postupný přesun do Infrastructre as Code Dobrá strana síly Ansible Vagrant Packer Docker? … Temná a svůdná strana VMware vCenter IBM GPFS
  • 33.
  • 34. “The Common Crawl corpus contains petabytes of data collected over the last 7 years. It contains raw web page data, extracted metadata and text extractions. The Common Crawl dataset lives on Amazon S3 as part of the Amazon Public Datasets program. From Public Data Sets, you can download the files entirely free using HTTP or S3. As the Common Crawl Foundation has evolved over the years, so has the format and metadata that accompany the crawls themselves.” http://commoncrawl.org/the-data/get-started/
  • 35.
  • 36. “Google podle mně nearchivuje, ale cachuje.” já, u vícero příležitostí
  • 38. WARC | ISO 28500:2009 | Prochází revizí WARC/1.0 WARC-Type: response WARC-Date: 2014-08-02T09:52:13Z WARC-Record-ID: Content-Length: 43428 Content-Type: application/http; msgtype=response WARC-Warcinfo-ID: WARC-Concurrent-To: WARC-IP-Address: 212.58.244.61 WARC-Target-URI: http://news.bbc.co.uk/2/hi/africa/3414345.stm WARC-Payload-Digest: sha1:M63W6MNGFDWXDSLTHF7GWUPCJUH4JK3J WARC-Block-Digest: sha1:YHKQUSBOS4CLYFEKQDVGJ457OAPD6IJO WARC-Truncated: length
  • 39. Wayback CDX Server API plain text or JSON array of the CDX data urlkey: org,archive timestamp: 19970126045828 original: http://www.archive.org:80 mimetype: text/html statuscode: 200 digest: Q4YULN754FHV2U6Q5JUT6Q2P57WEWNNY length: 1415 https://github.com/internetarchive/wayback/blob/master/wayback- cdx-server/README.md
  • 40. WAT | Metadata k archivovaným objektům | JSON WARC-Header-Metadata: WARC-Target-URI: http://news.bbc.co.uk/2/hi/africa/3414345.stm WARC-Type: response WARC-Date 2014-08-02T09:52:13Z … Payload-Metadata: HTTP-Response-Metadata: Headers: Content-Language: Content-Encoding: ... HTML-Metadata: Head: Title: BBC NEWS | Africa | Namibia braces for Nujoma exit … Metas: name: keywords content: BBC, News, BBC News, news online, world, uk, international, foreign, british, online, service … Links: href: /css/screen/shared/styles.css path: STYLE/#text … http://commoncrawl.org/the-data/get-started/
 https://gist.github.com/Smerity/e750f0ef0ab9aa366558#file-bbc-pretty-wat https://webarchive.jira.com/wiki/display/Iresearch/archive-metadata-extractor.jar
  • 41. WAT | Metadata k archivovaným objektům | JSON Server response "Headers" : { "Date" : "Sat, 02 Aug 2014 09:52:13 GMT", "Cache-Control" : "max-age=0", "Connection" : "close", "Expires" : "Sat, 02 Aug 2014 09:52:13 GMT", "Content-Type" : "text/html", "Server" : "Apache", "Vary" : "X-CDN", "Set-Cookie" : “BBC UID=15730d9c1b741c0d3942e2aca1317fbf39e57b90be68a329d375ba9d5 a8964080CCBot%2f2%2e0%20%28http%3a%2f%2fcommoncrawl%2eorg%2ffaq %2f%29; expires=Sun, 02-Aug-15 09:52:13 GMT; path=/; domain=bbc.co.uk;" http://commoncrawl.org/the-data/get-started/
 https://gist.github.com/Smerity/e750f0ef0ab9aa366558#file-bbc-pretty-wat https://webarchive.jira.com/wiki/display/Iresearch/archive-metadata-extractor.jar
  • 42. WET | Extrahovaný fulltext WARC/1.0 WARC-Type: conversion WARC-Target-URI: http://news.bbc.co.uk/2/hi/africa/3414345.stm WARC-Date: 2014-08-02T09:52:13Z WARC-Record-ID: <urn:uuid:007d632a-ab5a-4c4e-afc2-c455066a82de> WARC-Refers-To: <urn:uuid:ffbfb0c0-6456-42b0-af03-3867be6fc09f> WARC-Block-Digest: sha1:JROHLCS5SKMBR6XY46WXREW7RXM64EJC Content-Type: text/plain Content-Length: 6724 BBC NEWS | Africa | Namibia braces for Nujoma exit [an error occurred while processing this directive] … Your news when you want it News Front Page Africa … HausaPortuguese Africa More Last Updated: Thursday, 22 January, 2004, 00:48 GMT E-mail this to a friend Printable version … Swapo has been careful to secure the Ovambo vote by ploughing a large slice of development funding into the region, and the people there get more than their fair share of government positions. For the moment, Mr Nujoma's biggest headache is land reform. Huge tracks of land are still owned by a few white farmers and black Namibians are impatient at the slow pace of reform. White farmers say they are falling over backwards to please the government, but Mr Pahamba says that they are only handing over poor quality land. Meanwhile, the militant black farmer's union is threatening farm occupations similar to those in Zimbabwe. Guard dogs … http://commoncrawl.org/the-data/get-started/
 https://gist.github.com/Smerity/e750f0ef0ab9aa366558#file-bbc-pretty-wat
  • 43. LGA | Metadata pro vztahy mezi URL v čase ID-Map url: https://www.youtube.com/watch?v=--FDzShdFjw&gl=US&hl=en surt_url: com,youtube)/watch?gl=us&hl=en&v=–fdzshdfjw id: 294869 
 příklad
 {"url":"https://www.youtube.com/watch?v=-- FDzShdFjw&gl=US&hl=en","surt_url":"com,youtube)/watch?gl=us&hl=en&v=– fdzshdfjw","id":294869} ID-Graph timestamp: 20150209052911 id: 20150209052911 outilink_ids: 31, 31366, 62596, 91594, 91595, … 
 příklad {“timestamp":"20150209052911","id":294869,"outlink_ids": [31,31366,62596,91594,91595,129599, …]} 
 https://webarchive.jira.com/wiki/display/ARS/LGA+Overview+and+Technical+Details
  • 44. WANE | Extrahované jmenné entity url: http://dissonantwinstonsmith.wordpress.com/2014/08/24/im-sick-of/? like_comment=79&_wpnonce=0fc57aa499&replytocom=93 timestamp: 20141019212346 named_entities: locations: North County, America, St. Louis County St. Louis County Police St. Louis County, WordPress.com, Middle East, … organizations: Twitter Facebook Google, Google, Facebook, Wal-Mart, CNN, Bearcats, … persons: Stell, Tom Jackson, Smith, Pamela Fillingim, Darren Wilson Eric Fowler Eric Vickers Ferguson Ferguson, Ferguson, … digest: sha1:747IKFWUCVQVXY7TX2NMYFL422T4TRQX Extrahováno se Stanford Named Entity Recognizer (NER) http://nlp.stanford.edu/software/CRF-NER.shtml https://webarchive.jira.com/wiki/display/ARS/ WANE+Overview+and+Technical+Details
  • 45. NameTag / CNES 2.0 | WANE? http://ufal.mff.cuni.cz/nametag
 https://ufal.mff.cuni.cz/cnec/cnec2.0
  • 46.
  • 47. Open nsfw model “This repo contains code for running Not Suitable for Work (NSFW) classification deep neural network Caffe models. “ https://github.com/yahoo/open_nsfw/blob/master/
  • 48.
  • 50. NameTag / CNES 2.0 | WANE? http://ufal.mff.cuni.cz/nametag
 https://ufal.mff.cuni.cz/cnec/cnec2.0
  • 51.
  • 52.
  • 53. Jak metadata zpřístupnit? bulk data bulk data v S3 API webová služba
  • 54. Co s metadaty? vývoj formátů na webu vývoj prolinkování webů vývoj nsfw webů na doméně vývoj poměru grafiky / textu na webu vývoj web technologií …
  • 55. Oddělení archivace webu | ODIF | NK ČR Vedoucí: Jaroslav Kvasnica Kurátoři: Marie Haškovcová, Monika Holoubková, Markéta Hrdličková IT Operation: Rudolf.Kreibich@nkp.cz webarchiv.cz facebook.com/webarchivcz slideshare.net/webarchivCZ github.com/webarchivcz