4. Monitrix
https://github.com/ukwa/monitrix
Prototyp 1
Monitoring / Front-end pro Heritrix 3
Analytika probíhající sklizně / pravděpodobně agreguje jen jeden stroj
Prototyp 2
ELK: ElasticSearch / Logstash / Kibana
25 miliónů řádek logů / 26 GB na disku / 4vCPU / 20 GB RAM – otázka
jak škálovat na celoplošné sklizně
5. QA
proces na analýzu reportu na nesklizené weby a jejich znovu
sklizení
proces pro analýzu objevených ale nesklizených URL
na kontrolu sklizní speciální webů jako Youtube, Facebook,
Twitter
9. CDX SERVER API
http://web.archive.org/cdx/search/cdx?
url=archive.org&output=json&limit=2&filter=!statuscode:200
will return 2 capture results with non-200 status codes.
http://web.archive.org/cdx/search/cdx?
url=archive.org&output=json&limit=10&filter=!statuscode:
200&filter=!mimetype:text/html&filter=digest:
2WAXX5NUWNNCS2BDKCO5OVDQBJVNKIVV will return 10
capture results with non-200 status codes and mime types that are
not text/html but which match a specific content digest
https://github.com/iipc/openwayback/tree/master/
wayback-cdx-server-webapp
12. Common Crawl
Je možné použít Amazon infrastructure na analytiku nad
daty Common Crawl
více jak ~100 TB přírůstek měsíčně
Common Craw
https://commoncrawl.org/the-data/get-started/
Příklady využití dat Common Crawl
http://commoncrawl.org/the-data/examples/
CDX Server API s GUI pro procházení CDX souborů
http://index.commoncrawl.org
14. Portugalský prototyp
fulltextu
http://www.arquivo.pt/resawdev
The login is: resaw/resaw.eu
https://sobre.arquivo.pt/news/a-first-attempt-to-archive-the-.eu-domain?
set_language=en
https://netpreserveblog.wordpress.com/2015/06/03/a-first-attempt-to-archive-the-eu-
domain/
Thesis
http://sobre.arquivo.pt/sobre/publicacoes-1/Documentos-acerca-do-Arquivo.pt/
information-search-in-web-archives
Slides from IIPC GA 2015
http://www.netpreserve.org/sites/default/files/attachments/2015_IIPC-
GA_Slides_11_Gomes.pptx
kolegovy poznámky:
https://www.evernote.com/shard/s43/sh/e6e12603-
ecb2-42ae-8532-67d2779b4a86/3b2162e0bcc710d847b6fa5e86cc70b2
15. UK WA prototyp fulltextu
Shine
Prototyp
https://www.webarchive.org.uk/shine/search/advanced
Wiki
https://github.com/ukwa/shine/wiki/Specification
Code
https://github.com/ukwa/shine
Prezentace Helen Hockx-Yu
http://www.netpreserve.org/sites/default/files/attachments/
2015_IIPC-GA_Slides_08_Hockx.ppt
Video
https://www.youtube.com/watch?v=o4iIdZP4rg8
18. HTTP Archive
In addition to the content of web pages, it's important to record
how this digitized content is constructed and served. The HTTP
Archive provides this record. It is a permanent repository of web
performance information such as size of pages, failed requests,
and technologies utilized. This performance information allows
us to see trends in how the Web is built and provides a common
data set from which to conduct web performance research.
http://httparchive.org/trends.php?s=All&minlabel=Nov
+15+2010&maxlabel=Sep+15+2015
http://httparchive.org/interesting.php