Embed presentation
Downloaded 10 times























![Sampling URIs - DMOZSampling URIs - DMOZ
1. DMOZ:Random
o 10,000 URIs randomly sampled from DMOZ directory (~5M URIs).
2. DMOZ:TLD - 2% for each TLD from DMOZ or 100 URIs
whichever is greater
o 52 TLDs (com 23,470) (de 6,332), (org 4,025), (uk 3,309), (net
2,073), (it 1,775), (jp 1379), (ru 1244), (fr 1154), (pl 1062), (au
764), (ca 642), (at 438), (edu 390), (cz 385), (tr 334), (info 319),
(cn 278), (us 266), (nz 265), (es 238), (ar 213), (no 150), (br 149),
(tw 141), (za 118), (fi 113), ( 100 URIs for [ae, cat, cl, cu, eg, gov,
id, in, ir, is, ke, kr, ma, mt, mx, my, na, pe, pk, pt, sa, to, uy,
zw])
3. DMOZ:Languages - 100 URIs for each language
1. 24 languages: Icelandic, Portuguese, Catalan, Afrikaans,
Arabic, Indonesian, Chinese (Simplified), Chinese (Traditional),
Dutch, Spanish, French, Greek, Hindi, Italian, Japanese,
Korean, Norwegian, Persian, Polish , Russian, Turkish,
Ukrainian 24](https://image.slidesharecdn.com/iipc-2014-nelson-profiling-140521151040-phpapp01/75/Profiling-Web-Archives-24-2048.jpg)














![{"Profile":{
"Name":"Taiwan Web Archive",
"URI":"http://webarchive.lib.ntu.edu.tw",
"TimeGate":
"http://mementoproxy.cs.odu.edu/tw/timegate/",
"Code":"TW",
"Age":"Tue, 15 Jul 1997 00:00:00 GMT",
"TLD":[{"tw":0.6},{"cn":0.08},{"hk":0.04},
{"eg":0.04},{"gov":0.04},{"my":0.04},
{"jp":0.04},{"kr":0.02}],
"Language":[{"zh-TW":0.5},{"zh-CN":0.25},
{"id":0.08},{"ar":0.08}],
"GrowthRate":[
{"199707":[4,4]},{"200202":[1,1]},
{"200607":[30,62]},{"200608":[20,80]},
{"200609":[5,9]},{"200612":[77,129]},
... // other values truncated
{"201308":[7,94]},{"201309":[2,94]}]
}
}
A Possible SerializationA Possible Serialization](https://image.slidesharecdn.com/iipc-2014-nelson-profiling-140521151040-phpapp01/75/Profiling-Web-Archives-39-2048.jpg)





The document discusses the profiling of web archives, presenting a framework for analyzing and optimizing query routing for memento aggregators by sampling archival data across various dimensions. It highlights the limitations in URI lookups and the need for scalable profiling techniques, leading to findings about the coverage of different archives and the effectiveness of query routing strategies. The research aims to contribute to better understanding who is archiving what and improve access to archived materials.























![Sampling URIs - DMOZSampling URIs - DMOZ
1. DMOZ:Random
o 10,000 URIs randomly sampled from DMOZ directory (~5M URIs).
2. DMOZ:TLD - 2% for each TLD from DMOZ or 100 URIs
whichever is greater
o 52 TLDs (com 23,470) (de 6,332), (org 4,025), (uk 3,309), (net
2,073), (it 1,775), (jp 1379), (ru 1244), (fr 1154), (pl 1062), (au
764), (ca 642), (at 438), (edu 390), (cz 385), (tr 334), (info 319),
(cn 278), (us 266), (nz 265), (es 238), (ar 213), (no 150), (br 149),
(tw 141), (za 118), (fi 113), ( 100 URIs for [ae, cat, cl, cu, eg, gov,
id, in, ir, is, ke, kr, ma, mt, mx, my, na, pe, pk, pt, sa, to, uy,
zw])
3. DMOZ:Languages - 100 URIs for each language
1. 24 languages: Icelandic, Portuguese, Catalan, Afrikaans,
Arabic, Indonesian, Chinese (Simplified), Chinese (Traditional),
Dutch, Spanish, French, Greek, Hindi, Italian, Japanese,
Korean, Norwegian, Persian, Polish , Russian, Turkish,
Ukrainian 24](https://image.slidesharecdn.com/iipc-2014-nelson-profiling-140521151040-phpapp01/75/Profiling-Web-Archives-24-2048.jpg)














![{"Profile":{
"Name":"Taiwan Web Archive",
"URI":"http://webarchive.lib.ntu.edu.tw",
"TimeGate":
"http://mementoproxy.cs.odu.edu/tw/timegate/",
"Code":"TW",
"Age":"Tue, 15 Jul 1997 00:00:00 GMT",
"TLD":[{"tw":0.6},{"cn":0.08},{"hk":0.04},
{"eg":0.04},{"gov":0.04},{"my":0.04},
{"jp":0.04},{"kr":0.02}],
"Language":[{"zh-TW":0.5},{"zh-CN":0.25},
{"id":0.08},{"ar":0.08}],
"GrowthRate":[
{"199707":[4,4]},{"200202":[1,1]},
{"200607":[30,62]},{"200608":[20,80]},
{"200609":[5,9]},{"200612":[77,129]},
... // other values truncated
{"201308":[7,94]},{"201309":[2,94]}]
}
}
A Possible SerializationA Possible Serialization](https://image.slidesharecdn.com/iipc-2014-nelson-profiling-140521151040-phpapp01/75/Profiling-Web-Archives-39-2048.jpg)



