SlideShare a Scribd company logo
Web Archive Profiling
Through Fulltext Search
Sawood Alam and Michael L. Nelson
Computer Science Department, Old Dominion University
Norfolk, Virginia - 23529
Herbert Van de Sompel
Los Alamos National Laboratory, Los Alamos, NM
David S. H. Rosenthal
Stanford University Libraries, Stanford, CA
Supported in part by the IIPC and NSF 1526700
Unorganized Collections
2
Organized Collections
3
Collection Understanding
4
Memento Aggregator
5
Memento Aggregator
6
Memento Aggregator
7
Memento Aggregator
8
Memento Aggregator
9
Memento Aggregator
10
From: Michael Nelson [mailto:mln@cs.odu.edu]
Sent: Wednesday, December 02, 2015 12:33 PM
To: Jones, Gina
Cc: Rourke, Patrick; Grotke, Abigail
Subject: Re: WebSciDL
Hi Gina, I'll investigate. memgator is software that one my students wrote,
but I suspect the traffic you're seeing is b/c it is deployed in
http://oldweb.today/ can you share the IP addr from where you're seeing
the traffic? I presume the requests are for Memento TimeMaps? It should
not being actually scraping HTML pages.
regards,
Michael
On Wed, 2 Dec 2015, Jones, Gina wrote:
> Hi Michael, we have a slight configuration issue with the current OW
> set up for our webarchives. I think, from looking at the logs, that
> "MemGator:1.0-rc3 <@WebSciDL>" is really causing some issues
on our wayback.
> Do you know who is running this scraper? Itʼs not part of memento is it?
>
> Gina Jones
> Web Archiving Team
> Library of Congress
From: Ilya Kreymer <ikreymer@gmail.com>
Date: Wed, 2 Dec 2015 10:33:56 -0800
Subject: high traffic on oldweb!
To: Herbert Van de Sompel <hvdsomp@gmail.com>, Sawood Alam
<ibnesayeed@gmail.com>
Hi Herbert, Sawood,
Herbert: Perhaps you are lucky that I am not using the LANL aggregator,
as the traffic has gotten really high, and also I was asked to remove an
archive due to the traffic it was causing temporarily..
I am thinking that ability to remove source archives quickly is an
important aspect of an aggregator.
Sawood: Hopefully yours will support something like this so I don't need
to restart the container to change the archivelist ;)
Ilya
Broadcasting is Bad
11
Availability and Overlap
● Archives are sparse
● Broadcasting is wasteful, both clients and archives suffer
12
Memento Routing
13
Routing Pros & Cons
● Pros
○ Minimizes traffic and resources consumption
○ Improves throughput
● Cons
○ Upfront profile maintenance cost
○ May miss Mementos (false negatives)
14
Why Small Archives Matter?
15
Why Small Archives Matter?
● 400B+ web pages at IA do not cover
everything
● Top three archives after IA produce full
TimeMap 52% of the time (AlSum, et al., TPDL 2013)
● Targeted crawls
● Special focus archives
● Restricted resources
● Private archives
● Censorship
16
While the Internet Archive was Down...
$ memgator -f cdxj example.org | cut -c-4 | grep -v "^@" | uniq -c
2 2002
1 2005
1 2008
6 2009
67 2010
17 2011
64 2012
108 2013
108 2014
186 2015
51 2016 17
Archive Profile
● High-level summary of an archive
● Predicts presence of mementos of a URI-R in
an archive
● Provides various statistics about the holdings
● Small in size
● Publicly available
● Easy to update and partially patch
● Useful for Memento query routing and other
things
18
Profiling Strategies
● Sample URI Profiling (AlSum, et al., TPDL 2013)
● CDX Profiling (Alam, et al., TPDL 2015)
● Response Cache Profiling (Bornand, et al., JCDL 2016)
● Fulltext Search Profiling
19
Methodology
Top Nouns
time
year
people
way
man
day
thing
child
mr
government 20
Random Dict
analogies
unbolt
consonant
coils
stolidly
cigar
decrepit
rhododendron
cannibal
honeydew
Dynamic Words Discovery
the ‫وﻛﺎﻟﺔ‬ war
angry ‫أﻧﺑﺎء‬ the
arab ‫اﻟﻌرﺑﻲ‬ middle
news ‫اﻟﻐﺎﺿب‬ east
service on arabic
a politics poetry
source war art
Random Searcher Model (RSM)
21
START
STOP
Seed Vocabulary
NextWord()
ExtractWords()
Search()
Select a random link
from the search results
Vocabulary
seeding
needed?
Termination
condition
reached?
GenerateProfile()
Store search results
No
Yes
YesNo
Fetch the contents of the
selected document
RSM Illustration
Teaching Resources Adjunct Toolkit NC NET Academy PD Planning Tools Regional
Centers Campus Liaisons Nontraditional Careers College Tech Prep NC ACCESS Co op
Education Green Technology You are here NC NET Teaching Resources Discipline Specific
English English Self Paced Modules Writing Across the Curriculum NC NET Western Center
Incorporating Visuals in Workplace Documents Sections 1 2 Wake Tech Community College
Incorporating Visuals in Workplace Documents Section 3 Wake Tech Community College
All self paced modules can be accessed through the NC NET Blackboard server Log in with
the user name faculty and the password nc net Once connected you can view the courses
by topic or alphabetically by title English Webliography North Carolina Community College
System 2012
RSM Modes
● Static: Externally supplied static word list
● PopularityBiased: Refresh Vocabulary after
every search attempt and consider term
frequency for selecting next search keyword
● EqualOpportunity: Refresh Vocabulary
after every search attempt and ignore term
frequency for selecting next search keyword
● Conservative: Discover new words only
when the Vocabulary is exhausted
23
Profiling Policies & Archive-It Dataset
Policy # Keys Example
URIR 30,800,406 uk,co,bbc,news,)/Images/Logo.png?height=80&width=200
HxP1 1,724,284 uk,co,bbc,news,)/Images
DDom 91,629 uk,co,bbc,)/
H1P0 212 uk,)/
Sample URI: https://www.news.BBC.co.uk/Images/Logo.png?width=80&height=40
24
For a detailed list of profiling policies please refer to:
Alam, et al.: Web Archive Profiling Through CDX Summarization. IJDL (2016) 17: 223-238
Searches vs Coverage
25
100% in 11K searches
100% in 27K searches
100% in 337K searches 100% in 1.9M searches
RSM Operation Mode Costs
Mode
Query
Cost
HTTP
Cost
Remarks
Static C C
Suitable for specialized collection with known top
keywords
PopularityBiased C 2 * C Human like model, but costly
EqualOpportunity C 2 * C Human like model, but costly
Conservative C C +
(where << C)
Suitable for any collection and works without any
supplementary materials with very little overhead
26
Routing Confusion Matrix
Predicted  Actual Present in the Archive Not in the Archive
Routed to the Archive True Positive (TP) False Positive (FP)
Not Routed to the Archive False Negative (FN) True Negative (TN)
Routing Confusion Matrix Recall Accuracy
27
Accuracy, Recall, & Coverage (10-100%)
28
DMOZ IA Wayback
UK WaybackMemento Proxy
Low Accuracy (high FP) =>
Archives & Aggregator suffer
Low Recall (high FN) =>
Users suffer
Profile Policy Recommendations
● IF complete CDX is available THEN
○ Generate HxP1 profile
● ELSE IF fulltext search is available THEN
○ Generate DDom profile
● ELSE
○ Generate H1P0 or other smaller profiles using
Sample URIs
Note: It is possible to perform less detailed queries on more
specific (higher order) profiles, but not the other way
29
RSM Mode Recommendations
● IF the collection is about a specific topic in a
specific language AND a suitable top
keywords list is available THEN
○ Use Static mode
● ELSE
○ Use Conservative mode
30
Who Knows Term Frequency for
Estonian Nouns?
31
https://en.wiktionary.org/wiki/Category:Estonian_nouns
Future Work
● Evaluation of combination profiles such as
URI-Key along with Datetime
● Utilize archive profile to generate rank
ordered list of archive
● Profiles for usage other than Memento
routing, such as, site classification based
profiles (e.g., news, wiki, social media, blog
etc.)
32
Conclusions
● Evaluated the search cost as a function of archive holdings’
coverage and profiling policy
● Developed the Random Searcher Model
● Correctly route 80% requests while maintaining 0.9 Recall
by only discovering 10% of the archive holdings and
generating a profile that costs less than 1% of the complete
knowledge profile
33

More Related Content

What's hot

PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
Dimitris Kontokostas
 
VALA Tech Camp 2017: Intro to Wikidata & SPARQL
VALA Tech Camp 2017: Intro to Wikidata & SPARQLVALA Tech Camp 2017: Intro to Wikidata & SPARQL
VALA Tech Camp 2017: Intro to Wikidata & SPARQL
Jane Frazier
 
Memento 101
Memento 101Memento 101
Graph databases & data integration v2
Graph databases & data integration v2Graph databases & data integration v2
Graph databases & data integration v2
Dimitris Kontokostas
 
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
François Scharffe
 
GitHubGraph
GitHubGraphGitHubGraph
GitHubGraph
ronaknnatnani
 
Flagis linked open_data_stijn_goedertier
Flagis linked open_data_stijn_goedertierFlagis linked open_data_stijn_goedertier
Flagis linked open_data_stijn_goedertier
Flagis VZW
 
Semantic Web introduction
Semantic Web introductionSemantic Web introduction
Semantic Web introduction
Graphity
 
Data quality in Real Estate
Data quality in Real EstateData quality in Real Estate
Data quality in Real Estate
Dimitris Kontokostas
 
Linked Open Data stuff
Linked Open Data stuffLinked Open Data stuff
Dirk Goldhahn: Introduction to the German Wortschatz Project
Dirk Goldhahn: Introduction to the German Wortschatz ProjectDirk Goldhahn: Introduction to the German Wortschatz Project
Dirk Goldhahn: Introduction to the German Wortschatz Project
mbruemmer
 
Vocabulary for Linked Data Visualization Model - Dateso 2015
Vocabulary for Linked Data Visualization Model - Dateso 2015Vocabulary for Linked Data Visualization Model - Dateso 2015
Vocabulary for Linked Data Visualization Model - Dateso 2015
Jiří Helmich
 
Introduction to W3C Linked Data Platform
Introduction to W3C Linked Data PlatformIntroduction to W3C Linked Data Platform
Introduction to W3C Linked Data Platform
Nandana Mihindukulasooriya
 
SWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic WebSWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic Web
Pascal-Nicolas Becker
 
Insight Data Engineering project
Insight Data Engineering projectInsight Data Engineering project
Insight Data Engineering project
Hoa Nguyen
 
Web Data Management with RDF
Web Data Management with RDFWeb Data Management with RDF
Web Data Management with RDF
M. Tamer Özsu
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commons
Jesse Wang
 
Ontology, Semantic Web and DBpedia
Ontology, Semantic Web and DBpediaOntology, Semantic Web and DBpedia
Ontology, Semantic Web and DBpedia
Richard Kuo
 

What's hot (20)

PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
 
VALA Tech Camp 2017: Intro to Wikidata & SPARQL
VALA Tech Camp 2017: Intro to Wikidata & SPARQLVALA Tech Camp 2017: Intro to Wikidata & SPARQL
VALA Tech Camp 2017: Intro to Wikidata & SPARQL
 
Memento 101
Memento 101Memento 101
Memento 101
 
Graph databases & data integration v2
Graph databases & data integration v2Graph databases & data integration v2
Graph databases & data integration v2
 
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
 
20110728 datalift-rpi-troy
20110728 datalift-rpi-troy20110728 datalift-rpi-troy
20110728 datalift-rpi-troy
 
GitHubGraph
GitHubGraphGitHubGraph
GitHubGraph
 
Flagis linked open_data_stijn_goedertier
Flagis linked open_data_stijn_goedertierFlagis linked open_data_stijn_goedertier
Flagis linked open_data_stijn_goedertier
 
Semantic Web introduction
Semantic Web introductionSemantic Web introduction
Semantic Web introduction
 
Converting GHO to RDF
Converting GHO to RDFConverting GHO to RDF
Converting GHO to RDF
 
Data quality in Real Estate
Data quality in Real EstateData quality in Real Estate
Data quality in Real Estate
 
Linked Open Data stuff
Linked Open Data stuffLinked Open Data stuff
Linked Open Data stuff
 
Dirk Goldhahn: Introduction to the German Wortschatz Project
Dirk Goldhahn: Introduction to the German Wortschatz ProjectDirk Goldhahn: Introduction to the German Wortschatz Project
Dirk Goldhahn: Introduction to the German Wortschatz Project
 
Vocabulary for Linked Data Visualization Model - Dateso 2015
Vocabulary for Linked Data Visualization Model - Dateso 2015Vocabulary for Linked Data Visualization Model - Dateso 2015
Vocabulary for Linked Data Visualization Model - Dateso 2015
 
Introduction to W3C Linked Data Platform
Introduction to W3C Linked Data PlatformIntroduction to W3C Linked Data Platform
Introduction to W3C Linked Data Platform
 
SWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic WebSWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic Web
 
Insight Data Engineering project
Insight Data Engineering projectInsight Data Engineering project
Insight Data Engineering project
 
Web Data Management with RDF
Web Data Management with RDFWeb Data Management with RDF
Web Data Management with RDF
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commons
 
Ontology, Semantic Web and DBpedia
Ontology, Semantic Web and DBpediaOntology, Semantic Web and DBpedia
Ontology, Semantic Web and DBpedia
 

Viewers also liked

InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
Sawood Alam
 
Libyan digital newspapers_after_revolution
Libyan digital newspapers_after_revolutionLibyan digital newspapers_after_revolution
Libyan digital newspapers_after_revolution
maturban
 
10 Ways to Win at SlideShare SEO & Presentation Optimization
10 Ways to Win at SlideShare SEO & Presentation Optimization10 Ways to Win at SlideShare SEO & Presentation Optimization
10 Ways to Win at SlideShare SEO & Presentation Optimization
Oneupweb
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through StorytellingUsing Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Yasmin AlNoamany, PhD
 
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Michael Nelson
 
Social Feed Manager presentation at Archives Unleashed 3.0
Social Feed Manager presentation at Archives Unleashed 3.0Social Feed Manager presentation at Archives Unleashed 3.0
Social Feed Manager presentation at Archives Unleashed 3.0
Justin Littman
 
Twitter Analysis: Fake News
Twitter Analysis: Fake  NewsTwitter Analysis: Fake  News
Twitter Analysis: Fake News
Erika Siregar
 
Good News/ Bad News
Good News/ Bad NewsGood News/ Bad News
Good News/ Bad News
LulwahMA
 
My Presentation to SFIMA Summit 2010 - Social Media Strategy, YouTube, and Vi...
My Presentation to SFIMA Summit 2010 - Social Media Strategy, YouTube, and Vi...My Presentation to SFIMA Summit 2010 - Social Media Strategy, YouTube, and Vi...
My Presentation to SFIMA Summit 2010 - Social Media Strategy, YouTube, and Vi...
Murray Izenwasser
 
02 תואר בוגר וגליון ציונים
02 תואר בוגר וגליון ציונים02 תואר בוגר וגליון ציונים
02 תואר בוגר וגליון ציוניםEvyatar Glatzer
 
I sociedades de inversion
I sociedades de inversionI sociedades de inversion
I sociedades de inversion
AGHATA1236
 
5 Things You Should Do Before Job Interview-by Jubaer
5 Things You Should Do Before Job Interview-by Jubaer 5 Things You Should Do Before Job Interview-by Jubaer
5 Things You Should Do Before Job Interview-by Jubaer
Slide Gen
 
”C”は何の”C”
”C”は何の”C””C”は何の”C”
”C”は何の”C”
gipwest
 
Props music video pp
Props music video ppProps music video pp
Props music video pp
eloisesmith98
 

Viewers also liked (18)

InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
 
Libyan digital newspapers_after_revolution
Libyan digital newspapers_after_revolutionLibyan digital newspapers_after_revolution
Libyan digital newspapers_after_revolution
 
10 Ways to Win at SlideShare SEO & Presentation Optimization
10 Ways to Win at SlideShare SEO & Presentation Optimization10 Ways to Win at SlideShare SEO & Presentation Optimization
10 Ways to Win at SlideShare SEO & Presentation Optimization
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through StorytellingUsing Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through Storytelling
 
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
 
Social Feed Manager presentation at Archives Unleashed 3.0
Social Feed Manager presentation at Archives Unleashed 3.0Social Feed Manager presentation at Archives Unleashed 3.0
Social Feed Manager presentation at Archives Unleashed 3.0
 
Twitter Analysis: Fake News
Twitter Analysis: Fake  NewsTwitter Analysis: Fake  News
Twitter Analysis: Fake News
 
Good News/ Bad News
Good News/ Bad NewsGood News/ Bad News
Good News/ Bad News
 
My Presentation to SFIMA Summit 2010 - Social Media Strategy, YouTube, and Vi...
My Presentation to SFIMA Summit 2010 - Social Media Strategy, YouTube, and Vi...My Presentation to SFIMA Summit 2010 - Social Media Strategy, YouTube, and Vi...
My Presentation to SFIMA Summit 2010 - Social Media Strategy, YouTube, and Vi...
 
FINAL.LosOjos
FINAL.LosOjosFINAL.LosOjos
FINAL.LosOjos
 
02 תואר בוגר וגליון ציונים
02 תואר בוגר וגליון ציונים02 תואר בוגר וגליון ציונים
02 תואר בוגר וגליון ציונים
 
Operatingsystems 4grade
Operatingsystems 4gradeOperatingsystems 4grade
Operatingsystems 4grade
 
I sociedades de inversion
I sociedades de inversionI sociedades de inversion
I sociedades de inversion
 
5 Things You Should Do Before Job Interview-by Jubaer
5 Things You Should Do Before Job Interview-by Jubaer 5 Things You Should Do Before Job Interview-by Jubaer
5 Things You Should Do Before Job Interview-by Jubaer
 
”C”は何の”C”
”C”は何の”C””C”は何の”C”
”C”は何の”C”
 
Webles10
Webles10Webles10
Webles10
 
Props music video pp
Props music video ppProps music video pp
Props music video pp
 
Evidencias 2013
Evidencias 2013Evidencias 2013
Evidencias 2013
 

Similar to Web Archive Profiling Through Fulltext Search

dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data
dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of DatadipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data
dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data
eXascale Infolab
 
A Practical Approach to Design, Implementation, and Management A Practical Ap...
A Practical Approach to Design, Implementation, and Management A Practical Ap...A Practical Approach to Design, Implementation, and Management A Practical Ap...
A Practical Approach to Design, Implementation, and Management A Practical Ap...
Cynthia Velynne
 
Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"
Discover Pinterest
 
How to get started in Big Data for master's students
How to get started in Big Data for master's studentsHow to get started in Big Data for master's students
How to get started in Big Data for master's students
Mohamed Nadjib MAMI
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
National Information Standards Organization (NISO)
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
GlobalLogic Ukraine
 
Release webinar: Sansa and Ontario
Release webinar: Sansa and OntarioRelease webinar: Sansa and Ontario
Release webinar: Sansa and Ontario
BigData_Europe
 
Graph basedrdf storeforapachecassandra
Graph basedrdf storeforapachecassandraGraph basedrdf storeforapachecassandra
Graph basedrdf storeforapachecassandra
Ravindra Ranwala
 
PLNOG 6: Piotr Modzelewski, Bartłomiej Rymarski - Product Catalogue - Case Study
PLNOG 6: Piotr Modzelewski, Bartłomiej Rymarski - Product Catalogue - Case StudyPLNOG 6: Piotr Modzelewski, Bartłomiej Rymarski - Product Catalogue - Case Study
PLNOG 6: Piotr Modzelewski, Bartłomiej Rymarski - Product Catalogue - Case Study
PROIDEA
 
cyclades eswc2016
cyclades eswc2016cyclades eswc2016
cyclades eswc2016
Pascal Molli
 
Efficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databasesEfficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databasesRui Vieira
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream Processing
PlanetData Network of Excellence
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
Oscar Corcho
 
Handling the growth of data
Handling the growth of dataHandling the growth of data
Handling the growth of data
Piyush Katariya
 
Lightweight Collection and Storage of Software Repository Data with DataRover
Lightweight Collection and Storage of  Software Repository Data with DataRoverLightweight Collection and Storage of  Software Repository Data with DataRover
Lightweight Collection and Storage of Software Repository Data with DataRover
Christoph Matthies
 
Challenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan LambrightChallenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan Lambright
Gluster.org
 
Data Science as Scale
Data Science as ScaleData Science as Scale
Data Science as Scale
Conor B. Murphy
 
Data management for Quantitative Biology -Basics and challenges in biomedical...
Data management for Quantitative Biology -Basics and challenges in biomedical...Data management for Quantitative Biology -Basics and challenges in biomedical...
Data management for Quantitative Biology -Basics and challenges in biomedical...
QBiC_Tue
 
Complex Ephemeral Caching With Redis: Jeff Pollard
Complex Ephemeral Caching With Redis: Jeff PollardComplex Ephemeral Caching With Redis: Jeff Pollard
Complex Ephemeral Caching With Redis: Jeff Pollard
Redis Labs
 

Similar to Web Archive Profiling Through Fulltext Search (20)

dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data
dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of DatadipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data
dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data
 
A Practical Approach to Design, Implementation, and Management A Practical Ap...
A Practical Approach to Design, Implementation, and Management A Practical Ap...A Practical Approach to Design, Implementation, and Management A Practical Ap...
A Practical Approach to Design, Implementation, and Management A Practical Ap...
 
Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"
 
How to get started in Big Data for master's students
How to get started in Big Data for master's studentsHow to get started in Big Data for master's students
How to get started in Big Data for master's students
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
 
Release webinar: Sansa and Ontario
Release webinar: Sansa and OntarioRelease webinar: Sansa and Ontario
Release webinar: Sansa and Ontario
 
Graph basedrdf storeforapachecassandra
Graph basedrdf storeforapachecassandraGraph basedrdf storeforapachecassandra
Graph basedrdf storeforapachecassandra
 
PLNOG 6: Piotr Modzelewski, Bartłomiej Rymarski - Product Catalogue - Case Study
PLNOG 6: Piotr Modzelewski, Bartłomiej Rymarski - Product Catalogue - Case StudyPLNOG 6: Piotr Modzelewski, Bartłomiej Rymarski - Product Catalogue - Case Study
PLNOG 6: Piotr Modzelewski, Bartłomiej Rymarski - Product Catalogue - Case Study
 
cyclades eswc2016
cyclades eswc2016cyclades eswc2016
cyclades eswc2016
 
Efficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databasesEfficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databases
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream Processing
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
 
Handling the growth of data
Handling the growth of dataHandling the growth of data
Handling the growth of data
 
Lightweight Collection and Storage of Software Repository Data with DataRover
Lightweight Collection and Storage of  Software Repository Data with DataRoverLightweight Collection and Storage of  Software Repository Data with DataRover
Lightweight Collection and Storage of Software Repository Data with DataRover
 
Challenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan LambrightChallenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan Lambright
 
Data Science as Scale
Data Science as ScaleData Science as Scale
Data Science as Scale
 
Data management for Quantitative Biology -Basics and challenges in biomedical...
Data management for Quantitative Biology -Basics and challenges in biomedical...Data management for Quantitative Biology -Basics and challenges in biomedical...
Data management for Quantitative Biology -Basics and challenges in biomedical...
 
Complex Ephemeral Caching With Redis: Jeff Pollard
Complex Ephemeral Caching With Redis: Jeff PollardComplex Ephemeral Caching With Redis: Jeff Pollard
Complex Ephemeral Caching With Redis: Jeff Pollard
 

More from Sawood Alam

TrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web PagesTrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web Pages
Sawood Alam
 
CDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection InsightsCDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection Insights
Sawood Alam
 
Video Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineVideo Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback Machine
Sawood Alam
 
Profiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento RoutingProfiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento Routing
Sawood Alam
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web Bundles
Sawood Alam
 
Summarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSummarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMap
Sawood Alam
 
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingMementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
Sawood Alam
 
Supporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSupporting Web Archiving via Web Packaging
Supporting Web Archiving via Web Packaging
Sawood Alam
 
MementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination FrameworkMementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination Framework
Sawood Alam
 
Impact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesImpact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web Archives
Sawood Alam
 
Archive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification Framework
Sawood Alam
 
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingMementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
Sawood Alam
 
Web ARChive (WARC) File Format
Web ARChive (WARC) File FormatWeb ARChive (WARC) File Format
Web ARChive (WARC) File Format
Sawood Alam
 
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web ArchivingInterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
Sawood Alam
 
MemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoMemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in Go
Sawood Alam
 
Dockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationDockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to Containerization
Sawood Alam
 
Avoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerAvoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorker
Sawood Alam
 
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerClient-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Sawood Alam
 
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Sawood Alam
 
Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015
Sawood Alam
 

More from Sawood Alam (20)

TrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web PagesTrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web Pages
 
CDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection InsightsCDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection Insights
 
Video Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineVideo Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback Machine
 
Profiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento RoutingProfiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento Routing
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web Bundles
 
Summarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSummarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMap
 
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingMementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
 
Supporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSupporting Web Archiving via Web Packaging
Supporting Web Archiving via Web Packaging
 
MementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination FrameworkMementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination Framework
 
Impact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesImpact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web Archives
 
Archive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification Framework
 
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingMementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
 
Web ARChive (WARC) File Format
Web ARChive (WARC) File FormatWeb ARChive (WARC) File Format
Web ARChive (WARC) File Format
 
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web ArchivingInterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
 
MemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoMemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in Go
 
Dockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationDockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to Containerization
 
Avoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerAvoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorker
 
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerClient-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorker
 
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
 
Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015
 

Recently uploaded

ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
RASHMI M G
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
IshaGoswami9
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
Nucleophilic Addition of carbonyl compounds.pptx
Nucleophilic Addition of carbonyl  compounds.pptxNucleophilic Addition of carbonyl  compounds.pptx
Nucleophilic Addition of carbonyl compounds.pptx
SSR02
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
yqqaatn0
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills MN
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
Abdul Wali Khan University Mardan,kP,Pakistan
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
sanjana502982
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
Sharon Liu
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
Gokturk Mehmet Dilci
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
HongcNguyn6
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
Wasswaderrick3
 
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptxBREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
RASHMI M G
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
University of Maribor
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
Renu Jangid
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
pablovgd
 

Recently uploaded (20)

ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
Nucleophilic Addition of carbonyl compounds.pptx
Nucleophilic Addition of carbonyl  compounds.pptxNucleophilic Addition of carbonyl  compounds.pptx
Nucleophilic Addition of carbonyl compounds.pptx
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
 
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptxBREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
 

Web Archive Profiling Through Fulltext Search

  • 1. Web Archive Profiling Through Fulltext Search Sawood Alam and Michael L. Nelson Computer Science Department, Old Dominion University Norfolk, Virginia - 23529 Herbert Van de Sompel Los Alamos National Laboratory, Los Alamos, NM David S. H. Rosenthal Stanford University Libraries, Stanford, CA Supported in part by the IIPC and NSF 1526700
  • 11. From: Michael Nelson [mailto:mln@cs.odu.edu] Sent: Wednesday, December 02, 2015 12:33 PM To: Jones, Gina Cc: Rourke, Patrick; Grotke, Abigail Subject: Re: WebSciDL Hi Gina, I'll investigate. memgator is software that one my students wrote, but I suspect the traffic you're seeing is b/c it is deployed in http://oldweb.today/ can you share the IP addr from where you're seeing the traffic? I presume the requests are for Memento TimeMaps? It should not being actually scraping HTML pages. regards, Michael On Wed, 2 Dec 2015, Jones, Gina wrote: > Hi Michael, we have a slight configuration issue with the current OW > set up for our webarchives. I think, from looking at the logs, that > "MemGator:1.0-rc3 <@WebSciDL>" is really causing some issues on our wayback. > Do you know who is running this scraper? Itʼs not part of memento is it? > > Gina Jones > Web Archiving Team > Library of Congress From: Ilya Kreymer <ikreymer@gmail.com> Date: Wed, 2 Dec 2015 10:33:56 -0800 Subject: high traffic on oldweb! To: Herbert Van de Sompel <hvdsomp@gmail.com>, Sawood Alam <ibnesayeed@gmail.com> Hi Herbert, Sawood, Herbert: Perhaps you are lucky that I am not using the LANL aggregator, as the traffic has gotten really high, and also I was asked to remove an archive due to the traffic it was causing temporarily.. I am thinking that ability to remove source archives quickly is an important aspect of an aggregator. Sawood: Hopefully yours will support something like this so I don't need to restart the container to change the archivelist ;) Ilya Broadcasting is Bad 11
  • 12. Availability and Overlap ● Archives are sparse ● Broadcasting is wasteful, both clients and archives suffer 12
  • 14. Routing Pros & Cons ● Pros ○ Minimizes traffic and resources consumption ○ Improves throughput ● Cons ○ Upfront profile maintenance cost ○ May miss Mementos (false negatives) 14
  • 15. Why Small Archives Matter? 15
  • 16. Why Small Archives Matter? ● 400B+ web pages at IA do not cover everything ● Top three archives after IA produce full TimeMap 52% of the time (AlSum, et al., TPDL 2013) ● Targeted crawls ● Special focus archives ● Restricted resources ● Private archives ● Censorship 16
  • 17. While the Internet Archive was Down... $ memgator -f cdxj example.org | cut -c-4 | grep -v "^@" | uniq -c 2 2002 1 2005 1 2008 6 2009 67 2010 17 2011 64 2012 108 2013 108 2014 186 2015 51 2016 17
  • 18. Archive Profile ● High-level summary of an archive ● Predicts presence of mementos of a URI-R in an archive ● Provides various statistics about the holdings ● Small in size ● Publicly available ● Easy to update and partially patch ● Useful for Memento query routing and other things 18
  • 19. Profiling Strategies ● Sample URI Profiling (AlSum, et al., TPDL 2013) ● CDX Profiling (Alam, et al., TPDL 2015) ● Response Cache Profiling (Bornand, et al., JCDL 2016) ● Fulltext Search Profiling 19
  • 20. Methodology Top Nouns time year people way man day thing child mr government 20 Random Dict analogies unbolt consonant coils stolidly cigar decrepit rhododendron cannibal honeydew Dynamic Words Discovery the ‫وﻛﺎﻟﺔ‬ war angry ‫أﻧﺑﺎء‬ the arab ‫اﻟﻌرﺑﻲ‬ middle news ‫اﻟﻐﺎﺿب‬ east service on arabic a politics poetry source war art
  • 21. Random Searcher Model (RSM) 21 START STOP Seed Vocabulary NextWord() ExtractWords() Search() Select a random link from the search results Vocabulary seeding needed? Termination condition reached? GenerateProfile() Store search results No Yes YesNo Fetch the contents of the selected document
  • 22. RSM Illustration Teaching Resources Adjunct Toolkit NC NET Academy PD Planning Tools Regional Centers Campus Liaisons Nontraditional Careers College Tech Prep NC ACCESS Co op Education Green Technology You are here NC NET Teaching Resources Discipline Specific English English Self Paced Modules Writing Across the Curriculum NC NET Western Center Incorporating Visuals in Workplace Documents Sections 1 2 Wake Tech Community College Incorporating Visuals in Workplace Documents Section 3 Wake Tech Community College All self paced modules can be accessed through the NC NET Blackboard server Log in with the user name faculty and the password nc net Once connected you can view the courses by topic or alphabetically by title English Webliography North Carolina Community College System 2012
  • 23. RSM Modes ● Static: Externally supplied static word list ● PopularityBiased: Refresh Vocabulary after every search attempt and consider term frequency for selecting next search keyword ● EqualOpportunity: Refresh Vocabulary after every search attempt and ignore term frequency for selecting next search keyword ● Conservative: Discover new words only when the Vocabulary is exhausted 23
  • 24. Profiling Policies & Archive-It Dataset Policy # Keys Example URIR 30,800,406 uk,co,bbc,news,)/Images/Logo.png?height=80&width=200 HxP1 1,724,284 uk,co,bbc,news,)/Images DDom 91,629 uk,co,bbc,)/ H1P0 212 uk,)/ Sample URI: https://www.news.BBC.co.uk/Images/Logo.png?width=80&height=40 24 For a detailed list of profiling policies please refer to: Alam, et al.: Web Archive Profiling Through CDX Summarization. IJDL (2016) 17: 223-238
  • 25. Searches vs Coverage 25 100% in 11K searches 100% in 27K searches 100% in 337K searches 100% in 1.9M searches
  • 26. RSM Operation Mode Costs Mode Query Cost HTTP Cost Remarks Static C C Suitable for specialized collection with known top keywords PopularityBiased C 2 * C Human like model, but costly EqualOpportunity C 2 * C Human like model, but costly Conservative C C + (where << C) Suitable for any collection and works without any supplementary materials with very little overhead 26
  • 27. Routing Confusion Matrix Predicted Actual Present in the Archive Not in the Archive Routed to the Archive True Positive (TP) False Positive (FP) Not Routed to the Archive False Negative (FN) True Negative (TN) Routing Confusion Matrix Recall Accuracy 27
  • 28. Accuracy, Recall, & Coverage (10-100%) 28 DMOZ IA Wayback UK WaybackMemento Proxy Low Accuracy (high FP) => Archives & Aggregator suffer Low Recall (high FN) => Users suffer
  • 29. Profile Policy Recommendations ● IF complete CDX is available THEN ○ Generate HxP1 profile ● ELSE IF fulltext search is available THEN ○ Generate DDom profile ● ELSE ○ Generate H1P0 or other smaller profiles using Sample URIs Note: It is possible to perform less detailed queries on more specific (higher order) profiles, but not the other way 29
  • 30. RSM Mode Recommendations ● IF the collection is about a specific topic in a specific language AND a suitable top keywords list is available THEN ○ Use Static mode ● ELSE ○ Use Conservative mode 30
  • 31. Who Knows Term Frequency for Estonian Nouns? 31 https://en.wiktionary.org/wiki/Category:Estonian_nouns
  • 32. Future Work ● Evaluation of combination profiles such as URI-Key along with Datetime ● Utilize archive profile to generate rank ordered list of archive ● Profiles for usage other than Memento routing, such as, site classification based profiles (e.g., news, wiki, social media, blog etc.) 32
  • 33. Conclusions ● Evaluated the search cost as a function of archive holdings’ coverage and profiling policy ● Developed the Random Searcher Model ● Correctly route 80% requests while maintaining 0.9 Recall by only discovering 10% of the archive holdings and generating a profile that costs less than 1% of the complete knowledge profile 33