SlideShare a Scribd company logo
1 of 54
Download to read offline
Standard Datasets in IR 
By: S.J Brenda 
13-11-2014 1
What is dataset 
Collection of documents 
The data sets includes 
Document set : Collection of documents 
Query set: Set of information needs collection of questions asking the IR system for results 
Relevant judgement set: Methods calculate da relevance between results set & queries 
13-11-2014 2
Standard dataset 
Test collections 
Consists of static set of documents 
A set of information needs/topics 
A set of known relevant documents for each of the information needs 
Prefer to be in large scale 
Proper validation 
Rapid growth 
Great diversity 
13-11-2014 3
WHY we need 
Test the system 
Determine how well IR systems perform 
Compare the performance of the IR system with that of other systems 
Compare search algorithms with each other 
Compare search strategies with each other 
13-11-2014 4
Standard Datasets in IR 
The Cranfield collection 
Text Retrieval Conference (TREC) 
Gov2 
NII Test Collection for IR system (NTCIR) 
Cross Language Evaluation Forum (CLEF) 
Reuters 
20Newsgroups 
13-11-2014 5
The Cranfield collection 
This was the pioneering test collection in allowing precise quantitative measures of information retrieval effectiveness 
Created in late 1960 
The Cranfield corpus is a relatively small information retrieval corpus consisting of 1400 abstracts on aeronautical engineering topics. The documents contain a total of 136935 terms from a vocabulary of size 4,632 (after stop word removal). 
13-11-2014 6
The Cranfield collection 
The Cranfield corpus also contains a set of 225 query strings with ground truth relevance judgements. 
Source: abstract of scientific papers from aeronautic research field 1945-1962 
13-11-2014 7
The Cranfield collection 
Experimental assumptions 
Relevance = topical similarity 
Static information need 
All documents equally desirable 
Relevance judgments are complete and representative of the user population 
Drawbacks 
Lack of comparison between different systems 
Small collection not enough 
13-11-2014 8
TREC 
The TREC corpus is a large data set consisting of articles taken from a variety of newswire and other sources 
This data set consists of 528,155 documents spanning a total of 165,363,765 terms from a vocabulary of size 629,469 (after stop words removal) 
Also provided are a number of query strings consisting of three parts, a title, description and narrative. Ground truth judgements are available concerning whether or not each of the documents is relevant to each of the queries. 13-11-2014 9
TREC 
Design for large data collection 
Retrieval method: 
Not tied to any applicationAdhocquery: in a library environment 
Routing query: filter the result set 
Useful for IR system designers 
Good for dedicated searcher not novice searcher 
13-11-2014 10
TREC-Documents 
13-11-2014 11
TREC-Relevance Judgement 
3 methods 
On all documents, all topics 
Random sample of documents 
Pooling 
•Divide each set into topic 
•Select top ranked documents 
•Each topic merge with the results of other system 
•Sort based on document number 
•Remove duplicate documents 
13-11-2014 12
TREC datasets 
Contextual Suggestion Track : 
To investigate search techniques for complex information needs that are highly dependent on context and user interests. 
Clinical Decision Support Track : 
To investigate techniques for linking medical cases to information relevant for patient care 
Federated Web Search Track : 
To investigate techniques for the selection and combination of search results from a large number of real on-line web search services. 
Knowledge Base Acceleration Track : 
To develop techniques to dramatically improve the efficiency of (human) knowledge base curators by having the system suggest modifications/extensions to the KB based on its monitoring of the data streams. 
13-11-2014 13
Microblog Track: 
To examine the nature of real-time information needs and their satisfaction in the context of microblogging environments such as Twitter. 
Session Track : 
To develop methods for measuring multiple-query sessions where information needs drift or get more or less specific over the session. 
Temporal Summarization Track : 
To develop systems that allow users to efficiently monitor the information associated with an event over time. 
Web Track: 
To explore information seeking behaviours common in general web search. 
13-11-2014 14
Chemical Track : 
To develop and evaluate technology for large scale search in chemistry-related documents, including academic papers and patents, to better meet the needs of professional searchers, and specifically patent searchers and chemists. 
Crowdsourcing Track : 
To provide a collaborative venue for exploring crowd sourcing methods both for evaluating search and for performing search tasks. 
Genomics Track: 
To study the retrieval of genomic data, not just gene sequences but also supporting documentation such as research papers, lab reports, etc. Last ran on TREC 2007. 
Enterprise Trac: 
To study search over the data of an organization to complete some task. Last ran on TREC 2008 13-11-2014 15
Entity Track : 
To perform entity-related search on Web data. These search tasks (such as finding entities and properties of entities) address common information needs that are not that well modeled as ad hoc document search. 
Cross-Language Track : 
To investigate the ability of retrieval systems to find documents topically regardless of source language. 
FedWebTrack : 
To select best resources to forward a query to, and merge the results so that most relevant are on the top. 
Filtering Track : 
To binarilydecide retrieval of new incoming documents given a stable 
information need. 
HARD Track : 
To achieve High Accuracy Retrieval from Documents by 
leveraging additional information about the searcher 
and/or the search context. 
13-11-2014 16
Interactive Track : 
To study user interaction with text retrieval systems. 
Legal Track : 
To develop search technology that meets the needs of lawyers to engage in effective discoveryin digital document collections. 
Medical Records Track : 
To explore methods for searching unstructured information found in patient medical records. 
Novelty Track : 
To investigate systems' abilities to locate new (i.e., non-redundant) information. 
13-11-2014 17
Question Answering Track : 
To achieve more IRthan just Document Retrievalby answering factoid, list and definition-style questions. 
Robust Retrieval Track : 
To focus on individual topic effectiveness. 
Relevance FeedbackTrack: 
To further deep evaluation of relevance feedback processes. 
Spam Track : 
To provide a standard evaluation of current and proposed Spam filteringapproaches. 
TeraByteTrack: 
To investigate whether/how the IRcommunity can scale traditional IR test- collection-based evaluation to significantly large collections. 
VideoTrack: 
To research in automatic segmentation, indexing, and content-based retrieval of digital video.In2003, this track became its own independent evaluation named TRECVID 
13-11-2014 18
Advantages 
Larger Collections 
BetterResults 
Usable in many tasks 
Filtering 
Web search 
Video retrieval 
CLEF 
NTCIR 
GOV2 
Drawback: Complete judgement is impossible 
Poolingcan overcome this 13-11-2014 19
Gov2 
A TREC corpus consist of 25 million documents from US government domains and government related websites 
TREC topics 701-850 used for evaluation 
One of the largest web collection easily available for research purposes 
13-11-2014 20
NTCIR -NII Test Collection for IR system 
Built various test collections of similar sizes to the TREC 
Focus on East Asian Language and Cross Language information retrieval 
Query one language document collection more than one languages 
13-11-2014 21
13-11-2014 22
CLIR: IR/CLIR test collection 
CLIR test collection can be used for experiments of cross- lingual information retrieval between Chinese(traditional), Japanese, Korean and English (CJKE) such as 
Multilingual CLIR (MLIR) 
Bilingual CLIR (BLIR) 
Single Language (Monolingual) IR (SLIR). 
13-11-2014 23
CLQA(Cross Language Q&A data Test Collection) 
CLQA Task, the followingsubtasks were conducted. 
1. Japanese to English (J-E) subtask 
Find answers of Japanese questions in English documents. 
2. Chinese to English (C-E) subtaskFind answers of Chinese questions in English documents. 3. English to Japanese (E-J) subtaskFind answers of English questions in Japanese documents. 4. English to Chinese (E-C) subtaskFind answers of English questions in Chinese documents. 5. Chinese to Chinese (C-C) subtaskFind answers of Chinese questions in Chinese documents. 
13-11-2014 24
ACLIA(Advanced Cross-Lingual Information Retrieval and Question Answering) 
ACLIA test collection can be used for experiments of Complex Question Answering and Information Retrieval between Chinese (Simplified (CS), Traditional (CT)), Japanese (JA), English (EN) such as 
CCLQA (Complex Cross-Lingual Question Answering) 
Cross-Lingual Question Answering (EN-JA/EN-CS/EN-CT subtask) 
Monolingual Question Answering (JA-JA, CS-CS, and CT-CT subtask) 
IR4QA (Information Retrieval for Question Answering) 
Cross-Lingual Information Retrieval (EN-JA/EN-CS/EN-CT subtask)* 
Monolingual Information Retrieval (JA-JA, CS-CS, and CT-CT subtask) 
13-11-2014 25
CQA(Community QA Test Collection) 
This test collection can be used to evaluate the quality of the answer on the CQA site. This test collection consists of the following data. 
1500 questions extracted from Yahoo Chiebukurodata version 1. 
Assessment results by four assessors 
ID lists, best answer lists, and category information, etc. 
13-11-2014 26
CROSSLINK(Cross-lingual Link Discovery) 
Crosslink test collection can be used for experiments of cross-lingual link discovery from English to CJK (Chinese, Japanese and Korean) document linking such as 
English to Chinese CLLD (E2C) subtask 
English to Japanese CLLD (E2J) subtask 
English to Korean CLLD (E2K) subtask 
13-11-2014 27
INTENT (INTENT-1) 
INTENT (INTENT-1) Test Collectionsare the following: (a) NTCIR-9 INTENTChineseSubtopic MiningTest Collection(b) NTCIR-9 INTENTJapaneseSubtopic MiningTest Collection(c) NTCIR-9 INTENTChineseDocument RankingTest Collection(d) NTCIR-9 INTENTJapaneseDocument RankingTest Collection 
Subtopic Mining Subtask: given a query, return a ranked list of "subtopic strings." 
Document Ranking Subtask: given a query, return a diversified list of web pages. 
13-11-2014 28
1CLICK 
1CLICK (1CLICK-1) Test Collectionwas the test collection used at the NTCIR-9 1CLICK (Once Click Access) task. 
The NTCIR-9 1CLICK task was: given a Japanese query, return a 500- character or 140-character textual output that aims to satisfy the user as quickly as possible. Important pieces of information should be prioritized and the amount of text the user has to read should be minimized. 
13-11-2014 29
Math 
Math task was:NTCIR Math Task aims to explore search methods tailored to mathematical content through the design of suitable search tasks and the construction of evaluation datasets. 
Math Retrieval Subtask:Given a document collection, retrieve relevant mathematical formulae or documents for a given query. 
Math Understanding Subtask:Extract natural language descriptions of mathematical expressions in a document for their semantic interpretation. 
Relevance judgment*added Oct/14/2014* 
13-11-2014 30
MuST("summary and visualization of trend information" test collection) 
MuSTCorpus (untagged), selected from the two years of 1999, is the 581 articles (27 topics) that were used to create task data. 
A list of articles is assumed to obtained by the information retrieval 
Annotated with to the article content: Extraction result of important sentences in summary, and corresponds to the semantic processing results for it 
13-11-2014 31
Opinion(Opinion Analysis Task Test Collection) 
There are 32 topics ranging from 1998-2001, each in English, Chinese, and Japanese. 
The annotations assign opinion tags to sentences in the selected documents that are relevant to the topics. 
The documents that are annotated are separately distributed in a sentence-segmented format that aligns with the sentence numbering in the CSV annotation file 
13-11-2014 32
MOAT(Multilingual Opinion Analysis Test Collection) 
MOAT test collection can be used for experiments of multi-lingual opinion analysis in Japanese, English, and Chinese (simplified/traditional) (CstJE) such as 
Opinion Judgement 
Polarity (Positive/Negative/Neutral) Judgement 
Opinion Holder Identification 
Opinion Target Identification 
Relevance Judgement. 
13-11-2014 33
PATENT(IR Test Collection) 
•The collection consists of Document data (Japanese patent applications 1993-1997 and Patent Abstracts of Japan 1993-1997), 101 Japanese search topics (34 topics were translated into English, Simplified and Traditional Chinese, and Korean, respectively), and Relevance judgments for each search topic. 
•Japanese patent applications published in 1993-1997 are used for the Retrieval task. 
•Each search topic is a claim extracted from Japanese patent applications. 
13-11-2014 34
Q&A data Test Collection 
The collection includes: 
Document data: Mainichi Newspaper articles 1998-2001 
Taskdata: questions(200, in japanese), system’s output, human’s output and sample answers 
13-11-2014 35
SpokenDoc-2(IR for Spoken Documents) 
lecture speech, spoken passage, conversation detection 
The test collection includes: 
Spoken Term Detection (STD) 
inexistent Spoken Term Detection (iSTD) task 
Content Retrieval (SCR) Scoring tool for STD and iSTDtask 
Scoring tool for SCR task 
13-11-2014 36
IR and Term Extraction/Role Analysis Test Collections 
The IR Test collection includes 
Document data (Author abstracts of the Academic Conference Paper Database (1988-1997) = author abstracts of the paper presented at the academic conference hosted by either of 65 academic societies in Japan. about 330,000 documents; more than half are English-Japanese paired,) 
83 Search topics (Japanese,) and (3) Relevance Judgements. 
The collection can be used for retrieval experiments of Japanese text retrieval and CLIR of search Either of English documents or Japanese-English documents by Japanese topics. 
The Term Extraction Test collection includes tagged corpus using the 2000 Japanese documents selected from the above IR test collection. 
13-11-2014 37
SUMM: (Text Summarization Test Collection) 
The collection includes 
Document data (Japanese newspaper articles Mainichi Newspaper (1998-1999,) 
Model Summaries. Summary data consists of 
Single document summaries (Each of 60 documents, 7 types of single document summaries prepared in different length by different strategies were prepared by 3 analysts) and 
Multi-document summaries (Each of 50 document collections, 2 types of length of summaries were prepared by 3 analysis. 13-11-2014 38
WEB(IR Test Collection) 
WEB test collection consists of"Document Data"which is a collection oftext data processed from the crawled documents provided mainly on"Web servers of Japan" and"Task Data"which is a collection of search topics and the relevance judgments of the documents. 
"Task Data"consists of 400 mandatory topics and 841 optional topics for'Navigational Retrieval (Navi2)'."Document Data"named 'NW1000G-04' consists ofweb documents of approximately 1400GB in size and 100 million in number. 
13-11-2014 39
13-11-2014 40
The CLEF Test Suite 
The CLEF Test Suite contains the data used for the main tracks of the CLEF campaigns carried out from 2000 to 2003: 
Multilingual text retrieval 
Bilingual text retrieval 
Monolingual text retrieval 
Domain-specific text retrieval 
13-11-2014 41
The CLEF Test Suite 
The CLEF Test Suite is composed of: • The multilingual document collections• A Step-by-Step documentation on how to perform a system evaluation (EN) • Tools for results computation• Multilingual Sets of topics• Multilingual Sets of relevance assessments• Guidelines for participants (in English) • Tables of the results obtained by the participants; • Publications. 
13-11-2014 42
The CLEF Test Suite 
Multilingual corpora: • English• French• German• Italian• Spanish• Dutch• Swedish• Finnish• Russian• Portuguese 
13-11-2014 43
CLEF AdHoc-News Test Suites (2004-2008) 
The CLEF AdHoc-News Test Suites (2004-2008) contain the data used for the main AdHoctrack of the CLEF campaigns carried out from 2004 to 2008. 
This track tested the performance of monolingual, bilingual and multilingual Information Retrieval (IR) systems on multilingual news collections. 
13-11-2014 44
CLEF Domain Specific Test Suites (2004-2008) 
The CLEF Domain SpecificTestSuites (2004-2008) contain the data used for the Domain Specific track of the CLEF campaigns carried out from 2004 to 2008. 
This track tested the performance of monolingual, bilingual and multilingual Information Retrieval (IR) systems on multilingual collections of scientific articles. 
13-11-2014 45
CLEF Question Answering Test Suites (2003-2008) 
The CLEF Question Answering Suites (2003-2008) contain the data used for the Question Answering (QA) track of the CLEF campaigns carried out from 2003 to 2008. 
This track tested the performance of monolingual, bilingual and multilingual Question Answering systems on multilingual collections of news documents. 
13-11-2014 46
Reuters Corpora 
Reuters is the largest international text and television news agency. Its editorial division produces 11,000 stories a day in 23 languages. 
Stories are both distributed in real time and made available via online databases and other archival products. 
Datasets 
Reuters-21578 : used in text classification 
RCV1 
RCV2 
TRC2 
13-11-2014 47
RCV1 
In 2000, Reuters Ltd made available a large collection of Reuters News stories for use in research and development of natural language processing, information retrieval, and machine learning systems. 
Known as "Reuters Corpus, Volume 1" or RCV1, is significantly larger than the older Reuters-21.578 
Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire 
RCV1 is drawn from one of those online databases. It was intended to consist of all and only English language stories produced by Reuters journalists between August 20, 1996, and August 19,1997 
13-11-2014 48
RCV2 
Multilingual Corpus, 1996-08-20 to 1997-08-19 
contains over 487,000 Reuters News stories in thirteen languages (Dutch, French, German, Chinese, Japanese, Russian, Portuguese, Spanish, Latin American Spanish, Italian, Danish, Norwegian, and Swedish) 
13-11-2014 49
Thomson Reuters Text Research Collection (TRC2) 
The TRC2 corpus comprises 1,800,370 news stories covering the period from 2008-01-01 to 2009-02-28 
Initially made available to participants of the 2009 blog track at the Text Retrieval Conference (TREC), to supplement the BLOGS08 corpus (that contains results of a large blog crawl carried out at the University of Glasgow). 13-11-2014 50
20 Newsgroups 
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. 
It was originally collected by Ken Lang, for his Newsweeder: Learning to filter netnewspaper, though he does not explicitly mention this collection. 
The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering. 
13-11-2014 51
20 Newsgroups 
Class 
# train docs 
# test docs 
Total # docs 
alt.atheism 
480 
319 
799 
comp.graphics 
584 
389 
973 
comp.os.ms-windows.misc 
572 
394 
966 
comp.sys.ibm.pc.hardware 
590 
392 
982 
comp.sys.mac.hardware 
578 
385 
963 
comp.windows.x 
593 
392 
985 
misc.forsale 
585 
390 
975 
rec.autos 
594 
395 
989 
rec.motorcycles 
598 
398 
996 
rec.sport.baseball 
597 
397 
994 
rec.sport.hockey 
600 
399 
999 
sci.crypt 
595 
396 
991 
sci.electronics 
591 
393 
984 
sci.med 
594 
396 
990 
sci.space 
593 
394 
987 
soc.religion.christian 
598 
398 
996 
talk.politics.guns 
545 
364 
909 
talk.politics.mideast 
564 
376 
940 
talk.politics.misc 
465 
310 
775 
talk.religion.misc 
377 
251 
628 
Total 
11293 
7528 
18821 13-11-2014 52
References 
http://data.sindice.com/trec2011 
http://data-portal.ecmwf.int 
http://www.findbestopensource.com/article-detail/free-large-data-corpus 
http://mogadala.com/Toolkits_and_Datasets.html 
http://irkmlab.soe.ucsc.edu/DataSetsAvailableOnIRKMLabMachines 
http://www.inf.ed.ac.uk/teaching/courses/dme/html/datasets0405.html 
http://www.networkautomation.com/automate/urc/resources/livedocs/am/8/Advanced/About_Datasets.htm 
http://www.gabormelli.com/RKB/20_Newsgroups_Dataset 
http://www.csmining.org/index.php/id-20-newsgroups.html 
http://research.nii.ac.jp/ntcir/permission/perm-en.html#ntcir-3-qa 
13-11-2014 53
13-11-2014 54

More Related Content

What's hot

Movie Recommendation engine
Movie Recommendation engineMovie Recommendation engine
Movie Recommendation engineJayesh Lahori
 
Probabilistic retrieval model
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval modelbaradhimarch81
 
Information retrieval-systems notes
Information retrieval-systems notesInformation retrieval-systems notes
Information retrieval-systems notesBAIRAVI T
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olapData Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olap
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olapSalah Amean
 
Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)PyData
 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsSelman Bozkır
 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information RetrievalDishant Ailawadi
 
The vector space model
The vector space modelThe vector space model
The vector space modelpkgosh
 
Information Retrieval 02
Information Retrieval 02Information Retrieval 02
Information Retrieval 02Jeet Das
 
MongoDB Case Study in Healthcare
MongoDB Case Study in HealthcareMongoDB Case Study in Healthcare
MongoDB Case Study in HealthcareMongoDB
 
The Social Semantic Web
The Social Semantic WebThe Social Semantic Web
The Social Semantic WebJohn Breslin
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Primya Tamil
 
File organization 1
File organization 1File organization 1
File organization 1Rupali Rana
 

What's hot (20)

Unit 2 part-2
Unit 2 part-2Unit 2 part-2
Unit 2 part-2
 
Movie Recommendation engine
Movie Recommendation engineMovie Recommendation engine
Movie Recommendation engine
 
Probabilistic retrieval model
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval model
 
Text MIning
Text MIningText MIning
Text MIning
 
Information retrieval-systems notes
Information retrieval-systems notesInformation retrieval-systems notes
Information retrieval-systems notes
 
Temporal databases
Temporal databasesTemporal databases
Temporal databases
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olapData Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olap
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
 
Inverted index
Inverted indexInverted index
Inverted index
 
Term weighting
Term weightingTerm weighting
Term weighting
 
Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)
 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systems
 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information Retrieval
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
Information Retrieval 02
Information Retrieval 02Information Retrieval 02
Information Retrieval 02
 
MongoDB Case Study in Healthcare
MongoDB Case Study in HealthcareMongoDB Case Study in Healthcare
MongoDB Case Study in Healthcare
 
Text mining
Text miningText mining
Text mining
 
dbms lab manual
dbms lab manualdbms lab manual
dbms lab manual
 
The Social Semantic Web
The Social Semantic WebThe Social Semantic Web
The Social Semantic Web
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
 
File organization 1
File organization 1File organization 1
File organization 1
 

Viewers also liked

Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalJulián Urbano
 
Semantics + Filtering + Search = Twitcident - Exploring Information in Social...
Semantics + Filtering + Search = Twitcident - Exploring Information in Social...Semantics + Filtering + Search = Twitcident - Exploring Information in Social...
Semantics + Filtering + Search = Twitcident - Exploring Information in Social...Ke Tao
 
Document Distance for the Automated Expansion of Relevance Judgements for Inf...
Document Distance for the Automated Expansion of Relevance Judgements for Inf...Document Distance for the Automated Expansion of Relevance Judgements for Inf...
Document Distance for the Automated Expansion of Relevance Judgements for Inf...Diego Molla-Aliod
 
Are Linked Datasets fit for Open-domain Question Answering? A Quality Assessment
Are Linked Datasets fit for Open-domain Question Answering? A Quality AssessmentAre Linked Datasets fit for Open-domain Question Answering? A Quality Assessment
Are Linked Datasets fit for Open-domain Question Answering? A Quality AssessmentHarsh Thakkar
 
Memory Networks, Neural Turing Machines, and Question Answering
Memory Networks, Neural Turing Machines, and Question AnsweringMemory Networks, Neural Turing Machines, and Question Answering
Memory Networks, Neural Turing Machines, and Question AnsweringAkram El-Korashy
 
Tutorial on Question Answering Systems
Tutorial on Question Answering Systems Tutorial on Question Answering Systems
Tutorial on Question Answering Systems Saeedeh Shekarpour
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai
 

Viewers also liked (8)

Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
 
Semantics + Filtering + Search = Twitcident - Exploring Information in Social...
Semantics + Filtering + Search = Twitcident - Exploring Information in Social...Semantics + Filtering + Search = Twitcident - Exploring Information in Social...
Semantics + Filtering + Search = Twitcident - Exploring Information in Social...
 
Document Distance for the Automated Expansion of Relevance Judgements for Inf...
Document Distance for the Automated Expansion of Relevance Judgements for Inf...Document Distance for the Automated Expansion of Relevance Judgements for Inf...
Document Distance for the Automated Expansion of Relevance Judgements for Inf...
 
Are Linked Datasets fit for Open-domain Question Answering? A Quality Assessment
Are Linked Datasets fit for Open-domain Question Answering? A Quality AssessmentAre Linked Datasets fit for Open-domain Question Answering? A Quality Assessment
Are Linked Datasets fit for Open-domain Question Answering? A Quality Assessment
 
Memory Networks, Neural Turing Machines, and Question Answering
Memory Networks, Neural Turing Machines, and Question AnsweringMemory Networks, Neural Turing Machines, and Question Answering
Memory Networks, Neural Turing Machines, and Question Answering
 
Tutorial on Question Answering Systems
Tutorial on Question Answering Systems Tutorial on Question Answering Systems
Tutorial on Question Answering Systems
 
Open-ended Visual Question-Answering
Open-ended  Visual Question-AnsweringOpen-ended  Visual Question-Answering
Open-ended Visual Question-Answering
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 

Similar to Standard Datasets in Information Retrieval

Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineSalford Systems
 
Metadata for Research Objects
Metadata for Research ObjectsMetadata for Research Objects
Metadata for Research Objectsseanb
 
FAIRness Assessment of the Library of Integrated Network-based Cellular Signa...
FAIRness Assessment of the Library of Integrated Network-based Cellular Signa...FAIRness Assessment of the Library of Integrated Network-based Cellular Signa...
FAIRness Assessment of the Library of Integrated Network-based Cellular Signa...Kathleen Jagodnik
 
An introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalAn introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalMounia Lalmas-Roelleke
 
Presentation of Gantt Chart (System Analysis and Design)
Presentation of Gantt Chart (System Analysis and Design)Presentation of Gantt Chart (System Analysis and Design)
Presentation of Gantt Chart (System Analysis and Design)Mark Ivan Ligason
 
chapter 1-Overview of Information Retrieval.ppt
chapter 1-Overview of Information Retrieval.pptchapter 1-Overview of Information Retrieval.ppt
chapter 1-Overview of Information Retrieval.pptSamuelKetema1
 
Dynamic Search Using Semantics & Statistics
Dynamic Search Using Semantics & StatisticsDynamic Search Using Semantics & Statistics
Dynamic Search Using Semantics & StatisticsPaul Hofmann
 
Information retrieval
Information retrievalInformation retrieval
Information retrievalhplap
 
An Improved Mining Of Biomedical Data From Web Documents Using Clustering
An Improved Mining Of Biomedical Data From Web Documents Using ClusteringAn Improved Mining Of Biomedical Data From Web Documents Using Clustering
An Improved Mining Of Biomedical Data From Web Documents Using ClusteringKelly Lipiec
 
Comparative analysis of relative and exact search for web information retrieval
Comparative analysis of relative and exact search for web information retrievalComparative analysis of relative and exact search for web information retrieval
Comparative analysis of relative and exact search for web information retrievaleSAT Journals
 
Ontology Based Approach for Semantic Information Retrieval System
Ontology Based Approach for Semantic Information Retrieval SystemOntology Based Approach for Semantic Information Retrieval System
Ontology Based Approach for Semantic Information Retrieval SystemIJTET Journal
 
Text Retrieval Conferences (TREC)
Text Retrieval Conferences (TREC)Text Retrieval Conferences (TREC)
Text Retrieval Conferences (TREC)Abdul Gaffar
 
A Topic map-based ontology IR system versus Clustering-based IR System: A Com...
A Topic map-based ontology IR system versus Clustering-based IR System: A Com...A Topic map-based ontology IR system versus Clustering-based IR System: A Com...
A Topic map-based ontology IR system versus Clustering-based IR System: A Com...tmra
 

Similar to Standard Datasets in Information Retrieval (20)

Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search Engine
 
Metadata for Research Objects
Metadata for Research ObjectsMetadata for Research Objects
Metadata for Research Objects
 
FAIRness Assessment of the Library of Integrated Network-based Cellular Signa...
FAIRness Assessment of the Library of Integrated Network-based Cellular Signa...FAIRness Assessment of the Library of Integrated Network-based Cellular Signa...
FAIRness Assessment of the Library of Integrated Network-based Cellular Signa...
 
Dt35682686
Dt35682686Dt35682686
Dt35682686
 
An introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalAn introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information Retrieval
 
Sub1579
Sub1579Sub1579
Sub1579
 
Presentation of Gantt Chart (System Analysis and Design)
Presentation of Gantt Chart (System Analysis and Design)Presentation of Gantt Chart (System Analysis and Design)
Presentation of Gantt Chart (System Analysis and Design)
 
Introduction.pptx
Introduction.pptxIntroduction.pptx
Introduction.pptx
 
chapter 1-Overview of Information Retrieval.ppt
chapter 1-Overview of Information Retrieval.pptchapter 1-Overview of Information Retrieval.ppt
chapter 1-Overview of Information Retrieval.ppt
 
Dynamic Search Using Semantics & Statistics
Dynamic Search Using Semantics & StatisticsDynamic Search Using Semantics & Statistics
Dynamic Search Using Semantics & Statistics
 
Information retrieval
Information retrievalInformation retrieval
Information retrieval
 
An Improved Mining Of Biomedical Data From Web Documents Using Clustering
An Improved Mining Of Biomedical Data From Web Documents Using ClusteringAn Improved Mining Of Biomedical Data From Web Documents Using Clustering
An Improved Mining Of Biomedical Data From Web Documents Using Clustering
 
Comparative analysis of relative and exact search for web information retrieval
Comparative analysis of relative and exact search for web information retrievalComparative analysis of relative and exact search for web information retrieval
Comparative analysis of relative and exact search for web information retrieval
 
Ontology Based Approach for Semantic Information Retrieval System
Ontology Based Approach for Semantic Information Retrieval SystemOntology Based Approach for Semantic Information Retrieval System
Ontology Based Approach for Semantic Information Retrieval System
 
Research data life cycle
Research data life cycleResearch data life cycle
Research data life cycle
 
Text Retrieval Conferences (TREC)
Text Retrieval Conferences (TREC)Text Retrieval Conferences (TREC)
Text Retrieval Conferences (TREC)
 
Ijetcas14 446
Ijetcas14 446Ijetcas14 446
Ijetcas14 446
 
Digitisation and institutional repositories 2
Digitisation and institutional repositories 2Digitisation and institutional repositories 2
Digitisation and institutional repositories 2
 
Clicking Past Google
Clicking Past GoogleClicking Past Google
Clicking Past Google
 
A Topic map-based ontology IR system versus Clustering-based IR System: A Com...
A Topic map-based ontology IR system versus Clustering-based IR System: A Com...A Topic map-based ontology IR system versus Clustering-based IR System: A Com...
A Topic map-based ontology IR system versus Clustering-based IR System: A Com...
 

Recently uploaded

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

Standard Datasets in Information Retrieval

  • 1. Standard Datasets in IR By: S.J Brenda 13-11-2014 1
  • 2. What is dataset Collection of documents The data sets includes Document set : Collection of documents Query set: Set of information needs collection of questions asking the IR system for results Relevant judgement set: Methods calculate da relevance between results set & queries 13-11-2014 2
  • 3. Standard dataset Test collections Consists of static set of documents A set of information needs/topics A set of known relevant documents for each of the information needs Prefer to be in large scale Proper validation Rapid growth Great diversity 13-11-2014 3
  • 4. WHY we need Test the system Determine how well IR systems perform Compare the performance of the IR system with that of other systems Compare search algorithms with each other Compare search strategies with each other 13-11-2014 4
  • 5. Standard Datasets in IR The Cranfield collection Text Retrieval Conference (TREC) Gov2 NII Test Collection for IR system (NTCIR) Cross Language Evaluation Forum (CLEF) Reuters 20Newsgroups 13-11-2014 5
  • 6. The Cranfield collection This was the pioneering test collection in allowing precise quantitative measures of information retrieval effectiveness Created in late 1960 The Cranfield corpus is a relatively small information retrieval corpus consisting of 1400 abstracts on aeronautical engineering topics. The documents contain a total of 136935 terms from a vocabulary of size 4,632 (after stop word removal). 13-11-2014 6
  • 7. The Cranfield collection The Cranfield corpus also contains a set of 225 query strings with ground truth relevance judgements. Source: abstract of scientific papers from aeronautic research field 1945-1962 13-11-2014 7
  • 8. The Cranfield collection Experimental assumptions Relevance = topical similarity Static information need All documents equally desirable Relevance judgments are complete and representative of the user population Drawbacks Lack of comparison between different systems Small collection not enough 13-11-2014 8
  • 9. TREC The TREC corpus is a large data set consisting of articles taken from a variety of newswire and other sources This data set consists of 528,155 documents spanning a total of 165,363,765 terms from a vocabulary of size 629,469 (after stop words removal) Also provided are a number of query strings consisting of three parts, a title, description and narrative. Ground truth judgements are available concerning whether or not each of the documents is relevant to each of the queries. 13-11-2014 9
  • 10. TREC Design for large data collection Retrieval method: Not tied to any applicationAdhocquery: in a library environment Routing query: filter the result set Useful for IR system designers Good for dedicated searcher not novice searcher 13-11-2014 10
  • 12. TREC-Relevance Judgement 3 methods On all documents, all topics Random sample of documents Pooling •Divide each set into topic •Select top ranked documents •Each topic merge with the results of other system •Sort based on document number •Remove duplicate documents 13-11-2014 12
  • 13. TREC datasets Contextual Suggestion Track : To investigate search techniques for complex information needs that are highly dependent on context and user interests. Clinical Decision Support Track : To investigate techniques for linking medical cases to information relevant for patient care Federated Web Search Track : To investigate techniques for the selection and combination of search results from a large number of real on-line web search services. Knowledge Base Acceleration Track : To develop techniques to dramatically improve the efficiency of (human) knowledge base curators by having the system suggest modifications/extensions to the KB based on its monitoring of the data streams. 13-11-2014 13
  • 14. Microblog Track: To examine the nature of real-time information needs and their satisfaction in the context of microblogging environments such as Twitter. Session Track : To develop methods for measuring multiple-query sessions where information needs drift or get more or less specific over the session. Temporal Summarization Track : To develop systems that allow users to efficiently monitor the information associated with an event over time. Web Track: To explore information seeking behaviours common in general web search. 13-11-2014 14
  • 15. Chemical Track : To develop and evaluate technology for large scale search in chemistry-related documents, including academic papers and patents, to better meet the needs of professional searchers, and specifically patent searchers and chemists. Crowdsourcing Track : To provide a collaborative venue for exploring crowd sourcing methods both for evaluating search and for performing search tasks. Genomics Track: To study the retrieval of genomic data, not just gene sequences but also supporting documentation such as research papers, lab reports, etc. Last ran on TREC 2007. Enterprise Trac: To study search over the data of an organization to complete some task. Last ran on TREC 2008 13-11-2014 15
  • 16. Entity Track : To perform entity-related search on Web data. These search tasks (such as finding entities and properties of entities) address common information needs that are not that well modeled as ad hoc document search. Cross-Language Track : To investigate the ability of retrieval systems to find documents topically regardless of source language. FedWebTrack : To select best resources to forward a query to, and merge the results so that most relevant are on the top. Filtering Track : To binarilydecide retrieval of new incoming documents given a stable information need. HARD Track : To achieve High Accuracy Retrieval from Documents by leveraging additional information about the searcher and/or the search context. 13-11-2014 16
  • 17. Interactive Track : To study user interaction with text retrieval systems. Legal Track : To develop search technology that meets the needs of lawyers to engage in effective discoveryin digital document collections. Medical Records Track : To explore methods for searching unstructured information found in patient medical records. Novelty Track : To investigate systems' abilities to locate new (i.e., non-redundant) information. 13-11-2014 17
  • 18. Question Answering Track : To achieve more IRthan just Document Retrievalby answering factoid, list and definition-style questions. Robust Retrieval Track : To focus on individual topic effectiveness. Relevance FeedbackTrack: To further deep evaluation of relevance feedback processes. Spam Track : To provide a standard evaluation of current and proposed Spam filteringapproaches. TeraByteTrack: To investigate whether/how the IRcommunity can scale traditional IR test- collection-based evaluation to significantly large collections. VideoTrack: To research in automatic segmentation, indexing, and content-based retrieval of digital video.In2003, this track became its own independent evaluation named TRECVID 13-11-2014 18
  • 19. Advantages Larger Collections BetterResults Usable in many tasks Filtering Web search Video retrieval CLEF NTCIR GOV2 Drawback: Complete judgement is impossible Poolingcan overcome this 13-11-2014 19
  • 20. Gov2 A TREC corpus consist of 25 million documents from US government domains and government related websites TREC topics 701-850 used for evaluation One of the largest web collection easily available for research purposes 13-11-2014 20
  • 21. NTCIR -NII Test Collection for IR system Built various test collections of similar sizes to the TREC Focus on East Asian Language and Cross Language information retrieval Query one language document collection more than one languages 13-11-2014 21
  • 23. CLIR: IR/CLIR test collection CLIR test collection can be used for experiments of cross- lingual information retrieval between Chinese(traditional), Japanese, Korean and English (CJKE) such as Multilingual CLIR (MLIR) Bilingual CLIR (BLIR) Single Language (Monolingual) IR (SLIR). 13-11-2014 23
  • 24. CLQA(Cross Language Q&A data Test Collection) CLQA Task, the followingsubtasks were conducted. 1. Japanese to English (J-E) subtask Find answers of Japanese questions in English documents. 2. Chinese to English (C-E) subtaskFind answers of Chinese questions in English documents. 3. English to Japanese (E-J) subtaskFind answers of English questions in Japanese documents. 4. English to Chinese (E-C) subtaskFind answers of English questions in Chinese documents. 5. Chinese to Chinese (C-C) subtaskFind answers of Chinese questions in Chinese documents. 13-11-2014 24
  • 25. ACLIA(Advanced Cross-Lingual Information Retrieval and Question Answering) ACLIA test collection can be used for experiments of Complex Question Answering and Information Retrieval between Chinese (Simplified (CS), Traditional (CT)), Japanese (JA), English (EN) such as CCLQA (Complex Cross-Lingual Question Answering) Cross-Lingual Question Answering (EN-JA/EN-CS/EN-CT subtask) Monolingual Question Answering (JA-JA, CS-CS, and CT-CT subtask) IR4QA (Information Retrieval for Question Answering) Cross-Lingual Information Retrieval (EN-JA/EN-CS/EN-CT subtask)* Monolingual Information Retrieval (JA-JA, CS-CS, and CT-CT subtask) 13-11-2014 25
  • 26. CQA(Community QA Test Collection) This test collection can be used to evaluate the quality of the answer on the CQA site. This test collection consists of the following data. 1500 questions extracted from Yahoo Chiebukurodata version 1. Assessment results by four assessors ID lists, best answer lists, and category information, etc. 13-11-2014 26
  • 27. CROSSLINK(Cross-lingual Link Discovery) Crosslink test collection can be used for experiments of cross-lingual link discovery from English to CJK (Chinese, Japanese and Korean) document linking such as English to Chinese CLLD (E2C) subtask English to Japanese CLLD (E2J) subtask English to Korean CLLD (E2K) subtask 13-11-2014 27
  • 28. INTENT (INTENT-1) INTENT (INTENT-1) Test Collectionsare the following: (a) NTCIR-9 INTENTChineseSubtopic MiningTest Collection(b) NTCIR-9 INTENTJapaneseSubtopic MiningTest Collection(c) NTCIR-9 INTENTChineseDocument RankingTest Collection(d) NTCIR-9 INTENTJapaneseDocument RankingTest Collection Subtopic Mining Subtask: given a query, return a ranked list of "subtopic strings." Document Ranking Subtask: given a query, return a diversified list of web pages. 13-11-2014 28
  • 29. 1CLICK 1CLICK (1CLICK-1) Test Collectionwas the test collection used at the NTCIR-9 1CLICK (Once Click Access) task. The NTCIR-9 1CLICK task was: given a Japanese query, return a 500- character or 140-character textual output that aims to satisfy the user as quickly as possible. Important pieces of information should be prioritized and the amount of text the user has to read should be minimized. 13-11-2014 29
  • 30. Math Math task was:NTCIR Math Task aims to explore search methods tailored to mathematical content through the design of suitable search tasks and the construction of evaluation datasets. Math Retrieval Subtask:Given a document collection, retrieve relevant mathematical formulae or documents for a given query. Math Understanding Subtask:Extract natural language descriptions of mathematical expressions in a document for their semantic interpretation. Relevance judgment*added Oct/14/2014* 13-11-2014 30
  • 31. MuST("summary and visualization of trend information" test collection) MuSTCorpus (untagged), selected from the two years of 1999, is the 581 articles (27 topics) that were used to create task data. A list of articles is assumed to obtained by the information retrieval Annotated with to the article content: Extraction result of important sentences in summary, and corresponds to the semantic processing results for it 13-11-2014 31
  • 32. Opinion(Opinion Analysis Task Test Collection) There are 32 topics ranging from 1998-2001, each in English, Chinese, and Japanese. The annotations assign opinion tags to sentences in the selected documents that are relevant to the topics. The documents that are annotated are separately distributed in a sentence-segmented format that aligns with the sentence numbering in the CSV annotation file 13-11-2014 32
  • 33. MOAT(Multilingual Opinion Analysis Test Collection) MOAT test collection can be used for experiments of multi-lingual opinion analysis in Japanese, English, and Chinese (simplified/traditional) (CstJE) such as Opinion Judgement Polarity (Positive/Negative/Neutral) Judgement Opinion Holder Identification Opinion Target Identification Relevance Judgement. 13-11-2014 33
  • 34. PATENT(IR Test Collection) •The collection consists of Document data (Japanese patent applications 1993-1997 and Patent Abstracts of Japan 1993-1997), 101 Japanese search topics (34 topics were translated into English, Simplified and Traditional Chinese, and Korean, respectively), and Relevance judgments for each search topic. •Japanese patent applications published in 1993-1997 are used for the Retrieval task. •Each search topic is a claim extracted from Japanese patent applications. 13-11-2014 34
  • 35. Q&A data Test Collection The collection includes: Document data: Mainichi Newspaper articles 1998-2001 Taskdata: questions(200, in japanese), system’s output, human’s output and sample answers 13-11-2014 35
  • 36. SpokenDoc-2(IR for Spoken Documents) lecture speech, spoken passage, conversation detection The test collection includes: Spoken Term Detection (STD) inexistent Spoken Term Detection (iSTD) task Content Retrieval (SCR) Scoring tool for STD and iSTDtask Scoring tool for SCR task 13-11-2014 36
  • 37. IR and Term Extraction/Role Analysis Test Collections The IR Test collection includes Document data (Author abstracts of the Academic Conference Paper Database (1988-1997) = author abstracts of the paper presented at the academic conference hosted by either of 65 academic societies in Japan. about 330,000 documents; more than half are English-Japanese paired,) 83 Search topics (Japanese,) and (3) Relevance Judgements. The collection can be used for retrieval experiments of Japanese text retrieval and CLIR of search Either of English documents or Japanese-English documents by Japanese topics. The Term Extraction Test collection includes tagged corpus using the 2000 Japanese documents selected from the above IR test collection. 13-11-2014 37
  • 38. SUMM: (Text Summarization Test Collection) The collection includes Document data (Japanese newspaper articles Mainichi Newspaper (1998-1999,) Model Summaries. Summary data consists of Single document summaries (Each of 60 documents, 7 types of single document summaries prepared in different length by different strategies were prepared by 3 analysts) and Multi-document summaries (Each of 50 document collections, 2 types of length of summaries were prepared by 3 analysis. 13-11-2014 38
  • 39. WEB(IR Test Collection) WEB test collection consists of"Document Data"which is a collection oftext data processed from the crawled documents provided mainly on"Web servers of Japan" and"Task Data"which is a collection of search topics and the relevance judgments of the documents. "Task Data"consists of 400 mandatory topics and 841 optional topics for'Navigational Retrieval (Navi2)'."Document Data"named 'NW1000G-04' consists ofweb documents of approximately 1400GB in size and 100 million in number. 13-11-2014 39
  • 41. The CLEF Test Suite The CLEF Test Suite contains the data used for the main tracks of the CLEF campaigns carried out from 2000 to 2003: Multilingual text retrieval Bilingual text retrieval Monolingual text retrieval Domain-specific text retrieval 13-11-2014 41
  • 42. The CLEF Test Suite The CLEF Test Suite is composed of: • The multilingual document collections• A Step-by-Step documentation on how to perform a system evaluation (EN) • Tools for results computation• Multilingual Sets of topics• Multilingual Sets of relevance assessments• Guidelines for participants (in English) • Tables of the results obtained by the participants; • Publications. 13-11-2014 42
  • 43. The CLEF Test Suite Multilingual corpora: • English• French• German• Italian• Spanish• Dutch• Swedish• Finnish• Russian• Portuguese 13-11-2014 43
  • 44. CLEF AdHoc-News Test Suites (2004-2008) The CLEF AdHoc-News Test Suites (2004-2008) contain the data used for the main AdHoctrack of the CLEF campaigns carried out from 2004 to 2008. This track tested the performance of monolingual, bilingual and multilingual Information Retrieval (IR) systems on multilingual news collections. 13-11-2014 44
  • 45. CLEF Domain Specific Test Suites (2004-2008) The CLEF Domain SpecificTestSuites (2004-2008) contain the data used for the Domain Specific track of the CLEF campaigns carried out from 2004 to 2008. This track tested the performance of monolingual, bilingual and multilingual Information Retrieval (IR) systems on multilingual collections of scientific articles. 13-11-2014 45
  • 46. CLEF Question Answering Test Suites (2003-2008) The CLEF Question Answering Suites (2003-2008) contain the data used for the Question Answering (QA) track of the CLEF campaigns carried out from 2003 to 2008. This track tested the performance of monolingual, bilingual and multilingual Question Answering systems on multilingual collections of news documents. 13-11-2014 46
  • 47. Reuters Corpora Reuters is the largest international text and television news agency. Its editorial division produces 11,000 stories a day in 23 languages. Stories are both distributed in real time and made available via online databases and other archival products. Datasets Reuters-21578 : used in text classification RCV1 RCV2 TRC2 13-11-2014 47
  • 48. RCV1 In 2000, Reuters Ltd made available a large collection of Reuters News stories for use in research and development of natural language processing, information retrieval, and machine learning systems. Known as "Reuters Corpus, Volume 1" or RCV1, is significantly larger than the older Reuters-21.578 Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire RCV1 is drawn from one of those online databases. It was intended to consist of all and only English language stories produced by Reuters journalists between August 20, 1996, and August 19,1997 13-11-2014 48
  • 49. RCV2 Multilingual Corpus, 1996-08-20 to 1997-08-19 contains over 487,000 Reuters News stories in thirteen languages (Dutch, French, German, Chinese, Japanese, Russian, Portuguese, Spanish, Latin American Spanish, Italian, Danish, Norwegian, and Swedish) 13-11-2014 49
  • 50. Thomson Reuters Text Research Collection (TRC2) The TRC2 corpus comprises 1,800,370 news stories covering the period from 2008-01-01 to 2009-02-28 Initially made available to participants of the 2009 blog track at the Text Retrieval Conference (TREC), to supplement the BLOGS08 corpus (that contains results of a large blog crawl carried out at the University of Glasgow). 13-11-2014 50
  • 51. 20 Newsgroups The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. It was originally collected by Ken Lang, for his Newsweeder: Learning to filter netnewspaper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering. 13-11-2014 51
  • 52. 20 Newsgroups Class # train docs # test docs Total # docs alt.atheism 480 319 799 comp.graphics 584 389 973 comp.os.ms-windows.misc 572 394 966 comp.sys.ibm.pc.hardware 590 392 982 comp.sys.mac.hardware 578 385 963 comp.windows.x 593 392 985 misc.forsale 585 390 975 rec.autos 594 395 989 rec.motorcycles 598 398 996 rec.sport.baseball 597 397 994 rec.sport.hockey 600 399 999 sci.crypt 595 396 991 sci.electronics 591 393 984 sci.med 594 396 990 sci.space 593 394 987 soc.religion.christian 598 398 996 talk.politics.guns 545 364 909 talk.politics.mideast 564 376 940 talk.politics.misc 465 310 775 talk.religion.misc 377 251 628 Total 11293 7528 18821 13-11-2014 52
  • 53. References http://data.sindice.com/trec2011 http://data-portal.ecmwf.int http://www.findbestopensource.com/article-detail/free-large-data-corpus http://mogadala.com/Toolkits_and_Datasets.html http://irkmlab.soe.ucsc.edu/DataSetsAvailableOnIRKMLabMachines http://www.inf.ed.ac.uk/teaching/courses/dme/html/datasets0405.html http://www.networkautomation.com/automate/urc/resources/livedocs/am/8/Advanced/About_Datasets.htm http://www.gabormelli.com/RKB/20_Newsgroups_Dataset http://www.csmining.org/index.php/id-20-newsgroups.html http://research.nii.ac.jp/ntcir/permission/perm-en.html#ntcir-3-qa 13-11-2014 53