Standard Datasets in Information Retrieval

Standard Datasets in IR
By: S.J Brenda
13-11-2014 1

What is dataset
Collection of documents
The data sets includes
Document set : Collection of documents
Query set: Set of information needs collection of questions asking the IR system for results
Relevant judgement set: Methods calculate da relevance between results set & queries
13-11-2014 2

Standard dataset
Test collections
Consists of static set of documents
A set of information needs/topics
A set of known relevant documents for each of the information needs
Prefer to be in large scale
Proper validation
Rapid growth
Great diversity
13-11-2014 3

WHY we need
Test the system
Determine how well IR systems perform
Compare the performance of the IR system with that of other systems
Compare search algorithms with each other
Compare search strategies with each other
13-11-2014 4

Standard Datasets in IR
The Cranfield collection
Text Retrieval Conference (TREC)
Gov2
NII Test Collection for IR system (NTCIR)
Cross Language Evaluation Forum (CLEF)
Reuters
20Newsgroups
13-11-2014 5

The Cranfield collection
This was the pioneering test collection in allowing precise quantitative measures of information retrieval effectiveness
Created in late 1960
The Cranfield corpus is a relatively small information retrieval corpus consisting of 1400 abstracts on aeronautical engineering topics. The documents contain a total of 136935 terms from a vocabulary of size 4,632 (after stop word removal).
13-11-2014 6

The Cranfield corpus also contains a set of 225 query strings with ground truth relevance judgements.
Source: abstract of scientific papers from aeronautic research field 1945-1962
13-11-2014 7

Experimental assumptions
Relevance = topical similarity
Static information need
All documents equally desirable
Relevance judgments are complete and representative of the user population
Drawbacks
Lack of comparison between different systems
Small collection not enough
13-11-2014 8

TREC
The TREC corpus is a large data set consisting of articles taken from a variety of newswire and other sources
This data set consists of 528,155 documents spanning a total of 165,363,765 terms from a vocabulary of size 629,469 (after stop words removal)
Also provided are a number of query strings consisting of three parts, a title, description and narrative. Ground truth judgements are available concerning whether or not each of the documents is relevant to each of the queries. 13-11-2014 9

TREC
Design for large data collection
Retrieval method:
Not tied to any applicationAdhocquery: in a library environment
Routing query: filter the result set
Useful for IR system designers
Good for dedicated searcher not novice searcher
13-11-2014 10

TREC-Relevance Judgement
3 methods
On all documents, all topics
Random sample of documents
Pooling
•Divide each set into topic
•Select top ranked documents
•Each topic merge with the results of other system
•Sort based on document number
•Remove duplicate documents
13-11-2014 12

TREC datasets
Contextual Suggestion Track :
To investigate search techniques for complex information needs that are highly dependent on context and user interests.
Clinical Decision Support Track :
To investigate techniques for linking medical cases to information relevant for patient care
Federated Web Search Track :
To investigate techniques for the selection and combination of search results from a large number of real on-line web search services.
Knowledge Base Acceleration Track :
To develop techniques to dramatically improve the efficiency of (human) knowledge base curators by having the system suggest modifications/extensions to the KB based on its monitoring of the data streams.
13-11-2014 13

Microblog Track:
To examine the nature of real-time information needs and their satisfaction in the context of microblogging environments such as Twitter.
Session Track :
To develop methods for measuring multiple-query sessions where information needs drift or get more or less specific over the session.
Temporal Summarization Track :
To develop systems that allow users to efficiently monitor the information associated with an event over time.
Web Track:
To explore information seeking behaviours common in general web search.
13-11-2014 14

Chemical Track :
To develop and evaluate technology for large scale search in chemistry-related documents, including academic papers and patents, to better meet the needs of professional searchers, and specifically patent searchers and chemists.
Crowdsourcing Track :
To provide a collaborative venue for exploring crowd sourcing methods both for evaluating search and for performing search tasks.
Genomics Track:
To study the retrieval of genomic data, not just gene sequences but also supporting documentation such as research papers, lab reports, etc. Last ran on TREC 2007.
Enterprise Trac:
To study search over the data of an organization to complete some task. Last ran on TREC 2008 13-11-2014 15

Entity Track :
To perform entity-related search on Web data. These search tasks (such as finding entities and properties of entities) address common information needs that are not that well modeled as ad hoc document search.
Cross-Language Track :
To investigate the ability of retrieval systems to find documents topically regardless of source language.
FedWebTrack :
To select best resources to forward a query to, and merge the results so that most relevant are on the top.
Filtering Track :
To binarilydecide retrieval of new incoming documents given a stable
information need.
HARD Track :
To achieve High Accuracy Retrieval from Documents by
leveraging additional information about the searcher
and/or the search context.
13-11-2014 16

Interactive Track :
To study user interaction with text retrieval systems.
Legal Track :
To develop search technology that meets the needs of lawyers to engage in effective discoveryin digital document collections.
Medical Records Track :
To explore methods for searching unstructured information found in patient medical records.
Novelty Track :
To investigate systems' abilities to locate new (i.e., non-redundant) information.
13-11-2014 17

Question Answering Track :
To achieve more IRthan just Document Retrievalby answering factoid, list and definition-style questions.
Robust Retrieval Track :
To focus on individual topic effectiveness.
Relevance FeedbackTrack:
To further deep evaluation of relevance feedback processes.
Spam Track :
To provide a standard evaluation of current and proposed Spam filteringapproaches.
TeraByteTrack:
To investigate whether/how the IRcommunity can scale traditional IR test- collection-based evaluation to significantly large collections.
VideoTrack:
To research in automatic segmentation, indexing, and content-based retrieval of digital video.In2003, this track became its own independent evaluation named TRECVID
13-11-2014 18

Advantages
Larger Collections
BetterResults
Usable in many tasks
Filtering
Web search
Video retrieval
CLEF
NTCIR
GOV2
Drawback: Complete judgement is impossible
Poolingcan overcome this 13-11-2014 19

Gov2
A TREC corpus consist of 25 million documents from US government domains and government related websites
TREC topics 701-850 used for evaluation
One of the largest web collection easily available for research purposes
13-11-2014 20

NTCIR -NII Test Collection for IR system
Built various test collections of similar sizes to the TREC
Focus on East Asian Language and Cross Language information retrieval
Query one language document collection more than one languages
13-11-2014 21

CLIR: IR/CLIR test collection
CLIR test collection can be used for experiments of cross- lingual information retrieval between Chinese(traditional), Japanese, Korean and English (CJKE) such as
Multilingual CLIR (MLIR)
Bilingual CLIR (BLIR)
Single Language (Monolingual) IR (SLIR).
13-11-2014 23

CLQA(Cross Language Q&A data Test Collection)
CLQA Task, the followingsubtasks were conducted.
1. Japanese to English (J-E) subtask
Find answers of Japanese questions in English documents.
2. Chinese to English (C-E) subtaskFind answers of Chinese questions in English documents. 3. English to Japanese (E-J) subtaskFind answers of English questions in Japanese documents. 4. English to Chinese (E-C) subtaskFind answers of English questions in Chinese documents. 5. Chinese to Chinese (C-C) subtaskFind answers of Chinese questions in Chinese documents.
13-11-2014 24

ACLIA(Advanced Cross-Lingual Information Retrieval and Question Answering)
ACLIA test collection can be used for experiments of Complex Question Answering and Information Retrieval between Chinese (Simplified (CS), Traditional (CT)), Japanese (JA), English (EN) such as
CCLQA (Complex Cross-Lingual Question Answering)
Cross-Lingual Question Answering (EN-JA/EN-CS/EN-CT subtask)
Monolingual Question Answering (JA-JA, CS-CS, and CT-CT subtask)
IR4QA (Information Retrieval for Question Answering)
Cross-Lingual Information Retrieval (EN-JA/EN-CS/EN-CT subtask)*
Monolingual Information Retrieval (JA-JA, CS-CS, and CT-CT subtask)
13-11-2014 25

CQA(Community QA Test Collection)
This test collection can be used to evaluate the quality of the answer on the CQA site. This test collection consists of the following data.
1500 questions extracted from Yahoo Chiebukurodata version 1.
Assessment results by four assessors
ID lists, best answer lists, and category information, etc.
13-11-2014 26

CROSSLINK(Cross-lingual Link Discovery)
Crosslink test collection can be used for experiments of cross-lingual link discovery from English to CJK (Chinese, Japanese and Korean) document linking such as
English to Chinese CLLD (E2C) subtask
English to Japanese CLLD (E2J) subtask
English to Korean CLLD (E2K) subtask
13-11-2014 27

INTENT (INTENT-1)
INTENT (INTENT-1) Test Collectionsare the following: (a) NTCIR-9 INTENTChineseSubtopic MiningTest Collection(b) NTCIR-9 INTENTJapaneseSubtopic MiningTest Collection(c) NTCIR-9 INTENTChineseDocument RankingTest Collection(d) NTCIR-9 INTENTJapaneseDocument RankingTest Collection
Subtopic Mining Subtask: given a query, return a ranked list of "subtopic strings."
Document Ranking Subtask: given a query, return a diversified list of web pages.
13-11-2014 28

1CLICK
1CLICK (1CLICK-1) Test Collectionwas the test collection used at the NTCIR-9 1CLICK (Once Click Access) task.
The NTCIR-9 1CLICK task was: given a Japanese query, return a 500- character or 140-character textual output that aims to satisfy the user as quickly as possible. Important pieces of information should be prioritized and the amount of text the user has to read should be minimized.
13-11-2014 29

Math
Math task was:NTCIR Math Task aims to explore search methods tailored to mathematical content through the design of suitable search tasks and the construction of evaluation datasets.
Math Retrieval Subtask:Given a document collection, retrieve relevant mathematical formulae or documents for a given query.
Math Understanding Subtask:Extract natural language descriptions of mathematical expressions in a document for their semantic interpretation.
Relevance judgment*added Oct/14/2014*
13-11-2014 30

MuST("summary and visualization of trend information" test collection)
MuSTCorpus (untagged), selected from the two years of 1999, is the 581 articles (27 topics) that were used to create task data.
A list of articles is assumed to obtained by the information retrieval
Annotated with to the article content: Extraction result of important sentences in summary, and corresponds to the semantic processing results for it
13-11-2014 31

Opinion(Opinion Analysis Task Test Collection)
There are 32 topics ranging from 1998-2001, each in English, Chinese, and Japanese.
The annotations assign opinion tags to sentences in the selected documents that are relevant to the topics.
The documents that are annotated are separately distributed in a sentence-segmented format that aligns with the sentence numbering in the CSV annotation file
13-11-2014 32

MOAT(Multilingual Opinion Analysis Test Collection)
MOAT test collection can be used for experiments of multi-lingual opinion analysis in Japanese, English, and Chinese (simplified/traditional) (CstJE) such as
Opinion Judgement
Polarity (Positive/Negative/Neutral) Judgement
Opinion Holder Identification
Opinion Target Identification
Relevance Judgement.
13-11-2014 33

PATENT(IR Test Collection)
•The collection consists of Document data (Japanese patent applications 1993-1997 and Patent Abstracts of Japan 1993-1997), 101 Japanese search topics (34 topics were translated into English, Simplified and Traditional Chinese, and Korean, respectively), and Relevance judgments for each search topic.
•Japanese patent applications published in 1993-1997 are used for the Retrieval task.
•Each search topic is a claim extracted from Japanese patent applications.
13-11-2014 34

Q&A data Test Collection
The collection includes:
Document data: Mainichi Newspaper articles 1998-2001
Taskdata: questions(200, in japanese), system’s output, human’s output and sample answers
13-11-2014 35

SpokenDoc-2(IR for Spoken Documents)
lecture speech, spoken passage, conversation detection
The test collection includes:
Spoken Term Detection (STD)
inexistent Spoken Term Detection (iSTD) task
Content Retrieval (SCR) Scoring tool for STD and iSTDtask
Scoring tool for SCR task
13-11-2014 36

IR and Term Extraction/Role Analysis Test Collections
The IR Test collection includes
Document data (Author abstracts of the Academic Conference Paper Database (1988-1997) = author abstracts of the paper presented at the academic conference hosted by either of 65 academic societies in Japan. about 330,000 documents; more than half are English-Japanese paired,)
83 Search topics (Japanese,) and (3) Relevance Judgements.
The collection can be used for retrieval experiments of Japanese text retrieval and CLIR of search Either of English documents or Japanese-English documents by Japanese topics.
The Term Extraction Test collection includes tagged corpus using the 2000 Japanese documents selected from the above IR test collection.
13-11-2014 37

SUMM: (Text Summarization Test Collection)
The collection includes
Document data (Japanese newspaper articles Mainichi Newspaper (1998-1999,)
Model Summaries. Summary data consists of
Single document summaries (Each of 60 documents, 7 types of single document summaries prepared in different length by different strategies were prepared by 3 analysts) and
Multi-document summaries (Each of 50 document collections, 2 types of length of summaries were prepared by 3 analysis. 13-11-2014 38

WEB(IR Test Collection)
WEB test collection consists of"Document Data"which is a collection oftext data processed from the crawled documents provided mainly on"Web servers of Japan" and"Task Data"which is a collection of search topics and the relevance judgments of the documents.
"Task Data"consists of 400 mandatory topics and 841 optional topics for'Navigational Retrieval (Navi2)'."Document Data"named 'NW1000G-04' consists ofweb documents of approximately 1400GB in size and 100 million in number.
13-11-2014 39

The CLEF Test Suite
The CLEF Test Suite contains the data used for the main tracks of the CLEF campaigns carried out from 2000 to 2003:
Multilingual text retrieval
Bilingual text retrieval
Monolingual text retrieval
Domain-specific text retrieval
13-11-2014 41

The CLEF Test Suite
The CLEF Test Suite is composed of: • The multilingual document collections• A Step-by-Step documentation on how to perform a system evaluation (EN) • Tools for results computation• Multilingual Sets of topics• Multilingual Sets of relevance assessments• Guidelines for participants (in English) • Tables of the results obtained by the participants; • Publications.
13-11-2014 42

The CLEF Test Suite
Multilingual corpora: • English• French• German• Italian• Spanish• Dutch• Swedish• Finnish• Russian• Portuguese
13-11-2014 43

CLEF AdHoc-News Test Suites (2004-2008)
The CLEF AdHoc-News Test Suites (2004-2008) contain the data used for the main AdHoctrack of the CLEF campaigns carried out from 2004 to 2008.
This track tested the performance of monolingual, bilingual and multilingual Information Retrieval (IR) systems on multilingual news collections.
13-11-2014 44

CLEF Domain Specific Test Suites (2004-2008)
The CLEF Domain SpecificTestSuites (2004-2008) contain the data used for the Domain Specific track of the CLEF campaigns carried out from 2004 to 2008.
This track tested the performance of monolingual, bilingual and multilingual Information Retrieval (IR) systems on multilingual collections of scientific articles.
13-11-2014 45

CLEF Question Answering Test Suites (2003-2008)
The CLEF Question Answering Suites (2003-2008) contain the data used for the Question Answering (QA) track of the CLEF campaigns carried out from 2003 to 2008.
This track tested the performance of monolingual, bilingual and multilingual Question Answering systems on multilingual collections of news documents.
13-11-2014 46

Reuters Corpora
Reuters is the largest international text and television news agency. Its editorial division produces 11,000 stories a day in 23 languages.
Stories are both distributed in real time and made available via online databases and other archival products.
Datasets
Reuters-21578 : used in text classification
RCV1
RCV2
TRC2
13-11-2014 47

RCV1
In 2000, Reuters Ltd made available a large collection of Reuters News stories for use in research and development of natural language processing, information retrieval, and machine learning systems.
Known as "Reuters Corpus, Volume 1" or RCV1, is significantly larger than the older Reuters-21.578
Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire
RCV1 is drawn from one of those online databases. It was intended to consist of all and only English language stories produced by Reuters journalists between August 20, 1996, and August 19,1997
13-11-2014 48

RCV2
Multilingual Corpus, 1996-08-20 to 1997-08-19
contains over 487,000 Reuters News stories in thirteen languages (Dutch, French, German, Chinese, Japanese, Russian, Portuguese, Spanish, Latin American Spanish, Italian, Danish, Norwegian, and Swedish)
13-11-2014 49

Thomson Reuters Text Research Collection (TRC2)
The TRC2 corpus comprises 1,800,370 news stories covering the period from 2008-01-01 to 2009-02-28
Initially made available to participants of the 2009 blog track at the Text Retrieval Conference (TREC), to supplement the BLOGS08 corpus (that contains results of a large blog crawl carried out at the University of Glasgow). 13-11-2014 50

20 Newsgroups
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.
It was originally collected by Ken Lang, for his Newsweeder: Learning to filter netnewspaper, though he does not explicitly mention this collection.
The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.
13-11-2014 51

20 Newsgroups
Class
# train docs
# test docs
Total # docs
alt.atheism
480
319
799
comp.graphics
584
389
973
comp.os.ms-windows.misc
572
394
966
comp.sys.ibm.pc.hardware
590
392
982
comp.sys.mac.hardware
578
385
963
comp.windows.x
593
392
985
misc.forsale
585
390
975
rec.autos
594
395
989
rec.motorcycles
598
398
996
rec.sport.baseball
597
397
994
rec.sport.hockey
600
399
999
sci.crypt
595
396
991
sci.electronics
591
393
984
sci.med
594
396
990
sci.space
593
394
987
soc.religion.christian
598
398
996
talk.politics.guns
545
364
909
talk.politics.mideast
564
376
940
talk.politics.misc
465
310
775
talk.religion.misc
377
251
628
Total
11293
7528
18821 13-11-2014 52

References
http://data.sindice.com/trec2011
http://data-portal.ecmwf.int
http://www.findbestopensource.com/article-detail/free-large-data-corpus
http://mogadala.com/Toolkits_and_Datasets.html
http://irkmlab.soe.ucsc.edu/DataSetsAvailableOnIRKMLabMachines
http://www.inf.ed.ac.uk/teaching/courses/dme/html/datasets0405.html
http://www.networkautomation.com/automate/urc/resources/livedocs/am/8/Advanced/About_Datasets.htm
http://www.gabormelli.com/RKB/20_Newsgroups_Dataset
http://www.csmining.org/index.php/id-20-newsgroups.html
http://research.nii.ac.jp/ntcir/permission/perm-en.html#ntcir-3-qa
13-11-2014 53

Standard Datasets in Information Retrieval

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Standard Datasets in Information Retrieval

Similar to Standard Datasets in Information Retrieval (20)

Recently uploaded

Recently uploaded (20)

Standard Datasets in Information Retrieval