2. What is dataset
Collection of documents
The data sets includes
Document set : Collection of documents
Query set: Set of information needs collection of questions asking the IR system for results
Relevant judgement set: Methods calculate da relevance between results set & queries
13-11-2014 2
3. Standard dataset
Test collections
Consists of static set of documents
A set of information needs/topics
A set of known relevant documents for each of the information needs
Prefer to be in large scale
Proper validation
Rapid growth
Great diversity
13-11-2014 3
4. WHY we need
Test the system
Determine how well IR systems perform
Compare the performance of the IR system with that of other systems
Compare search algorithms with each other
Compare search strategies with each other
13-11-2014 4
5. Standard Datasets in IR
The Cranfield collection
Text Retrieval Conference (TREC)
Gov2
NII Test Collection for IR system (NTCIR)
Cross Language Evaluation Forum (CLEF)
Reuters
20Newsgroups
13-11-2014 5
6. The Cranfield collection
This was the pioneering test collection in allowing precise quantitative measures of information retrieval effectiveness
Created in late 1960
The Cranfield corpus is a relatively small information retrieval corpus consisting of 1400 abstracts on aeronautical engineering topics. The documents contain a total of 136935 terms from a vocabulary of size 4,632 (after stop word removal).
13-11-2014 6
7. The Cranfield collection
The Cranfield corpus also contains a set of 225 query strings with ground truth relevance judgements.
Source: abstract of scientific papers from aeronautic research field 1945-1962
13-11-2014 7
8. The Cranfield collection
Experimental assumptions
Relevance = topical similarity
Static information need
All documents equally desirable
Relevance judgments are complete and representative of the user population
Drawbacks
Lack of comparison between different systems
Small collection not enough
13-11-2014 8
9. TREC
The TREC corpus is a large data set consisting of articles taken from a variety of newswire and other sources
This data set consists of 528,155 documents spanning a total of 165,363,765 terms from a vocabulary of size 629,469 (after stop words removal)
Also provided are a number of query strings consisting of three parts, a title, description and narrative. Ground truth judgements are available concerning whether or not each of the documents is relevant to each of the queries. 13-11-2014 9
10. TREC
Design for large data collection
Retrieval method:
Not tied to any applicationAdhocquery: in a library environment
Routing query: filter the result set
Useful for IR system designers
Good for dedicated searcher not novice searcher
13-11-2014 10
12. TREC-Relevance Judgement
3 methods
On all documents, all topics
Random sample of documents
Pooling
•Divide each set into topic
•Select top ranked documents
•Each topic merge with the results of other system
•Sort based on document number
•Remove duplicate documents
13-11-2014 12
13. TREC datasets
Contextual Suggestion Track :
To investigate search techniques for complex information needs that are highly dependent on context and user interests.
Clinical Decision Support Track :
To investigate techniques for linking medical cases to information relevant for patient care
Federated Web Search Track :
To investigate techniques for the selection and combination of search results from a large number of real on-line web search services.
Knowledge Base Acceleration Track :
To develop techniques to dramatically improve the efficiency of (human) knowledge base curators by having the system suggest modifications/extensions to the KB based on its monitoring of the data streams.
13-11-2014 13
14. Microblog Track:
To examine the nature of real-time information needs and their satisfaction in the context of microblogging environments such as Twitter.
Session Track :
To develop methods for measuring multiple-query sessions where information needs drift or get more or less specific over the session.
Temporal Summarization Track :
To develop systems that allow users to efficiently monitor the information associated with an event over time.
Web Track:
To explore information seeking behaviours common in general web search.
13-11-2014 14
15. Chemical Track :
To develop and evaluate technology for large scale search in chemistry-related documents, including academic papers and patents, to better meet the needs of professional searchers, and specifically patent searchers and chemists.
Crowdsourcing Track :
To provide a collaborative venue for exploring crowd sourcing methods both for evaluating search and for performing search tasks.
Genomics Track:
To study the retrieval of genomic data, not just gene sequences but also supporting documentation such as research papers, lab reports, etc. Last ran on TREC 2007.
Enterprise Trac:
To study search over the data of an organization to complete some task. Last ran on TREC 2008 13-11-2014 15
16. Entity Track :
To perform entity-related search on Web data. These search tasks (such as finding entities and properties of entities) address common information needs that are not that well modeled as ad hoc document search.
Cross-Language Track :
To investigate the ability of retrieval systems to find documents topically regardless of source language.
FedWebTrack :
To select best resources to forward a query to, and merge the results so that most relevant are on the top.
Filtering Track :
To binarilydecide retrieval of new incoming documents given a stable
information need.
HARD Track :
To achieve High Accuracy Retrieval from Documents by
leveraging additional information about the searcher
and/or the search context.
13-11-2014 16
17. Interactive Track :
To study user interaction with text retrieval systems.
Legal Track :
To develop search technology that meets the needs of lawyers to engage in effective discoveryin digital document collections.
Medical Records Track :
To explore methods for searching unstructured information found in patient medical records.
Novelty Track :
To investigate systems' abilities to locate new (i.e., non-redundant) information.
13-11-2014 17
18. Question Answering Track :
To achieve more IRthan just Document Retrievalby answering factoid, list and definition-style questions.
Robust Retrieval Track :
To focus on individual topic effectiveness.
Relevance FeedbackTrack:
To further deep evaluation of relevance feedback processes.
Spam Track :
To provide a standard evaluation of current and proposed Spam filteringapproaches.
TeraByteTrack:
To investigate whether/how the IRcommunity can scale traditional IR test- collection-based evaluation to significantly large collections.
VideoTrack:
To research in automatic segmentation, indexing, and content-based retrieval of digital video.In2003, this track became its own independent evaluation named TRECVID
13-11-2014 18
19. Advantages
Larger Collections
BetterResults
Usable in many tasks
Filtering
Web search
Video retrieval
CLEF
NTCIR
GOV2
Drawback: Complete judgement is impossible
Poolingcan overcome this 13-11-2014 19
20. Gov2
A TREC corpus consist of 25 million documents from US government domains and government related websites
TREC topics 701-850 used for evaluation
One of the largest web collection easily available for research purposes
13-11-2014 20
21. NTCIR -NII Test Collection for IR system
Built various test collections of similar sizes to the TREC
Focus on East Asian Language and Cross Language information retrieval
Query one language document collection more than one languages
13-11-2014 21
23. CLIR: IR/CLIR test collection
CLIR test collection can be used for experiments of cross- lingual information retrieval between Chinese(traditional), Japanese, Korean and English (CJKE) such as
Multilingual CLIR (MLIR)
Bilingual CLIR (BLIR)
Single Language (Monolingual) IR (SLIR).
13-11-2014 23
24. CLQA(Cross Language Q&A data Test Collection)
CLQA Task, the followingsubtasks were conducted.
1. Japanese to English (J-E) subtask
Find answers of Japanese questions in English documents.
2. Chinese to English (C-E) subtaskFind answers of Chinese questions in English documents. 3. English to Japanese (E-J) subtaskFind answers of English questions in Japanese documents. 4. English to Chinese (E-C) subtaskFind answers of English questions in Chinese documents. 5. Chinese to Chinese (C-C) subtaskFind answers of Chinese questions in Chinese documents.
13-11-2014 24
25. ACLIA(Advanced Cross-Lingual Information Retrieval and Question Answering)
ACLIA test collection can be used for experiments of Complex Question Answering and Information Retrieval between Chinese (Simplified (CS), Traditional (CT)), Japanese (JA), English (EN) such as
CCLQA (Complex Cross-Lingual Question Answering)
Cross-Lingual Question Answering (EN-JA/EN-CS/EN-CT subtask)
Monolingual Question Answering (JA-JA, CS-CS, and CT-CT subtask)
IR4QA (Information Retrieval for Question Answering)
Cross-Lingual Information Retrieval (EN-JA/EN-CS/EN-CT subtask)*
Monolingual Information Retrieval (JA-JA, CS-CS, and CT-CT subtask)
13-11-2014 25
26. CQA(Community QA Test Collection)
This test collection can be used to evaluate the quality of the answer on the CQA site. This test collection consists of the following data.
1500 questions extracted from Yahoo Chiebukurodata version 1.
Assessment results by four assessors
ID lists, best answer lists, and category information, etc.
13-11-2014 26
27. CROSSLINK(Cross-lingual Link Discovery)
Crosslink test collection can be used for experiments of cross-lingual link discovery from English to CJK (Chinese, Japanese and Korean) document linking such as
English to Chinese CLLD (E2C) subtask
English to Japanese CLLD (E2J) subtask
English to Korean CLLD (E2K) subtask
13-11-2014 27
28. INTENT (INTENT-1)
INTENT (INTENT-1) Test Collectionsare the following: (a) NTCIR-9 INTENTChineseSubtopic MiningTest Collection(b) NTCIR-9 INTENTJapaneseSubtopic MiningTest Collection(c) NTCIR-9 INTENTChineseDocument RankingTest Collection(d) NTCIR-9 INTENTJapaneseDocument RankingTest Collection
Subtopic Mining Subtask: given a query, return a ranked list of "subtopic strings."
Document Ranking Subtask: given a query, return a diversified list of web pages.
13-11-2014 28
29. 1CLICK
1CLICK (1CLICK-1) Test Collectionwas the test collection used at the NTCIR-9 1CLICK (Once Click Access) task.
The NTCIR-9 1CLICK task was: given a Japanese query, return a 500- character or 140-character textual output that aims to satisfy the user as quickly as possible. Important pieces of information should be prioritized and the amount of text the user has to read should be minimized.
13-11-2014 29
30. Math
Math task was:NTCIR Math Task aims to explore search methods tailored to mathematical content through the design of suitable search tasks and the construction of evaluation datasets.
Math Retrieval Subtask:Given a document collection, retrieve relevant mathematical formulae or documents for a given query.
Math Understanding Subtask:Extract natural language descriptions of mathematical expressions in a document for their semantic interpretation.
Relevance judgment*added Oct/14/2014*
13-11-2014 30
31. MuST("summary and visualization of trend information" test collection)
MuSTCorpus (untagged), selected from the two years of 1999, is the 581 articles (27 topics) that were used to create task data.
A list of articles is assumed to obtained by the information retrieval
Annotated with to the article content: Extraction result of important sentences in summary, and corresponds to the semantic processing results for it
13-11-2014 31
32. Opinion(Opinion Analysis Task Test Collection)
There are 32 topics ranging from 1998-2001, each in English, Chinese, and Japanese.
The annotations assign opinion tags to sentences in the selected documents that are relevant to the topics.
The documents that are annotated are separately distributed in a sentence-segmented format that aligns with the sentence numbering in the CSV annotation file
13-11-2014 32
33. MOAT(Multilingual Opinion Analysis Test Collection)
MOAT test collection can be used for experiments of multi-lingual opinion analysis in Japanese, English, and Chinese (simplified/traditional) (CstJE) such as
Opinion Judgement
Polarity (Positive/Negative/Neutral) Judgement
Opinion Holder Identification
Opinion Target Identification
Relevance Judgement.
13-11-2014 33
34. PATENT(IR Test Collection)
•The collection consists of Document data (Japanese patent applications 1993-1997 and Patent Abstracts of Japan 1993-1997), 101 Japanese search topics (34 topics were translated into English, Simplified and Traditional Chinese, and Korean, respectively), and Relevance judgments for each search topic.
•Japanese patent applications published in 1993-1997 are used for the Retrieval task.
•Each search topic is a claim extracted from Japanese patent applications.
13-11-2014 34
35. Q&A data Test Collection
The collection includes:
Document data: Mainichi Newspaper articles 1998-2001
Taskdata: questions(200, in japanese), system’s output, human’s output and sample answers
13-11-2014 35
36. SpokenDoc-2(IR for Spoken Documents)
lecture speech, spoken passage, conversation detection
The test collection includes:
Spoken Term Detection (STD)
inexistent Spoken Term Detection (iSTD) task
Content Retrieval (SCR) Scoring tool for STD and iSTDtask
Scoring tool for SCR task
13-11-2014 36
37. IR and Term Extraction/Role Analysis Test Collections
The IR Test collection includes
Document data (Author abstracts of the Academic Conference Paper Database (1988-1997) = author abstracts of the paper presented at the academic conference hosted by either of 65 academic societies in Japan. about 330,000 documents; more than half are English-Japanese paired,)
83 Search topics (Japanese,) and (3) Relevance Judgements.
The collection can be used for retrieval experiments of Japanese text retrieval and CLIR of search Either of English documents or Japanese-English documents by Japanese topics.
The Term Extraction Test collection includes tagged corpus using the 2000 Japanese documents selected from the above IR test collection.
13-11-2014 37
38. SUMM: (Text Summarization Test Collection)
The collection includes
Document data (Japanese newspaper articles Mainichi Newspaper (1998-1999,)
Model Summaries. Summary data consists of
Single document summaries (Each of 60 documents, 7 types of single document summaries prepared in different length by different strategies were prepared by 3 analysts) and
Multi-document summaries (Each of 50 document collections, 2 types of length of summaries were prepared by 3 analysis. 13-11-2014 38
39. WEB(IR Test Collection)
WEB test collection consists of"Document Data"which is a collection oftext data processed from the crawled documents provided mainly on"Web servers of Japan" and"Task Data"which is a collection of search topics and the relevance judgments of the documents.
"Task Data"consists of 400 mandatory topics and 841 optional topics for'Navigational Retrieval (Navi2)'."Document Data"named 'NW1000G-04' consists ofweb documents of approximately 1400GB in size and 100 million in number.
13-11-2014 39
41. The CLEF Test Suite
The CLEF Test Suite contains the data used for the main tracks of the CLEF campaigns carried out from 2000 to 2003:
Multilingual text retrieval
Bilingual text retrieval
Monolingual text retrieval
Domain-specific text retrieval
13-11-2014 41
42. The CLEF Test Suite
The CLEF Test Suite is composed of: • The multilingual document collections• A Step-by-Step documentation on how to perform a system evaluation (EN) • Tools for results computation• Multilingual Sets of topics• Multilingual Sets of relevance assessments• Guidelines for participants (in English) • Tables of the results obtained by the participants; • Publications.
13-11-2014 42
43. The CLEF Test Suite
Multilingual corpora: • English• French• German• Italian• Spanish• Dutch• Swedish• Finnish• Russian• Portuguese
13-11-2014 43
44. CLEF AdHoc-News Test Suites (2004-2008)
The CLEF AdHoc-News Test Suites (2004-2008) contain the data used for the main AdHoctrack of the CLEF campaigns carried out from 2004 to 2008.
This track tested the performance of monolingual, bilingual and multilingual Information Retrieval (IR) systems on multilingual news collections.
13-11-2014 44
45. CLEF Domain Specific Test Suites (2004-2008)
The CLEF Domain SpecificTestSuites (2004-2008) contain the data used for the Domain Specific track of the CLEF campaigns carried out from 2004 to 2008.
This track tested the performance of monolingual, bilingual and multilingual Information Retrieval (IR) systems on multilingual collections of scientific articles.
13-11-2014 45
46. CLEF Question Answering Test Suites (2003-2008)
The CLEF Question Answering Suites (2003-2008) contain the data used for the Question Answering (QA) track of the CLEF campaigns carried out from 2003 to 2008.
This track tested the performance of monolingual, bilingual and multilingual Question Answering systems on multilingual collections of news documents.
13-11-2014 46
47. Reuters Corpora
Reuters is the largest international text and television news agency. Its editorial division produces 11,000 stories a day in 23 languages.
Stories are both distributed in real time and made available via online databases and other archival products.
Datasets
Reuters-21578 : used in text classification
RCV1
RCV2
TRC2
13-11-2014 47
48. RCV1
In 2000, Reuters Ltd made available a large collection of Reuters News stories for use in research and development of natural language processing, information retrieval, and machine learning systems.
Known as "Reuters Corpus, Volume 1" or RCV1, is significantly larger than the older Reuters-21.578
Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire
RCV1 is drawn from one of those online databases. It was intended to consist of all and only English language stories produced by Reuters journalists between August 20, 1996, and August 19,1997
13-11-2014 48
49. RCV2
Multilingual Corpus, 1996-08-20 to 1997-08-19
contains over 487,000 Reuters News stories in thirteen languages (Dutch, French, German, Chinese, Japanese, Russian, Portuguese, Spanish, Latin American Spanish, Italian, Danish, Norwegian, and Swedish)
13-11-2014 49
50. Thomson Reuters Text Research Collection (TRC2)
The TRC2 corpus comprises 1,800,370 news stories covering the period from 2008-01-01 to 2009-02-28
Initially made available to participants of the 2009 blog track at the Text Retrieval Conference (TREC), to supplement the BLOGS08 corpus (that contains results of a large blog crawl carried out at the University of Glasgow). 13-11-2014 50
51. 20 Newsgroups
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.
It was originally collected by Ken Lang, for his Newsweeder: Learning to filter netnewspaper, though he does not explicitly mention this collection.
The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.
13-11-2014 51