RIA failure analysis process [Buckley+04, p.6]
1. The topic (or pair of topics) for the day was determined, with a leader being assigned the topic, on a rotating
basis among all participants
2. Each participant was assigned one of the six standard runs (or systems) to examine, either individually or as a
3. Each participant or team spent from 60 to 90 minutes investigating how their assigned system did on the assigned
topic, examining how the system did absolutely, how it did compared to the other systems, and how performance
could be improved for it. A template (see Figure 1) was generally filled out to guide both the investigation and
4. All participants assigned to a topic discussed the topic for 20 to 30 minutes, in separate rooms if there were two
topics. The failures of each system were discussed, along with any conclusions about the difficulty of the topic
5. The topic leader summarized the results of the discussion in a short report (a template was developed for this by
week 3 of the workshop). If there were 2 topics assigned for the day, each leader would give a short presentation
on the results to the workshop as a whole.
RIA failure analysis categorisation [Buckley+04, pp.10‐12]
1. General success ‐ present systems worked well
2. General technical failure (stemming, tokenization)
3. All systems emphasize one aspect; missing another required term
4. All systems emphasize one aspect; missing another aspect
5. Some systems emphasize one aspect; some another; need both
6. All systems emphasize one irrelevant aspect; missing point of topic
7. Need outside expansion of “general” term (Europe for example)
8. Need QA query analysis and relationships
9. Systems missed difficult aspect that would need human help
10. Need proximity relationship between two aspects
Antarctic vs Antarctica
“What disasters have
occurred in tunnels used
“How much sugar does
Cuba export and which
countries import it?”
“What are new methods of
“What countries are experiencing an increase in tourism?”
Categorisation done by
RIA failure analysis conclusions [Buckley+04, p.12]
• The first conclusion is that the root cause of poor performance on any
one topic is likely to be the same for all systems.
• The other major conclusion to be reached from these category
assignments is that if a system can realize the problem associated with a
given topic, then for well over half the topics studied (at least categories
1 through 5), current technology should be able to improve results
significantly. This suggests it may be more important for research to
discover what current techniques should be applied to which topics, than
to come up with new techniques.
Improvements that don’t add up [Armstrong+09]
Armstrong et al. analysed 106 papers from SIGIR ‘98‐’08,
CIKM ‘04‐’08 that used TREC data, and reported:
• Researchers often use low baselines
• Researchers claim statistically significant improvements,
but the results are often not competitive with the best
• IR effectiveness has not really improved over a decade!
What we want What we’ve got?
“Running on the spot?” [Armstrong+09]
Each line represents a
improvement over a
• Armstrong, T.G., Moffat, A., Webber, W. and Zobel, J.: Improvements that
Don’t Add Up: Ad‐hoc Retrieval Results Since 1998, ACM CIKM 2009,
• Buckley and Harman: Reliable Information Access Final Workshop Report,
• Harman and Buckley: The NRRC Reliable Information Access (RIA)
Workshop, ACM SIGIR 2004, pp.528‐529, 2004.
• 酒井: NTCIR公式結果に基づく文書検索技術の進歩に関する一考察, FIT
References (2) – Available from
• Fujii, A., Iwayama, M. and Kando, N.: Overview of the Patent Retrieval
Task at the NTCIR‐6 Workshop, NTCIR‐6, pp.359‐365, 2007.
• Goto, I., Chow, K.P. Lu, B., Sumita, E. and Tsou, B.K.: Overview of the
Patent Machine Translation Task at the NTCIR‐10 Workshop, NTCIR‐10,
• Kishida, K., Chen, K.‐H., Lee,, S., Kuriyama, K., Kando, N. and Chen, H.‐
H.: Overview of CLIR Task at the Sixth NTCIR Workshop, NTCIR‐6, 2007.
• Sakai, T., Dou, Z., Yamamoto, T., Liu, Y., Zhang, M., Song, R., Kato, M.P.
and Iwata, M.: Overview of the NTCIR‐10 INTENT‐2 Task, NTCIR‐10
Proceedings, pp.94‐123, 2013.