We describe a pilot experiment to update the program of an Information Retrieval course for Computer Science undergraduates. We have engaged the students in the development of a search engine from scratch, and they have been involved in the elaboration, also from scratch, of a complete test collection to evaluate their systems. With this methodology they get a whole vision of the Information Retrieval process as they would find it in a real-world setting, and their direct involvement in the evaluation makes them realize the importance of these laboratory experiments in Computer Science. We show that this methodology is indeed reliable and feasible, and so we plan on improving and keep using it in the next years, leading to a public repository of resources for Information Retrieval courses.
Mlintro 120730222641-phpapp01-210624192524Scott Domes
- Prof. Lior Rokach introduces machine learning and gives an overview of his background and research interests.
- Machine learning aims to develop systems that can learn from experience to improve their performance on some task. A key aspect is that the system improves its performance based on experience.
- Examples of machine learning applications include spam filtering, data mining, handwriting recognition, and more. Popular algorithms discussed include decision trees, neural networks, nearest neighbors, and ensemble methods.
- Challenges in machine learning like overfitting, dimensionality, and finding meaningful patterns are also covered.
Jean Quan FPPC Form 460, 7-1-14 to 9-30-14OakMayor2014
This document contains a form for campaign finance disclosure. It includes sections for verifying the accuracy of the statement, providing committee and contact information, and selecting the type of statement being filed. The treasurer, controlling candidates or officials, and an assistant treasurer must all sign and date the form to verify the contents.
PAUL MARCUM: "Turn On, Weigh In, Tweet Out: Twitter as Your Body Sensor API"IGNITE NYC
What Google's Marissa Mayer calls "The sensor revolution" is upon us and multiple fitness sensors are on the market. Loic le Meur just called for a "body API" to handle the data these generate. In his talk Paul argues that it already exists and is called Twitter. Tweeters are already sharing #fitstats, weigh-ins, and #nikeplus and Garmin #runs by the thousands each day. It's in Twitter's best interest to encourage this behavior as data is a growing business and in the tweeters' best interest too as illustrated by a recent Kaiser Permanente study but the system needs to be optimized.
@jpmarcum, marcum.com
The document discusses the Long Island Housing Partnership's (LIHP) employer-assisted housing program. It summarizes how the program works to provide down payment assistance grants to employees through employer matching contributions combined with other state and federal funds administered by LIHP. Employers benefit by retaining and recruiting employees through improved affordability and employee satisfaction. LIHP guides both employers and employees through the process, from establishing an employer program to homebuying counseling and coordination of funds. The program has helped over 330 employees of over 130 employers purchase homes through more than $12 million in grants.
Este documento presenta consideraciones sobre la descentralización y el desarrollo local en El Salvador en el contexto del Plan de Gobierno 1999-2004. Describe conceptos clave como descentralización y desarrollo local, y analiza la estrategia nacional de descentralización contenida en el plan, la cual busca transferir gradualmente competencias y recursos a los gobiernos locales para mejorar la prestación de servicios a nivel municipal. El documento también resume las acciones propuestas y los actores involucrados en el proceso de descentralización durante el período de
Este documento discute el concepto de Zona de Desarrollo Próximo (ZDP) y cómo puede usarse para diseñar redes de aprendizaje cooperativo en educación virtual. Explica que la ZDP se refiere a la distancia entre el nivel de desarrollo actual de un estudiante y su nivel potencial cuando recibe guía de alguien más capaz. Propone que la interacción entre iguales mediada por tecnología puede crear oportunidades de aprendizaje significativo dentro de la ZDP, en lugar de depender exclusivamente de la guía
The document discusses evidence that the Lower Colorado River formed via downstream cascading lake spillover rather than headward erosion. Specifically:
- Sediments in the Laughlin, NV area indicate two episodes of lacustrine deposition separated by a flood, supporting the lake spillover model.
- The Bouse Formation, consisting of lacustrine deposits, fills the Cottonwood and Mohave Valleys after a major flood, inconsistent with the headward erosion model.
- Timing of sediment deposition is consistent with integration of the river system occurring via downstream lake spillover around 5-6 million years ago, rather than headward erosion over longer time periods.
Mlintro 120730222641-phpapp01-210624192524Scott Domes
- Prof. Lior Rokach introduces machine learning and gives an overview of his background and research interests.
- Machine learning aims to develop systems that can learn from experience to improve their performance on some task. A key aspect is that the system improves its performance based on experience.
- Examples of machine learning applications include spam filtering, data mining, handwriting recognition, and more. Popular algorithms discussed include decision trees, neural networks, nearest neighbors, and ensemble methods.
- Challenges in machine learning like overfitting, dimensionality, and finding meaningful patterns are also covered.
Jean Quan FPPC Form 460, 7-1-14 to 9-30-14OakMayor2014
This document contains a form for campaign finance disclosure. It includes sections for verifying the accuracy of the statement, providing committee and contact information, and selecting the type of statement being filed. The treasurer, controlling candidates or officials, and an assistant treasurer must all sign and date the form to verify the contents.
PAUL MARCUM: "Turn On, Weigh In, Tweet Out: Twitter as Your Body Sensor API"IGNITE NYC
What Google's Marissa Mayer calls "The sensor revolution" is upon us and multiple fitness sensors are on the market. Loic le Meur just called for a "body API" to handle the data these generate. In his talk Paul argues that it already exists and is called Twitter. Tweeters are already sharing #fitstats, weigh-ins, and #nikeplus and Garmin #runs by the thousands each day. It's in Twitter's best interest to encourage this behavior as data is a growing business and in the tweeters' best interest too as illustrated by a recent Kaiser Permanente study but the system needs to be optimized.
@jpmarcum, marcum.com
The document discusses the Long Island Housing Partnership's (LIHP) employer-assisted housing program. It summarizes how the program works to provide down payment assistance grants to employees through employer matching contributions combined with other state and federal funds administered by LIHP. Employers benefit by retaining and recruiting employees through improved affordability and employee satisfaction. LIHP guides both employers and employees through the process, from establishing an employer program to homebuying counseling and coordination of funds. The program has helped over 330 employees of over 130 employers purchase homes through more than $12 million in grants.
Este documento presenta consideraciones sobre la descentralización y el desarrollo local en El Salvador en el contexto del Plan de Gobierno 1999-2004. Describe conceptos clave como descentralización y desarrollo local, y analiza la estrategia nacional de descentralización contenida en el plan, la cual busca transferir gradualmente competencias y recursos a los gobiernos locales para mejorar la prestación de servicios a nivel municipal. El documento también resume las acciones propuestas y los actores involucrados en el proceso de descentralización durante el período de
Este documento discute el concepto de Zona de Desarrollo Próximo (ZDP) y cómo puede usarse para diseñar redes de aprendizaje cooperativo en educación virtual. Explica que la ZDP se refiere a la distancia entre el nivel de desarrollo actual de un estudiante y su nivel potencial cuando recibe guía de alguien más capaz. Propone que la interacción entre iguales mediada por tecnología puede crear oportunidades de aprendizaje significativo dentro de la ZDP, en lugar de depender exclusivamente de la guía
The document discusses evidence that the Lower Colorado River formed via downstream cascading lake spillover rather than headward erosion. Specifically:
- Sediments in the Laughlin, NV area indicate two episodes of lacustrine deposition separated by a flood, supporting the lake spillover model.
- The Bouse Formation, consisting of lacustrine deposits, fills the Cottonwood and Mohave Valleys after a major flood, inconsistent with the headward erosion model.
- Timing of sediment deposition is consistent with integration of the river system occurring via downstream lake spillover around 5-6 million years ago, rather than headward erosion over longer time periods.
This document provides an overview of methodology and tools for reading scientific papers. It discusses the objectives of actively reading papers, including understanding their content and writing reading notes, literature reviews, and reviews. It describes different types of papers and their typical structure. The document outlines a three-pass approach for actively reading papers: an initial skim, a deeper reading to understand the content, and a final careful reading. It also discusses justifying and evaluating the results and contributions of papers through formal proofs, experimentation, argumentation, and benchmarks. The key message is that actively reading papers involves writing annotations and notes to fully comprehend and synthesize their information.
This document discusses trees and binary trees. It covers tree terminology, tree traversals including preorder, inorder and postorder, implementing binary trees in Java, using binary trees to represent general trees, and applications of trees for data science. The next lecture will cover binary search trees, AVL trees, binary heaps and red black trees.
The document summarizes information about the EC-TEL Doctoral Consortium which focuses on interdisciplinary work. It will be held on September 29th or 30th in Nice, France. It is aimed at advanced PhD students who have achieved preliminary results beyond literature research. Students will submit a 6-page paper describing their research and preliminary findings which will be reviewed by mentors and other students. The draft agenda includes an intro session, main session for presentations and group discussions with mentors, and a closing session on career counseling after completing a PhD.
Symposium at the 24th Annual International Conference of the Society for Information Technology & Teacher Education, 27 March 2013.
This symposium discusses several ways in which (pre-service) teachers‟ TPACK can be measured. The first two studies unravel the TPACK survey (Schmidt et al., 2009), a self-report instrument to determine TPACK, and try to revalidate the survey in two different pre-service teacher education contexts: The US and the Netherlands. The third study triangulates findings from the TPACK survey with other instruments to better understand teachers‟ development of TPACK that resulted from teachers‟ collaborative design of technology integrated lessons. The last contribution focuses on measuring transfer of TPACK, as it studies how beginning teachers, who had TPACK training during their pre-service education, demonstrated TPACK in their practice. Similarities and differences in the ways TPACK were measured and its implications will be discussed.
The document provides background information on the development of a K-12 computer science framework. It describes the framework's vision and principles, including empowering students to be informed citizens and understand computing's role in the world. The framework will outline computer science concepts and practices for different grade levels. Feedback will be gathered from reviewers and incorporated to improve the framework, which is intended to inform the development of state standards.
Slides from the workshop presented by Margaret Hamilton and Joan Richardson at the Australian Technology Network conference in Sydney in November 2010.
From the ALTC-funded project "Web 2.0 Authoring Tools in Higher Education: New Directions for Assessment and Academic Integrity"
Topic models such as Latent Dirichlet Allocation (LDA) have been extensively used for characterizing text collections according to the topics discussed in documents. Organizing documents according to topic can be applied to different information access tasks such as document clustering, content-based recommendation or summarization. Spoken documents such as podcasts typically involve more than one speaker (e.g., meetings, interviews, chat shows or news with reporters). This paper presents a work-in-progress based on a variation of LDA that includes in the model the different speakers participating in conversational audio transcripts. Intuitively, each speaker has her own background knowledge which generates different topic and word distributions. We believe that informing a topic model with speaker segmentation (e.g., using existing speaker diarization techniques) may enhance discovery of topics in multi-speaker audio content.
Intra- and interdisciplinary cross-concordances for information retrieval GESIS
Intra- and interdisciplinary cross-concordances were created to improve information retrieval across heterogeneous collections. The KoMoHe project created 25 vocabularies in 64 cross-concordances, mapping 380,000 terms through 465,000 relations. Information retrieval tests found that cross-concordances significantly improved recall and precision more for interdisciplinary searches compared to intradisciplinary searches, which had more identical terms. Mapping projects should perform more information retrieval tests to measure the effect of their mappings on search effectiveness.
This document provides an overview of core methods in educational data mining. It discusses definitions of big data and how the concept has evolved over time. Various EDM methods are introduced, including classification, regression, clustering, and more recent transformer models. Practical considerations for applying models like selecting algorithms and interpreting regression results are addressed. The document also references related readings on comparing algorithms and treating data points as uncertain.
The document discusses the basics of information retrieval systems. It covers two main stages - indexing and retrieval. In the indexing stage, documents are preprocessed and stored in an index. In retrieval, queries are issued and the index is accessed to find relevant documents. The document then discusses several models for defining relevance between documents and queries, including the Boolean model and vector space model. It also covers techniques for representing documents and queries as vectors and calculating similarity between them.
This document discusses supervised learning techniques for text classification. It describes how supervised learning involves assigning objects like documents to predefined categories or classes based on labeled examples. For text classification, some key challenges include dealing with large feature spaces, data scarcity issues, and hierarchical relationships between classes. The document outlines several classification techniques used for text like nearest neighbor classifiers, Bayesian classifiers, support vector machines, and rule induction. It also discusses evaluating classifier accuracy and issues like feature selection.
This document provides an overview of the first lecture for a Human-Computer Interaction course. It introduces the importance of HCI in technology success. The class will discuss 12 papers over the semester related to HCI topics like input/output devices, information visualization, and augmented/virtual reality. Students will be assigned to teams to present two papers each, including a summary and leading a discussion. The professor outlines best practices for summarizing and critiquing papers in the allotted timeframes to prepare students.
This document describes the BLOOMS+ approach for performing contextual ontology alignment of Linked Open Data datasets with an upper ontology. It presents the challenges of existing ontology matching approaches when applied to large, diverse LOD datasets. BLOOMS+ leverages structured knowledge from Wikipedia categories to generate "BLOOMS trees" representing different senses of concept names, and calculates similarity between trees to identify subclass, equivalence and other relationships between concepts in different datasets. It outperforms existing approaches on aligning three real-world LOD ontologies to the PROTON upper ontology. Future work aims to improve the weighting and identify additional contextual sources to assist with schema alignment across datasets.
The document discusses Ellen Voorhees' work in information retrieval evaluation. It outlines some of the premises of evaluation, including the influential Cranfield paradigm, and how modern tests like TREC have adapted this paradigm. It also addresses challenges like pooling, incomplete judgments, assessor subjectivity, and how to build test collections for cross-language evaluation.
The document discusses using tag clouds to analyze lecture materials and student assignments. It suggests tag clouds can help students reflect on how their assignments address the learning outcomes and course topics. The document also notes some potential issues with tag clouds, such as ambiguities from synonyms and multiple meanings, but suggests they may still be useful for opening lectures or revising at the end of a course in a way that speaks to digital natives.
This document proposes improvements to domain-specific term extraction for ontology construction. It discusses issues with existing term extraction approaches and presents a new method that selects and organizes target and contrastive corpora. Terms are extracted using linguistic rules on part-of-speech tagged text. Statistical distributions are calculated to identify terms based on their frequency across multiple contrastive corpora. The approach achieves better precision in extracting simple and complex terms for computer science and biomedical domains compared to existing methods.
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Julián Urbano
Statistical significance testing is widely accepted as a means to assess how well a difference in effectiveness reflects an actual difference between systems, as opposed to random noise because of the selection of topics. According to recent surveys on SIGIR, CIKM, ECIR and TOIS papers, the t-test is the most popular choice among IR researchers. However, previous work has suggested computer intensive tests like the bootstrap or the permutation test, based mainly on theoretical arguments. On empirical grounds, others have suggested non-parametric alternatives such as the Wilcoxon test. Indeed, the question of which tests we should use has accompanied IR and related fields for decades now. Previous theoretical studies on this matter were limited in that we know that test assumptions are not met in IR experiments, and empirical studies were limited in that we do not have the necessary control over the null hypotheses to compute actual Type I and Type II error rates under realistic conditions. Therefore, not only is it unclear which test to use, but also how much trust we should put in them. In contrast to past studies, in this paper we employ a recent simulation methodology from TREC data to go around these limitations. Our study comprises over 500 million p-values computed for a range of tests, systems, effectiveness measures, topic set sizes and effect sizes, and for both the 2-tail and 1-tail cases. Having such a large supply of IR evaluation data with full knowledge of the null hypotheses, we are finally in a position to evaluate how well statistical significance tests really behave with IR data, and make sound recommendations for practitioners.
Going through a PhD may be seen as a requirement for an academic career or a different kind of job, simply as “the next step” in education, as something to do “because why not?”, or even just as a hobby you have on the side. What it really is though, is a life-changing experience, something that can be terribly painful and amazingly rewarding at the same time. In that journey I learned a few lessons in the hard way, lessons that I wish someone had told me about at the time. In this talk I’ll try to do just that and not talk about the content and process of a PhD, but rather about you, the person, during your PhD.
More Related Content
Similar to Bringing Undergraduate Students Closer to a Real-World Information Retrieval Setting: Methodology and Resources
This document provides an overview of methodology and tools for reading scientific papers. It discusses the objectives of actively reading papers, including understanding their content and writing reading notes, literature reviews, and reviews. It describes different types of papers and their typical structure. The document outlines a three-pass approach for actively reading papers: an initial skim, a deeper reading to understand the content, and a final careful reading. It also discusses justifying and evaluating the results and contributions of papers through formal proofs, experimentation, argumentation, and benchmarks. The key message is that actively reading papers involves writing annotations and notes to fully comprehend and synthesize their information.
This document discusses trees and binary trees. It covers tree terminology, tree traversals including preorder, inorder and postorder, implementing binary trees in Java, using binary trees to represent general trees, and applications of trees for data science. The next lecture will cover binary search trees, AVL trees, binary heaps and red black trees.
The document summarizes information about the EC-TEL Doctoral Consortium which focuses on interdisciplinary work. It will be held on September 29th or 30th in Nice, France. It is aimed at advanced PhD students who have achieved preliminary results beyond literature research. Students will submit a 6-page paper describing their research and preliminary findings which will be reviewed by mentors and other students. The draft agenda includes an intro session, main session for presentations and group discussions with mentors, and a closing session on career counseling after completing a PhD.
Symposium at the 24th Annual International Conference of the Society for Information Technology & Teacher Education, 27 March 2013.
This symposium discusses several ways in which (pre-service) teachers‟ TPACK can be measured. The first two studies unravel the TPACK survey (Schmidt et al., 2009), a self-report instrument to determine TPACK, and try to revalidate the survey in two different pre-service teacher education contexts: The US and the Netherlands. The third study triangulates findings from the TPACK survey with other instruments to better understand teachers‟ development of TPACK that resulted from teachers‟ collaborative design of technology integrated lessons. The last contribution focuses on measuring transfer of TPACK, as it studies how beginning teachers, who had TPACK training during their pre-service education, demonstrated TPACK in their practice. Similarities and differences in the ways TPACK were measured and its implications will be discussed.
The document provides background information on the development of a K-12 computer science framework. It describes the framework's vision and principles, including empowering students to be informed citizens and understand computing's role in the world. The framework will outline computer science concepts and practices for different grade levels. Feedback will be gathered from reviewers and incorporated to improve the framework, which is intended to inform the development of state standards.
Slides from the workshop presented by Margaret Hamilton and Joan Richardson at the Australian Technology Network conference in Sydney in November 2010.
From the ALTC-funded project "Web 2.0 Authoring Tools in Higher Education: New Directions for Assessment and Academic Integrity"
Topic models such as Latent Dirichlet Allocation (LDA) have been extensively used for characterizing text collections according to the topics discussed in documents. Organizing documents according to topic can be applied to different information access tasks such as document clustering, content-based recommendation or summarization. Spoken documents such as podcasts typically involve more than one speaker (e.g., meetings, interviews, chat shows or news with reporters). This paper presents a work-in-progress based on a variation of LDA that includes in the model the different speakers participating in conversational audio transcripts. Intuitively, each speaker has her own background knowledge which generates different topic and word distributions. We believe that informing a topic model with speaker segmentation (e.g., using existing speaker diarization techniques) may enhance discovery of topics in multi-speaker audio content.
Intra- and interdisciplinary cross-concordances for information retrieval GESIS
Intra- and interdisciplinary cross-concordances were created to improve information retrieval across heterogeneous collections. The KoMoHe project created 25 vocabularies in 64 cross-concordances, mapping 380,000 terms through 465,000 relations. Information retrieval tests found that cross-concordances significantly improved recall and precision more for interdisciplinary searches compared to intradisciplinary searches, which had more identical terms. Mapping projects should perform more information retrieval tests to measure the effect of their mappings on search effectiveness.
This document provides an overview of core methods in educational data mining. It discusses definitions of big data and how the concept has evolved over time. Various EDM methods are introduced, including classification, regression, clustering, and more recent transformer models. Practical considerations for applying models like selecting algorithms and interpreting regression results are addressed. The document also references related readings on comparing algorithms and treating data points as uncertain.
The document discusses the basics of information retrieval systems. It covers two main stages - indexing and retrieval. In the indexing stage, documents are preprocessed and stored in an index. In retrieval, queries are issued and the index is accessed to find relevant documents. The document then discusses several models for defining relevance between documents and queries, including the Boolean model and vector space model. It also covers techniques for representing documents and queries as vectors and calculating similarity between them.
This document discusses supervised learning techniques for text classification. It describes how supervised learning involves assigning objects like documents to predefined categories or classes based on labeled examples. For text classification, some key challenges include dealing with large feature spaces, data scarcity issues, and hierarchical relationships between classes. The document outlines several classification techniques used for text like nearest neighbor classifiers, Bayesian classifiers, support vector machines, and rule induction. It also discusses evaluating classifier accuracy and issues like feature selection.
This document provides an overview of the first lecture for a Human-Computer Interaction course. It introduces the importance of HCI in technology success. The class will discuss 12 papers over the semester related to HCI topics like input/output devices, information visualization, and augmented/virtual reality. Students will be assigned to teams to present two papers each, including a summary and leading a discussion. The professor outlines best practices for summarizing and critiquing papers in the allotted timeframes to prepare students.
This document describes the BLOOMS+ approach for performing contextual ontology alignment of Linked Open Data datasets with an upper ontology. It presents the challenges of existing ontology matching approaches when applied to large, diverse LOD datasets. BLOOMS+ leverages structured knowledge from Wikipedia categories to generate "BLOOMS trees" representing different senses of concept names, and calculates similarity between trees to identify subclass, equivalence and other relationships between concepts in different datasets. It outperforms existing approaches on aligning three real-world LOD ontologies to the PROTON upper ontology. Future work aims to improve the weighting and identify additional contextual sources to assist with schema alignment across datasets.
The document discusses Ellen Voorhees' work in information retrieval evaluation. It outlines some of the premises of evaluation, including the influential Cranfield paradigm, and how modern tests like TREC have adapted this paradigm. It also addresses challenges like pooling, incomplete judgments, assessor subjectivity, and how to build test collections for cross-language evaluation.
The document discusses using tag clouds to analyze lecture materials and student assignments. It suggests tag clouds can help students reflect on how their assignments address the learning outcomes and course topics. The document also notes some potential issues with tag clouds, such as ambiguities from synonyms and multiple meanings, but suggests they may still be useful for opening lectures or revising at the end of a course in a way that speaks to digital natives.
This document proposes improvements to domain-specific term extraction for ontology construction. It discusses issues with existing term extraction approaches and presents a new method that selects and organizes target and contrastive corpora. Terms are extracted using linguistic rules on part-of-speech tagged text. Statistical distributions are calculated to identify terms based on their frequency across multiple contrastive corpora. The approach achieves better precision in extracting simple and complex terms for computer science and biomedical domains compared to existing methods.
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Julián Urbano
Statistical significance testing is widely accepted as a means to assess how well a difference in effectiveness reflects an actual difference between systems, as opposed to random noise because of the selection of topics. According to recent surveys on SIGIR, CIKM, ECIR and TOIS papers, the t-test is the most popular choice among IR researchers. However, previous work has suggested computer intensive tests like the bootstrap or the permutation test, based mainly on theoretical arguments. On empirical grounds, others have suggested non-parametric alternatives such as the Wilcoxon test. Indeed, the question of which tests we should use has accompanied IR and related fields for decades now. Previous theoretical studies on this matter were limited in that we know that test assumptions are not met in IR experiments, and empirical studies were limited in that we do not have the necessary control over the null hypotheses to compute actual Type I and Type II error rates under realistic conditions. Therefore, not only is it unclear which test to use, but also how much trust we should put in them. In contrast to past studies, in this paper we employ a recent simulation methodology from TREC data to go around these limitations. Our study comprises over 500 million p-values computed for a range of tests, systems, effectiveness measures, topic set sizes and effect sizes, and for both the 2-tail and 1-tail cases. Having such a large supply of IR evaluation data with full knowledge of the null hypotheses, we are finally in a position to evaluate how well statistical significance tests really behave with IR data, and make sound recommendations for practitioners.
Going through a PhD may be seen as a requirement for an academic career or a different kind of job, simply as “the next step” in education, as something to do “because why not?”, or even just as a hobby you have on the side. What it really is though, is a life-changing experience, something that can be terribly painful and amazingly rewarding at the same time. In that journey I learned a few lessons in the hard way, lessons that I wish someone had told me about at the time. In this talk I’ll try to do just that and not talk about the content and process of a PhD, but rather about you, the person, during your PhD.
Statistical Analysis of Results in Music Information Retrieval: Why and HowJulián Urbano
This document summarizes an introduction given by Julián Urbano and Arthur Flexer on statistical analysis in music information retrieval. It discusses why evaluation is important, including addressing questions about how good a system is and comparing different systems. It describes how the Cranfield paradigm addresses these questions by using fixed test collections with documents, topics and relevance judgments to simulate users and allow reproducible experiments. Finally, it outlines different types of music information retrieval tasks like retrieval, annotation and extraction and how the evaluation approach may differ depending on the specific task or use case.
The Treatment of Ties in AP CorrelationJulián Urbano
The Kendall tau and AP correlation coefficients are very commonly use to compare two rankings over the same set of items. Even though Kendall tau was originally defined assuming that there are no ties in the rankings, two alternative versions were soon developed to account for ties in two different scenarios: measure the accuracy of an observer with respect to a true and objective ranking, and measure the agreement between two observers in the absence of a true ranking. These two variants prove useful in cases where ties are possible in either ranking, and may indeed result in very different scores. AP correlation was devised to incorporate a top-heaviness component into Kendall tau, penalizing more heavily if differences occur between items at the top of the rankings, making it a very compelling coefficient in Information Retrieval settings. However, the treatment of ties in AP correlation remains an open problem. In this paper we fill this gap, providing closed analytical formulations of AP correlation under the two scenarios of ties contemplated in Kendall tau. In addition, we developed an R package that implements these coefficients.
The Music Information Retrieval Evaluation eXchange (MIREX) is a valuable community service, having established standard datasets, metrics, baselines, methodologies, and infrastructure for comparing MIR methods. While MIREX has managed to successfully maintain operations for over a decade, its long-term sustainability is at risk without considerable ongoing financial support. The imposed constraint that input data cannot be made freely available to participants necessitates that all algorithms run on centralized computational resources, which are administered by a limited number of people. This incurs an approximately linear cost with the number of submissions, exacting significant tolls on both human and financial resources, such that the current paradigm becomes less tenable as participation increases. To alleviate the recurring costs of future evaluation campaigns, we propose a distributed, community-centric paradigm for system evaluation, built upon the principles of openness, transparency, reproducibility, and incremental evaluation. We argue that this proposal has the potential to reduce operating costs to sustainable levels. Moreover, the proposed paradigm would improve scalability, and eventually result in the release of large, open datasets for improving both MIR techniques and evaluation methods.
Crawling the Web for Structured DocumentsJulián Urbano
Structured Information Retrieval is gaining a lot of interest in recent years, as this kind of information is becoming an invaluable asset for professional communities such as Software Engineering. Most of the research has focused on XML documents, with initiatives like INEX to bring together and evaluate new techniques focused on structured information. Despite the use of XML documents is the immediate choice, the Web is filled with several other types of structured information, which account for millions of other documents. These documents may be collected directly using standard Web search engines like Google and Yahoo, or following specific search patterns in online repositories like Sourceforge. This demo describes a distributed and focused web crawler for any kind of structured documents, and we show with it how to exploit general-purpose resources to gather large amounts of real-world structured documents off the Web. This kind of tool could help building large test collections of other types of documents, such as Java source code for software oriented search engines or RDF for semantic searching.
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...Julián Urbano
We present an empirical analysis of the effect that the gain and discount functions have in the correlation between DCG and user satisfaction. Through a large user study we estimate the relationship between satisfaction and the effectiveness computed with a test collection. In particular, we estimate the probabilities that users find a system satisfactory given a DCG score, and that they agree with a difference in DCG as to which of two systems is more satisfactory. We study this relationship for 36 combinations of gain and discount, and find that a linear gain and a constant discount are best correlated with user satisfaction.
A Comparison of the Optimality of Statistical Significance Tests for Informat...Julián Urbano
This document evaluates different statistical significance tests for information retrieval evaluation by comparing their power, safety, and ability to maintain the desired error rate. It analyzes results from over 60 million p-value comparisons between 110 retrieval system runs on TREC topics. The t-test was found to have the fewest errors but lower power, while the Wilcoxon test best maintained the nominal error rate but had lower power and more errors than other tests. The permutation test had an optimal balance of power and safety for practical use in information retrieval evaluation.
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...Julián Urbano
This short paper describes four submissions to the Symbolic Melodic Similarity task of the MIREX 2010 edition. All four submissions rely on a local-alignment approach between sequences of n-grams, and they differ mainly on the substitution score between two n-grams. This score is based on a geometric representation that shapes musical pieces as curves in the pitch-time plane. One of the systems described ranked first for all ten effectiveness measures used and the other three ranked from second to fifth, depending on the measure.
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackJulián Urbano
This paper describes the participation of the uc3m team in both tasks of the TREC 2011 Crowdsourcing Track. For the first task we submitted three runs that used Amazon Mechanical Turk: one where workers made relevance judgments based on a 3-point scale, and two similar runs where workers provided an explicit ranking of documents. All three runs implemented a quality control mechanism at the task level based on a simple reading comprehension test. For the second task we also submitted three runs: one with a stepwise execution of the GetAnotherLabel algorithm and two others with a rule-based and a SVMbased model. According to the NIST gold labels, our runs performed very well in both tasks, ranking at the top for most measures.
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...Julián Urbano
Much research in MIR is based on descriptors computed from audio signals. Some music corpora use different audio encodings, some do not contain audio but descriptors already computed in some particular way, and sometimes we have to gather audio files ourselves. We thus assume that descriptors are robust to these changes and algorithms are not affected. We investigated this issue for MFCCs and Chroma: how do encoding quality, analysis parameters and musical characteristics affect their robustness?
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Julián Urbano
Test-collection based evaluation in (Music) Information Retrieval has been used for half a century now as the means to evaluate and compare retrieval techniques and advance the state of the art. However, this paradigm makes certain assumptions that remain a research problem and that may invalidate our experimental results. In this talk I will approach this paradigm as an estimator of certain probability distributions that describe the final user experience. These distributions are estimated with a test collection, computing system-related distributions assumed to reliably correlate with the target user-related distributions.
Using the Audio Music Similarity task as an example, I will talk about issues with our current evaluation methods, the degree to which they are problematic, how to analyze them and improve the situation. In terms of validity, we will see how the measured system distributions correspond to the target user distributions, and how this correspondence affects the conclusions we draw from an experiment. In terms of reliability, we will discuss optimal characteristics of test collections and statistical procedures. In terms of efficiency, we discuss models and methods to greatly reduce the annotation cost of an evaluation experiment.
This document discusses symbolic melodic similarity and methods for representing and comparing melodies symbolically. It presents several approaches for representing melodies, such as sequences of text, graphs, n-grams, and curves. It focuses on an approach where melodies are represented as curves interpolated from note points, and similarity is measured by aligning the curves and comparing the areas between their derivatives. Evaluation results from music information retrieval competitions show this curve alignment approach performs well. The document concludes with a demonstration of melody comparison software implementing these techniques.
Audio Music Similarity is a task within Music Information Retrieval that deals with systems that retrieve songs musically similar to a query song according to their audio content. Evaluation experiments are the main scientific tool in Information Retrieval to determine what systems work better and advance the state of the art accordingly. It is therefore essential that the conclusions drawn from these experiments are both valid and reliable, and that we can reach them at a low cost. This dissertation studies these three aspects of evaluation experiments for the particular case of Audio Music Similarity, with the general goal of improving how these systems are evaluated. The traditional paradigm for Information Retrieval evaluation based on test collections is approached as an statistical estimator of certain probability distributions that characterize how users employ systems. In terms of validity, we study how well the measured system distributions correspond to the target user distributions, and how this correspondence affects the conclusions we draw from an experiment. In terms of reliability, we study the optimal characteristics of test collections and statistical procedures, and in terms of efficiency we study models and methods to greatly reduce the cost of running an evaluation experiment.
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalJulián Urbano
The document discusses the validity and reliability of using Cranfield-like test collections to evaluate information retrieval systems. It outlines some key assumptions made in using system-based measures as proxies for user-based measures of satisfaction. Specifically, it assumes system measures correlate with user satisfaction, but questions whether the mapping is accurate. It also notes reliability depends on test collection size and representation. The document then describes an experiment comparing user preferences to system measure scores to evaluate the validity of different measures and relevance scales in predicting user satisfaction with single or multiple retrieval systems.
On the Measurement of Test Collection ReliabilityJulián Urbano
The reliability of a test collection is proportional to the number of queries it contains. But building a collection with many queries is expensive, so researchers have to find a balance between reliability and cost. Previous work on the measurement of test collection reliability relied on data-based approaches that contemplated random what if scenarios, and provided indicators such as swap rates and Kendall tau correlations. Generalizability Theory was proposed as an alternative founded on analysis of variance that provides reliability indicators based on statistical theory. However, these reliability indicators are hard to interpret in practice, because they do not correspond to well known indicators like Kendall tau correlation. We empirically established these relationships based on data from over 40 TREC collections, thus filling the gap in the practical interpretation of Generalizability Theory. We also review the computation of these indicators, and show that they are extremely dependent on the sample of systems and queries used, so much that the required number of queries to achieve a certain level of reliability can vary in orders of magnitude. We discuss the computation of confidence intervals for these statistics, providing a much more reliable tool to measure test collection reliability. Reflecting upon all these results, we review a wealth of TREC test collections, arguing that they are possibly not as reliable as generally accepted and that the common choice of 50 queries is insufficient even for stable rankings.
How Significant is Statistically Significant? The case of Audio Music Similar...Julián Urbano
This document discusses the significance of statistical significance in evaluating systems for tasks like audio music similarity and retrieval. It argues that statistical significance alone does not determine whether a result is important or useful. The magnitude of the observed effect, not just the p-value, needs to be considered. User studies can help validate whether statistically significant differences according to system evaluations actually lead to differences people notice in satisfaction or preference. Effect sizes are better indicators than p-values of how much an improvement may matter to users.
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Julián Urbano
Reliable evaluation of Information Retrieval systems requires large amounts of relevance judgments. Making these annotations is quite complex and tedious for many Music Information Retrieval tasks, so performing such evaluations requires too much effort. A low-cost alternative is the application of Minimal Test Collection algorithms, which offer quite reliable results while significantly reducing the annotation effort. The idea is to incrementally select what documents to judge so that we can compute estimates of the effectiveness differences between systems with a certain degree of confidence. In this paper we show a first approach towards its application to the evaluation of the Audio Music Similarity and Retrieval task, run by the annual MIREX evaluation campaign. An analysis with the MIREX 2011 data shows that the judging effort can be reduced to about 35% to obtain results with 95% confidence.
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...Julián Urbano
This notebook paper describes our participation in both tasks of the TREC 2011 Crowdsourcing Track. For the first one we submitted three runs that used Amazon Mechanical Turk: one where workers made relevance judgments based on a 3-point scale, and two similar runs where workers provided an explicit ranking of documents. All three runs implemented a quality control mechanism at the task level, which was based on a simple reading comprehension test. For the second task we submitted another three runs: one with a stepwise execution of the GetAnotherLabel algorithm by Ipeirotis et al., and two others with a rule-based and a SVM-based model. We also comment on several topics regarding the Track design and evaluation methods.
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Julián Urbano
The Music Information Retrieval field has acknowledged the need for rigorous scientific evaluations for some time now. Several efforts were set out to develop and provide the necessary infrastructure, technology and methodologies to carry out these evaluations, out of which the annual Music Information Retrieval Evaluation eXchange emerged. The community as a whole has enormously gained from this evaluation forum, but very little attention has been paid to reliability and correctness issues. From the standpoint of the analysis of experimental validity, this paper presents a survey of past meta-evaluation work in the context of Text Information Retrieval, arguing that the music community still needs to address various issues concerning the evaluation of music systems and the IR cycle, pointing out directions for further research and proposals in this line.
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptxEduSkills OECD
Iván Bornacelly, Policy Analyst at the OECD Centre for Skills, OECD, presents at the webinar 'Tackling job market gaps with a skills-first approach' on 12 June 2024
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumMJDuyan
(𝐓𝐋𝐄 𝟏𝟎𝟎) (𝐋𝐞𝐬𝐬𝐨𝐧 𝟏)-𝐏𝐫𝐞𝐥𝐢𝐦𝐬
𝐃𝐢𝐬𝐜𝐮𝐬𝐬 𝐭𝐡𝐞 𝐄𝐏𝐏 𝐂𝐮𝐫𝐫𝐢𝐜𝐮𝐥𝐮𝐦 𝐢𝐧 𝐭𝐡𝐞 𝐏𝐡𝐢𝐥𝐢𝐩𝐩𝐢𝐧𝐞𝐬:
- Understand the goals and objectives of the Edukasyong Pantahanan at Pangkabuhayan (EPP) curriculum, recognizing its importance in fostering practical life skills and values among students. Students will also be able to identify the key components and subjects covered, such as agriculture, home economics, industrial arts, and information and communication technology.
𝐄𝐱𝐩𝐥𝐚𝐢𝐧 𝐭𝐡𝐞 𝐍𝐚𝐭𝐮𝐫𝐞 𝐚𝐧𝐝 𝐒𝐜𝐨𝐩𝐞 𝐨𝐟 𝐚𝐧 𝐄𝐧𝐭𝐫𝐞𝐩𝐫𝐞𝐧𝐞𝐮𝐫:
-Define entrepreneurship, distinguishing it from general business activities by emphasizing its focus on innovation, risk-taking, and value creation. Students will describe the characteristics and traits of successful entrepreneurs, including their roles and responsibilities, and discuss the broader economic and social impacts of entrepreneurial activities on both local and global scales.
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...PECB
Denis is a dynamic and results-driven Chief Information Officer (CIO) with a distinguished career spanning information systems analysis and technical project management. With a proven track record of spearheading the design and delivery of cutting-edge Information Management solutions, he has consistently elevated business operations, streamlined reporting functions, and maximized process efficiency.
Certified as an ISO/IEC 27001: Information Security Management Systems (ISMS) Lead Implementer, Data Protection Officer, and Cyber Risks Analyst, Denis brings a heightened focus on data security, privacy, and cyber resilience to every endeavor.
His expertise extends across a diverse spectrum of reporting, database, and web development applications, underpinned by an exceptional grasp of data storage and virtualization technologies. His proficiency in application testing, database administration, and data cleansing ensures seamless execution of complex projects.
What sets Denis apart is his comprehensive understanding of Business and Systems Analysis technologies, honed through involvement in all phases of the Software Development Lifecycle (SDLC). From meticulous requirements gathering to precise analysis, innovative design, rigorous development, thorough testing, and successful implementation, he has consistently delivered exceptional results.
Throughout his career, he has taken on multifaceted roles, from leading technical project management teams to owning solutions that drive operational excellence. His conscientious and proactive approach is unwavering, whether he is working independently or collaboratively within a team. His ability to connect with colleagues on a personal level underscores his commitment to fostering a harmonious and productive workplace environment.
Date: May 29, 2024
Tags: Information Security, ISO/IEC 27001, ISO/IEC 42001, Artificial Intelligence, GDPR
-------------------------------------------------------------------------------
Find out more about ISO training and certification services
Training: ISO/IEC 27001 Information Security Management System - EN | PECB
ISO/IEC 42001 Artificial Intelligence Management System - EN | PECB
General Data Protection Regulation (GDPR) - Training Courses - EN | PECB
Webinars: https://pecb.com/webinars
Article: https://pecb.com/article
-------------------------------------------------------------------------------
For more information about PECB:
Website: https://pecb.com/
LinkedIn: https://www.linkedin.com/company/pecb/
Facebook: https://www.facebook.com/PECBInternational/
Slideshare: http://www.slideshare.net/PECBCERTIFICATION
The chapter Lifelines of National Economy in Class 10 Geography focuses on the various modes of transportation and communication that play a vital role in the economic development of a country. These lifelines are crucial for the movement of goods, services, and people, thereby connecting different regions and promoting economic activities.
Walmart Business+ and Spark Good for Nonprofits.pdfTechSoup
"Learn about all the ways Walmart supports nonprofit organizations.
You will hear from Liz Willett, the Head of Nonprofits, and hear about what Walmart is doing to help nonprofits, including Walmart Business and Spark Good. Walmart Business+ is a new offer for nonprofits that offers discounts and also streamlines nonprofits order and expense tracking, saving time and money.
The webinar may also give some examples on how nonprofits can best leverage Walmart Business+.
The event will cover the following::
Walmart Business + (https://business.walmart.com/plus) is a new shopping experience for nonprofits, schools, and local business customers that connects an exclusive online shopping experience to stores. Benefits include free delivery and shipping, a 'Spend Analytics” feature, special discounts, deals and tax-exempt shopping.
Special TechSoup offer for a free 180 days membership, and up to $150 in discounts on eligible orders.
Spark Good (walmart.com/sparkgood) is a charitable platform that enables nonprofits to receive donations directly from customers and associates.
Answers about how you can do more with Walmart!"
How to Make a Field Mandatory in Odoo 17Celine George
In Odoo, making a field required can be done through both Python code and XML views. When you set the required attribute to True in Python code, it makes the field required across all views where it's used. Conversely, when you set the required attribute in XML views, it makes the field required only in the context of that particular view.
Leveraging Generative AI to Drive Nonprofit InnovationTechSoup
In this webinar, participants learned how to utilize Generative AI to streamline operations and elevate member engagement. Amazon Web Service experts provided a customer specific use cases and dived into low/no-code tools that are quick and easy to deploy through Amazon Web Service (AWS.)
How to Setup Warehouse & Location in Odoo 17 InventoryCeline George
In this slide, we'll explore how to set up warehouses and locations in Odoo 17 Inventory. This will help us manage our stock effectively, track inventory levels, and streamline warehouse operations.
SWOT analysis in the project Keeping the Memory @live.pptx
Bringing Undergraduate Students Closer to a Real-World Information Retrieval Setting: Methodology and Resources
1. Bringing Undergraduate Students
Closer to a Real-World
Information Retrieval Setting:
Methodology and Resources
Julián Urbano, Mónica Marrero,
Diego Martín and Jorge Morato
http://julian-urbano.info
Twitter: @julian_urbano
ACM ITiCSE 2011
Darmstadt, Germany · June 29th
2. 3
Information Retrieval
• Automatically search documents
▫ Effectively
▫ Efficiently
• Expected between 2009 and 2020 [Gantz et al., 2010]
▫ 44 times more digital information
▫ 1.4 times more staff and investment
• IR is very important for the world
• IR is clearly important for Computer Science students
▫ Usually not in the core program of CS majors
[Fernández-Luna et al., 2009]
3. 4
Experiments in Computer Science
• Decisions in the CS industry
▫ Based on mere experience
▫ Bias towards or against some technologies
• CS professionals need background
▫ Experimental methods
▫ Critical analysis of experimental studies [IEEE/ACM, 2001]
• IR is a highly experimental discipline [Voorhees, 2002]
• Focus on evaluation of search engine effectiveness
▫ Evaluation experiments not common in IR courses
[Fernández-Luna et al., 2009]
4. 5
Proposal:
Teach IR through Experimentation
• Try to resemble a real-world IR scenario
▫ Relatively large amounts of information
▫ Specific characteristics and needs
▫ Evaluate what techniques are better
• Can we reliably adapt the methodologies of
academic/industrial evaluation workshops?
▫ TREC, INEX, CLEF…
5. 6
Current IR Courses
• Develop IR systems extending frameworks/tools
▫ IR Base [Calado et al., 2007]
▫ Alkaline, Greenstone, SpidersRUs [Chau et al., 2010]
▫ Lucene, Lemur
• Students should focus on the concepts
and not struggle with the tools [Madnani et al., 2008]
6. 7
Current IR Courses (II)
• Develop IR systems from scratch
▫ Build your search engine in 90 days [Chau et al., 2003]
▫ History Places [Hendry, 2007]
• Only possible in technical majors
▫ Software development
▫ Algorithms
▫ Data structures
▫ Database management
• Much more complicated than using other tools
• Much better to understand the IR techniques and
processes explained
7. 8
Current IR Courses (III)
• Students must learn the Scientific Method
[IEEE/ACM Computing Curricula, 2001]
• Not only get the numbers [Markey et al., 1978]
▫ Where do they come from?
▫ What do they really mean?
▫ Why they are calculated that way?
• Analyze the reliability and validity
• Not usually paid much attention in IR courses
[Fernández-Luna et al., 2009]
8. 9
Current IR Courses (and IV)
• Modern CS industry is changing
▫ Big data
• Real conditions differ widely from class [Braught et al., 2004]
▫ Interesting from the experimental viewpoint
• Finding free collections to teach IR is hard
▫ Copyright
▫ Diversity
• Usually stick to the same collections year after year
▫ Try to make them a product of the course itself
9. 10
Our IR Course
• Senior CS undergraduates
▫ So far: elective, 1 group of ~30 students
▫ From next year: mandatory, 2-3 groups of ~35 students
• Exercise: very simple Web crawler [Urbano et al., 2010]
• Search engine form scratch, in groups of 3-4 students
▫ Module 1: basic indexing and retrieval
▫ Module 2: query expansion
▫ Module 3: Named Entity Recognition
• Skeleton and evaluation framework are provided
• External libraries allowed for certain parts
▫ Stemmer, Wordnet, etc.
• Microsoft .net, C#, SQL Server Express
10. 12
Evaluation in Early TREC Editions
Document collection,
depending on the task, domain…
11. 13
Evaluation in Early TREC Editions
Document collection,
depending on the task, domain…
Relevance assessors,
… retired analysts
12. 14
Evaluation in Early TREC Editions
Document collection,
depending on the task, domain…
Candidate
… topics
13. 15
Evaluation in Early TREC Editions
Document collection,
depending on the task, domain…
… Topic difficulty?
14. 16
Evaluation in Early TREC Editions
Document collection,
depending on the task, domain…
…
Organizers
choose final ~50 topics
20. 22
Evaluation in Early TREC Editions
Top 100 results
Depth-100 pool
Depending on overlapping,
pool sizes vary (usually ~1/3)
21. 23
Evaluation in Early TREC Editions
Top 100 results
Depth-100 pool
Depending on overlapping,
pool sizes vary (usually ~1/3)
What documents
are relevant?
22. 24
Evaluation in Early TREC Editions
Top 100 results
Depth-100 pool
Depending on overlapping,
pool sizes vary (usually ~1/3)
What documents
are relevant? Relevance judgments
Organizers
23. 25
Evaluation in Early TREC Editions
Top 100 results
Depth-100 pool
Depending on overlapping,
pool sizes vary (usually ~1/3)
What documents
are relevant? Relevance judgments
Organizers
25. 27
Changes for the Course
Document collection,
can not be too large
Topics
… are restricted
• Different themes = many diverse documents
• Similar topics = fewer documents
• Start with a theme common to all topics
26. 28
Changes for the Course
Write topics
… with a common theme
How do we get relevant documents?
27. 29
Changes for the Course
Write topics
… with a common theme
Try to answer the
topic ourselves
28. 30
Changes for the Course
Write topics
… with a common theme
Try to answer the
topic ourselves
Too difficult?
Too few documents?
29. 31
Changes for the Course
Write topics
… with a common theme
Try to answer the
topic ourselves
Too difficult?
Too few documents? Complete
document collection
Final topics
Crawler
30. 32
Changes for the Course
• Relevance judging would be next
• Before students start developing their systems
▫ Cheating is very tempting: all (mine) relevant!
• Reliable pools without the student systems?
• Use other well-known systems: Lucene & Lemur
▫ These are the pooling systems
▫ Different techniques, similar to what we teach
32. 34
Changes for the Course
…
Lucene Lemur Lucene
… Lemur
depth-k pools lead to
very different sizes
Some students would judge many more documents than others
33. 35
Changes for the Course
Topic 001 Topic 002 Topic 003
Topic 004 Topic 005 Topic N
• Instead of depth-k, pool until size-k
▫ Size-100 in our case
▫ About 2 hours to judge, possible to do in class
34. 36
Changes for the Course
• Pooling systems might leave relevant material out
▫ Include Google’s top 10 results to each pool
When choosing topics we knew there were some
• Students might do the judgments carelessly
▫ Crawl 2 noise topics with unrelated queries
▫ Add 10 noise documents to each pool and check
• Now keep pooling documents until size-100
▫ The union of the pools form the biased collection
This is the one we gave students
35. 37
2010 Test Collection
• 32 students, 5 faculty
• Theme: Computing
▫ 20 topics (plus 2 for noise)
17 judged by two people
▫ 9,769 documents (735MB) in complete collection
▫ 1,967 documents (161MB) in biased collection
• Pool sizes between 100 and 105
▫ Average pool depth 27.5, from 15 to 58
36. 38
Evaluation
• With results and judgments, rank student systems
• Compute mean NDCG scores [Järvelin et al., 2002]
▫ Are relevant documents properly ranked?
▫ Ranges between 0 (useless) to 1 (perfect)
• The group with the best system is rewarded
with an extra point (over 10) in the final grade
▫ About half the students really motivated for this
37. 39
Is This Evaluation Method Reliable?
• These evaluations have problems [Voorhees, 2002]
▫ Inconsistency of judgments [Voorhees, 2000]
People have different notions of «relevance»
Same document judged differently
▫ Incompleteness of judgments [Zobel, 1998]
Documents not in the pool are considered not relevant
What if some of them were actually relevant?
• TREC ad hoc was quite reliable
▫ Many topics
▫ Great man-power
38. 40
Assessor Agreement
• 17 topics were judged by 2 people
▫ Mean Cohen’s Kappa = 0.417
From 0.096 to 0.735
▫ Mean Precision-Recall = 0.657 - 0.638
From 0.111 to 1
• For TREC, observed 0.65 precision at 0.65 recall
[Voorhees, 2000]
• Only 1 noise document was judged relevant
39. Mean NDCG@100
0.2 0.3 0.4 0.5 0.6 0.7 0.8
03-3
01-1
03-2
03-1
01-3
05-3
07-1
01-2
Inconsistency &
05-1
07-3
02-1
05-2
System Performance
07-2
08-1
Student system
04-1
02-3
08-2
02-2
08-3
04-2
04-3
06-3
06-1
Mean over 2,000 trels
06-2
41
40. 42
Inconsistency &
System Performance (and II)
• Take 2,000 random samples of assessors
▫ Each would have been possible if using 1 assessor
▫ NDCG differences between 0.075 and 0.146
• 1.5 times larger than in TREC [Voorhees, 2000]
▫ Unexperienced assessors
▫ 3-point relevance scale
▫ Larger values to begin with! (x2)
Percentage differences are lower
41. 43
Inconsistency & System Ranking
• Absolute numbers do not tell much, rankings do
▫ It is highly dependent on the collection
• Compute correlations with 2,000 random samples
▫ Mean Kendall’s tau = 0.926
From 0.811 to 1
• For TREC, observed 0.938 [Voorhees, 2000]
• No swap was significant (Wilcoxon, α=0.05)
42. 44
Incompleteness &
System Performance
16
Increment in Mean NDCG@100 (%)
Mean difference over 2,000 trels
14
Estimated difference
12
10
8
6
4
2
0
30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Pool size
43. 45
Incompleteness &
System Performance (and II)
• Start with size-20 pools and keep adding 5 more
▫ 20 to 25: NDCG variations between 14% and 37%
▫ 95 to 100: between 0.4% and 1.6%, mean 1%
• In TREC editions, between 0.5% and 3.5% [Zobel, 1998]
▫ Even some observations of 19%!
• We achieve these levels for sizes 60-65
44. 46
Incompleteness & Effort
• Extrapolate to larger pools
▫ If differences decrease a lot, judge a little more
▫ If not, judge fewer documents but more topics
• 135 documents for mean NDCG differences <0.5%
▫ Not really worth the +35% effort
45. 47
Student Perception
• Brings more work both for faculty and students
• More technical problems than previous years
▫ Survey scores dropped from 3.27 to 2.82 over 5
▫ Expected, they had to work considerably more
▫ Possible gaps in (our) CS program?
• Same satisfaction scores as previous years
▫ Very appealing for us
▫ This was our first year, lots of logistics problems
46. 48
Conclusions & Future Work
• IR deserves more attention in CS education
• Laboratory experiments deserve it too
• Focus on the experimental nature of IR
▫ But give a wider perspective useful beyond IR
• Much more interesting and instructive
• Adapt well-known methodologies in
industry/academia to our limited resources
47. 49
Conclusions & Future Work (II)
• Improve the methodology
▫ Explore quality control techniques in crowdsourcing
• Integrate with other courses
▫ Computer Science
▫ Library Science
• Make everything public, free
▫ Tets collections and reports
▫ http://ir.kr.inf.uc3m.es/eirex/
48. 50
Conclusions & Future Work (and III)
• We call all this EIREX
▫ Information Retrieval Education
through Experimentation
• We would like others to participate
▫ Use it in your class
Collections
The methodology
Tools
49.
50. 52
2011 Semester
• Only 19 students
• The 2010 collection used for testing
▫ Every year there will be more variety to choose
• 2011 collection: Crowdsourcing
▫ 23 topics, better described
Created by the students
Judgments by 1 student
▫ 13,245 documents (952MB) in complete collection
▫ 2,088 documents (96MB) in biased collection
• Overview report coming soon!
51. 53
Can we reliably adapt the large-scale methodologies
of academic/industrial IR evaluation workshops?
YES (we can)