This document provides an introduction to information retrieval systems and their main components. It discusses how IR systems aim to find relevant documents from a large collection in response to a user's information need. The key processes involved are document indexing to represent contents, query formulation, retrieval of relevant documents, and system evaluation. Indexing involves selecting important keywords from documents and assigning them weights. Various retrieval models are described for comparing document and query representations, such as vector space and probabilistic models. The document also discusses challenges in document representation, query processing, and evaluating system effectiveness.
This document discusses vector space retrieval models. It describes how documents and queries are represented as vectors in a common vector space based on terms. Terms are weighted using metrics like term frequency (TF) and inverse document frequency (IDF) to determine importance. The cosine similarity measure is used to calculate similarity between document and query vectors and rank results by relevance. While simple and effective in practice, vector space models have limitations like missing semantic and syntactic information.
The document discusses algorithms complexity and data structures efficiency, explaining that algorithm complexity can be measured using asymptotic notation like O(n) or O(n^2) to represent operations scaling linearly or quadratically with input size, and different data structures have varying time efficiency for operations like add, find, and delete.
This document summarizes key concepts in information retrieval systems and algorithms for large data sets. It discusses the differences between information retrieval and data retrieval systems. It also describes several classic models for relevance ranking in IR, including the Boolean model and vector space model. The document outlines topics like text processing, indexing, searching, and evaluation in information retrieval systems.
This document provides an overview of using latent semantic analysis (LSA) and the R programming language for language technology enhanced learning applications. It describes using LSA to create a semantic space to compare documents and evaluate student writings. It also demonstrates clustering terms based on their semantic similarity and visualizing networks in R. Evaluation results show LSA machine scores for essay quality had a Spearman's rank correlation of 0.687 with human scores, outperforming a pure vector space model.
Introduction to data structures and complexity.pptxPJS KUMAR
The document discusses data structures and algorithms. It defines data structures as the logical organization of data and describes common linear and nonlinear structures like arrays and trees. It explains that the choice of data structure depends on accurately representing real-world relationships while allowing effective processing. Key data structure operations are also outlined like traversing, searching, inserting, deleting, sorting, and merging. The document then defines algorithms as step-by-step instructions to solve problems and analyzes the complexity of algorithms in terms of time and space. Sub-algorithms and their use are also covered.
The document discusses two main types of retrieval models: Boolean models which use set theory and vector space models which use statistical and algebraic approaches. Vector space models represent documents and queries as vectors of keywords weighted by factors like term frequency and inverse document frequency. Similarity between document and query vectors is calculated using measures like the inner product or cosine similarity to retrieve and rank documents.
This document provides an overview of the Introduction to Algorithms course, including the course modules and motivating problems. It introduces the Document Distance problem, which aims to define metrics to measure the similarity between documents based on word frequencies. It discusses an initial Python program ("docdist1.py") to calculate document distance that runs inefficiently due to quadratic time list concatenation. Profiling identifies this as the bottleneck. The solution is to use list extension, resulting in "docdist3.py". Further optimizations include using a dictionary to count word frequencies in constant time, creating "docdist4.py". The document outlines remaining opportunities like improving the word extraction and sorting algorithms.
19. Data Structures and Algorithm ComplexityIntro C# Book
In this chapter we will compare the data structures we have learned so far by the performance (execution speed) of the basic operations (addition, search, deletion, etc.). We will give specific tips in what situations what data structures to use. We will explain how to choose between data structures like hash-tables, arrays, dynamic arrays and sets implemented by hash-tables or balanced trees. Almost all of these structures are implemented as part of NET Framework, so to be able to write efficient and reliable code we have to learn to apply the most appropriate structures for each situation.
This document discusses vector space retrieval models. It describes how documents and queries are represented as vectors in a common vector space based on terms. Terms are weighted using metrics like term frequency (TF) and inverse document frequency (IDF) to determine importance. The cosine similarity measure is used to calculate similarity between document and query vectors and rank results by relevance. While simple and effective in practice, vector space models have limitations like missing semantic and syntactic information.
The document discusses algorithms complexity and data structures efficiency, explaining that algorithm complexity can be measured using asymptotic notation like O(n) or O(n^2) to represent operations scaling linearly or quadratically with input size, and different data structures have varying time efficiency for operations like add, find, and delete.
This document summarizes key concepts in information retrieval systems and algorithms for large data sets. It discusses the differences between information retrieval and data retrieval systems. It also describes several classic models for relevance ranking in IR, including the Boolean model and vector space model. The document outlines topics like text processing, indexing, searching, and evaluation in information retrieval systems.
This document provides an overview of using latent semantic analysis (LSA) and the R programming language for language technology enhanced learning applications. It describes using LSA to create a semantic space to compare documents and evaluate student writings. It also demonstrates clustering terms based on their semantic similarity and visualizing networks in R. Evaluation results show LSA machine scores for essay quality had a Spearman's rank correlation of 0.687 with human scores, outperforming a pure vector space model.
Introduction to data structures and complexity.pptxPJS KUMAR
The document discusses data structures and algorithms. It defines data structures as the logical organization of data and describes common linear and nonlinear structures like arrays and trees. It explains that the choice of data structure depends on accurately representing real-world relationships while allowing effective processing. Key data structure operations are also outlined like traversing, searching, inserting, deleting, sorting, and merging. The document then defines algorithms as step-by-step instructions to solve problems and analyzes the complexity of algorithms in terms of time and space. Sub-algorithms and their use are also covered.
The document discusses two main types of retrieval models: Boolean models which use set theory and vector space models which use statistical and algebraic approaches. Vector space models represent documents and queries as vectors of keywords weighted by factors like term frequency and inverse document frequency. Similarity between document and query vectors is calculated using measures like the inner product or cosine similarity to retrieve and rank documents.
This document provides an overview of the Introduction to Algorithms course, including the course modules and motivating problems. It introduces the Document Distance problem, which aims to define metrics to measure the similarity between documents based on word frequencies. It discusses an initial Python program ("docdist1.py") to calculate document distance that runs inefficiently due to quadratic time list concatenation. Profiling identifies this as the bottleneck. The solution is to use list extension, resulting in "docdist3.py". Further optimizations include using a dictionary to count word frequencies in constant time, creating "docdist4.py". The document outlines remaining opportunities like improving the word extraction and sorting algorithms.
19. Data Structures and Algorithm ComplexityIntro C# Book
In this chapter we will compare the data structures we have learned so far by the performance (execution speed) of the basic operations (addition, search, deletion, etc.). We will give specific tips in what situations what data structures to use. We will explain how to choose between data structures like hash-tables, arrays, dynamic arrays and sets implemented by hash-tables or balanced trees. Almost all of these structures are implemented as part of NET Framework, so to be able to write efficient and reliable code we have to learn to apply the most appropriate structures for each situation.
The document discusses information retrieval (IR) models, including the Boolean, vector space, and probabilistic models. The Boolean model represents documents and queries as sets of index terms and determines relevance through binary term presence, while the vector space model represents documents and queries as weighted vectors in a multidimensional space and ranks documents by calculating similarity between document and query vectors. The probabilistic model determines relevance probabilities based on the likelihood of terms appearing in relevant vs. non-relevant documents.
The document discusses various aspects of term weighting which is important for text retrieval systems, including term frequency, document frequency, inverse document frequency, and how they are used to calculate TF-IDF weights for terms. It also covers stoplists, stemming, and the bag-of-words model which represents text as vectors of word occurrences without considering word order. Term weighting schemes play a major role in the similarity measures used by information retrieval systems to determine document relevance.
A fast-paced introduction to Deep Learning that starts with a simple yet complete neural network (no frameworks), followed by an overview of activation functions, cost functions, backpropagation, and then a quick dive into CNNs. Next we'll create a neural network using Keras, followed by an introduction to TensorFlow and TensorBoard. For best results, familiarity with basic vectors and matrices, inner (aka "dot") products of vectors, and rudimentary Python is definitely helpful.
This document discusses various techniques for optimizing Python code, including:
1. Using the right algorithms and data structures to minimize time complexity, such as choosing lists, sets or dictionaries based on needed functionality.
2. Leveraging Python-specific optimizations like string concatenation, lookups, loops and imports.
3. Profiling code with tools like timeit, cProfile and visualizers to identify bottlenecks before optimizing.
4. Optimizing only after validating a performance need and starting with general strategies before rewriting hotspots in Python or other languages. Premature optimization can complicate code.
19. Java data structures algorithms and complexityIntro C# Book
In this chapter we will compare the data structures we have learned so far by the performance (execution speed) of the basic operations (addition, search, deletion, etc.). We will give specific tips in what situations what data structures to use.
The document discusses the basics of information retrieval systems. It covers two main stages - indexing and retrieval. In the indexing stage, documents are preprocessed and stored in an index. In retrieval, queries are issued and the index is accessed to find relevant documents. The document then discusses several models for defining relevance between documents and queries, including the Boolean model and vector space model. It also covers techniques for representing documents and queries as vectors and calculating similarity between them.
A fast-paced introduction to Deep Learning concepts, such as activation functions, cost functions, back propagation, and then a quick dive into CNNs, followed by a Keras code sample for defining a CNN. Basic knowledge of vectors, matrices, and derivatives is helpful in order to derive the maximum benefit from this session. Then we'll see a short introduction to TensorFlow 1.x and some insights into TF 2 that will be released some time this year.
ON RUN-LENGTH-CONSTRAINED BINARY SEQUENCESijitjournal
A class of binary sequences, constrained with respect to the length of zero runs, is considered.
For such sequences, termed (d, k)-sequences, new combinatorial and computational results
are established. Explicit expressions for enumerating (d, k)-sequences of finite length are
obtained. Efficient computational procedures for calculating the capacity of a (d, k)-code are
given. A simple method for constructing a near-optimal (d, k)-code is proposed. Illustrative
numerical examples demonstrate further the theoretical results.
An introduction to Deep Learning concepts, with a simple yet complete neural network, CNNs, followed by rudimentary concepts of Keras and TensorFlow, and some simple code fragments.
This document discusses various similarity measures that can be used to quantify the similarity between documents, queries, or a document and query in an information retrieval system. It describes classic measures like Dice coefficient, overlap coefficient, Jaccard coefficient, and cosine coefficient. It provides examples of calculating these measures and compares the relations between different measures. The document also discusses using term-document matrices and shows an example matrix.
This fast-paced session starts with an introduction to neural networks and linear regression models, along with a quick view of TensorFlow, followed by some Scala APIs for TensorFlow. You'll also see a simple dockerized image of Scala and TensorFlow code and how to execute the code in that image from the command line. No prior knowledge of NNs, Keras, or TensorFlow is required (but you must be comfortable with Scala).
1. Hash tables are good for random access of elements but not sequential access. When records need to be accessed sequentially, hashing can be problematic because elements are stored in random locations instead of consecutively.
2. To find the successor of a node in a binary search tree, we take the right child. This operation has a runtime complexity of O(1).
3. When comparing operations like insertion, deletion, and searching between different data structures, arrays generally have the best performance for insertion and searching, while linked lists have better performance for deletion and allow for easy insertion/deletion anywhere. Binary search trees fall between these two.
The document discusses probabilistic retrieval models in information retrieval. It provides an overview of older models like Boolean retrieval and vector space models. The main focus is on probabilistic models like BM25 and language models. It explains key concepts in probabilistic IR like the probability ranking principle, using Bayes' rule to estimate the probability that a document is relevant given features of the document, and estimating probabilities based on the frequencies of terms in relevant documents. The goal is to rank documents based on the probability of relevance to the query.
Elasticsearch document summarization:
- Elasticsearch calculates relevance scores (scores queries against documents) using a complex algorithm involving term frequency, inverse document frequency, field length norm, and other factors.
- It analyzes queries and documents, finds matching documents, then scores each match based on the algorithm. Higher scores indicate more relevant matches.
- Explain API and scoring details can show how Elasticsearch calculates relevance for specific queries and documents, breaking down factors like TF, IDF, coordination, etc. This helps understand and optimize search effectiveness.
The document discusses using clustering techniques like K-means, LDA, and PAM to analyze topics in a large dataset of Wikipedia documents. It explores preprocessing steps, compares different clustering algorithms, and analyzes the results. K-means identified around 250 topics using the elbow method. LDA was able to identify coherent topics based on word co-occurrence. PAM using bigrams found some meaningful word pairs but the clusters did not separate well. The techniques revealed topics related to music, politics, war and more.
This document provides an overview of algorithm analysis and asymptotic notation. It discusses analyzing algorithms based on problem size and using Big-O notation to characterize runtime. Specifically, it introduces the concepts of best, worst, and average case analysis. It also covers properties of Big-O, like how operations combine asymptotically. Examples analyze the runtime of prefix averages algorithms and solving recursive equations using repeated substitution or telescoping. Finally, it discusses abstract data types and how to design new data types through specification, application, and implementation.
The document discusses various techniques for information retrieval and language modeling approaches to IR, including:
- Clustering documents into similar groups to aid in retrieval
- Using term frequency-inverse document frequency (TF-IDF) to measure word importance in documents
- Language models that represent documents and queries as probability distributions over words
- Smoothing language models to address data sparsity issues
- Cluster-based scoring methods that incorporate information from query-relevant document clusters
This document provides an overview of the Python programming language. It discusses that Python is a popular, object-oriented scripting language that emphasizes code readability. The document summarizes key Python features such as rapid development, automatic memory management, object-oriented programming, and embedding/extending with C. It also outlines common uses of Python and when it may not be suitable.
The document discusses information retrieval (IR) models, including the Boolean, vector space, and probabilistic models. The Boolean model represents documents and queries as sets of index terms and determines relevance through binary term presence, while the vector space model represents documents and queries as weighted vectors in a multidimensional space and ranks documents by calculating similarity between document and query vectors. The probabilistic model determines relevance probabilities based on the likelihood of terms appearing in relevant vs. non-relevant documents.
The document discusses various aspects of term weighting which is important for text retrieval systems, including term frequency, document frequency, inverse document frequency, and how they are used to calculate TF-IDF weights for terms. It also covers stoplists, stemming, and the bag-of-words model which represents text as vectors of word occurrences without considering word order. Term weighting schemes play a major role in the similarity measures used by information retrieval systems to determine document relevance.
A fast-paced introduction to Deep Learning that starts with a simple yet complete neural network (no frameworks), followed by an overview of activation functions, cost functions, backpropagation, and then a quick dive into CNNs. Next we'll create a neural network using Keras, followed by an introduction to TensorFlow and TensorBoard. For best results, familiarity with basic vectors and matrices, inner (aka "dot") products of vectors, and rudimentary Python is definitely helpful.
This document discusses various techniques for optimizing Python code, including:
1. Using the right algorithms and data structures to minimize time complexity, such as choosing lists, sets or dictionaries based on needed functionality.
2. Leveraging Python-specific optimizations like string concatenation, lookups, loops and imports.
3. Profiling code with tools like timeit, cProfile and visualizers to identify bottlenecks before optimizing.
4. Optimizing only after validating a performance need and starting with general strategies before rewriting hotspots in Python or other languages. Premature optimization can complicate code.
19. Java data structures algorithms and complexityIntro C# Book
In this chapter we will compare the data structures we have learned so far by the performance (execution speed) of the basic operations (addition, search, deletion, etc.). We will give specific tips in what situations what data structures to use.
The document discusses the basics of information retrieval systems. It covers two main stages - indexing and retrieval. In the indexing stage, documents are preprocessed and stored in an index. In retrieval, queries are issued and the index is accessed to find relevant documents. The document then discusses several models for defining relevance between documents and queries, including the Boolean model and vector space model. It also covers techniques for representing documents and queries as vectors and calculating similarity between them.
A fast-paced introduction to Deep Learning concepts, such as activation functions, cost functions, back propagation, and then a quick dive into CNNs, followed by a Keras code sample for defining a CNN. Basic knowledge of vectors, matrices, and derivatives is helpful in order to derive the maximum benefit from this session. Then we'll see a short introduction to TensorFlow 1.x and some insights into TF 2 that will be released some time this year.
ON RUN-LENGTH-CONSTRAINED BINARY SEQUENCESijitjournal
A class of binary sequences, constrained with respect to the length of zero runs, is considered.
For such sequences, termed (d, k)-sequences, new combinatorial and computational results
are established. Explicit expressions for enumerating (d, k)-sequences of finite length are
obtained. Efficient computational procedures for calculating the capacity of a (d, k)-code are
given. A simple method for constructing a near-optimal (d, k)-code is proposed. Illustrative
numerical examples demonstrate further the theoretical results.
An introduction to Deep Learning concepts, with a simple yet complete neural network, CNNs, followed by rudimentary concepts of Keras and TensorFlow, and some simple code fragments.
This document discusses various similarity measures that can be used to quantify the similarity between documents, queries, or a document and query in an information retrieval system. It describes classic measures like Dice coefficient, overlap coefficient, Jaccard coefficient, and cosine coefficient. It provides examples of calculating these measures and compares the relations between different measures. The document also discusses using term-document matrices and shows an example matrix.
This fast-paced session starts with an introduction to neural networks and linear regression models, along with a quick view of TensorFlow, followed by some Scala APIs for TensorFlow. You'll also see a simple dockerized image of Scala and TensorFlow code and how to execute the code in that image from the command line. No prior knowledge of NNs, Keras, or TensorFlow is required (but you must be comfortable with Scala).
1. Hash tables are good for random access of elements but not sequential access. When records need to be accessed sequentially, hashing can be problematic because elements are stored in random locations instead of consecutively.
2. To find the successor of a node in a binary search tree, we take the right child. This operation has a runtime complexity of O(1).
3. When comparing operations like insertion, deletion, and searching between different data structures, arrays generally have the best performance for insertion and searching, while linked lists have better performance for deletion and allow for easy insertion/deletion anywhere. Binary search trees fall between these two.
The document discusses probabilistic retrieval models in information retrieval. It provides an overview of older models like Boolean retrieval and vector space models. The main focus is on probabilistic models like BM25 and language models. It explains key concepts in probabilistic IR like the probability ranking principle, using Bayes' rule to estimate the probability that a document is relevant given features of the document, and estimating probabilities based on the frequencies of terms in relevant documents. The goal is to rank documents based on the probability of relevance to the query.
Elasticsearch document summarization:
- Elasticsearch calculates relevance scores (scores queries against documents) using a complex algorithm involving term frequency, inverse document frequency, field length norm, and other factors.
- It analyzes queries and documents, finds matching documents, then scores each match based on the algorithm. Higher scores indicate more relevant matches.
- Explain API and scoring details can show how Elasticsearch calculates relevance for specific queries and documents, breaking down factors like TF, IDF, coordination, etc. This helps understand and optimize search effectiveness.
The document discusses using clustering techniques like K-means, LDA, and PAM to analyze topics in a large dataset of Wikipedia documents. It explores preprocessing steps, compares different clustering algorithms, and analyzes the results. K-means identified around 250 topics using the elbow method. LDA was able to identify coherent topics based on word co-occurrence. PAM using bigrams found some meaningful word pairs but the clusters did not separate well. The techniques revealed topics related to music, politics, war and more.
This document provides an overview of algorithm analysis and asymptotic notation. It discusses analyzing algorithms based on problem size and using Big-O notation to characterize runtime. Specifically, it introduces the concepts of best, worst, and average case analysis. It also covers properties of Big-O, like how operations combine asymptotically. Examples analyze the runtime of prefix averages algorithms and solving recursive equations using repeated substitution or telescoping. Finally, it discusses abstract data types and how to design new data types through specification, application, and implementation.
The document discusses various techniques for information retrieval and language modeling approaches to IR, including:
- Clustering documents into similar groups to aid in retrieval
- Using term frequency-inverse document frequency (TF-IDF) to measure word importance in documents
- Language models that represent documents and queries as probability distributions over words
- Smoothing language models to address data sparsity issues
- Cluster-based scoring methods that incorporate information from query-relevant document clusters
This document provides an overview of the Python programming language. It discusses that Python is a popular, object-oriented scripting language that emphasizes code readability. The document summarizes key Python features such as rapid development, automatic memory management, object-oriented programming, and embedding/extending with C. It also outlines common uses of Python and when it may not be suitable.
This document summarizes the key issues with the traditional waterfall software development model based on analyses from the mid-1990s. It discusses that while the theory behind waterfall is sound, in practice it often led to: 1) protracted integration and late design breakages due to unforeseen issues emerging late in the process, 2) late risk resolution focusing too much on early paper artifacts, 3) requirements-driven decomposition ignoring emerging needs, 4) adversarial stakeholder relationships focusing on documents not collaboration, and 5) over-focus on documents and reviews not iterative development. Overall, less than 20% of projects succeeded under this model.
The document discusses various sorting algorithms including insertion sort, quicksort, merge sort, and their time complexities. Insertion sort has worst-case time complexity of O(n^2) but works well for small lists. Quicksort and merge sort have average time complexity of O(nlogn). Merge sort uses additional storage space while quicksort may have worst-case time of O(n^2) if the pivot choice is poor.
This document provides an overview of databases and database management systems (DBMS). It discusses the history of databases, from early file-based systems to hierarchical, network, and relational models. Key topics covered include the definition of a database, components of a DBMS like SQL and data dictionaries, the roles involved in database administration, and advantages/limitations of DBMS. The document concludes with an assignment asking students to review the chapter, read an appendix, and submit a group list.
The document provides information about SQL and PL/SQL. It discusses SQL, which is a standard language for database manipulation. It allows users to create, update, retrieve, and delete data from databases. The document also describes SQL history, characteristics, advantages, datatypes, and commands including DDL, DML, DCL, and TCL. It then discusses MySQL, its features, datatypes, and how to install and connect to MySQL.
This document summarizes key concepts from Chapter 2 of the textbook "Database System Concepts" by Silberschatz, Korth and Sudarshan. It introduces the relational model, including the structure of relational databases, relational algebra operations, null values, and modification of databases. Key concepts covered include relations, tuples, relation schemas, keys, and the basic relational algebra operations of select, project, join, union, difference and rename. An example of a banking database with relations for branches, customers, accounts and loans is also provided.
The document discusses project planning and management. It covers topics like process planning, effort estimation, schedule and resource estimation, quality planning, and risk management. Effective project management is key to successfully executing projects on time and within budget. Project planning involves creating detailed schedules, estimating efforts, defining quality objectives, and identifying and mitigating risks. Estimation models like COCOMO are used to estimate effort based on parameters like project size. Milestones are determined based on effort distribution and manpower ramp-up over time.
The document discusses file input/output (I/O) in Java. It covers:
1) Java's I/O system of readers, writers, and streams for reading from and writing to files.
2) Exceptions in file I/O and how to handle errors using try/catch blocks.
3) Examples of reading text and binary data from files, writing data to files, and scanning files for specific data.
The document introduces information retrieval and describes how an inverted index works as the key data structure for modern IR systems. An inverted index stores for each term a list of all documents that contain the term. It allows efficient processing of Boolean queries by merging the postings lists of query terms. Query processing aims to optimize the order of processing terms based on their document frequencies to minimize the size of intermediate results.
This document introduces object-oriented programming (OOP) by explaining the differences between structured and OOP, defining key OOP terminology like objects, classes, methods, and attributes. It describes the four main design principles of OOP - encapsulation, abstraction, polymorphism, and inheritance. Encapsulation hides implementation details and abstraction focuses on important facts. Polymorphism allows the same word to have different meanings. Inheritance allows classes to inherit attributes and methods from superclasses to subclasses. Popular OOP languages include Java, C++, and Smalltalk.
The document discusses the Scanner class in Java, which is used to get user input from the keyboard. It describes several methods of the Scanner class like nextInt(), nextFloat(), and nextLine() that can be used to read integer, float, and string values from the user. An example program is given that uses the Scanner class to take integer, float, and string inputs from the user and display them.
Project evaluation is the process of measuring the success of a project through gathering data and using evaluation methods. It allows identifying performance improvements and keeping stakeholders updated. Project evaluation criteria consider factors like time, cost, scope, and quality. There are various project evaluation methods including pre-project, ongoing, and post-project evaluation. Project appraisal involves a detailed evaluation of a project's political, social, environmental, technical, financial, and economic feasibility to determine its viability. It helps decide whether to accept or reject a project.
This document discusses project evaluation techniques including strategic assessment, technical assessment, cost benefit analysis, cashflow forecasting, and risk evaluation. It provides details on each technique. Strategic assessment evaluates how well a project aligns with organizational goals and strategies. Technical assessment considers the functionality of a project. Cost benefit analysis compares projected costs and benefits in monetary terms. Cashflow forecasting estimates costs and benefits over time. Risk evaluation examines potential risks of a project. The document also discusses challenges in project monitoring and evaluation.
The document discusses object-oriented programming concepts including objects, classes, message passing, abstraction, encapsulation, inheritance, polymorphism, and dynamic binding. It provides examples and definitions for each concept. It also discusses how to represent real-world entities like a person or place as objects with states (attributes and values) and behaviors (methods). Classes are defined as blueprints that specify common properties and functionality for objects. The relationships between classes and objects are demonstrated.
Wireless communication technologies have evolved from Guglielmo Marconi's early radio demonstrations in 1897. In the 1960s-1970s, Bell Laboratories developed the cellular concept, which enabled wireless communication networks to serve entire populations. This led to the development of cellular mobile systems using radio frequency technology. Cellular systems use a hexagonal cell structure and frequency reuse to improve spectrum efficiency and service capacity. They employ technologies such as handoff, dynamic channel assignment, and prioritization of handoffs to manage calls as users move between cells.
The document introduces software project management. It discusses that software projects are a type of project management that faces unique challenges due to the invisible nature and complexity of software. Successful project management requires setting clear and measurable objectives, thorough planning, and active monitoring and control to adapt to inevitable changes. Communication between stakeholders is essential throughout the project life cycle.
This document discusses conventional software project management. It outlines key attributes and players in a project, typical expenditures by phase for a conventional project using the waterfall model, and the need for project management. Project management is defined as managing and controlling project activities, while operation management focuses on running ongoing business operations. The differences between project and operation management are explained. Finally, the document outlines some activities covered under software project management.
This document discusses conventional software project management. It outlines key attributes and players in a project, typical expenditures by phase for a conventional project using the waterfall model, and the need for project management. Project management is defined as managing and controlling project activities, while operation management focuses on running ongoing business operations. The differences between project and operation management are explained. Finally, the document outlines some activities covered under software project management.
The document discusses various searching and sorting algorithms. It begins by describing sequential search on an unordered file, then covers the differences between searching ordered and unordered lists. It introduces binary search as a faster search method for ordered lists. The document explains how binary search works and analyzes its logarithmic time complexity. Finally, it briefly introduces common sorting algorithms like bubble sort and insertion sort to sort data before applying faster search methods.
Gabriel Kalembo A Rising Star in the World of Football Coachinggabrielkalembous
Gabriel Kalembo is a player's coach who connects with his teams on a deep level. With a strong background in sports science and a passion for the game, Kalembo has developed a unique coaching philosophy that emphasizes player development and tactical flexibility. His ability to connect with players and create a positive team culture has led to success at every level he has coached.
According to the report, the consumption of video content related to IPL 2024 has seen significant growth, nearly 3 times more than the previous season, reflecting an increasing interest of fans.
Understanding Golf Simulator Equipment A Beginner's Guide.pdfMy Garage Golf
Dive into golf simulation with our beginner's guide, perfect for anyone new to the concept. Understand the critical components like sturdy frames, high-quality impact screens, and side netting that ensure your safety and enrich your practice sessions. Learn the benefits of proper projector mounts and compatibility with your existing setup. This guide helps you make informed choices, transforming your home into a realistic and effective golfing practice environment.
For More Information-: https://mygaragegolf.com/shop
Match By Match Detailed Schedule Of The ICC Men's T20 World Cup 2024.pdfmouthhunt5
20 Teams, One Trophy: What to Expect from the ICC Men's T20 World Cup 2024
The ICC Men's T20 World Cup 2024 is set to be an exciting event, co-hosted by the West Indies and the USA from June 1 to June 29, 2024. This edition of the tournament will feature a record 20 teams divided into four groups, competing across 55 matches for the prestigious title.
Luciano Spalletti Leads Italy's Transition at UEFA Euro 2024.docxEuro Cup 2024 Tickets
Italy are the defending European champs, but after Luciano Spalletti swapped Roberto Mancini last September, they are still taking the cautious first steps of a new era
Spain vs Croatia Euro 2024 Spain's Chance to Shine on the International Stage...Eticketing.co
Euro 2024 fans worldwide can book Spain vs Croatia Tickets from our online platform www.eticketing.co. Fans can book Euro Cup Germany Tickets on our website at discounted prices.
Netherlands vs Austria Netherlands Face Familiar Foes in Euro Cup Germany Gro...Eticketing.co
The Netherlands are in Group D in Euro Cup Germany - and, unpaid to this, they will be coming up against familiar foes. Remarkably, they have played France, who have fashioned some of the greatest players of all time, 30 times throughout history. Despite France being more effective in major competitions, including captivating the World Cup in 2018, Holland have the greater head-to-head record.
We offer Euro Cup Tickets to admirers who can get Netherlands vs Austria Tickets through our trusted online ticketing marketplace. Eticketing.co is the most reliable source for booking Euro Cup Final Tickets. Sign up for the latest Euro Cup Germany Ticket alert.
UEFA Euro 2024 Tickets | Euro 2024 Tickets | Netherlands vs Austria Tickets
However, in 2023, they played one another twice, with France endearing both matches 4-0 and 2-1 individually. Against Poland and Austria, the Netherlands also have a stout record, winning just under half the matches. They faced Austria at Euro 2020, engaging 2-0, and they haven't lost to Poland since 1979.
The lettering is on the wall for Holland to qualify for the knockouts, but nothing is failsafe. The Netherlands kickstart their Euros campaign against Poland on Sunday, June 16th. In Hamburg, they will have to go up against one of the best strikers in the world, Robert Lewandowski.
Netherlands vs Austria: Tough Challenges Await the Netherlands in Euro Cup Germany
Five days later, they travel south to face France in Leipzig, a side led by Kylian Mbappe - one of the finest players in the world currently and one of the most impressive players in his nation's history. To conclude, they face Austria in Berlin, knowing it could be the end of the road if they don't perform.
Ronald Koeman is widely considered one of the more successful Dutch managers in Premier League history, considering the nation has a reputation for struggling to replicate their talents in England. The former Everton manager went against that script and shone — and now he is back managing his nation.
UEFA Euro 2024 Tickets | Euro 2024 Tickets | Euro Cup Germany Tickets | Netherlands vs Austria Tickets
Euro fans worldwide can book Euro Cup Germany Tickets from our online platform, www.eticketing.co. Fans can book Euro Cup 2024 Tickets on our website at discounted prices.
Netherlands vs Austria: Ronald Koeman's Tactical Approach For UEFA Euro 2024
As well as being the highest-scoring defender in history, Koeman is a man with immense tactical knowledge. He returned to manage Holland at the start of 2023 after it was announced Louis van Gaal would retire. His life back in the dugout with the team wasn't easy, as he lost his first match 4-0 to France after going 3-0 down within 21 minutes.
However, he eventually helped them qualify for Euro Cup Germany. The 61-year-old likes to organize his team with a defensive mindset. Some might call it pragmatic as he defends with minimal space between the lines, but that's often needed for international football.
Italy vs Albania Soul and sacrifice' are the keys to success for Albania at E...Eticketing.co
We offer UEFA Euro 2024 Tickets to admirers who can get Italy vs Albania Tickets through our trusted online ticketing marketplace. Eticketing. co is the most reliable source for booking Euro Cup Final Tickets. Sign up for the latest Euro Cup Germany Ticket alert.
Boletin de la I Copa Panamericana de Voleibol Femenino U17 Guatemala 2024Judith Chuquipul
holaesungusto.- Boletín final de la I Copa Panamericana de Voleibol Femenino U17 - Ciudad de Guatemala 2024 que se realizó del 27 de mayo al 01 de julio, en el Domo Polideportivo Zona 13.
Fuente: norceca.net
Belgium vs Romania Ultimate Guide to Euro Cup 2024 Tactics, Ticketing, and Qu...Eticketing.co
Euro Cup 2024 fans worldwide can book Belgium vs Romania Tickets from our online platform www.eticketing.co. Fans can book Euro Cup Germany Tickets on our website at discounted prices.
Croatia vs Italy Modric's Last Dance Croatia's UEFA Euro 2024 Journey and Ita...Eticketing.co
UEFA Euro 2024 fans worldwide can book Croatia vs Italy Tickets from our online platform www.eticketing.co. Fans can book Euro Cup Germany Tickets on our website at discounted prices.
Belgium vs Slovakia Belgium Euro 2024 Golden Generation Faces Euro Cup Final ...Eticketing.co
We offer Euro Cup Tickets to admirers who can get Belgium vs Slovakia Tickets through our trusted online ticketing marketplace. Eticketing.co is the most reliable source for booking Euro Cup Final Tickets. Sign up for the latest Euro Cup Germany Ticket alert.
Hesan Soufi's Legacy: Inspiring the Next GenerationHesan Soufi
Hesan Soufi's impact on the game extends far beyond his on-field exploits. With his humility, sportsmanship, and unwavering commitment to excellence, Soufi has become a role model for aspiring footballers worldwide. His legacy lies not only in his achievements but also in the inspiration he provides to the next generation of talented players.
Georgia vs Portugal Georgia UEFA Euro 2024 Squad Khvicha Kvaratskhelia Leads ...Eticketing.co
UEFA Euro 2024 fans worldwide can book Georgia vs Portugal Tickets from our online platform www.eticketing.co. Fans can book Euro Cup Germany Tickets on our website at discounted prices.
Psaroudakis: Family and Football – The Psaroudakis Success StoryPsaroudakis
Psaroudakis, a name that resonates with football fans around the globe, is a testament to the powerful synergy between familial support and individual passion. Born on March 10, 1992, in the historic city of Heraklion, Crete, Psaroudakis’ journey to international football stardom is a compelling narrative of dedication, perseverance, and unwavering family support. His story not only highlights his athletic prowess but also underscores the crucial role his family played in shaping his career and character.
Psaroudakis’ early life in Heraklion was deeply influenced by a supportive and nurturing family environment. His father, a former semi-professional footballer, recognized Psaroudakis’ potential from an early age. Acting as his first coach, his father’s guidance was instrumental in igniting Psaroudakis’ passion for football. This paternal influence instilled in him a strong work ethic and fundamental skills that would become the foundation of his future success. His mother, a dedicated homemaker, provided a stable and nurturing environment, ensuring that Psaroudakis could pursue his dreams without any hindrances.
From a young age, Psaroudakis showed an innate talent for football. Growing up in Heraklion, he spent countless hours playing football in local parks and streets with friends and family. His natural ability was evident even in these informal settings, and his enthusiasm for the game was infectious. By the age of five, Psaroudakis had joined a local youth football club, where his skills began to flourish. His father’s role as his first coach during these formative years was crucial, as he emphasized not only technical skills but also the importance of discipline and teamwork.
The transition from playing in local parks to joining a structured football environment marked a significant step in Psaroudakis’ journey. At the age of ten, he joined the youth academy of OFI Crete, one of Greece’s most esteemed football clubs. This move marked the beginning of a more rigorous and professional approach to his training. The academy environment was demanding, focusing on honing technical abilities and instilling values of sportsmanship and dedication. Psaroudakis’ dedication to his craft was evident as he quickly rose through the ranks, becoming a standout player in the youth teams.
The support of Psaroudakis’ family was unwavering during this critical period. His father continued to be a source of guidance and mentorship, while his mother ensured that he had everything he needed to succeed. Their collective efforts created a balanced environment where Psaroudakis could focus entirely on his development as a footballer. This familial support was not just about providing the basics; it was about creating an environment where Psaroudakis felt encouraged and motivated to pursue his dreams relentlessly.
As Psaroudakis transitioned from the youth academy to professional football, the challenges became more significant.
2. 2
Outline
What is the IR problem?
How to organize an IR system? (Or the
main processes in IR)
Indexing
Retrieval
System evaluation
Some current research topics
3. 3
The problem of IR
Goal = find documents relevant to an information
need from a large document set
Document
collection
Info.
need
Query
Answer list
IR
system
Retrieval
5. 5
IR problem
First applications: in libraries (1950s)
ISBN: 0-201-12227-8
Author: Salton, Gerard
Title: Automatic text processing: the transformation,
analysis, and retrieval of information by computer
Editor: Addison-Wesley
Date: 1989
Content: <Text>
external attributes and internal attribute (content)
Search by external attributes = Search in DB
IR: search by content
6. 6
Possible approaches
1. String matching (linear search in
documents)
- Slow
- Difficult to improve
2. Indexing (*)
- Fast
- Flexible to further improvement
8. 8
Main problems in IR
Document and query indexing
How to best represent their contents?
Query evaluation (or retrieval process)
To what extent does a document correspond
to a query?
System evaluation
How good is a system?
Are the retrieved documents relevant?
(precision)
Are all the relevant documents retrieved?
(recall)
9. 9
Document indexing
Goal = Find the important meanings and create an
internal representation
Factors to consider:
Accuracy to represent meanings (semantics)
Exhaustiveness (cover all the contents)
Facility for computer to manipulate
What is the best representation of contents?
Char. string (char trigrams): not precise enough
Word: good coverage, not precise
Phrase: poor coverage, more precise
Concept: poor coverage, precise
Coverage
(Recall)
Accuracy
(Precision)
String Word Phrase Concept
10. 10
Keyword selection and weighting
How to select important keywords?
Simple method: using middle-frequency words
Frequency/Informativity
frequency informativity
Max.
Min.
1 2 3 … Rank
11. 11
tf = term frequency
frequency of a term/keyword in a document
The higher the tf, the higher the importance (weight) for the doc.
df = document frequency
no. of documents containing the term
distribution of the term
idf = inverse document frequency
the unevenness of term distribution in the corpus
the specificity of term to a document
The more the term is distributed evenly, the less it is specific to a
document
weight(t,D) = tf(t,D) * idf(t)
tf*idf weighting schema
12. 12
Some common tf*idf schemes
tf(t, D)=freq(t,D) idf(t) = log(N/n)
tf(t, D)=log[freq(t,D)] n = #docs containing t
tf(t, D)=log[freq(t,D)]+1 N = #docs in corpus
tf(t, D)=freq(t,d)/Max[f(t,d)]
weight(t,D) = tf(t,D) * idf(t)
Normalization: Cosine normalization, /max, …
13. 13
Document Length
Normalization
Sometimes, additional normalizations e.g.
length:
)
,
(
_
)
1
(
1
)
,
(
)
,
(
D
t
weight
normalized
povot
slope
slope
D
t
weight
D
t
pivoted
pivot
Probability
of relevance
Probability of retrieval
Doc. length
slope
14. 14
function words do not bear useful information for IR
of, in, about, with, I, although, …
Stoplist: contain stopwords, not to be used as index
Prepositions
Articles
Pronouns
Some adverbs and adjectives
Some frequent words (e.g. document)
The removal of stopwords usually improves IR
effectiveness
A few “standard” stoplists are commonly used.
Stopwords / Stoplist
15. 15
Stemming
Reason:
Different word forms may bear similar meaning
(e.g. search, searching): create a “standard”
representation for them
Stemming:
Removing some endings of word
computer
compute
computes
computing
computed
computation
comput
16. 16
Porter algorithm
(Porter, M.F., 1980, An algorithm for suffix stripping,
Program, 14(3) :130-137)
Step 1: plurals and past participles
SSES -> SS caresses -> caress
(*v*) ING -> motoring -> motor
Step 2: adj->n, n->v, n->adj, …
(m>0) OUSNESS -> OUS callousness -> callous
(m>0) ATIONAL -> ATE relational -> relate
Step 3:
(m>0) ICATE -> IC triplicate -> triplic
Step 4:
(m>1) AL -> revival -> reviv
(m>1) ANCE -> allowance -> allow
Step 5:
(m>1) E -> probate -> probat
(m > 1 and *d and *L) -> single letter controll -> control
17. 17
Lemmatization
transform to standard form according to syntactic
category.
E.g. verb + ing verb
noun + s noun
Need POS tagging
More accurate than stemming, but needs more resources
crucial to choose stemming/lemmatization rules
noise v.s. recognition rate
compromise between precision and recall
light/no stemming severe stemming
-recall +precision +recall -precision
18. 18
Result of indexing
Each document is represented by a set of weighted
keywords (terms):
D1 {(t1, w1), (t2,w2), …}
e.g.D1 {(comput, 0.2), (architect, 0.3), …}
D2 {(comput, 0.1), (network, 0.5), …}
Inverted file:
comput {(D1,0.2), (D2,0.1), …}
Inverted file is used during retrieval for higher efficiency.
19. 19
Retrieval
The problems underlying retrieval
Retrieval model
How is a document represented with the
selected keywords?
How are document and query representations
compared to calculate a score?
Implementation
20. 20
Cases
1-word query:
The documents to be retrieved are those that
include the word
- Retrieve the inverted list for the word
- Sort in decreasing order of the weight of the word
Multi-word query?
- Combining several lists
- How to interpret the weight?
(IR model)
21. 21
IR models
Matching score model
Document D = a set of weighted keywords
Query Q = a set of non-weighted keywords
R(D, Q) = i w(ti , D)
where ti is in Q.
22. 22
Boolean model
Document = Logical conjunction of keywords
Query = Boolean expression of keywords
R(D, Q) = D Q
e.g. D = t1 t2 … tn
Q = (t1 t2) (t3 t4)
D Q, thus R(D, Q) = 1.
Problems:
R is either 1 or 0 (unordered set of documents)
many documents or few documents
End-users cannot manipulate Boolean operators correctly
E.g. documents about kangaroos and koalas
23. 23
Extensions to Boolean model
(for document ordering)
D = {…, (ti, wi), …}: weighted keywords
Interpretation:
D is a member of class ti to degree wi.
In terms of fuzzy sets: ti(D) = wi
A possible Evaluation:
R(D, ti) = ti(D);
R(D, Q1 Q2) = min(R(D, Q1), R(D, Q2));
R(D, Q1 Q2) = max(R(D, Q1), R(D, Q2));
R(D, Q1) = 1 - R(D, Q1).
24. 24
Vector space model
Vector space = all the keywords encountered
<t1, t2, t3, …, tn>
Document
D = < a1, a2, a3, …, an>
ai = weight of ti in D
Query
Q = < b1, b2, b3, …, bn>
bi = weight of ti in Q
R(D,Q) = Sim(D,Q)
26. 26
Some formulas for Sim
Dot product
Cosine
Dice
Jaccard
i i i
i
i
i
i
i
i
i
i i
i
i
i
i
i
i i
i
i
i
i
i
i
i
b
a
b
a
b
a
Q
D
Sim
b
a
b
a
Q
D
Sim
b
a
b
a
Q
D
Sim
b
a
Q
D
Sim
)
*
(
)
*
(
)
,
(
)
*
(
2
)
,
(
*
)
*
(
)
,
(
)
*
(
)
,
(
2
2
2
2
2
2
t1
t2
D
Q
27. 27
Implementation (space)
Matrix is very sparse: a few 100s terms for
a document, and a few terms for a query,
while the term space is large (~100k)
Stored as:
D1 {(t1, a1), (t2,a2), …}
t1 {(D1,a1), …}
28. 28
Implementation (time)
The implementation of VSM with dot product:
Naïve implementation: O(m*n)
Implementation using inverted file:
Given a query = {(t1,b1), (t2,b2)}:
1. find the sets of related documents through inverted file for
t1 and t2
2. calculate the score of the documents to each weighted term
(t1,b1) {(D1,a1 *b1), …}
3. combine the sets and sum the weights ()
O(|Q|*n)
29. 29
Other similarities
Cosine:
- use and to normalize the
weights after indexing
- Dot product
(Similar operations do not apply to Dice and
Jaccard)
j
j
i
i
j
j
i
j j
i
i
i
i
i
b
b
a
a
b
a
b
a
Q
D
Sim
2
2
2
2
*
)
*
(
)
,
(
j
j
a
2
j
j
b
2
30. 30
Probabilistic model
Given D, estimate P(R|D) and P(NR|D)
P(R|D)=P(D|R)*P(R)/P(D) (P(D), P(R) constant)
P(D|R)
D = {t1=x1, t2=x2, …}
i
i
i
i
i
i
i
i
i
i
i
i
i
i
t
x
i
x
i
t
x
i
x
i
t
x
i
x
i
t
x
i
x
i
D
x
t
i
i
q
q
NR
t
P
NR
t
P
NR
D
P
p
p
R
t
P
R
t
P
R
x
t
P
R
D
P
)
1
(
)
1
(
)
1
(
)
1
(
)
(
)
1
(
)
|
0
(
)
|
1
(
)
|
(
)
1
(
)
|
0
(
)
|
1
(
)
|
(
)
|
(
absent
present
xi
0
1
32. 32
Prob. model (cont’d)
How to estimate pi and qi?
A set of N relevant and
irrelevant samples:
ri
Rel. doc.
with ti
ni-ri
Irrel.doc.
with ti
ni
Doc.
with ti
Ri-ri
Rel. doc.
without ti
N-Ri–n+ri
Irrel.doc.
without ti
N-ni
Doc.
without ti
Ri
Rel. doc
N-Ri
Irrel.doc.
N
Samples
i
i
i
i
i
i
i
R
N
r
n
q
R
r
p
33. 33
Prob. model (cont’d)
Smoothing (Robertson-Sparck-Jones formula)
When no sample is available:
pi=0.5,
qi=(ni+0.5)/(N+0.5)ni/N
May be implemented as VSM
)
)(
(
)
(
)
1
(
)
1
(
log
)
(
i
i
i
i
i
i
i
i
t
i
i
i
i
i
t
i
r
n
r
R
r
n
R
N
r
x
p
q
q
p
x
D
Odd
i
i
D
t
i
i
i
i
i
i
i
i
i
t
i
i
i
w
r
n
r
R
r
n
R
N
r
x
D
Odd
)
5
.
0
)(
5
.
0
(
)
5
.
0
)(
5
.
0
(
)
(
34. 34
BM25
k1, k2, k3, d: parameters
qtf: query term frequency
dl: document length
avdl: average document length
)
)
1
((
|
|
)
1
(
)
1
(
)
,
(
1
2
3
3
1
dl
avdl
dl
b
b
k
K
dl
avdl
dl
avdl
Q
k
qtf
k
qtf
k
tf
K
tf
k
w
Q
D
Score
Q
t
35. 35
(Classic) Presentation of results
Query evaluation result is a list of documents,
sorted by their similarity to the query.
E.g.
doc1 0.67
doc2 0.65
doc3 0.54
…
36. 36
System evaluation
Efficiency: time, space
Effectiveness:
How is a system capable of retrieving relevant
documents?
Is a system better than another one?
Metrics often used (together):
Precision = retrieved relevant docs / retrieved docs
Recall = retrieved relevant docs / relevant docs
relevant retrieved
retrieved relevant
37. 37
General form of precision/recall
Precision
1.0
Recall
1.0
-Precision change w.r.t. Recall (not a fixed point)
-Systems cannot compare at one Precision/Recall point
-Average precision (on 11 points of recall: 0.0, 0.1, …, 1.0)
38. 38
An illustration of P/R
calculation
List Rel?
Doc1 Y
Doc2
Doc3 Y
Doc4 Y
Doc5
…
Precision
1.0 - * (0.2, 1.0)
0.8 - * (0.6, 0.75)
* (0.4, 0.67)
0.6 - * (0.6, 0.6)
* (0.2, 0.5)
0.4 -
0.2 -
0.0 | | | | | Recall
0.2 0.4 0.6 0.8 1.0
Assume: 5 relevant docs.
39. 39
MAP (Mean Average Precision)
rij = rank of the j-th relevant document for Qi
|Ri| = #rel. doc. for Qi
n = # test queries
E.g. Rank: 1 4 1st rel. doc.
5 8 2nd rel. doc.
10 3rd rel. doc.
i i
j
Q R
D ij
i r
j
R
n
MAP
|
|
1
1
)]
8
2
4
1
(
2
1
)
10
3
5
2
1
1
(
3
1
[
2
1
MAP
40. 40
Some other measures
Noise = retrieved irrelevant docs / retrieved docs
Silence = non-retrieved relevant docs / relevant docs
Noise = 1 – Precision; Silence = 1 – Recall
Fallout = retrieved irrel. docs / irrel. docs
Single value measures:
F-measure = 2 P * R / (P + R)
Average precision = average at 11 points of recall
Precision at n document (often used for Web IR)
Expected search length (no. irrelevant documents to read
before obtaining n relevant doc.)
41. 41
Test corpus
Compare different IR systems on the same
test corpus
A test corpus contains:
A set of documents
A set of queries
Relevance judgment for every document-query pair
(desired answers for each query)
The results of a system is compared with the
desired answers.
42. 42
An evaluation example
(SMART)
Run number: 1 2
Num_queries: 52 52
Total number of documents over
all queries
Retrieved: 780 780
Relevant: 796 796
Rel_ret: 246 229
Recall - Precision Averages:
at 0.00 0.7695 0.7894
at 0.10 0.6618 0.6449
at 0.20 0.5019 0.5090
at 0.30 0.3745 0.3702
at 0.40 0.2249 0.3070
at 0.50 0.1797 0.2104
at 0.60 0.1143 0.1654
at 0.70 0.0891 0.1144
at 0.80 0.0891 0.1096
at 0.90 0.0699 0.0904
at 1.00 0.0699 0.0904
Average precision for all points
11-pt Avg: 0.2859 0.3092
% Change: 8.2
Recall:
Exact: 0.4139 0.4166
at 5 docs: 0.2373 0.2726
at 10 docs: 0.3254 0.3572
at 15 docs: 0.4139 0.4166
at 30 docs: 0.4139 0.4166
Precision:
Exact: 0.3154
0.2936
At 5 docs: 0.4308 0.4192
At 10 docs: 0.3538 0.3327
At 15 docs: 0.3154 0.2936
At 30 docs: 0.1577 0.1468
43. 43
The TREC experiments
Once per year
A set of documents and queries are distributed
to the participants (the standard answers are
unknown) (April)
Participants work (very hard) to construct, fine-
tune their systems, and submit the answers
(1000/query) at the deadline (July)
NIST people manually evaluate the answers
and provide correct answers (and classification
of IR systems) (July – August)
TREC conference (November)
44. 44
TREC evaluation methodology
Known document collection (>100K) and query set
(50)
Submission of 1000 documents for each query by
each participant
Merge 100 first documents of each participant ->
global pool
Human relevance judgment of the global pool
The other documents are assumed to be irrelevant
Evaluation of each system (with 1000 answers)
Partial relevance judgments
But stable for system ranking
45. 45
Tracks (tasks)
Ad Hoc track: given document collection, different
topics
Routing (filtering): stable interests (user profile),
incoming document flow
CLIR: Ad Hoc, but with queries in a different
language
Web: a large set of Web pages
Question-Answering: When did Nixon visit China?
Interactive: put users into action with system
Spoken document retrieval
Image and video retrieval
Information tracking: new topic / follow up
46. 46
CLEF and NTCIR
CLEF = Cross-Language Experimental
Forum
for European languages
organized by Europeans
Each per year (March – Oct.)
NTCIR:
Organized by NII (Japan)
For Asian languages
cycle of 1.5 year
47. 47
Impact of TREC
Provide large collections for further
experiments
Compare different systems/techniques on
realistic data
Develop new methodology for system
evaluation
Similar experiments are organized in other
areas (NLP, Machine translation,
Summarization, …)
48. 48
Some techniques to
improve IR effectiveness
Interaction with user (relevance feedback)
- Keywords only cover part of the contents
- User can help by indicating relevant/irrelevant
document
The use of relevance feedback
To improve query expression:
Qnew = *Qold + *Rel_d - *Nrel_d
where Rel_d = centroid of relevant documents
NRel_d = centroid of non-relevant documents
49. 49
Effect of RF
* x * x x
* * * x x
* * R* Q * NR x
* x * x x
* * x
Qnew
* * *
* *
* *
* *
1st retrieval
2nd retrieval
50. 50
Modified relevance feedback
Users usually do not cooperate (e.g.
AltaVista in early years)
Pseudo-relevance feedback (Blind RF)
Using the top-ranked documents as if they
are relevant:
Select m terms from n top-ranked documents
One can usually obtain about 10% improvement
51. 51
Query expansion
A query contains part of the important words
Add new (related) terms into the query
Manually constructed knowledge base/thesaurus
(e.g. Wordnet)
Q = information retrieval
Q’ = (information + data + knowledge + …)
(retrieval + search + seeking + …)
Corpus analysis:
two terms that often co-occur are related (Mutual
information)
Two terms that co-occur with the same words are
related (e.g. T-shirt and coat with wear, …)
52. 52
Global vs. local context analysis
Global analysis: use the whole document
collection to calculate term relationships
Local analysis: use the query to retrieve a
subset of documents, then calculate term
relationships
Combine pseudo-relevance feedback and term co-
occurrences
More effective than global analysis
53. 53
Some current research topics:
Go beyond keywords
Keywords are not perfect representatives of concepts
Ambiguity:
table = data structure, furniture?
Lack of precision:
“operating”, “system” less precise than “operating_system”
Suggested solution
Sense disambiguation (difficult due to the lack of contextual
information)
Using compound terms (no complete dictionary of
compound terms, variation in form)
Using noun phrases (syntactic patterns + statistics)
Still a long way to go
55. 55
Logical models
How to describe the relevance
relation as a logical relation?
D => Q
What are the properties of this
relation?
How to combine uncertainty with a
logical framework?
The problem: What is relevance?
56. 56
Related applications:
Information filtering
IR: changing queries on stable document collection
IF: incoming document flow with stable interests
(queries)
yes/no decision (in stead of ordering documents)
Advantage: the description of user’s interest may be
improved using relevance feedback (the user is more willing
to cooperate)
Difficulty: adjust threshold to keep/ignore document
The basic techniques used for IF are the same as those for
IR – “Two sides of the same coin”
IF
… doc3, doc2, doc1
keep
ignore
User profile
57. 57
IR for (semi-)structured
documents
Using structural information to assign weights
to keywords (Introduction, Conclusion, …)
Hierarchical indexing
Querying within some structure (search in
title, etc.)
INEX experiments
Using hyperlinks in indexing and retrieval
(e.g. Google)
…
58. 58
PageRank in Google
Assign a numeric value to each page
The more a page is referred to by important pages, the more this
page is important
d: damping factor (0.85)
Many other criteria: e.g. proximity of query words
“…information retrieval …” better than “… information … retrieval …”
A B
i i
i
I
C
I
PR
d
d
A
PR
)
(
)
(
)
1
(
)
(
I1
I2
59. 59
IR on the Web
No stable document collection (spider,
crawler)
Invalid document, duplication, etc.
Huge number of documents (partial
collection)
Multimedia documents
Great variation of document quality
Multilingual problem
…
60. 60
Final remarks on IR
IR is related to many areas:
NLP, AI, database, machine learning, user
modeling…
library, Web, multimedia search, …
Relatively week theories
Very strong tradition of experiments
Many remaining (and exciting) problems
Difficult area: Intuitive methods do not
necessarily improve effectiveness in practice
61. 61
Why is IR difficult
Vocabularies mismatching
Synonymy: e.g. car v.s. automobile
Polysemy: table
Queries are ambiguous, they are partial specification
of user’s need
Content representation may be inadequate and
incomplete
The user is the ultimate judge, but we don’t know
how the judge judges…
The notion of relevance is imprecise, context- and user-
dependent
But how much it is rewarding to gain 10%
improvement!