Question Classification using Semantic, Syntactic and Lexical featuresdannyijwest
This document summarizes research on improving question classification accuracy through the use of machine learning and a combination of semantic, syntactic, and lexical features. The researchers tested various classifiers like Naive Bayes, k-Nearest Neighbors, and Support Vector Machines on the UIUC question classification dataset. Their best results were achieved using a Support Vector Machine classifier trained on features including question headwords, hypernyms from WordNet, part-of-speech tags, and word shapes, achieving 96.2% accuracy for coarse-grained and 91.1% for fine-grained classification. This outperformed previous state-of-the-art results, demonstrating that combining semantic and syntactic features with lexical features improves automated question classification.
Question Classification using Semantic, Syntactic and Lexical featuresIJwest
This document summarizes research on improving question classification accuracy through the use of machine learning and a combination of semantic, syntactic, and lexical features. The researchers tested various classifiers like Naive Bayes, k-Nearest Neighbors, and Support Vector Machines on the UIUC question classification dataset. Their best results were achieved using a Support Vector Machine classifier trained on features including question headwords, hypernyms from WordNet, part-of-speech tags, and word shapes, achieving 96.2% accuracy for coarse-grained and 91.1% for fine-grained classification. This outperformed previous state-of-the-art results, demonstrating that combining semantic and syntactic features with lexical features improves automated question classification.
The document discusses query-based summarization, including defining the task, evaluation criteria, and different approaches. Some key approaches discussed are using document graphs to relate sentences, linguistic analysis techniques, machine learning methods like using support vector machines, and domain-specific systems for tasks like medical summarization and opinion summarization. Evaluation is typically done using the ROUGE method to assess recall against a model summary.
The document discusses query-based summarization, including defining the task, evaluation criteria, and different approaches used. Some key approaches discussed are using document graphs to identify relevant sections, rhetorical structure theory to create a graph representation, linguistics techniques like Hidden Markov Models for sentence selection, and machine learning methods like using support vector machines to rank sentences. Different domains like medical and opinion summarization are also outlined.
The document discusses query-based summarization, including defining the task, evaluation criteria, and general and application-specific approaches. Query-based summarization produces a summary from documents based on a query, and can be extractive or abstractive. Evaluation uses ROUGE metrics. Approaches include document graphs, linguistics, machine learning, and application-tailored systems for domains like medicine and opinion summarization.
Apply deep learning to improve the question analysis model in the Vietnamese ...IJECEIAES
Question answering (QA) system nowadays is quite popular for automated answering purposes, the meaning analysis of the question plays an important role, directly affecting the accuracy of the system. In this article, we propose an improvement for question-answering models by adding more specific question analysis steps, including contextual characteristic analysis, pos-tag analysis, and question-type analysis built on deep learning network architecture. Weights of extracted words through question analysis steps are combined with the best matching 25 (BM25) algorithm to find the best relevant paragraph of text and incorporated into the QA model to find the best and least noisy answer. The dataset for the question analysis step consists of 19,339 labeled questions covering a variety of topics. Results of the question analysis model are combined to train the question-answering model on the data set related to the learning regulations of Industrial University of Ho Chi Minh City. It includes 17,405 pairs of questions and answers for the training set and 1,600 pairs for the test set, where the robustly optimized BERT pretraining approach (RoBERTa) model has an F1-score accuracy of 74%. The model has improved significantly. For long and complex questions, the mode has extracted weights and correctly provided answers based on the question’s contents.
This paper proposes a Tamil document summarization system that utilizes statistical, semantic, and heuristic methods to generate a coherent multi-document summary based on a given query. The system performs Latent Dirichlet Allocation (LDA) topic modeling on document clusters to identify important topics and words. Sentences are then scored based on topic modeling results and redundancy is removed using Maximal Marginal Relevance. The summary is generated from the highest scoring sentences in different perspectives based on the query topic or entities. Evaluation results show the system effectively summarizes multiple documents according to the query.
Advantages of Query Biased Summaries in Information RetrievalOnur Yılmaz
Presentation of the paper:
Advantages of query biased summaries in information retrieval (1998)
Anastasios Tombros and Mark Sanderson.
In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '98).
ACM, New York, NY, USA, 2-10.
DOI=10.1145/290941.290947
Question Classification using Semantic, Syntactic and Lexical featuresdannyijwest
This document summarizes research on improving question classification accuracy through the use of machine learning and a combination of semantic, syntactic, and lexical features. The researchers tested various classifiers like Naive Bayes, k-Nearest Neighbors, and Support Vector Machines on the UIUC question classification dataset. Their best results were achieved using a Support Vector Machine classifier trained on features including question headwords, hypernyms from WordNet, part-of-speech tags, and word shapes, achieving 96.2% accuracy for coarse-grained and 91.1% for fine-grained classification. This outperformed previous state-of-the-art results, demonstrating that combining semantic and syntactic features with lexical features improves automated question classification.
Question Classification using Semantic, Syntactic and Lexical featuresIJwest
This document summarizes research on improving question classification accuracy through the use of machine learning and a combination of semantic, syntactic, and lexical features. The researchers tested various classifiers like Naive Bayes, k-Nearest Neighbors, and Support Vector Machines on the UIUC question classification dataset. Their best results were achieved using a Support Vector Machine classifier trained on features including question headwords, hypernyms from WordNet, part-of-speech tags, and word shapes, achieving 96.2% accuracy for coarse-grained and 91.1% for fine-grained classification. This outperformed previous state-of-the-art results, demonstrating that combining semantic and syntactic features with lexical features improves automated question classification.
The document discusses query-based summarization, including defining the task, evaluation criteria, and different approaches. Some key approaches discussed are using document graphs to relate sentences, linguistic analysis techniques, machine learning methods like using support vector machines, and domain-specific systems for tasks like medical summarization and opinion summarization. Evaluation is typically done using the ROUGE method to assess recall against a model summary.
The document discusses query-based summarization, including defining the task, evaluation criteria, and different approaches used. Some key approaches discussed are using document graphs to identify relevant sections, rhetorical structure theory to create a graph representation, linguistics techniques like Hidden Markov Models for sentence selection, and machine learning methods like using support vector machines to rank sentences. Different domains like medical and opinion summarization are also outlined.
The document discusses query-based summarization, including defining the task, evaluation criteria, and general and application-specific approaches. Query-based summarization produces a summary from documents based on a query, and can be extractive or abstractive. Evaluation uses ROUGE metrics. Approaches include document graphs, linguistics, machine learning, and application-tailored systems for domains like medicine and opinion summarization.
Apply deep learning to improve the question analysis model in the Vietnamese ...IJECEIAES
Question answering (QA) system nowadays is quite popular for automated answering purposes, the meaning analysis of the question plays an important role, directly affecting the accuracy of the system. In this article, we propose an improvement for question-answering models by adding more specific question analysis steps, including contextual characteristic analysis, pos-tag analysis, and question-type analysis built on deep learning network architecture. Weights of extracted words through question analysis steps are combined with the best matching 25 (BM25) algorithm to find the best relevant paragraph of text and incorporated into the QA model to find the best and least noisy answer. The dataset for the question analysis step consists of 19,339 labeled questions covering a variety of topics. Results of the question analysis model are combined to train the question-answering model on the data set related to the learning regulations of Industrial University of Ho Chi Minh City. It includes 17,405 pairs of questions and answers for the training set and 1,600 pairs for the test set, where the robustly optimized BERT pretraining approach (RoBERTa) model has an F1-score accuracy of 74%. The model has improved significantly. For long and complex questions, the mode has extracted weights and correctly provided answers based on the question’s contents.
This paper proposes a Tamil document summarization system that utilizes statistical, semantic, and heuristic methods to generate a coherent multi-document summary based on a given query. The system performs Latent Dirichlet Allocation (LDA) topic modeling on document clusters to identify important topics and words. Sentences are then scored based on topic modeling results and redundancy is removed using Maximal Marginal Relevance. The summary is generated from the highest scoring sentences in different perspectives based on the query topic or entities. Evaluation results show the system effectively summarizes multiple documents according to the query.
Advantages of Query Biased Summaries in Information RetrievalOnur Yılmaz
Presentation of the paper:
Advantages of query biased summaries in information retrieval (1998)
Anastasios Tombros and Mark Sanderson.
In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '98).
ACM, New York, NY, USA, 2-10.
DOI=10.1145/290941.290947
Architecture of an ontology based domain-specific natural language question a...IJwest
The document summarizes the architecture of an ontology-based domain-specific natural language question answering system. The proposed architecture defines four main modules: 1) question processing which analyzes and classifies questions and reformulates queries, 2) document retrieval which retrieves relevant documents, 3) document processing which processes retrieved documents, and 4) answer extraction which extracts and generates responses. Natural language processing techniques and ontologies are used to analyze questions and documents and extract relationships and answers. The system aims to generate concise, specific answers to natural language questions in a given domain and achieved 94% accuracy in testing.
The International Journal of Engineering and Science (IJES)theijes
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
Relevance feature discovery for text miningredpel dot com
The document discusses relevance feature discovery for text mining. It presents an innovative model that discovers both positive and negative patterns in text documents as higher-level features and uses them to classify terms into categories and update term weights based on their specificity and distribution in patterns. Experiments on standard datasets show the proposed model outperforms both term-based and pattern-based methods.
Machine learning for text document classification-efficient classification ap...IAESIJAI
Numerous alternative methods for text classification have been created because of the increase in the amount of online text information available. The cosine similarity classifier is the most extensively utilized simple and efficient approach. It improves text classification performance. It is combined with estimated values provided by conventional classifiers such as Multinomial Naive Bayesian (MNB). Consequently, combining the similarity between a test document and a category with the estimated value for the category enhances the performance of the classifier. This approach provides a text document categorization method that is both efficient and effective. In addition, methods for determining the proper relationship between a set of words in a document and its document categorization is also obtained.
This document summarizes an article that proposes an automatic text summarization technique using feature terms to calculate sentence relevance. The technique uses both statistical and linguistic methods to identify semantically important sentences for creating a generic summary. It determines the relevance of sentences based on feature term ranks and performs semantic analysis of sentences with the highest ranks to select those most important for the summary. The performance is evaluated by comparing summaries to those created by human evaluators.
Automatic Text Summarization Using Natural Language Processing (1)Don Dooley
This document discusses automatic text summarization using natural language processing. It describes two main approaches for summarization - extractive and abstractive. Extractive summarization selects important sentences from the original text, while abstractive summarization generates new sentences to summarize the text. The document presents the objectives, methodologies, organization of the project report, and provides a literature review of papers on text summarization techniques.
This document presents a multi-stage framework for extractive summarization of scientific papers based on keyword profiling and language modeling. The framework first identifies keywords that capture a paper's most important contributions, then creates a keyword profile language model. It ranks sentences by divergence from this model and re-ranks them based on novelty to generate a 5-sentence summary. Evaluation shows the framework achieves state-of-the-art performance on summarizing individual papers and remains effective under more stringent length limits.
A template based algorithm for automatic summarization and dialogue managemen...eSAT Journals
Abstract This paper describes an automated approach for extracting significant and useful events from unstructured text. The goal of research is to come out with a methodology which helps in extracting important events such as dates, places, and subjects of interest. It would be also convenient if the methodology helps in presenting the users with a shorter version of the text which contain all non-trivial information. We also discuss implementation of algorithms which exactly does this task, developed by us. Key Words: Cosine Similarity, Information, Natural Language, Summarization, Text Mining
Summarization using ntc approach based on keyword extraction for discussion f...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
In this research work we have develop a new scoring mathematical model that works on the five types of questions. The question text failures are first extracted and a score is found based on its structure with respect to its template structure and then answer score is calculated again the question as well as paragraph. Text to finally reach at the index of the most probable answer with respect to question.
The document summarizes research on multi-document summarization using EM clustering. It begins with an introduction to the topic and issues with existing techniques. It then proposes using Expectation-Maximization (EM) clustering to identify clusters, which improves over other methods by identifying latent semantic variables between sentences. The architecture involves preprocessing, EM clustering, mutual reinforcement ranking algorithms RARP and RDRP, summarization, and post-processing. Experimental results on DUC2007 data show EM clustering identifies more clusters and sentences than affinity propagation clustering. The technique aims to improve summarization accuracy by better capturing semantic relationships between sentences.
The document provides the scheme and syllabus of the M.Tech Computer Science and Engineering program for the 2013-14 session at Himachal Pradesh Technical University, Hamirpur.
It includes the course codes, subjects, lecture hours, examination schemes, and maximum marks for each subject in the first four semesters. The subjects cover areas like computer architecture, parallel processing, software engineering, optimization methods, data structures, algorithms, operating systems, computer networks, databases, and electives.
The document also lists the elective subjects, examination schedules including theory and practical sessions, and provides an overview of some of the individual subjects covering their learning objectives and topics.
This document describes a query expansion approach that clusters documents during an offline phase and uses n-grams and cluster tags to augment queries during an online phase. The offline phase clusters documents, adds synonyms, and indexes clusters. The online phase retrieves relevant clusters for queries, predicts expansion terms based on n-grams or cluster tags, augments the query, and provides expanded queries to the user. Evaluation shows the approach improves precision and recall over random query expansion.
Arabic text categorization algorithm using vector evaluation methodijcsit
Text categorization is the process of grouping documents into categories based on their contents. This
process is important to make information retrieval easier, and it became more important due to the huge
textual information available online. The main problem in text categorization is how to improve the
classification accuracy. Although Arabic text categorization is a new promising field, there are a few
researches in this field. This paper proposes a new method for Arabic text categorization using vector
evaluation. The proposed method uses a categorized Arabic documents corpus, and then the weights of the
tested document's words are calculated to determine the document keywords which will be compared with
the keywords of the corpus categorizes to determine the tested document's best category.
A Comparison Of The Rule And Case-Based Reasoning Approaches For The Automati...Darian Pruitt
This dissertation explores using rule-based reasoning (RBR) and case-based reasoning (CBR) for automating help desk operations at the tier-two level. The author builds RBR and CBR prototypes to solve a set of PC hardware and software problems. Trained help desk operators use the systems and provide feedback. The author finds that the CBR system provides more flexible and precise solutions, is easier to maintain, and enables problems to be solved more quickly, thereby reducing costs. The study supports using CBR over RBR for tier-two help desk automation.
This course teaches technical writing skills needed for a career in technology. It covers writing processes, styles, and formats for various technical documents. Key topics include organizing information, writing for different audiences, using graphics and visual elements, writing instructions, descriptions, reports, and other common technical document types. Students learn grammar and mechanics rules specific to technical writing. The goal is for students to gain proficiency in written communication skills required in technical fields.
8 efficient multi-document summary generation using neural networkINFOGAIN PUBLICATION
This paper proposes a multi-document summarization system that uses bisect k-means clustering, an optimal merge function, and a neural network. The system first preprocesses input documents through stemming and removing stop words. It then applies bisect k-means clustering to group similar sentences. The clusters are merged using an optimal merge function to find important keywords. The NEWSUM algorithm is used to generate a primary summary for each keyword. A neural network trained on sentence classifications is then used to classify sentences in the primary summary as positive or negative. Only positively classified sentences are included in the final summary to improve accuracy. The system aims to generate a concise and accurate summary in a short period of time from multiple documents on a given topic.
Taxonomy extraction from automotive natural language requirements using unsup...ijnlc
In this paper we present a novel approach to semi-automatically learn concept hierarchies from natural
language requirements of the automotive industry. The approach is based on the distributional hypothesis
and the special characteristics of domain-specific German compounds. We extract taxonomies by using
clustering techniques in combination with general thesauri. Such a taxonomy can be used to support
requirements engineering in early stages by providing a common system understanding and an agreedupon
terminology. This work is part of an ontology-driven requirements engineering process, which builds
on top of the taxonomy. Evaluation shows that this taxonomy extraction approach outperforms common
hierarchical clustering techniques.
Semantic Based Document Clustering Using Lexical ChainsIRJET Journal
The document proposes a semantic-based document clustering approach using lexical chains. It uses WordNet to perform word sense disambiguation on documents to identify core semantic features. Lexical chains of semantically related words are then generated from the documents based on these features. The lexical chains represent the semantic content of documents and are used to cluster the documents. The results show improved clustering performance compared to traditional approaches. The approach aims to address challenges in text clustering like extracting core semantics, assigning meaningful cluster descriptions, and vocabulary diversity.
IRJET-Semantic Based Document Clustering Using Lexical ChainsIRJET Journal
This document discusses a semantic-based document clustering approach using lexical chains. It proposes using WordNet to perform word sense disambiguation on documents to extract core semantic features represented as lexical chains. Lexical chains identify semantically related words in a text based on relations like synonyms and hypernyms. Documents are then clustered based on the lexical chains extracted. The approach aims to overcome issues in traditional clustering like synonyms and polysemy by incorporating semantic information from WordNet ontology. It is argued that identifying themes based on disambiguated semantic features extracted via lexical chains can improve text clustering performance compared to bag-of-words models. An evaluation of the approach showed better results when using a threshold of 50% for lexical chain selection.
Nghiên Cứu Thu Nhận Pectin Từ Một Số Nguồn Thực Vật Và Sản Xuất Màng Pectin Sinh Học Ứng Dụng Trong Bảo Quản Trái Cây, các bạn tham khảo thêm tại tài liệu, bài mẫu điểm cao tại luanvantot.com
Phát Triển Cho Vay Hộ Kinh Doanh Tại Ngân Hàng Nông Nghiệp Và Phát Triển Nông Thôn Quận Ngũ Hành Sơn, các bạn tham khảo thêm tại tài liệu, bài mẫu điểm cao tại luanvantot.com
More Related Content
Similar to The Methods To Support Ranking In Cross-Language Web Search.doc
Architecture of an ontology based domain-specific natural language question a...IJwest
The document summarizes the architecture of an ontology-based domain-specific natural language question answering system. The proposed architecture defines four main modules: 1) question processing which analyzes and classifies questions and reformulates queries, 2) document retrieval which retrieves relevant documents, 3) document processing which processes retrieved documents, and 4) answer extraction which extracts and generates responses. Natural language processing techniques and ontologies are used to analyze questions and documents and extract relationships and answers. The system aims to generate concise, specific answers to natural language questions in a given domain and achieved 94% accuracy in testing.
The International Journal of Engineering and Science (IJES)theijes
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
Relevance feature discovery for text miningredpel dot com
The document discusses relevance feature discovery for text mining. It presents an innovative model that discovers both positive and negative patterns in text documents as higher-level features and uses them to classify terms into categories and update term weights based on their specificity and distribution in patterns. Experiments on standard datasets show the proposed model outperforms both term-based and pattern-based methods.
Machine learning for text document classification-efficient classification ap...IAESIJAI
Numerous alternative methods for text classification have been created because of the increase in the amount of online text information available. The cosine similarity classifier is the most extensively utilized simple and efficient approach. It improves text classification performance. It is combined with estimated values provided by conventional classifiers such as Multinomial Naive Bayesian (MNB). Consequently, combining the similarity between a test document and a category with the estimated value for the category enhances the performance of the classifier. This approach provides a text document categorization method that is both efficient and effective. In addition, methods for determining the proper relationship between a set of words in a document and its document categorization is also obtained.
This document summarizes an article that proposes an automatic text summarization technique using feature terms to calculate sentence relevance. The technique uses both statistical and linguistic methods to identify semantically important sentences for creating a generic summary. It determines the relevance of sentences based on feature term ranks and performs semantic analysis of sentences with the highest ranks to select those most important for the summary. The performance is evaluated by comparing summaries to those created by human evaluators.
Automatic Text Summarization Using Natural Language Processing (1)Don Dooley
This document discusses automatic text summarization using natural language processing. It describes two main approaches for summarization - extractive and abstractive. Extractive summarization selects important sentences from the original text, while abstractive summarization generates new sentences to summarize the text. The document presents the objectives, methodologies, organization of the project report, and provides a literature review of papers on text summarization techniques.
This document presents a multi-stage framework for extractive summarization of scientific papers based on keyword profiling and language modeling. The framework first identifies keywords that capture a paper's most important contributions, then creates a keyword profile language model. It ranks sentences by divergence from this model and re-ranks them based on novelty to generate a 5-sentence summary. Evaluation shows the framework achieves state-of-the-art performance on summarizing individual papers and remains effective under more stringent length limits.
A template based algorithm for automatic summarization and dialogue managemen...eSAT Journals
Abstract This paper describes an automated approach for extracting significant and useful events from unstructured text. The goal of research is to come out with a methodology which helps in extracting important events such as dates, places, and subjects of interest. It would be also convenient if the methodology helps in presenting the users with a shorter version of the text which contain all non-trivial information. We also discuss implementation of algorithms which exactly does this task, developed by us. Key Words: Cosine Similarity, Information, Natural Language, Summarization, Text Mining
Summarization using ntc approach based on keyword extraction for discussion f...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
In this research work we have develop a new scoring mathematical model that works on the five types of questions. The question text failures are first extracted and a score is found based on its structure with respect to its template structure and then answer score is calculated again the question as well as paragraph. Text to finally reach at the index of the most probable answer with respect to question.
The document summarizes research on multi-document summarization using EM clustering. It begins with an introduction to the topic and issues with existing techniques. It then proposes using Expectation-Maximization (EM) clustering to identify clusters, which improves over other methods by identifying latent semantic variables between sentences. The architecture involves preprocessing, EM clustering, mutual reinforcement ranking algorithms RARP and RDRP, summarization, and post-processing. Experimental results on DUC2007 data show EM clustering identifies more clusters and sentences than affinity propagation clustering. The technique aims to improve summarization accuracy by better capturing semantic relationships between sentences.
The document provides the scheme and syllabus of the M.Tech Computer Science and Engineering program for the 2013-14 session at Himachal Pradesh Technical University, Hamirpur.
It includes the course codes, subjects, lecture hours, examination schemes, and maximum marks for each subject in the first four semesters. The subjects cover areas like computer architecture, parallel processing, software engineering, optimization methods, data structures, algorithms, operating systems, computer networks, databases, and electives.
The document also lists the elective subjects, examination schedules including theory and practical sessions, and provides an overview of some of the individual subjects covering their learning objectives and topics.
This document describes a query expansion approach that clusters documents during an offline phase and uses n-grams and cluster tags to augment queries during an online phase. The offline phase clusters documents, adds synonyms, and indexes clusters. The online phase retrieves relevant clusters for queries, predicts expansion terms based on n-grams or cluster tags, augments the query, and provides expanded queries to the user. Evaluation shows the approach improves precision and recall over random query expansion.
Arabic text categorization algorithm using vector evaluation methodijcsit
Text categorization is the process of grouping documents into categories based on their contents. This
process is important to make information retrieval easier, and it became more important due to the huge
textual information available online. The main problem in text categorization is how to improve the
classification accuracy. Although Arabic text categorization is a new promising field, there are a few
researches in this field. This paper proposes a new method for Arabic text categorization using vector
evaluation. The proposed method uses a categorized Arabic documents corpus, and then the weights of the
tested document's words are calculated to determine the document keywords which will be compared with
the keywords of the corpus categorizes to determine the tested document's best category.
A Comparison Of The Rule And Case-Based Reasoning Approaches For The Automati...Darian Pruitt
This dissertation explores using rule-based reasoning (RBR) and case-based reasoning (CBR) for automating help desk operations at the tier-two level. The author builds RBR and CBR prototypes to solve a set of PC hardware and software problems. Trained help desk operators use the systems and provide feedback. The author finds that the CBR system provides more flexible and precise solutions, is easier to maintain, and enables problems to be solved more quickly, thereby reducing costs. The study supports using CBR over RBR for tier-two help desk automation.
This course teaches technical writing skills needed for a career in technology. It covers writing processes, styles, and formats for various technical documents. Key topics include organizing information, writing for different audiences, using graphics and visual elements, writing instructions, descriptions, reports, and other common technical document types. Students learn grammar and mechanics rules specific to technical writing. The goal is for students to gain proficiency in written communication skills required in technical fields.
8 efficient multi-document summary generation using neural networkINFOGAIN PUBLICATION
This paper proposes a multi-document summarization system that uses bisect k-means clustering, an optimal merge function, and a neural network. The system first preprocesses input documents through stemming and removing stop words. It then applies bisect k-means clustering to group similar sentences. The clusters are merged using an optimal merge function to find important keywords. The NEWSUM algorithm is used to generate a primary summary for each keyword. A neural network trained on sentence classifications is then used to classify sentences in the primary summary as positive or negative. Only positively classified sentences are included in the final summary to improve accuracy. The system aims to generate a concise and accurate summary in a short period of time from multiple documents on a given topic.
Taxonomy extraction from automotive natural language requirements using unsup...ijnlc
In this paper we present a novel approach to semi-automatically learn concept hierarchies from natural
language requirements of the automotive industry. The approach is based on the distributional hypothesis
and the special characteristics of domain-specific German compounds. We extract taxonomies by using
clustering techniques in combination with general thesauri. Such a taxonomy can be used to support
requirements engineering in early stages by providing a common system understanding and an agreedupon
terminology. This work is part of an ontology-driven requirements engineering process, which builds
on top of the taxonomy. Evaluation shows that this taxonomy extraction approach outperforms common
hierarchical clustering techniques.
Semantic Based Document Clustering Using Lexical ChainsIRJET Journal
The document proposes a semantic-based document clustering approach using lexical chains. It uses WordNet to perform word sense disambiguation on documents to identify core semantic features. Lexical chains of semantically related words are then generated from the documents based on these features. The lexical chains represent the semantic content of documents and are used to cluster the documents. The results show improved clustering performance compared to traditional approaches. The approach aims to address challenges in text clustering like extracting core semantics, assigning meaningful cluster descriptions, and vocabulary diversity.
IRJET-Semantic Based Document Clustering Using Lexical ChainsIRJET Journal
This document discusses a semantic-based document clustering approach using lexical chains. It proposes using WordNet to perform word sense disambiguation on documents to extract core semantic features represented as lexical chains. Lexical chains identify semantically related words in a text based on relations like synonyms and hypernyms. Documents are then clustered based on the lexical chains extracted. The approach aims to overcome issues in traditional clustering like synonyms and polysemy by incorporating semantic information from WordNet ontology. It is argued that identifying themes based on disambiguated semantic features extracted via lexical chains can improve text clustering performance compared to bag-of-words models. An evaluation of the approach showed better results when using a threshold of 50% for lexical chain selection.
Similar to The Methods To Support Ranking In Cross-Language Web Search.doc (20)
Nghiên Cứu Thu Nhận Pectin Từ Một Số Nguồn Thực Vật Và Sản Xuất Màng Pectin Sinh Học Ứng Dụng Trong Bảo Quản Trái Cây, các bạn tham khảo thêm tại tài liệu, bài mẫu điểm cao tại luanvantot.com
Phát Triển Cho Vay Hộ Kinh Doanh Tại Ngân Hàng Nông Nghiệp Và Phát Triển Nông Thôn Quận Ngũ Hành Sơn, các bạn tham khảo thêm tại tài liệu, bài mẫu điểm cao tại luanvantot.com
Nghiên Cứu Các Nhân Tố Ảnh Hưởng Đến Kết Quả Kinh Doanh Của Các Công Ty Ngành Thủy Sản Niêm Yết Trên Thị Trường Chứng Khoán Việt Nam, các bạn tham khảo thêm tại tài liệu, bài mẫu điểm cao tại luanvantot.com
Phát Triển Kinh Tế Hộ Nông Dân Trên Địa Bàn Huyện Quảng Ninh, Tỉnh Quảng Bình, các bạn tham khảo thêm tại tài liệu, bài mẫu điểm cao tại luanvantot.com
ận Dụng Mô Hình Hồi Quy Ngưỡng Trong Nghiên Cứu Tác Động Của Nợ Lên Giá Trị Doanh Nghiệp Của Các Công Ty Niêm Yết Trên Sàn Chứng Khoán Thành Phố Hồ Chí Minh (Hose), các bạn tham khảo thêm tại tài liệu, bài mẫu điểm cao tại luanvantot.com
Nghiên Cứu Các Nhân Tố Ảnh Hưởng Đến Cấu Trúc Vốn Của Doanh Nghiệp Ngành Hàng Tiêu Dùng Niêm Yết Tại Sở Giao Dịch Chứng Khoán Thành Phố Hồ Chí Minh, các bạn tham khảo thêm tại tài liệu, bài mẫu điểm cao tại luanvantot.com
Nghiên Cứu Các Nhân Tố Ảnh Hưởng Đến Hiệu Quả Kinh Doanh Của Các Doanh Nghiệp Ngành Du Lịch – Khách Sạn Niêm Yết Trên Thị Trường Chứng Khoán Việt Nam, các bạn tham khảo thêm tại tài liệu, bài mẫu điểm cao tại luanvantot.com
Hoàn Thiện Công Tác Thẩm Định Giá Tài Sản Bảo Đảm Trong Hoạt Động Cho Vay Tại Các Chi Nhánh Ngân Hàng Thƣơng Mại Cổ Phần Xuất Nhập Khẩu Việt Nam Trên Địa Bàn Thành Phố Đà Nẵng, các bạn tham khảo thêm tại tài liệu, bài mẫu điểm cao tại luanvantot.com
Biện Pháp Quản Lý Xây Dựng Ngân Hàng Câu Hỏi Kiểm Tra Đánh Giá Kết Quả Học Tập Môn Toán Của Hiệu Trưởng Trường Tiểu Học Trên Địa Bàn Huyện Hòa Vang, Thành Ph Ố Đà Nẵng, các bạn tham khảo thêm tại tài liệu, bài mẫu điểm cao tại luanvantot.com
Hoàn Thiện Công Tác Huy Động Vốn Tại Ngân Hàng Tmcp Công Thương Việt Nam Chi Nhánh Đắk Lắk, các bạn tham khảo thêm tại tài liệu, bài mẫu điểm cao tại luanvantot.com
Giải Pháp Hạn Chế Nợ Xấu Đối Với Khách Hàng Doanh Nghiệp Tại Ngân Hàng Thương Mại Cổ Phần Ngoại Thương Việt Nam Chi Nhánh Phú Tài, các bạn tham khảo thêm tại tài liệu, bài mẫu điểm cao tại luanvantot.com
Hoàn Thiện Công Tác Đào Tạo Đội Ngũ Cán Bộ Công Chức Phường Trên Địa Bàn Quận Cẩm Lệ, các bạn tham khảo thêm tại tài liệu, bài mẫu điểm cao tại luanvantot.com
Giải Pháp Marketing Cho Dịch Vụ Ngân Hàng Điện Tử Tại Ngân Hàng Tmcp Hàng Hải Việt Nam - Chi Nhánh Đắk Lắk, các bạn tham khảo thêm tại tài liệu, bài mẫu điểm cao tại luanvantot.com
Biện Pháp Quản Lý Công Tác Tự Đánh Giá Trong Kiểm Định Chất Lượng Giáo Dục Các Trường Mầm Non Trên Địa Bàn Quận Hải Châu, Thành Phố Đà Nẵng, các bạn tham khảo thêm tại tài liệu, bài mẫu điểm cao tại luanvantot.com
Kiểm Soát Rủi Ro Tín Dụng Trong Cho Vay Ngành Xây Dựng Tại Nhtmcp Công Thương Việt Nam Chi Nhánh Đà Nẵng, các bạn tham khảo thêm tại tài liệu, bài mẫu điểm cao tại luanvantot.com
How to Setup Warehouse & Location in Odoo 17 InventoryCeline George
In this slide, we'll explore how to set up warehouses and locations in Odoo 17 Inventory. This will help us manage our stock effectively, track inventory levels, and streamline warehouse operations.
How to Build a Module in Odoo 17 Using the Scaffold MethodCeline George
Odoo provides an option for creating a module by using a single line command. By using this command the user can make a whole structure of a module. It is very easy for a beginner to make a module. There is no need to make each file manually. This slide will show how to create a module using the scaffold method.
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
How to Manage Your Lost Opportunities in Odoo 17 CRMCeline George
Odoo 17 CRM allows us to track why we lose sales opportunities with "Lost Reasons." This helps analyze our sales process and identify areas for improvement. Here's how to configure lost reasons in Odoo 17 CRM
This presentation was provided by Steph Pollock of The American Psychological Association’s Journals Program, and Damita Snow, of The American Society of Civil Engineers (ASCE), for the initial session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session One: 'Setting Expectations: a DEIA Primer,' was held June 6, 2024.
How to Make a Field Mandatory in Odoo 17Celine George
In Odoo, making a field required can be done through both Python code and XML views. When you set the required attribute to True in Python code, it makes the field required across all views where it's used. Conversely, when you set the required attribute in XML views, it makes the field required only in the context of that particular view.
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPRAHUL
This Dissertation explores the particular circumstances of Mirzapur, a region located in the
core of India. Mirzapur, with its varied terrains and abundant biodiversity, offers an optimal
environment for investigating the changes in vegetation cover dynamics. Our study utilizes
advanced technologies such as GIS (Geographic Information Systems) and Remote sensing to
analyze the transformations that have taken place over the course of a decade.
The complex relationship between human activities and the environment has been the focus
of extensive research and worry. As the global community grapples with swift urbanization,
population expansion, and economic progress, the effects on natural ecosystems are becoming
more evident. A crucial element of this impact is the alteration of vegetation cover, which plays a
significant role in maintaining the ecological equilibrium of our planet.Land serves as the foundation for all human activities and provides the necessary materials for
these activities. As the most crucial natural resource, its utilization by humans results in different
'Land uses,' which are determined by both human activities and the physical characteristics of the
land.
The utilization of land is impacted by human needs and environmental factors. In countries
like India, rapid population growth and the emphasis on extensive resource exploitation can lead
to significant land degradation, adversely affecting the region's land cover.
Therefore, human intervention has significantly influenced land use patterns over many
centuries, evolving its structure over time and space. In the present era, these changes have
accelerated due to factors such as agriculture and urbanization. Information regarding land use and
cover is essential for various planning and management tasks related to the Earth's surface,
providing crucial environmental data for scientific, resource management, policy purposes, and
diverse human activities.
Accurate understanding of land use and cover is imperative for the development planning
of any area. Consequently, a wide range of professionals, including earth system scientists, land
and water managers, and urban planners, are interested in obtaining data on land use and cover
changes, conversion trends, and other related patterns. The spatial dimensions of land use and
cover support policymakers and scientists in making well-informed decisions, as alterations in
these patterns indicate shifts in economic and social conditions. Monitoring such changes with the
help of Advanced technologies like Remote Sensing and Geographic Information Systems is
crucial for coordinated efforts across different administrative levels. Advanced technologies like
Remote Sensing and Geographic Information Systems
9
Changes in vegetation cover refer to variations in the distribution, composition, and overall
structure of plant communities across different temporal and spatial scales. These changes can
occur natural.
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...PECB
Denis is a dynamic and results-driven Chief Information Officer (CIO) with a distinguished career spanning information systems analysis and technical project management. With a proven track record of spearheading the design and delivery of cutting-edge Information Management solutions, he has consistently elevated business operations, streamlined reporting functions, and maximized process efficiency.
Certified as an ISO/IEC 27001: Information Security Management Systems (ISMS) Lead Implementer, Data Protection Officer, and Cyber Risks Analyst, Denis brings a heightened focus on data security, privacy, and cyber resilience to every endeavor.
His expertise extends across a diverse spectrum of reporting, database, and web development applications, underpinned by an exceptional grasp of data storage and virtualization technologies. His proficiency in application testing, database administration, and data cleansing ensures seamless execution of complex projects.
What sets Denis apart is his comprehensive understanding of Business and Systems Analysis technologies, honed through involvement in all phases of the Software Development Lifecycle (SDLC). From meticulous requirements gathering to precise analysis, innovative design, rigorous development, thorough testing, and successful implementation, he has consistently delivered exceptional results.
Throughout his career, he has taken on multifaceted roles, from leading technical project management teams to owning solutions that drive operational excellence. His conscientious and proactive approach is unwavering, whether he is working independently or collaboratively within a team. His ability to connect with colleagues on a personal level underscores his commitment to fostering a harmonious and productive workplace environment.
Date: May 29, 2024
Tags: Information Security, ISO/IEC 27001, ISO/IEC 42001, Artificial Intelligence, GDPR
-------------------------------------------------------------------------------
Find out more about ISO training and certification services
Training: ISO/IEC 27001 Information Security Management System - EN | PECB
ISO/IEC 42001 Artificial Intelligence Management System - EN | PECB
General Data Protection Regulation (GDPR) - Training Courses - EN | PECB
Webinars: https://pecb.com/webinars
Article: https://pecb.com/article
-------------------------------------------------------------------------------
For more information about PECB:
Website: https://pecb.com/
LinkedIn: https://www.linkedin.com/company/pecb/
Facebook: https://www.facebook.com/PECBInternational/
Slideshare: http://www.slideshare.net/PECBCERTIFICATION
How to Add Chatter in the odoo 17 ERP ModuleCeline George
In Odoo, the chatter is like a chat tool that helps you work together on records. You can leave notes and track things, making it easier to talk with your team and partners. Inside chatter, all communication history, activity, and changes will be displayed.
Executive Directors Chat Leveraging AI for Diversity, Equity, and InclusionTechSoup
Let’s explore the intersection of technology and equity in the final session of our DEI series. Discover how AI tools, like ChatGPT, can be used to support and enhance your nonprofit's DEI initiatives. Participants will gain insights into practical AI applications and get tips for leveraging technology to advance their DEI goals.
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
The Methods To Support Ranking In Cross-Language Web Search.doc
1. TẢI TÀI LIỆU KẾT BẠN ZALO
0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
MINISTRY EDUCATION AND TRAINING
UNIVERSITY OF DANANG
Lam Tung Giang
THE METHODS TO SUPPORT RANKING
IN CROSS-LANGUAGE WEB SEARCH
Specialty
Code
: Computer Science
:62480101
DOCTORAL THESIS SUMMARY
Danang - 2017
2. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
The dissertation is completed at Danang University of
Technology, University of Danang.
Supervisors:
- Associate Professor Vo Trung Hung, PhD.
- Associate Professor Huynh Cong Phap, PhD.
Opponent 1: Professor Hoang Van Kiem, PhD.
Opponent 2: Associate Professor Le Manh Thanh, PhD.
Opponent 3: Associate Professor Phan Huy Khanh, PhD.
The dissertation was defended before the Board at UD meeting
at 14h00, May 26th ,2017
3. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
PREFACE
The Cross-Language Web Search is related to the task of
gathering information request given by the user in one language (the
source language) and creating a list of relevant web documents in
another language (the target language). Ranking in web search is
related to creating the result from a search query in the form of a list
of documents, ordering descending by relevance level to the
information need given by the user, and assuring that "best" results
appear early in the result list displayed to the user.
There are two central problems in cross-language web
search. The first problem is related translation, which helps to
represent queries and documents to be searched in a same
environment for matching. The second problem is ranking for
calculating the relevance level between documents and queries.
In cross-language information retrieval (CLIR), the
dictionary-based approach is popular as a result of the simplicity and
the readiness of machine readable dictionaries. The main limits of
this approach include the ambiguities and the dictionary coverage.
There is a need of developing techniques to improve query
translation.
The web search is different from the traditional information
retrieval, which has been being applied for library systems. An
HTML document contains many different parts: title, summary or
description, content; each part can affect differently on the search.
On the basis of literature and practical review, the topic "The
methods to support ranking in cross-language web search" is selected
as the research content of the Doctoral thesis, with the aim of
creating a cross-language web search model and proposing technical
solutions applied in the model to improve the search result ranking
effectiveness.
1. The Goals, Objectives and Scope of the Research
The goal of this thesis is the proposal of 2 groups of
4. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
1
5. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
technical methods applied in cross-language web search. The first
group is related to translation and consists of post-translation query
processing, disambiguation, post-translation query processing
methods. The second group includes the cross-language proximity
models and a learning-to-rank model based on Genetic
Programming. The main measure for the effectiveness in the thesis is
the MAP (Mean Average Precision) score.
2. Thesis structure
In addition to the introduction, conclusion and future work
sections, the structure of thesis contains the following chapters:
Chapter 1: Overview and research proposal.
Chapter 2: Automatic translation in cross-language
information retrieval.
Chapter 3 : Techniques to support query translation.
Chapter 4: Re-ranking.
Chapter 5: The Vietnamese-English web search system.
3. Thesis Contribution
- The proposal of new methods for disambiguation in the
translation module;
- The proposal of pre-translation query processing method;
- The proposal of query refining methods in the target
language;
- The proposal of cross-language proximity models;
- The proposal of a learning-to-rank model based on Genetic
Programming;
- The design of a Vietnamese-English web search system.
CHAPTER 1: OVERVIEW AND RESEARCH PROPOSAL
1.1. Information Retrieval
1.1.1. Introduction
1.1.2. Formal definition
1.1.3. Processing schema
There are two phases in an information retrieval system:
6. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
2
7. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
Phase 1: Data collection, processing, indexing and storage.
Phase II: Querying.
1.1.4. Traditional IR models
Traditional IR models include Boolean model, Vector Space
model and Probabilistic model.
1.1.5. Models based on in-document term relation
The Latent Semantic Indexing (LSI) model and the
proximity models are based on the relation between terms in
documents.
1.2. Evaluation in Information Retrieval
1.3. Cross-language Information Retrieval
1.3.1. Introduction
Cross-language Information Retrieval concerns the case
when the language of documents being searched is different from the
one of the query.
1.3.2. Approaches
The main two approaches in CLIR are query translation and
document translation.
1.4. Re-ranking techniques
1.5. Ranking in web-search
1.6. The limitation and research proposals
1.6.1. Limitation
The main limitations consist of the translation quality and
the lack of using special structure of HTML documents in ranking.
1.6.2. Research proposals
Research missions include the proposal of two groups of
techniques: (1) translation techniques to create an environment,
where query representation and document representation are
comparable for matching; (2) techniques for improving ranking
quality of search result list. As the result, the author proposes a
ranking model for cross language web search.
The following techniques are selected as research topics:
- Automatic translation techniques;
8. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
3
9. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
- The techniques supporting query translation, including pre-
translation query processing in the source language and post-
translation query optimizing in the target language;
- Learning to Rank methods;
- Building a cross-language web search system.
1.7. Summary
Research missions include the proposal of two groups of
techniques: (1) translation techniques to create an environment,
where query representation and document representation are
comparable for matching; (2) techniques for improving ranking
quality of search result list.
CHAPTER 2: AUTOMATIC TRANSLATION IN CROSS-
LANGUAGE INFORMATION RETRIEVAL
2.1. Automatic translation approaches
2.2. Disambiguation in dictionary-based approach
The three main problems, which can cause low performance
for a dictionary-based CLIR system, includes the dictionary
coverage, the query segmentation and the selection of correct
translation for each query term. The third problem is a hot research
topic and is known as the disambiguation.
2.3. Dictionary-based disambiguation models
2.3.1. Variations of Mutual Information
2.3.1.1. Based on co-occurrence statistics of pairs of words
The common formula for calculation Mutual Information
value, describing the relation of a pair of words, has the following
form:
= log (
( , )
)
(2.1)
( ) × ( )
where p(x,y) is the probability of the event the two words x,y
co-occur in the same sentence with the distance no more 5 words;
p(x) and p(y) are the probabilities of the two words x and y appearing
10. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
4
11. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
in the document collection.
2.3.1.2. Based on search engines
With the two words x and y, the strings x, y and 'x AND y' are
used as queries and are sent to the search engine. The values n(x),
n(y), n(x, y) are numbers of documents containing x, y and x, y
together. Then,
=
( , )
(2.2)
( ) × ( )
2.3.2. Algorithms for selecting best translations
The algorithms in this part are executed with the Vietnamese keywords v1,..,vn and lists of translation
candidates L1,…, Ln (each list = ( 1, … , ) contains the translation candidates of vi).
2.3.2.1. Algorithm using cohesion score
2.3.2.2. Algorithm SMI
Each candidate translation qtrane of the given query is
represented in the form qtrane = (e1, ..., en), where ei is selected
from the list Li. The function SMI (Summary Mutual Information) is
defined as follow::
( ) = ∑ ( , ) (2.3)
, ∈
The candidate translation with the highest value of the SMI
function is selected as the English translation for a given Vietnamese
query qv.
2.3.2.3. The algorithm SQ (Select translations sequentially)
At first, a list from pairs of translations ( ti
k
, ti
j
1 ) of all
adjacent columns (i, i+1) is created. At first, the two columns i0 and
i0+1 contain the pair with highest value of MI function is selected to
create the GoodColumns set. The best translation from neighborhood
columns of the first two columns (are selected based on a cohesion
5
12. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
score function as follow:
ℎ ( ) = ∑ ( , )
(2.4)
∈
The column containing the best translation is added into the
GoodColumns set. The process continues until all columns are
examined. Next, the translation candidates in each column are
resorted. Finally, each Vietnamese keyword is associated with a list
of translations, ordering descending by cohesion score.
2.3.3. Building query translation
2.3.3.1. Combining the two ways
The query has the following form:
= ( 1 1 2 2 …11)
1 1 1 1 1 1 1
(2.5)
…( 1 1 2 2 …)
2.3.3.2. Assign weight based on result of disambiguation
process
Given 1, 2, … being translation options of keyword vi in the list Li with weights 1, 2, …
respectively, the query in the target language has the following form:
= ( 1 12 2…11
)
1 1 1 1 1 1
) (2.6)
…( 1 12 2 …
2.4. Experiment with the formula SMI
Table 2.1: Experiment results
Configuration P@1 P@5 P@10 MAP Comparison
1 nMI 0.497 0.482 0.429 0.436 74.79%
2 SMI 0.511 0.488 0.447 0.446 76.50%
3 Google 0.489 0.535 0.505 0.499 85.59%
translate
4 Manual 0.605 0.605 0.563 0.583 100%
translation
2.5. Experiment with the algorithms creating structured
query
6
13. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
2.5.1. Experimental environment
2.5.2. Experimental configurations
2.5.3. Experiment results
Table 2.2: Comparison of P@k and MAP
Configuration P@1 P@5 P@10 MAP Comp.
1 top_one_ch 0.64 0.48 0.444 0.275 71.24%
2 top_one_sq 0.52 0.472 0.46 0.291 75.39%
3 top_three_ch 0.68 0.528 0.524 0.316 81.87%
4 top_three_sq 0.64 0.552 0.532 0.323 84.55%
5 top_three_all 0.76 0.576 0.54 0.364 94.30%
6 Google 0.64 0.568 0.536 0.349 90.41%
7 Baseline 0.76 0.648 0.696 0.386 100%
2.6. Summary
The chapter 2 presents author's researches related automatic
translation techniques used in CLIR. The contributions include 2
methods for dictionary-based query translation. The first method
defines a Summary Mutual Information function for selecting the
best translation for each keyword contained in the original query.
The second method is based on an algorithm of selecting best
translations sequentially.
The formula SMI gives a better result in the comparison with
the algorithm Greedy, however it is not better Google Translate tool.
The algorithm of selecting best translations sequentially outperforms
the Google Translate tool. The condition for applying this algorithm
is the search engine should support structured queries.
CHAPTER 3: TECHNIQUES TO SUPPORT QUERY
TRANSLATION
3.1. Query segmentation
3.1.1. Using the tool vnTagger
3.1.2. The algorithm WLQS
The algorithm WLQS (Word-length-based Query
14. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
7
15. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
Segmentation) - proposed by the author - splits the query into
separated keywords based on the keywords lengths. The idea behind
this algorithm is an author's hypothesis: if a compound word exists in
the dictionary and contains other words inside, the translation of the
compound word tends to be better than the translations of the words
inside.
3.1.3. The combination of WLQS and vnTagger
This section introduces the combination of the algorithm
WLQS and the tool vnTagger, consisting of 5 steps: looking words
in dictionaries, assigning labels, removing words fully contained
inside another word, removing overlapped words, adding remained
tagged words.
3.2. Improving the query in the target language
3.2.1. Pseudo relevance feedback in CLIR
In CLIR, PRF can be applied in different stages: before or/and
after translation process with the aim of improving query
performance.
3.2.2. Refining the structured query in the target language
With the documents returned by the first query, the query
term weights are re-calculated to build a new query in the form:
′=( 1 1 2 2… 1 1)
1 1 1 1 1 1
To expand the query, there are 4 different formulas to
calculate new weights for keywords in the query
The formula FW1:
(tj) = × ∑
| | (3.1)
∈
The formula FW2 combines local tf-idf weight and the idf
weight of the keywords:
(tj) = × ∑ × log(
+ 1
) (3.2)
| |
∈
+ 1
Here, N is the number of documents in the document
16. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
8
17. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
collection, Nt is the number of documents containing the term t, is a
tunable parameter.
With a term tj and a keyword qk, mi(tj,qk) is the number of
times the two words co-occur with the distance no more than 3
words. The formula FW3:
(tj) = × ∑ ( , ) (3.3)
∈
The formula FW4:
+ 1
( ) = × ∑ ( , ) × ( )
(3.4)
+ 1
∈
By adding p terms with highest weights, the final query
being created has the following form:
= ′ ( ) =
= ( 1 22 …1 1 )
1 1 1 1 1
(3.5)
…( 112 2 …)
1 1 …
Where 1 , 2 , … are translate options of vi in the list Li with
weights 1 , 2, …respectively; , ,…, are expanded
1 2
terms being added to the query with the corresponding
weights 1, 2, … ,.
3.3. Experiments
The experiment results show that the combination of the query
keywords re-calculation algorithm and the query expansion helps to
improve the precision and recall of the system.
3.4. Summary
The author's contribution presented in the chapter 3 includes:
The query segmentation algorithm, executing in the pre-translation
process by combining the WLQS algorithm and the tool vnTagger;
the techniques for re-calculating query's terms and query expansion
for the query in the target language.
CHAPTER 4: RE-RANKING
18. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
9
19. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
4.1. Genetic Programming for Learning to Rank
4.1.1. GP-based L2R model
The author use the dataset OHSUMED for evaluating the
"learn to rank" method based on genetic programming. Each
chromosome in GP is a ranking function f(q,d), measuring relevance
level of documents to the query q. 4 options to create ranking
functions are presented:
Option 1: Linear combination of 45 attributes:
− = 1× 1+ 2× 2+⋯+ 45× 45 (4.1)
Option 2: Linear combination of a random number of
attributes:
− = 1× 1+ 2× 2+⋯+ × (4.2)
Option 3: Apply a defined function to attributes. Limited
with the use of functions x, 1/x, sin(x), log(x), and 1/(1+ex
).
− = 1 × ℎ1( 1) + 2 × ℎ2( 2) + ⋯ + 45 (4.3)
× ℎ45( 45)
Option 4: Create the function TF-GF with the tree structure,
keeping all non-linear variants.
In the above formula, ai are parameters, fi are document
attributes, hi are functions. The fitness function being used is the
MAP score.
4.1.2. Experiment results
4.1.3. Evaluation
The results show that methods TF-AF, TF-RF give good
results. The values MAP, NDCG@k and P@k of these methods
outperform the ones of the methods Regression, RankSVM and
RankBoost, similar with some advantages in the comparison with
ListNet and FRank. The method TF-GF gives a non-satisfied result.
These results show that the use of linear function in the proposed
Learning to Rank method can improve the system performance.
4.2. The proposed proximity models
The author proposes proximity models, applied in CLIR.
4.2.1. The CL-Büttcher model
20. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
10
22. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
CL-Butcher CL-Rasolofo
CL-
HighDensity
Join-all 1.71% 7.12% 6.55%
Flat 3.44% 18.32% 14.12%
4.3. Learning to Rank web pages
4.3.1. L2R models
The two models of Learning to Rank based on genetic
programming are proposed to "learn" a ranking function in the form
of a linear combination of basic ranking functions. The first model
uses the training data, containing scores assigned to elements in
HTML documents by basic ranking functions and labels, indicating
whether a document is relevant to a query. The second model uses
only assigned scores to elements in HTML documents; it compares
ranking position given by a candidate ranking function with the one
assigned by basic ranking functions.
4.3.2. Chromosome
With a set of n basic ranking functions F0, F1,…,Fn, each
chromosome has the form of a linear function, combing basic
ranking functions:
( ) = ∑ × ( ) (4.7)
=0
With are real numbers, d is the document being assigned
score. The aim of learning process is select a function f given the
best ranking performance.
4.3.2.1. Fitness function
The fitness function indicates how good each chromosome
is. The fitness function in the proposed supervised L2R model is the
MAP score
Algorithm 4.1: Calculation goodness (supervised)
Input: The candidate function f, set of queries Q
Output: goodness level if the function f
23. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
12
24. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
begin
n = 0; sap = 0;
for each query q do
n+=1;
calculate score of each document assigned by
f; ap = average precision of function f; sap +=
ap;
map = sap/n
return map
end
In the unsupervised L2R model, r(i,d,q) is the ranking of
document d in the search result list from query q, using ranking
function Fi; rf(d,q) is the rank of document d in the search result list
from query q, using ranking function f. The algorithm is as follow:
Algorithm 4.2: Calculation goodness (unsupervised)
Input: The candidate function f, set of queries Q
Output: goodness level of function f
begin
s_fit = 0;
for each query q do
calculation score of each document given by f;
D = set of top 200 documents;
for each document d in D do
k+=1;d_fit = 0;
for i=0 to n do
d_fit +=distance(i,k,q)
s_fit += d_fit
return s_fit
end
The experiments are conducted with 3 variants of the
function distance(i,k,q):
25. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
13
26. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
Table 4.3: Variants of the function distance
Variant distance(i,k,q)
1 abs(r(i,d,q)-rf(d,q))
2 abs(r(i,d,q)-rf(d,q))/log(k+1)
3 (r(i,d,q)-rf(d,q))/ k
4.3.2.2. The learning process
4.3.3. Experimental environment
4.3.4. Experimental configurations
The learning process is examined with the following
configurations:
Configuration SQ: using structured query.
Configuration SC: using result of supervised L2R algorithm.
Configurations UC1, UC2, UC3: using result of
unsupervised L2R algorithm with different variants of fitness
function defined in Table 4.3.
4.3.5. Experiment results
Table 4.4: Experiment results
Configuration MAP
Baseline 0.3742
Google 0.3548
SQ 0.4307
SC 0.4640
UC1 0.4284
UC2 0.4394
UC3 0.4585
4.4. Summary
Given a query in the source language, the application of
techniques presented in chapter 2 and chapter 3 allows us to create
and optimize a structured query in the target language as the query
translation. The chapter 4 inherits these results and presents several
techniques, proposed by the author, for re-ranking the list of search
results. The author's contributions in this chapter include:
27. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
14
28. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
- Extract and index elements in web pages with the search
engine to define basic ranking functions;
- Define cross language proximity functions CL-Buttcher,
CL-Rasolofo and CL-HighDensity as new basic ranking functions;
- Propose Learning to Rank models in a cross language web
search system, where the ranking function is "learnt" in the form of a
linear combination of basic ranking functions.
The experiment results show that the proposed L2R models
help to improve the system performance (measured by MAP score).
CHAPTER 5: VIETNAMESE-ENGLISH CROSS-LANGUAGE
WEB SEARCH SYSTEM
The chapter 5 presents design details for a cross language
Vietnamese-English web search system and the experimental results
to evaluate the effects of application proposed techniques and to
compare the performance with other solutions.
5.1. System design
5.1.1. System components
The main components in the system include: pre-translation
query processing, query translation, query optimization, English
search and re-ranking. These techniques have been presented in
chapters 2, 3 and 4.
5.1.2. Dictionary data
5.1.3. Data to be indexed
5.2. Evaluation methods
5.3. Experiments with query translation methods
5.3.1. Experimental configuration
5.3.2. Table 5.1: Experimental configuration
Configuration Description
Baseline The query is translated manually
Google The query is translated by the Google
Translate tool
nMI Using the disambiguation algorithm greedy
SMI Using the disambiguation algorithm SMI
29. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
15
30. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
Top_one_all Using the algorithm to select translations
sequentially, extract one best translation for
each keyword
Top_three_all Using the algorithm to select translations
sequentially, extract three best translations
for each keyword to create a structured
query translation
Top_three_weight Using the algorithm to select translations
sequentially, extract three best translations
for each keyword to create a structured
query translation with the weight calculated
from the disambiguation process
Top-Three_flat Using the algorithm to select translations
sequentially; create a structured query by
grouping translation options for each
keyword with the operator OR, then
grouping created groups by operator AND
Join-All Create a structured query by grouping all
translation options for each keyword with
the operator OR, then grouping created
groups by operator AND
5.3.3. Experiment results
Table 5.2: Comparison of P@k and MAP
Configuration P@5 P@10 P@20 MAP Comp.
Baseline 0.636 0.562 0.514 0.3838 100%
Google 0.616 0.54 0.507 0.3743 97,52%
nMI 0.5 0.464 0.418 0.269 70,09%
SMI 0.496 0.478 0.427 0.2862 74,57%
Top_one_all 0.56 0.526 0.451 0.3245 84,55%
Top_three_all 0.64 0.582 0.52 0.3924 102,24%
Top_three_weight 0.64 0.592 0.52 0.3988 103,91%
31. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
16
32. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
Top-Three_flat 0.592 0.556 0.499 0.3737 97,37%
Join-All 0.612 0.574 0.509 0.3865 100,70%
5.3.4. Evaluation
Among methods using only one best translation for each
keyword, the configuration SMI gives a better result in the
comparison with the configuration nMI. The configuration
Top_one_all gives the best result with the translation as a structured
query. The solution of creating structured query gives good results.
Among the configurations using 3 best translations for each
keyword, the configuration Top_three_weight gives the best result.
5.4. Experiments with query optimization
5.4.1. Experimental configurations
Table 5.3: Configurations to evaluate query optimization
Configurations Description
Baseline Manual translation
FW2_Top_three_all Using algorithm Top_three_all for
query translation. Apply the
formula FW2 to modify the query.
FW2_Top_three_weight_A Using algorithm Top_three_weight
for query translation. Apply the
formula FW2 to modify the query.
FW2_Top_three_weight_B Using algorithm Top_three_weight
for query translation. Apply the
formula FW2 to modify the query
without recalculation of keyword
weights.
Top-Three_flat Using algorithm Top_three_all for
query translation. Apply the
formula FW2 to modify the query.
5.4.2. Experiment results
Table 5.4: Comparison
33. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
17
34. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
Configuration P@5 P@10 P@20 MAP
Baseline 0.636 0.562 0.514 0.3838
FW2_Top_three_all 0.640 0.586 0.522 0.4261
FW2_Top_three_weight_A 0.644 0.586 0.522 0.4192
FW2_Top_three_weight_B 0.660 0.594 0.535 0.4312
FW2_Top-Three_flat 0.652 0.586 0.520 0.4220
5.4.3. Evaluation
The table of experiment results shows that the application of
query refining have a positive effect on system performance with the
best result obtained by the configuration FW2_Top_three_weight_B,
following by the configuration FW2_Top_three_all.
5.5. Experiments with re-ranking
The proposed re-ranking methods based on Genetic
Programming are evaluated and are compared with several other
Learning to Rank methods, deployed with the RankLib tool.
5.5.1. Experimental Configurations
Table 5.5: Experimental Configurations
Configuration Description
SC-1 Supervised L2R; using query translation
method FW2_Top_three_all
UC3-1 Un-supervised L2R; using query
translation method FW2_Top_three_all
SC-2 Supervised L2R; using query translation
method FW2_Top_three_weight_B
UC3-2 Un-supervised L2R; using query
translation method
FW2_Top_three_weight_B
MART Using RankLib with MART method
Coordinate Ascent Using RankLib with Coordinate Ascent
method
35. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
18
36. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
Random Forests Using RankLib with Random Forests
method
5.5.2. Experiment results
The highest average MAP scores are obtained with 2
configurations SC-1 and SC-2. These scores are higher than average
MAP scores of configurations MART, Coordinate Ascent and
Random Forests, deployed with the RankLib tool. The
configurations UC3-1 and UC3-2 give the average MAP scores of
0.456 and 0.464, which are good for un-supervised L2R methods
5.5.3. Evaluation
The experiments show the effects of re-ranking methods
based on Genetic Programming. The configurations SC-1 and SC-2
obtain the MAP scores of 0.476 and 0.484, which are 123,96% and
126,09% respectively in the comparison with the manual translation.
The average MAP scores of configurations UC3-1 and UC3-
2, which do not use the training data, are 0.456 and 0.464. These
scores are higher 7% and 7.7% when compared with configurations
FW2_Top_three_all and FW2_Top_three_weight_B respectively.
5.6. Evaluation of proposed techniques
Table 5.6: Evaluation of proposed techniques
Configuration Description MAP
Baseline Using manual translation 0.384
Google Using Google translate tool 0.374
Query translation methods
SMI Using disambiguation method 0.286
SMI
Top_one_all Extract one best translation for 0.325
each keyword
Top_three_all Select translations sequentially, 0.392
extract 3 best translations for
each keyword and create a
37. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
19
38. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
structured query
Top_three_weight Select translations sequentially, 0.399
extract 3 best translations for
each keyword and create a
structured query with weights
calculated from the
disambiguation process
Query optimization
FW2_Top_three_all Use the method Top_three_all 0.427
and apply the formula FW2 for
query expansion
FW2_Top_three_wei Use the method 0.431
ght_B Top_three_weight. Apply the
formula FW2. Do not re-
calculate keyword weights
Learning to Rank
UC3 Using the ranking function 0.456
FW2_Top_three_all learnt by the un-supervised
method UC3 on the query
created by the configuration
FW2_Top_three_all.
SC Using the ranking function 0.476
FW2_Top_three_all learnt by the supervised method
SC on the query created by the
configuration
FW2_Top_three_all
UC3 Using the ranking function 0.464
FW2_Top_three_wei learnt by the un-supervised
ght method UC3 on the query
created by the configuration
FW2_Top_three_weight.
39. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
20
40. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
SC
FW2_Top_three_wei
ght
Using the ranking function
learnt by the supervised method
SC on the query created by the
configuration
FW2_Top_three_weight.
0.484
The methods using only one best translation for each
keyword SMI and Top_one_all obtain the MAP scores of 0.286 and
0.325, which are 74,48% and 84,64% in the comparison with the
baseline configuration.
With the methods using 3 best translation options for each
keyword to create the structured query, then optimizing the query in
the target language and applying "learning to rank" methods, the
MAP scores are improved.
5.7. Summary
In the chapter 5, an evaluating environment is created to
verify the effectiveness of proposed techniques and to compare the
results with other techniques. The summary results show that the
combined application of proposed techniques helps to improve the
system performance (measured by the MAP score).
CONCLUSION AND FUTURE WORKS
1. Conclusion
1.1. Thesis content summary
The thesis presents author's research results on the methods
for ranking search results in cross language web search. The author
reviews theories and researches in the field of Information Retrieval,
Cross Language Information Retrieval and Re-ranking. On the basis
of processing schema of an IR system, the author introduces a model
for ranking web page in cross language web search and defines the
research contents.
In the chapters 2, 3 and 4, the author studies problems of
query processing, automatic translation, re-ranking and proposes
41. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
21
42. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
techniques to solve these problems with the aim of improving the
ranking performance of cross language web search systems.
In the chapter 5, an evaluating environment is created to
verify the effectiveness of proposed techniques and to compare the
results with other techniques. The summary results show that the
combined application of proposed techniques helps to improve the
system performance (measured by the MAP score)
1.2. Results
1.2.1. Theory
The theoretical results, proposed by the author, include 2
groups of techniques, which can be applied in components of a
cross-language web search system.
The first group consists of techniques for translation: pre-
translation query processing, automatic query translation and query
modification in the target language:
- The algorithm WLQS is used in the combination with the
open source tool vnTagger and serves as a tool for query
segmentation. This algorithm gives the result in the form of groups
of keywords and candidates translations.
- The algorithms for disambiguation are based on the
definition of Mutual Information. The Summary Mutual Information
algorithm is defined to select one best translation for each keyword.
The second algorithm selects translations sequentially and create lists
of ordered best translations for each keywords.
- Several methods of building the query translation in the
target language. First, the author proposes to create a structured
query based on the lists of translation candidates for each keyword.
Next, the author uses the Pseudo-Relevance Feedback for
recalculating keyword weights and query expansion to improve the
translated query in the target language.
The second groups consists of techniques for re-ranking
search result lists in cross-language information retrieval:
43. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
22
44. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
- Cross-language proximity models, including 2 models
based on the ideas of monolingual proximity models Büttcher and
Rasolofo. In the third model, the score is calculated based on BM25
scoring scheme and limited for sentences containing many query
keywords. The Cross-language proximity models can be used as
basic ranking functions.
- The method for re-ranking search results of web search.
Within the search engine Solr, the documents are analyzed into fields
and contents of each field are indexed. A set of ranking functions is
defined as basic ranking functions. The Learning to Rank technique
is applied to create a final ranking function to re-ranking search
results.
All proposed techniques are integrated as components in a
cross language web search model
1.2.2. The experiment results
The experiments are conducted to verify proposed
techniques. The results are presented in several scientific articles:
- The experiment applying the algorithm WLQS and the
function Summary Mutual Information as a disambiguation solution
shows that this technique is better than the popular formula nMI in
selecting best translation for keywords in a query.
- The experiment with the model in which queries are
segmented by the combination of algorithm WLQS and the tool
vnTagger, then the disambiguation is executed by the algorithm
selecting translation sequentially and the final structured query is
created in the target language outperforms the tool Google Translate.
- The experiment applying Pseudo Relevance Feedback to
modify and expand queries in the target language shows that this
technique allows to improve system performance in the precision and
recall.
- An experiment is conducted to learn ranking function.
From the experiments with the LETOR dataset and with defined
45. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
23
46. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
cross-language proximity functions, the author conducts the
Learning to Rank experiment for a cross-language web search
system to learn ranking functions. The experiment results show that
the system performance is improved (in the MAP score).
All in all, the techniques proposed by the author help to
improve the ranking performance of a cross language web search
system. An important result of the thesis is the combination of these
techniques as components in a cross language web search system
helps to improve the ranking performance and even better the
method using manual translation in conducted experiments.
2. Future works
In addition to the obtained results in the thesis, some issues
may continue to be studied in the future:
- The algorithms for processing query presented in the thesis
can be sensitive with languages, contents and sizes of queries. In the
limited size of the thesis, the author have focused on the queries in
Vietnamese and the documents to be search written in English. The
queries being studied have mostly an average size ( 5-10 words).
Other types of queries should be studied in the future.
- Optimization of query processing. The running time of pre-
translation query processing, disambiguation algorithms should be
optimized further in both data organization and algorithm.
- Research on other machine learning algorithms for learning
other types of ranking function combination. A disadvantage of the
genetic programming is the time cost. Besides, the learning process
in the thesis is executed with a limited number of ranking functions.
A future work can be done with more ranking functions.
47. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
24
48. TẢI TÀI LIỆU KẾT BẠN ZALO 0973.287.149
Nhận viết đề tài trọn gói – ZL: 0973.287.149 –
Luanvanmaster.com
THE PUBLISHED ARTICLES
[1] Giang L.T., Hùng V.T., "Các phương pháp xếp hạng lại trong
trộn kết quả tìm kiếm". Tạp chí Khoa học và Công nghệ các
trường Đại học Kỹ thuật, vol. 91, pp. 59–64, 2012.
[2] Giang L.T., Hùng V.T., "Ứng dụng lập trình di truyền trong
học xếp hạng". Tạp chí Khoa học và Công nghệ các trường
Đại học Kỹ thuật, vol. 92, pp. 58–63, 2013.
[3] Giang L.T., Hùng V.T., "Đánh giá thực nghiệm mô hình truy
vấn thông tin đa ngữ". In: Hội nghị quốc gia lần thứ VI
Nghiên cứu cơ bản và ứng dụng Công nghệ thông tin, pp.
103–107, 2013.
[4] Giang L.T., Hung V.T., Phap H.C., "Building Evaluation
Dataset in Vietnamese Information Retrieval". Journal of
Science and Technology Danang University, vol. 12, no. 1,
pp. 37–41, 2013.
[5] Giang L.T., Hung V.T., Phap H.C., "Experiments with query
translation and re-ranking methods in Vietnamese-English
bilingual information retrieval". In: Proceedings of the
Fourth Symposium on Information and Communication
Technology, pp. 118–122, 2013.
[6] Giang L.T., Hung V.T., Phap H.C., "Building Structured
Query in Target Language for Vietnamese – English Cross
Language Information Retrieval Systems". International
Journal of Engineering Research & Technology (IJERT), vol.
4, no. 04, pp. 146–151, 2015.
[7] Giang L.T., Hung V.T., Phap H.C., "Improve Cross Language
Information Retrieval with Pseudo-Relevance Feedback". In:
FAIR 2015, pp. 315–320, 2015.
[8] Giang L.T., Hung V.T., Phap H.C., "Building proximity
models for Cross Language Information Retrieval". Issue on
Information and Communication Technology- University of
Danang, vol. 1, no. 1, pp. 8–12, 2015.
[9] Giang L.T., Hùng V.T., Pháp H.C., "Áp dụng học máy dựa
trên lập trình di truyền trong tìm kiếm Web xuyên ngữ". Tạp
chí Khoa học và Công nghệ, Đại học Đà Nẵng, vol. 1, no. 98,
pp. 93-97, 2016.