TF-IDF is one of the most popular term-weighting schemes, and is applied by search engines, recommender systems, and user modeling engines. With regard to user modeling and recommender systems, we see two shortcomings of TF-IDF. First, calculating IDF requires access to the document corpus from which recommendations are made. Such access is not always given in a user-modeling or recommender system. Second, TF-IDF ignores information from a user’s personal document collection, which could – so we hypothesize – enhance the user modeling process. In this paper, we introduce TF-IDuF as a term-weighting scheme that does not require access to the general document corpus and that considers information from the users’ personal document collections. We evaluated the effectiveness of TF-IDuF compared to TF-IDF and TF-Only and found that TF-IDF and TF-IDuF perform similarly (click-through rates (CTR) of 5.09% vs. 5.14%), and both are around 25% more effective than TF-Only (CTR of 4.06%) for recommending research papers. Consequently, we conclude that TF-IDuF could be a promising term-weighting scheme, especially when access to the document corpus for recommendations is not possible, and thus classic IDF cannot be computed. It is also notable that TF-IDuF and TF-IDF are not exclusive, so that both metrics may be combined to a more effective term-weighting scheme.
Evaluating the CC-IDF citation-weighting scheme: How effectively can ‘Inverse...Joeran Beel
In the domain of academic search engines and research-paper recommender systems, CC-IDF is a common citation-weighting scheme that is used to calculate semantic relatedness between documents. CC-IDF adopts the principles of the popular term-weighting scheme TF-IDF and assumes that if a rare academic citation is shared by two documents then this occurrence should receive a higher weight than if the citation is shared among a large number of documents. Although CC-IDF is in common use, we found no empirical evaluation and comparison of CC-IDF with plain citation weight (CC-Only). Therefore, we conducted such an evaluation and present the results in this paper. The evaluation was conducted with real users of the recommender system Docear. The effectiveness of CC-IDF and CC-Only was measured using click-through rate (CTR). For 238,681 delivered recommendations, CC-IDF had about the same effectiveness as CC-Only (CTR of 6.15% vs. 6.23%). In other words, CC-IDF was not more effective than CC-Only, which is a surprising result. We provide a number of potential reasons and suggest to conduct further research to understand the principles of CC-IDF in more detail.
The influence of information security onIJCNCJournal
This document summarizes a research study on how information security influences the adoption of cloud computing. The study utilized surveys of IT managers and directors to examine how their perceptions of security, cost-effectiveness, and compliance impact decisions to adopt cloud computing. The results of the multiple linear regression analysis showed that management's perception of cost-effectiveness more significantly correlates to their decision to adopt cloud computing than does their perception of security. The document provides background on cloud computing models and adoption theories to help explain the context and methodology of the research study.
A systems engineering methodology for wide area network selectionAlexander Decker
This document describes a study that applies the Analytic Hierarchy Process (AHP) to help a company select the best wide area network (WAN) solution based on their requirements. The document provides background on AHP and reviews related literature on using multi-criteria decision making for selection problems. It then outlines the steps of AHP, including constructing a hierarchy, performing pairwise comparisons, and calculating weights and consistency. Finally, it describes how AHP could be applied to help the hypothetical company evaluate WAN alternatives and select the optimal solution.
Clinical Decision Support Systems (CDSS) were explicitly introduced in the 90’s with the aim of providing knowledge to clinicians in order to influence its decisions and, therefore, improve patients’ health care. There are different architectural approaches for implementing CDSS. Some of these approaches are based on cloud computing, which provides on-demand computing resources over the internet. The goal of this paper is to determine and discuss key issues and approaches involving architectural designs in implementing a CDSS using cloud computing. To this end, we performed a standard Systematic Literature Review (SLR) of primary studies showing the intervention of cloud computing on CDSS implementations. Twenty-one primary studies were reviewed. We found that CDSS architectural components are similar in most of the studies. Cloud-based CDSS are most used in Home Healthcare and Emergency Medical Systems. Alerts/Reminders and Knowledge Service are the most common implementations. Major challenges are around security, performance, and compatibility. We concluded on the benefits of implementing a cloud-based CDSS since it allows cost-efficient, ubiquitous and elastic computing resources. We highlight that some studies show weaknesses regarding the conceptualization of a cloud-based computing approach and lack of a formal methodology in the architectural design process.
Data reduction techniques to analyze nsl kdd datasetIAEME Publication
The document discusses applying data reduction techniques to the NSL-KDD dataset to analyze network intrusion detection data. It describes how data reduction can minimize data size without losing important information. The document applies several data reduction algorithms to the NSL-KDD dataset and uses the output to train and test two classification algorithms, J48 and Naive Bayes. The results are compared based on accuracy, specificity, and sensitivity to determine which data reduction technique improves classification performance the most. The goal is to find an effective and efficient way to analyze large network intrusion detection datasets using data reduction and machine learning.
Applying Soft Computing Techniques in Information RetrievalIJAEMSJORNAL
There is plethora of information available over the internet on daily basis and to retrieve meaningful effective information using usual IR methods is becoming a cumbersome task. Hence this paper summarizes the different soft computing techniques available that can be applied to information retrieval systems to improve its efficiency in acquiring knowledge related to a user’s query.
Virtual Machine Allocation Policy in Cloud Computing Environment using CloudSim IJECEIAES
This document discusses virtual machine allocation policies in cloud computing environments using the CloudSim simulation tool. It begins with an introduction to cloud computing and discusses challenges related to resource management and energy consumption. It then reviews previous research on modeling approaches, energy optimization techniques, and network topologies. A UML class model is presented for analyzing energy consumption when accessing cloud servers arranged in a step network topology. The methodology section outlines how energy consumption by system components like processors, RAM, hard disks, and motherboards will be calculated. Simulation results will depict response times and cost details for different data center configurations and allocation policies.
Biometric retrieval is a challenging task as the size of the databases have increased considerably. In this work, a novel optimized kd-tree algorithm is implemented to enhance the efficiency of indexing and retrieving for a multibiometric database comprising of iris and fingerprints. To improve the retrieval performance, fingerprint image is represented by minutiae features and iris image is represented by texture features and the features are fused together by feature level fusion. Dimension reduction of the feature vector is carried out using Principal Component Analysis to reduce the storage space and increase retrieval rate. The proposed optimized kd-tree indexing technique with dimension reduction aims to overcome the limitations of the existing nearest kd-tree. From the experimental results, it is concluded that the proposed optimized kd-tree indexing algorithm with dimension reduction has reduced False Acceptance Rate and False Rejection Rate and has improved Hit rate to 95% at 60% penetration rate compared to existing nearest kd-tree techniquefor a multibiometric database.
Evaluating the CC-IDF citation-weighting scheme: How effectively can ‘Inverse...Joeran Beel
In the domain of academic search engines and research-paper recommender systems, CC-IDF is a common citation-weighting scheme that is used to calculate semantic relatedness between documents. CC-IDF adopts the principles of the popular term-weighting scheme TF-IDF and assumes that if a rare academic citation is shared by two documents then this occurrence should receive a higher weight than if the citation is shared among a large number of documents. Although CC-IDF is in common use, we found no empirical evaluation and comparison of CC-IDF with plain citation weight (CC-Only). Therefore, we conducted such an evaluation and present the results in this paper. The evaluation was conducted with real users of the recommender system Docear. The effectiveness of CC-IDF and CC-Only was measured using click-through rate (CTR). For 238,681 delivered recommendations, CC-IDF had about the same effectiveness as CC-Only (CTR of 6.15% vs. 6.23%). In other words, CC-IDF was not more effective than CC-Only, which is a surprising result. We provide a number of potential reasons and suggest to conduct further research to understand the principles of CC-IDF in more detail.
The influence of information security onIJCNCJournal
This document summarizes a research study on how information security influences the adoption of cloud computing. The study utilized surveys of IT managers and directors to examine how their perceptions of security, cost-effectiveness, and compliance impact decisions to adopt cloud computing. The results of the multiple linear regression analysis showed that management's perception of cost-effectiveness more significantly correlates to their decision to adopt cloud computing than does their perception of security. The document provides background on cloud computing models and adoption theories to help explain the context and methodology of the research study.
A systems engineering methodology for wide area network selectionAlexander Decker
This document describes a study that applies the Analytic Hierarchy Process (AHP) to help a company select the best wide area network (WAN) solution based on their requirements. The document provides background on AHP and reviews related literature on using multi-criteria decision making for selection problems. It then outlines the steps of AHP, including constructing a hierarchy, performing pairwise comparisons, and calculating weights and consistency. Finally, it describes how AHP could be applied to help the hypothetical company evaluate WAN alternatives and select the optimal solution.
Clinical Decision Support Systems (CDSS) were explicitly introduced in the 90’s with the aim of providing knowledge to clinicians in order to influence its decisions and, therefore, improve patients’ health care. There are different architectural approaches for implementing CDSS. Some of these approaches are based on cloud computing, which provides on-demand computing resources over the internet. The goal of this paper is to determine and discuss key issues and approaches involving architectural designs in implementing a CDSS using cloud computing. To this end, we performed a standard Systematic Literature Review (SLR) of primary studies showing the intervention of cloud computing on CDSS implementations. Twenty-one primary studies were reviewed. We found that CDSS architectural components are similar in most of the studies. Cloud-based CDSS are most used in Home Healthcare and Emergency Medical Systems. Alerts/Reminders and Knowledge Service are the most common implementations. Major challenges are around security, performance, and compatibility. We concluded on the benefits of implementing a cloud-based CDSS since it allows cost-efficient, ubiquitous and elastic computing resources. We highlight that some studies show weaknesses regarding the conceptualization of a cloud-based computing approach and lack of a formal methodology in the architectural design process.
Data reduction techniques to analyze nsl kdd datasetIAEME Publication
The document discusses applying data reduction techniques to the NSL-KDD dataset to analyze network intrusion detection data. It describes how data reduction can minimize data size without losing important information. The document applies several data reduction algorithms to the NSL-KDD dataset and uses the output to train and test two classification algorithms, J48 and Naive Bayes. The results are compared based on accuracy, specificity, and sensitivity to determine which data reduction technique improves classification performance the most. The goal is to find an effective and efficient way to analyze large network intrusion detection datasets using data reduction and machine learning.
Applying Soft Computing Techniques in Information RetrievalIJAEMSJORNAL
There is plethora of information available over the internet on daily basis and to retrieve meaningful effective information using usual IR methods is becoming a cumbersome task. Hence this paper summarizes the different soft computing techniques available that can be applied to information retrieval systems to improve its efficiency in acquiring knowledge related to a user’s query.
Virtual Machine Allocation Policy in Cloud Computing Environment using CloudSim IJECEIAES
This document discusses virtual machine allocation policies in cloud computing environments using the CloudSim simulation tool. It begins with an introduction to cloud computing and discusses challenges related to resource management and energy consumption. It then reviews previous research on modeling approaches, energy optimization techniques, and network topologies. A UML class model is presented for analyzing energy consumption when accessing cloud servers arranged in a step network topology. The methodology section outlines how energy consumption by system components like processors, RAM, hard disks, and motherboards will be calculated. Simulation results will depict response times and cost details for different data center configurations and allocation policies.
Biometric retrieval is a challenging task as the size of the databases have increased considerably. In this work, a novel optimized kd-tree algorithm is implemented to enhance the efficiency of indexing and retrieving for a multibiometric database comprising of iris and fingerprints. To improve the retrieval performance, fingerprint image is represented by minutiae features and iris image is represented by texture features and the features are fused together by feature level fusion. Dimension reduction of the feature vector is carried out using Principal Component Analysis to reduce the storage space and increase retrieval rate. The proposed optimized kd-tree indexing technique with dimension reduction aims to overcome the limitations of the existing nearest kd-tree. From the experimental results, it is concluded that the proposed optimized kd-tree indexing algorithm with dimension reduction has reduced False Acceptance Rate and False Rejection Rate and has improved Hit rate to 95% at 60% penetration rate compared to existing nearest kd-tree techniquefor a multibiometric database.
Text pre-processing of multilingual for sentiment analysis based on social ne...IJECEIAES
Sentiment analysis (SA) is an enduring area for research especially in the field of text analysis. Text pre-processing is an important aspect to perform SA accurately. This paper presents a text processing model for SA, using natural language processing techniques for twitter data. The basic phases for machine learning are text collection, text cleaning, pre-processing, feature extractions in a text and then categorize the data according to the SA techniques. Keeping the focus on twitter data, the data is extracted in domain specific manner. In data cleaning phase, noisy data, missing data, punctuation, tags and emoticons have been considered. For pre-processing, tokenization is performed which is followed by stop word removal (SWR). The proposed article provides an insight of the techniques, that are used for text pre-processing, the impact of their presence on the dataset. The accuracy of classification techniques has been improved after applying text preprocessing and dimensionality has been reduced. The proposed corpus can be utilized in the area of market analysis, customer behaviour, polling analysis, and brand monitoring. The text pre-processing process can serve as the baseline to apply predictive analysis, machine learning and deep learning algorithms which can be extended according to problem definition.
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET Journal
This document proposes a methodology to automatically assign topics to unlabeled datasets using topic modeling techniques. It applies latent Dirichlet allocation (LDA) and non-negative matrix factorization (NMF) with term frequency-inverse document frequency (TF-IDF) weighting to product reviews to generate topics. Word similarities are used to cluster words for each topic. Sentiment analysis and word clouds are also used to gain insights. The methodology successfully converts unlabeled to labeled data and provides automatic topic labeling to facilitate further research and opportunity discovery.
Analysis Model in the Cloud Optimization Consumption in Pricing the Internet ...IJECEIAES
The problem of internet pricing is a problem that is often a major problem in optimization. In this study, the internet pricing scheme focuses on optimizing the use of bandwidth consumption. This research utilizes modification of cloud model in finding optimal solution in network. Cloud computing is computational model which is like network, server, storage and service that is utilizing internet connection. As ISP's Internet service provider requires appropriate pricing schemes in order to maximize revenue and provide quality of service (Quality on Service) or QoS so as to satisfy internet users or users. The model used will be completed with the help of LINGO software program to get optimal solution and accurate result. Based on the optimal solution obtained from the modification of the cloud model can be utilized ISP to maximize revenue and provide services in accordance with needs and requests.
IRJET- Automated Document Summarization and Classification using Deep Lear...IRJET Journal
The document proposes a system that uses deep learning methods for automated document summarization and classification. It uses a recurrent convolutional neural network (RCNN) which combines a convolutional neural network and recurrent neural network to build a robust classifier model. For summarization, it employs a graph-based method inspired by PageRank to extract the top 20% of sentences from a document based on word intersections. The RCNN model achieved over 97% accuracy on classifying documents from various domains using their summaries. The system aims to speed up classification and make it more intuitive using automated summarization techniques with deep learning.
An Iterative Model as a Tool in Optimal Allocation of Resources in University...Dr. Amarjeet Singh
In this paper, a study was carried out to aid in
adequate allocation of resources in the College of Natural
Sciences, TYZ University (not real name because of ethical
issue). Questionnaires were administered to the highranking officials of one the Colleges, College of Pure and
Applied Sciences, to examine how resources were allocated
for three consecutive sessions(the sessions were 2009/2010,
2010/2011 and 2011/2012),then used the data gathered and
analysed to generate contributory inputs for the three basic
outputs (variables)formed for the purpose of the study.
These variables are: 1
x
represents the quality of graduates
produced;
2
x
stands for research papers, Seminars,
Journals articles etc. published by faculties and
3
x
denotes service delivery within the three sessions under study.
Simplex Method of Linear Programming was used to solve
the model formulated.
An Efficient Cloud Scheduling Algorithm for the Conservation of Energy throug...IJECEIAES
Method of broadcasting is the well known operation that is used for providing support to different computing protocols in cloud computing. Attaining energy efficiency is one of the prominent challenges, that is quite significant in the scheduling process that is used in cloud computing as, there are fixed limits that have to be met by the system. In this research paper, we are particularly focusing on the cloud server maintenance and scheduling process and to do so, we are using the interactive broadcasting energy efficient computing technique along with the cloud computing server. Additionally, the remote host machines used for cloud services are dissipating more power and with that they are consuming more and more energy. The effect of the power consumption is one of the main factors for determining the cost of the computing resources. With the idea of using the avoidance technology for assigning the data center resources that dynamically depend on the application demands and supports the cloud computing with the optimization of the servers in use.
Improving IF Algorithm for Data Aggregation Techniques in Wireless Sensor Net...IJECEIAES
In Wireless Sensor Network (WSN), fact from different sensor nodes is collected at assembling node, which is typically complete via modest procedures such as averaging as inadequate computational power and energy resources. Though such collections is identified to be extremely susceptible to node compromising attacks. These approaches are extremely prone to attacks as WSN are typically lacking interfere resilient hardware. Thus, purpose of veracity of facts and prestige of sensor nodes is critical for wireless sensor networks. Therefore, imminent gatherer nodes will be proficient of accomplishment additional cultivated data aggregation algorithms, so creating WSN little unresisting, as the performance of actual low power processors affectedly increases. Iterative filtering algorithms embrace inordinate capacity for such a resolution. The way of allocated the matching mass elements to information delivered by each source, such iterative algorithms concurrently assemble facts from several roots and deliver entrust valuation of these roots. Though suggestively extra substantial against collusion attacks beside the modest averaging techniques, are quiet vulnerable to a different cultivated attack familiarize. The existing literature is surveyed in this paper to have a study of iterative filtering techniques and a detailed comparison is provided. At the end of this paper new technique of improved iterative filtering is proposed with the help of literature survey and drawbacks found in the literature.
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...israel edem
This presentation discusses fog computing and big data. It introduces the 5 V's of big data (volume, velocity, variety, veracity, value) and outlines a framework for managing big data that includes data preprocessing, clustering, feature extraction, classification, data mining, and visualization. It contrasts datasets, which are fixed, with data streams, which have continuous high velocity. Bio-inspired algorithms are presented as a way to process big data. Fog/edge computing is discussed as a solution to issues with processing big data solely in the cloud. A key challenge of fog computing is ensuring data quality given the 5V's, and a proposed solution is a quality-of-use framework that considers speed, size, and type of
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCEijesajournal
Automation is a powerful word that lies everywhere. It shows that without automation, application will not get developed. In a semiconductor industry, artificial intelligence played a vital role for implementing the chip based design through automation .The main advantage of applying the machine learning & deep learning technique is to improve the implementation rate based upon the capability of the society. The main objective of the proposed system is to apply the deep learning using data driven approach for controlling the system. Thus leads to a improvement in design, delay ,speed of operation & costs. Through this system, huge volume of data’s that are generated by the system will also get control.
Presentation on the work we've done within BeSTGRID as it relates to bioinformatics in NZ, for the 2010 Bioinformatics Symposium https://www.bestgrid.org/NZ-Bioinformatics-Symposium-2010
A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...ijaia
In the era of the fourth industrial revolution, measuring and ensuring the reliability, efficiency and safety of the industrial systems and components are one of the uppermost key concern. In addition, predicting performance degradation or remaining useful life (RUL) of an equipment over time based on its historical sensor data enables companies to greatly reduce their maintenance cost. In this way, companies can prevent costly unexpected breakdown and become more profitable and competitive in the marketplace. This paper introduces a deep learning-based method by combining CNN(Convolutional Neural Networks) and LSTM (Long Short-Term Memory)neural networks to predict RUL for industrial equipment. The proposed method does not depend upon any degradation trend assumptions and it can learn complex temporal representative and distinguishing patterns in the sensor data. In order to evaluate the efficiency and effectiveness of the proposed method, we evaluated it on two different experiment: RUL estimation and predicting the status of the IoT devices in 2-week period. Experiments are conducted on a publicly available NASA’s turbo fan-engine dataset. Based on the experiment results, the deep learning-based approach achieved high prediction accuracy. Moreover, the results show that the method outperforms standard well-accepted machine learning algorithms and accomplishes competitive performance when compared to the state-of-the art methods
Peer-to-Peer Data Sharing and Deduplication using Genetic AlgorithmIRJET Journal
This document proposes a peer-to-peer data sharing and deduplication system using genetic algorithms. The system would allow organizations in a corporate network to share data by registering with a P2P service provider and launching peer instances. It addresses challenges of scalability, performance, and security for inter-organizational data sharing. The system integrates cloud computing, databases, and P2P technologies. It uses genetic algorithms for deduplication to reduce redundant data storage. The system is intended to provide flexible, scalable, and cost-effective data sharing services for corporate networks based on a pay-as-you-go model.
Green computing on Consumer's buying behavior Shibly Ahamed
Green computing, also called green technology, is the environmentally responsible use of computers and related resources. Such practices include the implementation of energy-efficient central processing units (CPUs), servers and peripherals as well as reduced resource consumption and proper disposal of electronic waste (e-waste).
Data mining referred to extracting the hidden predictive information from huge amount of data set. Recently, there are number of private institution are came into existence and they put their efforts to get fruitful admissions. In this paper, the techniques of data mining are used to analyze the mind setup of student after matriculate. One of the best tools of data mining is known as WEKA (Waikato Environment Knowledge Analysis), is used to formulate the process of analysis.
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET Journal
This document discusses several approaches for clustering textual documents, including:
1. TF-IDF, word embedding, and K-means clustering are proposed to automatically classify and organize documents.
2. Previous work on document clustering is reviewed, including partition-based techniques like K-means and K-medoids, hierarchical clustering, and approaches using semantic features, PSO optimization, and multi-view clustering.
3. Challenges of clustering large document collections at scale are discussed, along with potential solutions using frameworks like Hadoop.
Australia's Environmental Predictive CapabilityTERN Australia
Federating world-leading research, data and technical capabilities to create Australia’s National Environmental Prediction System (NEPS).
Community consultation presentation.
3-12 February 2020
Dr Michelle Barker (Facilitator)
(Presentation v5)
This document presents a proposed approach for detecting suggestions in user reviews using a gated recurrent neural network (GRU) and convolutional neural network (CNN). The methodology uses a CNN to generate text embeddings, followed by a GRU and CNN layers to classify reviews as containing a suggestion or not. The approach is evaluated on a benchmark dataset, achieving an F1-score of 0.5806, outperforming other methods. Future work could leverage external knowledge and handle ambiguous/short reviews.
IRJET - Mobile Chatbot for Information SearchIRJET Journal
This document summarizes a research paper on developing a mobile chatbot using IBM Watson services to allow students to search for their exam scores. The chatbot uses Watson Assistant for natural language processing, a SQL database as a knowledge base to store score information, and text-to-speech and speech-to-text for input and output. It was built with Android Studio and Java to provide an intuitive mobile interface for users to interact with the chatbot.
IRJET- Methodologies used on News Articles :A SurveyIRJET Journal
This document provides a survey of various methodologies used for analyzing news articles, including classification, clustering, sentiment analysis, data visualization, and text summarization. It discusses how researchers have applied techniques like machine learning algorithms, natural language processing, and deep learning to perform tasks on news data. Classification involves categorizing articles by domain or region using methods such as naive Bayes, support vector machines, and neural networks. Clustering groups similar articles together to reduce intra-cluster distance. Sentiment analysis determines opinion or credibility of articles using techniques including neural networks and naive Bayes. Data visualization represents news data graphically to help predict relationships or trends. Text summarization reduces article length through techniques like word stemming. The survey concludes by discussing the scope for
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.Provectus
Healthcare organizations generate piles of documents and forms in different formats, making it difficult to achieve operational excellence and streamline business processes. Manual entry and OCR are no longer viable, and healthcare entities are looking for new solutions to handle documents.
In this presentation you can learn about:
- Healthcare document types and use cases
- IDP framework: building blocks for document processing solutions
- The document processing market landscape
- Methodology for solution evaluation: comparing apples to apples
Whether you are looking for a ready-made solution or plan to build a custom solution of your own, this webinar will help you find the best fit for your healthcare use cases.
GenerativeAI and Automation - IEEE ACSOS 2023.pptxAllen Chan
Generative AI has been rapidly evolving, enabling different and more sophisticated interactions with Large Language Models (LLMs) like those available in IBM watsonx.ai or Meta Llama2. In this session, we will take a use case based approach to look at how we can leverage LLMs together with existing automation technologies like Workflow, Content Management, and Decisions to enable new solutions.
IRJET- PDF Extraction using Data Mining TechniquesIRJET Journal
This document discusses techniques for extracting information from PDF documents using data mining. It presents a proposed system that would allow users to upload a PDF file and receive a summarized output of the most important information from the file. The system is intended to reduce the time needed to understand large documents by automatically identifying and presenting the key points. The conclusion states that the proposed web application would implement text summarization using clustering and diversity-based methods to generate a summary preserving the overall meaning while removing redundancy.
Text pre-processing of multilingual for sentiment analysis based on social ne...IJECEIAES
Sentiment analysis (SA) is an enduring area for research especially in the field of text analysis. Text pre-processing is an important aspect to perform SA accurately. This paper presents a text processing model for SA, using natural language processing techniques for twitter data. The basic phases for machine learning are text collection, text cleaning, pre-processing, feature extractions in a text and then categorize the data according to the SA techniques. Keeping the focus on twitter data, the data is extracted in domain specific manner. In data cleaning phase, noisy data, missing data, punctuation, tags and emoticons have been considered. For pre-processing, tokenization is performed which is followed by stop word removal (SWR). The proposed article provides an insight of the techniques, that are used for text pre-processing, the impact of their presence on the dataset. The accuracy of classification techniques has been improved after applying text preprocessing and dimensionality has been reduced. The proposed corpus can be utilized in the area of market analysis, customer behaviour, polling analysis, and brand monitoring. The text pre-processing process can serve as the baseline to apply predictive analysis, machine learning and deep learning algorithms which can be extended according to problem definition.
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET Journal
This document proposes a methodology to automatically assign topics to unlabeled datasets using topic modeling techniques. It applies latent Dirichlet allocation (LDA) and non-negative matrix factorization (NMF) with term frequency-inverse document frequency (TF-IDF) weighting to product reviews to generate topics. Word similarities are used to cluster words for each topic. Sentiment analysis and word clouds are also used to gain insights. The methodology successfully converts unlabeled to labeled data and provides automatic topic labeling to facilitate further research and opportunity discovery.
Analysis Model in the Cloud Optimization Consumption in Pricing the Internet ...IJECEIAES
The problem of internet pricing is a problem that is often a major problem in optimization. In this study, the internet pricing scheme focuses on optimizing the use of bandwidth consumption. This research utilizes modification of cloud model in finding optimal solution in network. Cloud computing is computational model which is like network, server, storage and service that is utilizing internet connection. As ISP's Internet service provider requires appropriate pricing schemes in order to maximize revenue and provide quality of service (Quality on Service) or QoS so as to satisfy internet users or users. The model used will be completed with the help of LINGO software program to get optimal solution and accurate result. Based on the optimal solution obtained from the modification of the cloud model can be utilized ISP to maximize revenue and provide services in accordance with needs and requests.
IRJET- Automated Document Summarization and Classification using Deep Lear...IRJET Journal
The document proposes a system that uses deep learning methods for automated document summarization and classification. It uses a recurrent convolutional neural network (RCNN) which combines a convolutional neural network and recurrent neural network to build a robust classifier model. For summarization, it employs a graph-based method inspired by PageRank to extract the top 20% of sentences from a document based on word intersections. The RCNN model achieved over 97% accuracy on classifying documents from various domains using their summaries. The system aims to speed up classification and make it more intuitive using automated summarization techniques with deep learning.
An Iterative Model as a Tool in Optimal Allocation of Resources in University...Dr. Amarjeet Singh
In this paper, a study was carried out to aid in
adequate allocation of resources in the College of Natural
Sciences, TYZ University (not real name because of ethical
issue). Questionnaires were administered to the highranking officials of one the Colleges, College of Pure and
Applied Sciences, to examine how resources were allocated
for three consecutive sessions(the sessions were 2009/2010,
2010/2011 and 2011/2012),then used the data gathered and
analysed to generate contributory inputs for the three basic
outputs (variables)formed for the purpose of the study.
These variables are: 1
x
represents the quality of graduates
produced;
2
x
stands for research papers, Seminars,
Journals articles etc. published by faculties and
3
x
denotes service delivery within the three sessions under study.
Simplex Method of Linear Programming was used to solve
the model formulated.
An Efficient Cloud Scheduling Algorithm for the Conservation of Energy throug...IJECEIAES
Method of broadcasting is the well known operation that is used for providing support to different computing protocols in cloud computing. Attaining energy efficiency is one of the prominent challenges, that is quite significant in the scheduling process that is used in cloud computing as, there are fixed limits that have to be met by the system. In this research paper, we are particularly focusing on the cloud server maintenance and scheduling process and to do so, we are using the interactive broadcasting energy efficient computing technique along with the cloud computing server. Additionally, the remote host machines used for cloud services are dissipating more power and with that they are consuming more and more energy. The effect of the power consumption is one of the main factors for determining the cost of the computing resources. With the idea of using the avoidance technology for assigning the data center resources that dynamically depend on the application demands and supports the cloud computing with the optimization of the servers in use.
Improving IF Algorithm for Data Aggregation Techniques in Wireless Sensor Net...IJECEIAES
In Wireless Sensor Network (WSN), fact from different sensor nodes is collected at assembling node, which is typically complete via modest procedures such as averaging as inadequate computational power and energy resources. Though such collections is identified to be extremely susceptible to node compromising attacks. These approaches are extremely prone to attacks as WSN are typically lacking interfere resilient hardware. Thus, purpose of veracity of facts and prestige of sensor nodes is critical for wireless sensor networks. Therefore, imminent gatherer nodes will be proficient of accomplishment additional cultivated data aggregation algorithms, so creating WSN little unresisting, as the performance of actual low power processors affectedly increases. Iterative filtering algorithms embrace inordinate capacity for such a resolution. The way of allocated the matching mass elements to information delivered by each source, such iterative algorithms concurrently assemble facts from several roots and deliver entrust valuation of these roots. Though suggestively extra substantial against collusion attacks beside the modest averaging techniques, are quiet vulnerable to a different cultivated attack familiarize. The existing literature is surveyed in this paper to have a study of iterative filtering techniques and a detailed comparison is provided. At the end of this paper new technique of improved iterative filtering is proposed with the help of literature survey and drawbacks found in the literature.
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...israel edem
This presentation discusses fog computing and big data. It introduces the 5 V's of big data (volume, velocity, variety, veracity, value) and outlines a framework for managing big data that includes data preprocessing, clustering, feature extraction, classification, data mining, and visualization. It contrasts datasets, which are fixed, with data streams, which have continuous high velocity. Bio-inspired algorithms are presented as a way to process big data. Fog/edge computing is discussed as a solution to issues with processing big data solely in the cloud. A key challenge of fog computing is ensuring data quality given the 5V's, and a proposed solution is a quality-of-use framework that considers speed, size, and type of
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCEijesajournal
Automation is a powerful word that lies everywhere. It shows that without automation, application will not get developed. In a semiconductor industry, artificial intelligence played a vital role for implementing the chip based design through automation .The main advantage of applying the machine learning & deep learning technique is to improve the implementation rate based upon the capability of the society. The main objective of the proposed system is to apply the deep learning using data driven approach for controlling the system. Thus leads to a improvement in design, delay ,speed of operation & costs. Through this system, huge volume of data’s that are generated by the system will also get control.
Presentation on the work we've done within BeSTGRID as it relates to bioinformatics in NZ, for the 2010 Bioinformatics Symposium https://www.bestgrid.org/NZ-Bioinformatics-Symposium-2010
A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...ijaia
In the era of the fourth industrial revolution, measuring and ensuring the reliability, efficiency and safety of the industrial systems and components are one of the uppermost key concern. In addition, predicting performance degradation or remaining useful life (RUL) of an equipment over time based on its historical sensor data enables companies to greatly reduce their maintenance cost. In this way, companies can prevent costly unexpected breakdown and become more profitable and competitive in the marketplace. This paper introduces a deep learning-based method by combining CNN(Convolutional Neural Networks) and LSTM (Long Short-Term Memory)neural networks to predict RUL for industrial equipment. The proposed method does not depend upon any degradation trend assumptions and it can learn complex temporal representative and distinguishing patterns in the sensor data. In order to evaluate the efficiency and effectiveness of the proposed method, we evaluated it on two different experiment: RUL estimation and predicting the status of the IoT devices in 2-week period. Experiments are conducted on a publicly available NASA’s turbo fan-engine dataset. Based on the experiment results, the deep learning-based approach achieved high prediction accuracy. Moreover, the results show that the method outperforms standard well-accepted machine learning algorithms and accomplishes competitive performance when compared to the state-of-the art methods
Peer-to-Peer Data Sharing and Deduplication using Genetic AlgorithmIRJET Journal
This document proposes a peer-to-peer data sharing and deduplication system using genetic algorithms. The system would allow organizations in a corporate network to share data by registering with a P2P service provider and launching peer instances. It addresses challenges of scalability, performance, and security for inter-organizational data sharing. The system integrates cloud computing, databases, and P2P technologies. It uses genetic algorithms for deduplication to reduce redundant data storage. The system is intended to provide flexible, scalable, and cost-effective data sharing services for corporate networks based on a pay-as-you-go model.
Green computing on Consumer's buying behavior Shibly Ahamed
Green computing, also called green technology, is the environmentally responsible use of computers and related resources. Such practices include the implementation of energy-efficient central processing units (CPUs), servers and peripherals as well as reduced resource consumption and proper disposal of electronic waste (e-waste).
Data mining referred to extracting the hidden predictive information from huge amount of data set. Recently, there are number of private institution are came into existence and they put their efforts to get fruitful admissions. In this paper, the techniques of data mining are used to analyze the mind setup of student after matriculate. One of the best tools of data mining is known as WEKA (Waikato Environment Knowledge Analysis), is used to formulate the process of analysis.
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET Journal
This document discusses several approaches for clustering textual documents, including:
1. TF-IDF, word embedding, and K-means clustering are proposed to automatically classify and organize documents.
2. Previous work on document clustering is reviewed, including partition-based techniques like K-means and K-medoids, hierarchical clustering, and approaches using semantic features, PSO optimization, and multi-view clustering.
3. Challenges of clustering large document collections at scale are discussed, along with potential solutions using frameworks like Hadoop.
Australia's Environmental Predictive CapabilityTERN Australia
Federating world-leading research, data and technical capabilities to create Australia’s National Environmental Prediction System (NEPS).
Community consultation presentation.
3-12 February 2020
Dr Michelle Barker (Facilitator)
(Presentation v5)
This document presents a proposed approach for detecting suggestions in user reviews using a gated recurrent neural network (GRU) and convolutional neural network (CNN). The methodology uses a CNN to generate text embeddings, followed by a GRU and CNN layers to classify reviews as containing a suggestion or not. The approach is evaluated on a benchmark dataset, achieving an F1-score of 0.5806, outperforming other methods. Future work could leverage external knowledge and handle ambiguous/short reviews.
IRJET - Mobile Chatbot for Information SearchIRJET Journal
This document summarizes a research paper on developing a mobile chatbot using IBM Watson services to allow students to search for their exam scores. The chatbot uses Watson Assistant for natural language processing, a SQL database as a knowledge base to store score information, and text-to-speech and speech-to-text for input and output. It was built with Android Studio and Java to provide an intuitive mobile interface for users to interact with the chatbot.
IRJET- Methodologies used on News Articles :A SurveyIRJET Journal
This document provides a survey of various methodologies used for analyzing news articles, including classification, clustering, sentiment analysis, data visualization, and text summarization. It discusses how researchers have applied techniques like machine learning algorithms, natural language processing, and deep learning to perform tasks on news data. Classification involves categorizing articles by domain or region using methods such as naive Bayes, support vector machines, and neural networks. Clustering groups similar articles together to reduce intra-cluster distance. Sentiment analysis determines opinion or credibility of articles using techniques including neural networks and naive Bayes. Data visualization represents news data graphically to help predict relationships or trends. Text summarization reduces article length through techniques like word stemming. The survey concludes by discussing the scope for
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.Provectus
Healthcare organizations generate piles of documents and forms in different formats, making it difficult to achieve operational excellence and streamline business processes. Manual entry and OCR are no longer viable, and healthcare entities are looking for new solutions to handle documents.
In this presentation you can learn about:
- Healthcare document types and use cases
- IDP framework: building blocks for document processing solutions
- The document processing market landscape
- Methodology for solution evaluation: comparing apples to apples
Whether you are looking for a ready-made solution or plan to build a custom solution of your own, this webinar will help you find the best fit for your healthcare use cases.
GenerativeAI and Automation - IEEE ACSOS 2023.pptxAllen Chan
Generative AI has been rapidly evolving, enabling different and more sophisticated interactions with Large Language Models (LLMs) like those available in IBM watsonx.ai or Meta Llama2. In this session, we will take a use case based approach to look at how we can leverage LLMs together with existing automation technologies like Workflow, Content Management, and Decisions to enable new solutions.
IRJET- PDF Extraction using Data Mining TechniquesIRJET Journal
This document discusses techniques for extracting information from PDF documents using data mining. It presents a proposed system that would allow users to upload a PDF file and receive a summarized output of the most important information from the file. The system is intended to reduce the time needed to understand large documents by automatically identifying and presenting the key points. The conclusion states that the proposed web application would implement text summarization using clustering and diversity-based methods to generate a summary preserving the overall meaning while removing redundancy.
This document provides an overview of information retrieval systems. It defines key concepts such as data, information, and knowledge. It describes the components of an information retrieval system including the system, users, and documents. It discusses different models for information retrieval including vector space and probabilistic models. It also covers techniques for improving retrieval effectiveness such as relevance feedback and using term frequency-inverse document frequency to assign weights. The document outlines two main approaches to information retrieval - indexing and retrieval.
IRJET- Determining Document Relevance using Keyword ExtractionIRJET Journal
This document describes a system that aims to search for and retrieve relevant documents from a large collection based on a user's query. It does this through three main components: keyword extraction, document searching, and a question answering bot. Keyword extraction is done using the TF-IDF algorithm to identify important words in documents. These keywords are stored in a database along with their TF-IDF weights. When a user submits a query, the system searches for documents containing keywords from the query and returns relevant results. It also includes a feedback mechanism for users to improve search accuracy over time. The goal is to deliver accurate search results quickly from large document collections.
IoT Processing Topologies and Types: Data Format, Importance of Processing in IoT, Processing Topologies, IoT Device Design and Selection Considerations, Processing Offloading.
The document is an introduction to a series on document understanding presented by Mukesh Kala. It discusses what documents are, different types of documents including structured, semi-structured, and unstructured documents. It then covers topics like rule-based and model-based data extraction, optical character recognition, challenges in document understanding, and the document understanding framework which involves taxonomy, digitization, classification, extraction, validation, and training steps.
A service oriented architecture (SOA) organizes software into business services that are network accessible and executable. Key characteristics include quality of service specifications, discoverable services and data catalogs, and use of industry standards. A SOA breaks up monolithic systems into reusable components called services that can be more easily maintained and replaced. Implementing a SOA requires organizing infrastructure, data, security, computing, communication, and application services to maximize reuse across the enterprise.
The document discusses decision trees and the ID3 algorithm. It provides an overview of data mining techniques, including decision trees. It then describes the ID3 algorithm in detail, including how it uses information gain to build decision trees top-down and recursively to classify data. An example of applying the ID3 algorithm to a sample dataset is also provided to illustrate the step-by-step process.
IRJET- Conextualization: Generalization and Empowering Content DomainIRJET Journal
This document discusses contextualization in artificial intelligence and describes several key concepts:
1. Contextualization is important for AI to understand complex user decisions and preferences based on natural language. Various algorithms will be developed to create and manage contexts using graph data structures and decision theory.
2. Natural language processing techniques like natural language understanding and natural language generation are discussed which allow AI systems to understand and generate human languages.
3. Personalization is described where systems learn individual user traits and preferences from their requests to provide customized responses and recommendations.
4. The paper concludes that a contextualization-based system will be developed using various graph algorithms to create a generalized system for decision making.
The document discusses Quiterian, a data mining and predictive analysis platform that helps companies get more value from data sooner, anticipate the future to react earlier, and empower users while reducing IT costs. It provides fast data loading and exploration without limits, dynamic analysis and predictive modeling techniques, and easy report publishing and distribution. A typical implementation takes less than a month and requires minimal IT resources. Quiterian has been used by leading organizations in various industries.
This document provides an overview of topics related to data and analytics for IoT. It discusses structured vs unstructured data, data in motion vs data at rest, and different types of data analysis including descriptive, diagnostic, predictive, and prescriptive. It also covers machine learning techniques including supervised learning methods like regression and classification, as well as unsupervised learning methods like clustering and association. Popular algorithms for each are listed. Challenges of analyzing IoT data like scaling issues and data volatility are also addressed.
An Analysis on Query Optimization in Distributed DatabaseEditor IJMTER
The query optimizer is a significant element in today’s relational database
management system. This element is responsible for translating a user-submitted query
commonly written in a non-procedural language-into an efficient query evaluation program that
can be executed against the database. This research paper describes architecture steps of query
process and optimization time and memory usage. Key goal of this paper is to understand the
basic query optimization process and its architecture.
AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a fl...Dr. Haxel Consult
Customers interested in Language Analytics solutions typically approach us with a broad range of business cases and specific business needs. Especially when it comes to the data available for their case and for any AI aspects involved, the variation in data types, data quality and data quantity is, by our experience, quite vast and at the same time so critical for a project's success, that we often start our requirements analysis right there: at the data. At Karakun, our Language Analytics team addresses this in an increasingly flexible way: We select from a set of Language Analytics tools and related services (e.g. data cleansing and data procurement) to meet the business needs at hand with the data available or at least in reach – at reasonable costs.
The methodology stack ranges from heuristic logic over statistical solutions to neural networks. At the same time, we aim at reducing the amount of data needed for such training, e.g. by integrating state-of-the-art neural technologies into our platform. That way, also SMEs and their specific business cases can benefit from the full range of Language Analytics options.
To illustrate our approach, we will present an e-Safe solution which allows for semantic document tagging and search in highly secured virtual safes. In addition, our solution provides text-based triggers for complex workflows depending on the safe´s content.
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...inside-BigData.com
In this Deck from the 2018 Swiss HPC Conference, Dave Turek from IBM presents: The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big Data.
"There is a shift underway where HPC is beginning to be addressed with novel techniques and technologies including cognitive and analytic approaches to HPC problems and the arrival of the first quantum systems. This talk will showcase how IBM is merging cognitive, analytics, and quantum with classic simulation and modeling to create a new path for computational science."
Watch the video: https://wp.me/p3RLHQ-ik7
Learn more: http://ibm.com
and
http://www.hpcadvisorycouncil.com/events/2018/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Birgit Plietzsch “RDM within research computing support” SALCTG June 2013SALCTG
An overview of Research Data Management: the research process from developing ideas to preservation of data; funder perspectives, the impact on the wider service, Data Asset Frameworks, preservation and access, and cost implications.
A comparative study of secure search protocols in pay as-you-go cloudseSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Steps to consider when moving from paper to digital in any business. Solutions presented have been developed by TC Inc. and or Networking team. Steps provided should work on just about any environment and allows for expansion while minimizing growing pains.
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...IRJET Journal
This document presents a methodology for a robust keyword-based document retrieval system utilizing advanced encryption. Key aspects of the methodology include:
1) Performing stop word removal and term frequency analysis to create feature vectors for documents.
2) Assigning unique numbers to terms to create a dictionary and document codes for identification and comparison.
3) Using advanced encryption techniques like substitution and mixing on the document codes.
4) Comparing the codes of user queries to document codes to find and rank the most relevant documents.
The methodology is tested on real and artificial datasets, showing improved accuracy, precision, and recall over previous methods according to experimental results.
Similar to TF-IDuF: A Novel Term-Weighting Scheme for User Modeling based on Users’ Personal Document Collections (20)
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Marlon Dumas
This webinar discusses the limitations of traditional approaches for business process simulation based on had-crafted model with restrictive assumptions. It shows how process mining techniques can be assembled together to discover high-fidelity digital twins of end-to-end processes from event data.
06-20-2024-AI Camp Meetup-Unstructured Data and Vector DatabasesTimothy Spann
Tech Talk: Unstructured Data and Vector Databases
Speaker: Tim Spann (Zilliz)
Abstract: In this session, I will discuss the unstructured data and the world of vector databases, we will see how they different from traditional databases. In which cases you need one and in which you probably don’t. I will also go over Similarity Search, where do you get vectors from and an example of a Vector Database Architecture. Wrapping up with an overview of Milvus.
Introduction
Unstructured data, vector databases, traditional databases, similarity search
Vectors
Where, What, How, Why Vectors? We’ll cover a Vector Database Architecture
Introducing Milvus
What drives Milvus' Emergence as the most widely adopted vector database
Hi Unstructured Data Friends!
I hope this video had all the unstructured data processing, AI and Vector Database demo you needed for now. If not, there’s a ton more linked below.
My source code is available here
https://github.com/tspannhw/
Let me know in the comments if you liked what you saw, how I can improve and what should I show next? Thanks, hope to see you soon at a Meetup in Princeton, Philadelphia, New York City or here in the Youtube Matrix.
Get Milvused!
https://milvus.io/
Read my Newsletter every week!
https://github.com/tspannhw/FLiPStackWeekly/blob/main/141-10June2024.md
For more cool Unstructured Data, AI and Vector Database videos check out the Milvus vector database videos here
https://www.youtube.com/@MilvusVectorDatabase/videos
Unstructured Data Meetups -
https://www.meetup.com/unstructured-data-meetup-new-york/
https://lu.ma/calendar/manage/cal-VNT79trvj0jS8S7
https://www.meetup.com/pro/unstructureddata/
https://zilliz.com/community/unstructured-data-meetup
https://zilliz.com/event
Twitter/X: https://x.com/milvusio https://x.com/paasdev
LinkedIn: https://www.linkedin.com/company/zilliz/ https://www.linkedin.com/in/timothyspann/
GitHub: https://github.com/milvus-io/milvus https://github.com/tspannhw
Invitation to join Discord: https://discord.com/invite/FjCMmaJng6
Blogs: https://milvusio.medium.com/ https://www.opensourcevectordb.cloud/ https://medium.com/@tspann
https://www.meetup.com/unstructured-data-meetup-new-york/events/301383476/?slug=unstructured-data-meetup-new-york&eventId=301383476
https://www.aicamp.ai/event/eventdetails/W2024062014
TF-IDuF: A Novel Term-Weighting Scheme for User Modeling based on Users’ Personal Document Collections
1. TF-IDuF: A Novel Term-Weighting Scheme for User
Modeling based on Users’ Personal Document
Collections
Joeran Beel, Stefan Langer, Bela Gipp
iConference 2017 -- 2017/03/24, presented by Maria Gäde
2. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 2
Outline
1. Term-Weighting Schemes
2. TF-IDuF Introduction
3. Evaluation
3. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 3
1. Term Weighting Schemes
4. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 4
Purpose of Term-Weighting Schemes
• Search Engines
• Calculate how well a term describes
a document’s content
• Match with search query
• User-Modelling and Recommender
Systems
• calculate how well a term describes
a user’s information need.
• Find most relevant documents to
satisfy the information need
5. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 5
TF-IDF
• TF-IDF was introduced by Jones
(1972)
• Probably the most popular
term-weighting scheme for
search
• One of the most popular
schemes for user modeling and
recommender systems.
• Two components
• Term Frequency (TF)
• Inverse document frequency
(IDF).
𝑇𝐹 − 𝐼𝐷𝐹 = 𝑡𝑓 𝑡 ∗ log
𝑁𝑟
𝑛 𝑟
t Term to weight
tf(t) Frequency of tin the documents of cum
cr A corpus of documents that may be
recommended to u
Nr Number of documents in cr
nr Number of documents in cr that contain t
6. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 6
TF-IDF Illustration
• User u possesses a document collection cu. This collection might contain, for instance,
all documents that the user downloaded, bought, or read.
• The user-modeling engine identifies those documents from cu that are relevant for
modeling the user’s information need. Relevant documents could be, for instance,
documents that the user downloaded or bought in the past x days. The engine selects
these documents as a temporary document collection cum to be used for user
modeling.
• The user-modeling engine weights each term that occurs in cum with TF-IDF
• The user-modeling engines stores the z highest weighted terms as user model um.
These terms are meant to represent the user’s information need.
• The recommender system matches um with the documents in cr and recommends the
most relevant recommendation candidates to u.
Identify relevant
documents
Weight terms ti...n
and create um
Match user
model and
rec. candidates
User model um of
user u
Temporary
document collection
for user modeling cum
Document collection cu
of user u
Corpus of recommendation
candidates cr
IDFTF
7. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 7
Problems of TF-IDF (for User Modelling)
1. To calculate IDF, access to the recommendation corpus is needed,
which is not always available.
2. Documents in a user’s document collection that are not selected
for the user modelling process are ignored in the weighting. We
assume that these remaining documents contain valuable
information.
8. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 8
2. TF-IDuF Introduction
9. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 9
TF-IDuF
• The term frequency (TF) component in TF-IDuF is the same as in TF-IDF:
terms are weighted higher, the more often they occur in the documents
selected for building the user model.
• The user-focused inverse document frequency (IDuF) differs from
traditional IDF. While the classic IDF is calculated using the document
frequencies in the recommendation corpus, IDuF is calculated using the
document frequencies in a user’s personal document collection cu, where
terms are weighted more strongly, the fewer documents in a user’s
collection contain these terms.
10. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 10
Rationale A (1)
• The user-modeling engine
selects a user’s two most
recently downloaded
documents d1 and d2.
• Frequency of t1 in d1 equals
frequency of t2 in d2 .
• User’s document collection
contains additional with t2,
but these documents were
not selected
Identify relevant
documents
Document contains t2 and is
relevant for user modeling
Document contains t1 and is
relevant for user modeling
Document collection cu
of user u
Document collection
for user modeling cum
Document contains t2 but is not
relevant for user modeling
Legend
d1
d2
Option 1
11. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 11
Rationale A (2)
• We assume
• t1 describes a new topic that the
author was previously not
interested in. Hence, t1 should be
weighted stronger than t2
• It is easier to generate good
recommendations for t1 than for
t2 because there are potentially
more documents on t1 that the
user does not yet know about
compared to documents on t2.
• Users have probably received
recommendations for t2 in the
past
Identify relevant
documents
Document contains t2 and is
relevant for user modeling
Document contains t1 and is
relevant for user modeling
Document collection cu
of user u
Document collection
for user modeling cum
Document contains t2 but is not
relevant for user modeling
Legend
d1
d2
Option 1
12. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 12
Rationale B (1)
• The user modeling engine
selects d1, d2, … dn
• d1 contains term t1, and d2…n
contain term t2.
• The overall term frequency for
t1 and t2 in cum is the same.
--> The density of t1 in d1 must be
higher than the density of t2 in
each of the documents d2…n. In
other words, t1 occurs very
frequently in d1, while t2 occurs
only a few times in each of the
documents d2…n.
Document cont
relevant for use
Document contains t1 and is
relevant for user modeling
Legend
Identify relevant
documents
Document collection cu
of user u
Document collection
for user modeling cum
d1
d2...n
Example 1
13. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 13
Rationale B (2)
• We assume
• d1 covers t1 in depth,
• d2…n cover the topic t2 only
to some extent.
• t1 is more suitable for
describing the user’s
information need. Hence, t1
should be weighted stronger
than t2
Document cont
relevant for use
Document contains t1 and is
relevant for user modeling
Legend
Identify relevant
documents
Document collection cu
of user u
Document collection
for user modeling cum
d1
d2...n
Example 1
14. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 14
3. Evaluation
15. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 15
Methodology
• A/B Test in Docear’s research-paper
recommender system.
• Docear is a reference manager that
allows users to manage references and
PDF files, similar to Mendeley and
Zotero.
• One key difference is that Docear’s
users manage their data in mind-maps.
Users’ mind-maps contain links to
PDFs, as well as the user’s annotations
made within those PDFs.
• To calculate TF-IDuF, each mind map of
a user was considered as one
document.
16. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 16
A/B Test Design
• Random Selection of
• TF-IDuF
• TF-IDF
• TF-only
• Evaluation with click-through rates (CTR).
• 228,762 recommendations to 3,483 users
• January – September 2014.
• All results are statistically significant (p<0.05), if not stated
otherwise.
17. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 17
Results
TF-Only TF-IDF TF-IDuF
CTR 4.06% 5.09% 5.14%
0%
1%
2%
3%
4%
5%
6%
CTR
WeightingScheme
• TF-IDF outperforms TF-Only by 25% (CTR 5.09% vs. 4.06%)
• Result is not surprising but we are the first to empirically confirm
this result for research-paper recommender systems.
• TF-IDuF performed equally well as TF-IDF (5.14% vs. 5.09%)
18. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 18
Conclusion
• TF-IDuF is equally effective as TF-IDF
• TF-IDuF is faster to calculate than TF-IDF and can be calculated
locally, without access to the global recommendation corpus,
• TF-IDuF and TF-IDF are not exclusive and could be used in a
complementary manner. This means, a term could be weighted
based on all three factors TF, IDF, and IDuF.
• Further research is necessary to confirm the promising performance
and to find out if TF-IDuF performs equally well on other types of
personal document corpora, such as users’ collections of research-
papers, websites or news articles.
--> TF-IDuF is a promising weighting scheme.
Term-weighting schemes are used by search engines and by user-modeling and recommender systems. Search engines use term-weighting schemes to calculate how well a term describes a document’s content, while user-modeling and recommender systems use term-weighting schemes to calculate how well a term describes a user’s information need. One popular term-weighting schemes is TF-IDF
TF is the frequency with which a term occurs in a document or user model. The rationale is that the more frequently a term occurs, the more likely this term describes a document’s content or user’s information need.
IDF reflects the importance of the term by computing the inverse frequency of documents containing the term within the entire corpus of documents to be searched or recommended. The basic assumption is that a term should be given a higher weight if few other documents also contain that term, because rare terms will likely be more representative of a document’s content or user’s interests.
For instance, Nascimento, Laender, Silva, & Gonçalves (2011) create user models locally in their literature recommender system and then send the user model as search query to the ACM Digital Library (the search results are presented as recommendations). In such a scenario, IDF cannot be calculated by the recommender system.
Traditional TF-IDF calculates term weights based on TF in the documents selected for the user-modeling process and IDF based on the number of documents containing the terms in the recommendation corpus.
The user-modeling engine selects a user’s two most recently downloaded documents d1 and d2. d1 contains t1 in the same frequency as d2 contains t2. Based on term frequency alone, both terms would be considered equally suitable for describing the user’s information need. However, the user’s document collection contains a number of additional documents that contain t2, but these documents were not selected for the user modeling process, e.g. because they were downloaded many months ago. There are no further documents that contain t1 in the user’s document collection. In this scenario, we may assume that t1 describes a new topic that the author was previously not interested in. We hypothesize that in such a scenario, t1 should be weighted more strongly than t2 because:
Users are likely to favor recommendations for the newer topic t1 rather than for the older topic t2.
It is easier to generate good recommendations for t1 than for t2 because there are potentially more documents on t1 that the user does not yet know about compared to documents on t2.
Users have probably received recommendations for t2 in the past, but they have likely not yet received many recommendations for t1. Hence, for t2, the most relevant documents probably have already been recommended to the user.
The user-modeling engine selects a user’s two most recently downloaded documents d1 and d2. d1 contains t1 in the same frequency as d2 contains t2. Based on term frequency alone, both terms would be considered equally suitable for describing the user’s information need. However, the user’s document collection contains a number of additional documents that contain t2, but these documents were not selected for the user modeling process, e.g. because they were downloaded many months ago. There are no further documents that contain t1 in the user’s document collection. In this scenario, we may assume that t1 describes a new topic that the author was previously not interested in. We hypothesize that in such a scenario, t1 should be weighted more strongly than t2 because:
Users are likely to favor recommendations for the newer topic t1 rather than for the older topic t2.
It is easier to generate good recommendations for t1 than for t2 because there are potentially more documents on t1 that the user does not yet know about compared to documents on t2.
Users have probably received recommendations for t2 in the past, but they have likely not yet received many recommendations for t1. Hence, for t2, the most relevant documents probably have already been recommended to the user.
The user modeling engine selects d1, d2, … dn for the user modeling process. d1 contains term t1, and d2…n contain term t2. The overall term frequency for t1 and t2 in cum is the same. Consequently, the density of t1 in d1 must be higher than the density of t2 in each of the documents d2…n. In other words, t1 occurs very frequently in d1, while t2 occurs only a few times in each of the documents d2…n. We would therefore assume that d1 covers t1 in depth, while d2…n cover the topic t2 only to some extent. We hypothesize that in this scenario, t1 is more suitable for describing the user’s information need. Hence, t1 should be weighted more strongly than t2, which is the case when using TF-IDuF, since only one document in cu contains t1, while many documents contain t2.
The user modeling engine selects d1, d2, … dn for the user modeling process. d1 contains term t1, and d2…n contain term t2. The overall term frequency for t1 and t2 in cum is the same. Consequently, the density of t1 in d1 must be higher than the density of t2 in each of the documents d2…n. In other words, t1 occurs very frequently in d1, while t2 occurs only a few times in each of the documents d2…n. We would therefore assume that d1 covers t1 in depth, while d2…n cover the topic t2 only to some extent. We hypothesize that in this scenario, t1 is more suitable for describing the user’s information need. Hence, t1 should be weighted more strongly than t2, which is the case when using TF-IDuF, since only one document in cu contains t1, while many documents contain t2.
Whenever Docear wanted to diaplay recommendations, Docear randomly selected on of the three weighting schemes. We measured how often users clicked on the recommendations.
Click-through rate for TF-IDF was significantly higher than for TF-Only (5.09% vs. 4.06%), i.e. TF-IDF was approximately 25% more effective than TF-Only (Figure 3).
This result confirms the previous findings of TF-IDF being more effective than term frequency alone. Although, this result is not surprising, we are, to the best of our knowledge, the first to empirically confirm this result for research-paper recommender systems.
TF-IDuF achieved a CTR of 5.14%, meaning it performed equally well as TF-IDF, with its average CTR of 5.09% (the difference is statistically not significant).
We performed the first evaluation of TF-IDuF using the mind maps of Docear’s users as personal document corpora.
We were positively surprised by the results.