SlideShare a Scribd company logo
TF-IDuF: A Novel Term-Weighting Scheme for User
Modeling based on Users’ Personal Document
Collections
Joeran Beel, Stefan Langer, Bela Gipp
iConference 2017 -- 2017/03/24, presented by Maria Gäde
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 2
Outline
1. Term-Weighting Schemes
2. TF-IDuF Introduction
3. Evaluation
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 3
1. Term Weighting Schemes
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 4
Purpose of Term-Weighting Schemes
• Search Engines
• Calculate how well a term describes
a document’s content
• Match with search query
• User-Modelling and Recommender
Systems
• calculate how well a term describes
a user’s information need.
• Find most relevant documents to
satisfy the information need
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 5
TF-IDF
• TF-IDF was introduced by Jones
(1972)
• Probably the most popular
term-weighting scheme for
search
• One of the most popular
schemes for user modeling and
recommender systems.
• Two components
• Term Frequency (TF)
• Inverse document frequency
(IDF).
𝑇𝐹 − 𝐼𝐷𝐹 = 𝑡𝑓 𝑡 ∗ log
𝑁𝑟
𝑛 𝑟
t Term to weight
tf(t) Frequency of tin the documents of cum
cr A corpus of documents that may be
recommended to u
Nr Number of documents in cr
nr Number of documents in cr that contain t
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 6
TF-IDF Illustration
• User u possesses a document collection cu. This collection might contain, for instance,
all documents that the user downloaded, bought, or read.
• The user-modeling engine identifies those documents from cu that are relevant for
modeling the user’s information need. Relevant documents could be, for instance,
documents that the user downloaded or bought in the past x days. The engine selects
these documents as a temporary document collection cum to be used for user
modeling.
• The user-modeling engine weights each term that occurs in cum with TF-IDF
• The user-modeling engines stores the z highest weighted terms as user model um.
These terms are meant to represent the user’s information need.
• The recommender system matches um with the documents in cr and recommends the
most relevant recommendation candidates to u.
Identify relevant
documents
Weight terms ti...n
and create um
Match user
model and
rec. candidates
User model um of
user u
Temporary
document collection
for user modeling cum
Document collection cu
of user u
Corpus of recommendation
candidates cr
IDFTF
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 7
Problems of TF-IDF (for User Modelling)
1. To calculate IDF, access to the recommendation corpus is needed,
which is not always available.
2. Documents in a user’s document collection that are not selected
for the user modelling process are ignored in the weighting. We
assume that these remaining documents contain valuable
information.
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 8
2. TF-IDuF Introduction
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 9
TF-IDuF
• The term frequency (TF) component in TF-IDuF is the same as in TF-IDF:
terms are weighted higher, the more often they occur in the documents
selected for building the user model.
• The user-focused inverse document frequency (IDuF) differs from
traditional IDF. While the classic IDF is calculated using the document
frequencies in the recommendation corpus, IDuF is calculated using the
document frequencies in a user’s personal document collection cu, where
terms are weighted more strongly, the fewer documents in a user’s
collection contain these terms.
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 10
Rationale A (1)
• The user-modeling engine
selects a user’s two most
recently downloaded
documents d1 and d2.
• Frequency of t1 in d1 equals
frequency of t2 in d2 .
• User’s document collection
contains additional with t2,
but these documents were
not selected
Identify relevant
documents
Document contains t2 and is
relevant for user modeling
Document contains t1 and is
relevant for user modeling
Document collection cu
of user u
Document collection
for user modeling cum
Document contains t2 but is not
relevant for user modeling
Legend
d1
d2
Option 1
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 11
Rationale A (2)
• We assume
• t1 describes a new topic that the
author was previously not
interested in. Hence, t1 should be
weighted stronger than t2
• It is easier to generate good
recommendations for t1 than for
t2 because there are potentially
more documents on t1 that the
user does not yet know about
compared to documents on t2.
• Users have probably received
recommendations for t2 in the
past
Identify relevant
documents
Document contains t2 and is
relevant for user modeling
Document contains t1 and is
relevant for user modeling
Document collection cu
of user u
Document collection
for user modeling cum
Document contains t2 but is not
relevant for user modeling
Legend
d1
d2
Option 1
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 12
Rationale B (1)
• The user modeling engine
selects d1, d2, … dn
• d1 contains term t1, and d2…n
contain term t2.
• The overall term frequency for
t1 and t2 in cum is the same.
--> The density of t1 in d1 must be
higher than the density of t2 in
each of the documents d2…n. In
other words, t1 occurs very
frequently in d1, while t2 occurs
only a few times in each of the
documents d2…n.
Document cont
relevant for use
Document contains t1 and is
relevant for user modeling
Legend
Identify relevant
documents
Document collection cu
of user u
Document collection
for user modeling cum
d1
d2...n
Example 1
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 13
Rationale B (2)
• We assume
• d1 covers t1 in depth,
• d2…n cover the topic t2 only
to some extent.
• t1 is more suitable for
describing the user’s
information need. Hence, t1
should be weighted stronger
than t2
Document cont
relevant for use
Document contains t1 and is
relevant for user modeling
Legend
Identify relevant
documents
Document collection cu
of user u
Document collection
for user modeling cum
d1
d2...n
Example 1
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 14
3. Evaluation
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 15
Methodology
• A/B Test in Docear’s research-paper
recommender system.
• Docear is a reference manager that
allows users to manage references and
PDF files, similar to Mendeley and
Zotero.
• One key difference is that Docear’s
users manage their data in mind-maps.
Users’ mind-maps contain links to
PDFs, as well as the user’s annotations
made within those PDFs.
• To calculate TF-IDuF, each mind map of
a user was considered as one
document.
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 16
A/B Test Design
• Random Selection of
• TF-IDuF
• TF-IDF
• TF-only
• Evaluation with click-through rates (CTR).
• 228,762 recommendations to 3,483 users
• January – September 2014.
• All results are statistically significant (p<0.05), if not stated
otherwise.
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 17
Results
TF-Only TF-IDF TF-IDuF
CTR 4.06% 5.09% 5.14%
0%
1%
2%
3%
4%
5%
6%
CTR
WeightingScheme
• TF-IDF outperforms TF-Only by 25% (CTR 5.09% vs. 4.06%)
• Result is not surprising but we are the first to empirically confirm
this result for research-paper recommender systems.
• TF-IDuF performed equally well as TF-IDF (5.14% vs. 5.09%)
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 18
Conclusion
• TF-IDuF is equally effective as TF-IDF
• TF-IDuF is faster to calculate than TF-IDF and can be calculated
locally, without access to the global recommendation corpus,
• TF-IDuF and TF-IDF are not exclusive and could be used in a
complementary manner. This means, a term could be weighted
based on all three factors TF, IDF, and IDuF.
• Further research is necessary to confirm the promising performance
and to find out if TF-IDuF performs equally well on other types of
personal document corpora, such as users’ collections of research-
papers, websites or news articles.
--> TF-IDuF is a promising weighting scheme.
Thank You
Questions: Joeran Beel, beel@tcd.ie

More Related Content

What's hot

Text pre-processing of multilingual for sentiment analysis based on social ne...
Text pre-processing of multilingual for sentiment analysis based on social ne...Text pre-processing of multilingual for sentiment analysis based on social ne...
Text pre-processing of multilingual for sentiment analysis based on social ne...
IJECEIAES
 
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET Journal
 
Analysis Model in the Cloud Optimization Consumption in Pricing the Internet ...
Analysis Model in the Cloud Optimization Consumption in Pricing the Internet ...Analysis Model in the Cloud Optimization Consumption in Pricing the Internet ...
Analysis Model in the Cloud Optimization Consumption in Pricing the Internet ...
IJECEIAES
 
IRJET- Automated Document Summarization and Classification using Deep Lear...
IRJET- 	  Automated Document Summarization and Classification using Deep Lear...IRJET- 	  Automated Document Summarization and Classification using Deep Lear...
IRJET- Automated Document Summarization and Classification using Deep Lear...
IRJET Journal
 
An Iterative Model as a Tool in Optimal Allocation of Resources in University...
An Iterative Model as a Tool in Optimal Allocation of Resources in University...An Iterative Model as a Tool in Optimal Allocation of Resources in University...
An Iterative Model as a Tool in Optimal Allocation of Resources in University...
Dr. Amarjeet Singh
 
An Efficient Cloud Scheduling Algorithm for the Conservation of Energy throug...
An Efficient Cloud Scheduling Algorithm for the Conservation of Energy throug...An Efficient Cloud Scheduling Algorithm for the Conservation of Energy throug...
An Efficient Cloud Scheduling Algorithm for the Conservation of Energy throug...
IJECEIAES
 
Improving IF Algorithm for Data Aggregation Techniques in Wireless Sensor Net...
Improving IF Algorithm for Data Aggregation Techniques in Wireless Sensor Net...Improving IF Algorithm for Data Aggregation Techniques in Wireless Sensor Net...
Improving IF Algorithm for Data Aggregation Techniques in Wireless Sensor Net...
IJECEIAES
 
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
israel edem
 
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCEANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
ijesajournal
 
100503 bioinfo instsymp
100503 bioinfo instsymp100503 bioinfo instsymp
100503 bioinfo instsymp
Nick Jones
 
A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...
A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...
A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...
ijaia
 
Peer-to-Peer Data Sharing and Deduplication using Genetic Algorithm
Peer-to-Peer Data Sharing and Deduplication using Genetic AlgorithmPeer-to-Peer Data Sharing and Deduplication using Genetic Algorithm
Peer-to-Peer Data Sharing and Deduplication using Genetic Algorithm
IRJET Journal
 
Green computing on Consumer's buying behavior
Green computing on Consumer's buying behavior Green computing on Consumer's buying behavior
Green computing on Consumer's buying behavior
Shibly Ahamed
 
Analysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry SystemAnalysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry System
IJSRD
 
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET Journal
 
Australia's Environmental Predictive Capability
Australia's Environmental Predictive CapabilityAustralia's Environmental Predictive Capability
Australia's Environmental Predictive Capability
TERN Australia
 
Suggestion Mining by Ahsan_CSE_CU
Suggestion Mining by Ahsan_CSE_CUSuggestion Mining by Ahsan_CSE_CU
Suggestion Mining by Ahsan_CSE_CU
Ahsan Ullah
 
IRJET - Mobile Chatbot for Information Search
 IRJET - Mobile Chatbot for Information Search IRJET - Mobile Chatbot for Information Search
IRJET - Mobile Chatbot for Information Search
IRJET Journal
 
IRJET- Methodologies used on News Articles :A Survey
IRJET- Methodologies used on News Articles :A SurveyIRJET- Methodologies used on News Articles :A Survey
IRJET- Methodologies used on News Articles :A Survey
IRJET Journal
 

What's hot (19)

Text pre-processing of multilingual for sentiment analysis based on social ne...
Text pre-processing of multilingual for sentiment analysis based on social ne...Text pre-processing of multilingual for sentiment analysis based on social ne...
Text pre-processing of multilingual for sentiment analysis based on social ne...
 
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
 
Analysis Model in the Cloud Optimization Consumption in Pricing the Internet ...
Analysis Model in the Cloud Optimization Consumption in Pricing the Internet ...Analysis Model in the Cloud Optimization Consumption in Pricing the Internet ...
Analysis Model in the Cloud Optimization Consumption in Pricing the Internet ...
 
IRJET- Automated Document Summarization and Classification using Deep Lear...
IRJET- 	  Automated Document Summarization and Classification using Deep Lear...IRJET- 	  Automated Document Summarization and Classification using Deep Lear...
IRJET- Automated Document Summarization and Classification using Deep Lear...
 
An Iterative Model as a Tool in Optimal Allocation of Resources in University...
An Iterative Model as a Tool in Optimal Allocation of Resources in University...An Iterative Model as a Tool in Optimal Allocation of Resources in University...
An Iterative Model as a Tool in Optimal Allocation of Resources in University...
 
An Efficient Cloud Scheduling Algorithm for the Conservation of Energy throug...
An Efficient Cloud Scheduling Algorithm for the Conservation of Energy throug...An Efficient Cloud Scheduling Algorithm for the Conservation of Energy throug...
An Efficient Cloud Scheduling Algorithm for the Conservation of Energy throug...
 
Improving IF Algorithm for Data Aggregation Techniques in Wireless Sensor Net...
Improving IF Algorithm for Data Aggregation Techniques in Wireless Sensor Net...Improving IF Algorithm for Data Aggregation Techniques in Wireless Sensor Net...
Improving IF Algorithm for Data Aggregation Techniques in Wireless Sensor Net...
 
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
 
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCEANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
 
100503 bioinfo instsymp
100503 bioinfo instsymp100503 bioinfo instsymp
100503 bioinfo instsymp
 
A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...
A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...
A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...
 
Peer-to-Peer Data Sharing and Deduplication using Genetic Algorithm
Peer-to-Peer Data Sharing and Deduplication using Genetic AlgorithmPeer-to-Peer Data Sharing and Deduplication using Genetic Algorithm
Peer-to-Peer Data Sharing and Deduplication using Genetic Algorithm
 
Green computing on Consumer's buying behavior
Green computing on Consumer's buying behavior Green computing on Consumer's buying behavior
Green computing on Consumer's buying behavior
 
Analysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry SystemAnalysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry System
 
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
 
Australia's Environmental Predictive Capability
Australia's Environmental Predictive CapabilityAustralia's Environmental Predictive Capability
Australia's Environmental Predictive Capability
 
Suggestion Mining by Ahsan_CSE_CU
Suggestion Mining by Ahsan_CSE_CUSuggestion Mining by Ahsan_CSE_CU
Suggestion Mining by Ahsan_CSE_CU
 
IRJET - Mobile Chatbot for Information Search
 IRJET - Mobile Chatbot for Information Search IRJET - Mobile Chatbot for Information Search
IRJET - Mobile Chatbot for Information Search
 
IRJET- Methodologies used on News Articles :A Survey
IRJET- Methodologies used on News Articles :A SurveyIRJET- Methodologies used on News Articles :A Survey
IRJET- Methodologies used on News Articles :A Survey
 

Similar to TF-IDuF: A Novel Term-Weighting Scheme for User Modeling based on Users’ Personal Document Collections

Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Provectus
 
GenerativeAI and Automation - IEEE ACSOS 2023.pptx
GenerativeAI and Automation - IEEE ACSOS 2023.pptxGenerativeAI and Automation - IEEE ACSOS 2023.pptx
GenerativeAI and Automation - IEEE ACSOS 2023.pptx
Allen Chan
 
IRJET- PDF Extraction using Data Mining Techniques
IRJET- PDF Extraction using Data Mining TechniquesIRJET- PDF Extraction using Data Mining Techniques
IRJET- PDF Extraction using Data Mining Techniques
IRJET Journal
 
Information retrieval systems irt ppt do
Information retrieval systems irt ppt doInformation retrieval systems irt ppt do
Information retrieval systems irt ppt do
PonnuthuraiSelvaraj1
 
IRJET- Determining Document Relevance using Keyword Extraction
IRJET-  	  Determining Document Relevance using Keyword ExtractionIRJET-  	  Determining Document Relevance using Keyword Extraction
IRJET- Determining Document Relevance using Keyword Extraction
IRJET Journal
 
IoT Processing Topologies.pptx
IoT Processing Topologies.pptxIoT Processing Topologies.pptx
IoT Processing Topologies.pptx
taruian
 
DU_SERIES_Session1.pdf
DU_SERIES_Session1.pdfDU_SERIES_Session1.pdf
DU_SERIES_Session1.pdf
RohitRadhakrishnan8
 
Soa
SoaSoa
Introduction
IntroductionIntroduction
Introduction
sarojbhavaraju5
 
IRJET- Conextualization: Generalization and Empowering Content Domain
IRJET- Conextualization: Generalization and Empowering Content DomainIRJET- Conextualization: Generalization and Empowering Content Domain
IRJET- Conextualization: Generalization and Empowering Content Domain
IRJET Journal
 
Brochure quiterian DDWeb
Brochure quiterian DDWebBrochure quiterian DDWeb
Brochure quiterian DDWeb
Josep Arroyo
 
18CS81 IOT MODULE 4 PPT.pdf
18CS81 IOT MODULE 4 PPT.pdf18CS81 IOT MODULE 4 PPT.pdf
18CS81 IOT MODULE 4 PPT.pdf
FURYGaming22
 
An Analysis on Query Optimization in Distributed Database
An Analysis on Query Optimization in Distributed DatabaseAn Analysis on Query Optimization in Distributed Database
An Analysis on Query Optimization in Distributed Database
Editor IJMTER
 
AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a fl...
AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a fl...AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a fl...
AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a fl...
Dr. Haxel Consult
 
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...
inside-BigData.com
 
Birgit Plietzsch “RDM within research computing support” SALCTG June 2013
Birgit Plietzsch “RDM within research computing support” SALCTG June 2013Birgit Plietzsch “RDM within research computing support” SALCTG June 2013
Birgit Plietzsch “RDM within research computing support” SALCTG June 2013
SALCTG
 
Clinical Document Architecture Implementations - Lessons Learnt to Date
Clinical Document Architecture Implementations - Lessons Learnt to DateClinical Document Architecture Implementations - Lessons Learnt to Date
Clinical Document Architecture Implementations - Lessons Learnt to Date
Health Informatics New Zealand
 
A comparative study of secure search protocols in pay as-you-go clouds
A comparative study of secure search protocols in pay as-you-go cloudsA comparative study of secure search protocols in pay as-you-go clouds
A comparative study of secure search protocols in pay as-you-go clouds
eSAT Publishing House
 
From paper to digital
From paper to digitalFrom paper to digital
From paper to digital
Jose Ivan Delgado, Ph.D.
 
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
IRJET Journal
 

Similar to TF-IDuF: A Novel Term-Weighting Scheme for User Modeling based on Users’ Personal Document Collections (20)

Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
 
GenerativeAI and Automation - IEEE ACSOS 2023.pptx
GenerativeAI and Automation - IEEE ACSOS 2023.pptxGenerativeAI and Automation - IEEE ACSOS 2023.pptx
GenerativeAI and Automation - IEEE ACSOS 2023.pptx
 
IRJET- PDF Extraction using Data Mining Techniques
IRJET- PDF Extraction using Data Mining TechniquesIRJET- PDF Extraction using Data Mining Techniques
IRJET- PDF Extraction using Data Mining Techniques
 
Information retrieval systems irt ppt do
Information retrieval systems irt ppt doInformation retrieval systems irt ppt do
Information retrieval systems irt ppt do
 
IRJET- Determining Document Relevance using Keyword Extraction
IRJET-  	  Determining Document Relevance using Keyword ExtractionIRJET-  	  Determining Document Relevance using Keyword Extraction
IRJET- Determining Document Relevance using Keyword Extraction
 
IoT Processing Topologies.pptx
IoT Processing Topologies.pptxIoT Processing Topologies.pptx
IoT Processing Topologies.pptx
 
DU_SERIES_Session1.pdf
DU_SERIES_Session1.pdfDU_SERIES_Session1.pdf
DU_SERIES_Session1.pdf
 
Soa
SoaSoa
Soa
 
Introduction
IntroductionIntroduction
Introduction
 
IRJET- Conextualization: Generalization and Empowering Content Domain
IRJET- Conextualization: Generalization and Empowering Content DomainIRJET- Conextualization: Generalization and Empowering Content Domain
IRJET- Conextualization: Generalization and Empowering Content Domain
 
Brochure quiterian DDWeb
Brochure quiterian DDWebBrochure quiterian DDWeb
Brochure quiterian DDWeb
 
18CS81 IOT MODULE 4 PPT.pdf
18CS81 IOT MODULE 4 PPT.pdf18CS81 IOT MODULE 4 PPT.pdf
18CS81 IOT MODULE 4 PPT.pdf
 
An Analysis on Query Optimization in Distributed Database
An Analysis on Query Optimization in Distributed DatabaseAn Analysis on Query Optimization in Distributed Database
An Analysis on Query Optimization in Distributed Database
 
AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a fl...
AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a fl...AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a fl...
AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a fl...
 
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...
 
Birgit Plietzsch “RDM within research computing support” SALCTG June 2013
Birgit Plietzsch “RDM within research computing support” SALCTG June 2013Birgit Plietzsch “RDM within research computing support” SALCTG June 2013
Birgit Plietzsch “RDM within research computing support” SALCTG June 2013
 
Clinical Document Architecture Implementations - Lessons Learnt to Date
Clinical Document Architecture Implementations - Lessons Learnt to DateClinical Document Architecture Implementations - Lessons Learnt to Date
Clinical Document Architecture Implementations - Lessons Learnt to Date
 
A comparative study of secure search protocols in pay as-you-go clouds
A comparative study of secure search protocols in pay as-you-go cloudsA comparative study of secure search protocols in pay as-you-go clouds
A comparative study of secure search protocols in pay as-you-go clouds
 
From paper to digital
From paper to digitalFrom paper to digital
From paper to digital
 
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
 

Recently uploaded

A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
dataschool1
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
TeukuEriSyahputra
 
CAP Excel Formulas & Functions July - Copy (4).pdf
CAP Excel Formulas & Functions July - Copy (4).pdfCAP Excel Formulas & Functions July - Copy (4).pdf
CAP Excel Formulas & Functions July - Copy (4).pdf
frp60658
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
keesa2
 
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdfNamma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
22ad0301
 
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service LucknowCall Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
hiju9823
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
uevausa
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Marlon Dumas
 
SAP BW4HANA Implementagtion Content Document
SAP BW4HANA Implementagtion Content DocumentSAP BW4HANA Implementagtion Content Document
SAP BW4HANA Implementagtion Content Document
newdirectionconsulta
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
agdhot
 
Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
Vineet
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
aguty
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
Timothy Spann
 
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
Call Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call GirlCall Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call Girl
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
sapna sharmap11
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
hqfek
 
Senior Software Profiles Backend Sample - Sheet1.pdf
Senior Software Profiles  Backend Sample - Sheet1.pdfSenior Software Profiles  Backend Sample - Sheet1.pdf
Senior Software Profiles Backend Sample - Sheet1.pdf
Vineet
 
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
actyx
 
一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理
ugydym
 
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
zoykygu
 

Recently uploaded (20)

A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
 
CAP Excel Formulas & Functions July - Copy (4).pdf
CAP Excel Formulas & Functions July - Copy (4).pdfCAP Excel Formulas & Functions July - Copy (4).pdf
CAP Excel Formulas & Functions July - Copy (4).pdf
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
 
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdfNamma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
 
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service LucknowCall Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
 
SAP BW4HANA Implementagtion Content Document
SAP BW4HANA Implementagtion Content DocumentSAP BW4HANA Implementagtion Content Document
SAP BW4HANA Implementagtion Content Document
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
 
Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
 
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
Call Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call GirlCall Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call Girl
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
 
Senior Software Profiles Backend Sample - Sheet1.pdf
Senior Software Profiles  Backend Sample - Sheet1.pdfSenior Software Profiles  Backend Sample - Sheet1.pdf
Senior Software Profiles Backend Sample - Sheet1.pdf
 
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
 
一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理
 
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
 

TF-IDuF: A Novel Term-Weighting Scheme for User Modeling based on Users’ Personal Document Collections

  • 1. TF-IDuF: A Novel Term-Weighting Scheme for User Modeling based on Users’ Personal Document Collections Joeran Beel, Stefan Langer, Bela Gipp iConference 2017 -- 2017/03/24, presented by Maria Gäde
  • 2. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 2 Outline 1. Term-Weighting Schemes 2. TF-IDuF Introduction 3. Evaluation
  • 3. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 3 1. Term Weighting Schemes
  • 4. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 4 Purpose of Term-Weighting Schemes • Search Engines • Calculate how well a term describes a document’s content • Match with search query • User-Modelling and Recommender Systems • calculate how well a term describes a user’s information need. • Find most relevant documents to satisfy the information need
  • 5. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 5 TF-IDF • TF-IDF was introduced by Jones (1972) • Probably the most popular term-weighting scheme for search • One of the most popular schemes for user modeling and recommender systems. • Two components • Term Frequency (TF) • Inverse document frequency (IDF). 𝑇𝐹 − 𝐼𝐷𝐹 = 𝑡𝑓 𝑡 ∗ log 𝑁𝑟 𝑛 𝑟 t Term to weight tf(t) Frequency of tin the documents of cum cr A corpus of documents that may be recommended to u Nr Number of documents in cr nr Number of documents in cr that contain t
  • 6. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 6 TF-IDF Illustration • User u possesses a document collection cu. This collection might contain, for instance, all documents that the user downloaded, bought, or read. • The user-modeling engine identifies those documents from cu that are relevant for modeling the user’s information need. Relevant documents could be, for instance, documents that the user downloaded or bought in the past x days. The engine selects these documents as a temporary document collection cum to be used for user modeling. • The user-modeling engine weights each term that occurs in cum with TF-IDF • The user-modeling engines stores the z highest weighted terms as user model um. These terms are meant to represent the user’s information need. • The recommender system matches um with the documents in cr and recommends the most relevant recommendation candidates to u. Identify relevant documents Weight terms ti...n and create um Match user model and rec. candidates User model um of user u Temporary document collection for user modeling cum Document collection cu of user u Corpus of recommendation candidates cr IDFTF
  • 7. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 7 Problems of TF-IDF (for User Modelling) 1. To calculate IDF, access to the recommendation corpus is needed, which is not always available. 2. Documents in a user’s document collection that are not selected for the user modelling process are ignored in the weighting. We assume that these remaining documents contain valuable information.
  • 8. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 8 2. TF-IDuF Introduction
  • 9. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 9 TF-IDuF • The term frequency (TF) component in TF-IDuF is the same as in TF-IDF: terms are weighted higher, the more often they occur in the documents selected for building the user model. • The user-focused inverse document frequency (IDuF) differs from traditional IDF. While the classic IDF is calculated using the document frequencies in the recommendation corpus, IDuF is calculated using the document frequencies in a user’s personal document collection cu, where terms are weighted more strongly, the fewer documents in a user’s collection contain these terms.
  • 10. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 10 Rationale A (1) • The user-modeling engine selects a user’s two most recently downloaded documents d1 and d2. • Frequency of t1 in d1 equals frequency of t2 in d2 . • User’s document collection contains additional with t2, but these documents were not selected Identify relevant documents Document contains t2 and is relevant for user modeling Document contains t1 and is relevant for user modeling Document collection cu of user u Document collection for user modeling cum Document contains t2 but is not relevant for user modeling Legend d1 d2 Option 1
  • 11. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 11 Rationale A (2) • We assume • t1 describes a new topic that the author was previously not interested in. Hence, t1 should be weighted stronger than t2 • It is easier to generate good recommendations for t1 than for t2 because there are potentially more documents on t1 that the user does not yet know about compared to documents on t2. • Users have probably received recommendations for t2 in the past Identify relevant documents Document contains t2 and is relevant for user modeling Document contains t1 and is relevant for user modeling Document collection cu of user u Document collection for user modeling cum Document contains t2 but is not relevant for user modeling Legend d1 d2 Option 1
  • 12. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 12 Rationale B (1) • The user modeling engine selects d1, d2, … dn • d1 contains term t1, and d2…n contain term t2. • The overall term frequency for t1 and t2 in cum is the same. --> The density of t1 in d1 must be higher than the density of t2 in each of the documents d2…n. In other words, t1 occurs very frequently in d1, while t2 occurs only a few times in each of the documents d2…n. Document cont relevant for use Document contains t1 and is relevant for user modeling Legend Identify relevant documents Document collection cu of user u Document collection for user modeling cum d1 d2...n Example 1
  • 13. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 13 Rationale B (2) • We assume • d1 covers t1 in depth, • d2…n cover the topic t2 only to some extent. • t1 is more suitable for describing the user’s information need. Hence, t1 should be weighted stronger than t2 Document cont relevant for use Document contains t1 and is relevant for user modeling Legend Identify relevant documents Document collection cu of user u Document collection for user modeling cum d1 d2...n Example 1
  • 14. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 14 3. Evaluation
  • 15. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 15 Methodology • A/B Test in Docear’s research-paper recommender system. • Docear is a reference manager that allows users to manage references and PDF files, similar to Mendeley and Zotero. • One key difference is that Docear’s users manage their data in mind-maps. Users’ mind-maps contain links to PDFs, as well as the user’s annotations made within those PDFs. • To calculate TF-IDuF, each mind map of a user was considered as one document.
  • 16. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 16 A/B Test Design • Random Selection of • TF-IDuF • TF-IDF • TF-only • Evaluation with click-through rates (CTR). • 228,762 recommendations to 3,483 users • January – September 2014. • All results are statistically significant (p<0.05), if not stated otherwise.
  • 17. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 17 Results TF-Only TF-IDF TF-IDuF CTR 4.06% 5.09% 5.14% 0% 1% 2% 3% 4% 5% 6% CTR WeightingScheme • TF-IDF outperforms TF-Only by 25% (CTR 5.09% vs. 4.06%) • Result is not surprising but we are the first to empirically confirm this result for research-paper recommender systems. • TF-IDuF performed equally well as TF-IDF (5.14% vs. 5.09%)
  • 18. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 18 Conclusion • TF-IDuF is equally effective as TF-IDF • TF-IDuF is faster to calculate than TF-IDF and can be calculated locally, without access to the global recommendation corpus, • TF-IDuF and TF-IDF are not exclusive and could be used in a complementary manner. This means, a term could be weighted based on all three factors TF, IDF, and IDuF. • Further research is necessary to confirm the promising performance and to find out if TF-IDuF performs equally well on other types of personal document corpora, such as users’ collections of research- papers, websites or news articles. --> TF-IDuF is a promising weighting scheme.
  • 19. Thank You Questions: Joeran Beel, beel@tcd.ie

Editor's Notes

  1. Term-weighting schemes are used by search engines and by user-modeling and recommender systems. Search engines use term-weighting schemes to calculate how well a term describes a document’s content, while user-modeling and recommender systems use term-weighting schemes to calculate how well a term describes a user’s information need. One popular term-weighting schemes is TF-IDF
  2. TF is the frequency with which a term occurs in a document or user model. The rationale is that the more frequently a term occurs, the more likely this term describes a document’s content or user’s information need. IDF reflects the importance of the term by computing the inverse frequency of documents containing the term within the entire corpus of documents to be searched or recommended. The basic assumption is that a term should be given a higher weight if few other documents also contain that term, because rare terms will likely be more representative of a document’s content or user’s interests.
  3. For instance, Nascimento, Laender, Silva, & Gonçalves (2011) create user models locally in their literature recommender system and then send the user model as search query to the ACM Digital Library (the search results are presented as recommendations). In such a scenario, IDF cannot be calculated by the recommender system. Traditional TF-IDF calculates term weights based on TF in the documents selected for the user-modeling process and IDF based on the number of documents containing the terms in the recommendation corpus.
  4. The user-modeling engine selects a user’s two most recently downloaded documents d1 and d2. d1 contains t1 in the same frequency as d2 contains t2. Based on term frequency alone, both terms would be considered equally suitable for describing the user’s information need. However, the user’s document collection contains a number of additional documents that contain t2, but these documents were not selected for the user modeling process, e.g. because they were downloaded many months ago. There are no further documents that contain t1 in the user’s document collection. In this scenario, we may assume that t1 describes a new topic that the author was previously not interested in. We hypothesize that in such a scenario, t1 should be weighted more strongly than t2 because: Users are likely to favor recommendations for the newer topic t1 rather than for the older topic t2. It is easier to generate good recommendations for t1 than for t2 because there are potentially more documents on t1 that the user does not yet know about compared to documents on t2. Users have probably received recommendations for t2 in the past, but they have likely not yet received many recommendations for t1. Hence, for t2, the most relevant documents probably have already been recommended to the user.
  5. The user-modeling engine selects a user’s two most recently downloaded documents d1 and d2. d1 contains t1 in the same frequency as d2 contains t2. Based on term frequency alone, both terms would be considered equally suitable for describing the user’s information need. However, the user’s document collection contains a number of additional documents that contain t2, but these documents were not selected for the user modeling process, e.g. because they were downloaded many months ago. There are no further documents that contain t1 in the user’s document collection. In this scenario, we may assume that t1 describes a new topic that the author was previously not interested in. We hypothesize that in such a scenario, t1 should be weighted more strongly than t2 because: Users are likely to favor recommendations for the newer topic t1 rather than for the older topic t2. It is easier to generate good recommendations for t1 than for t2 because there are potentially more documents on t1 that the user does not yet know about compared to documents on t2. Users have probably received recommendations for t2 in the past, but they have likely not yet received many recommendations for t1. Hence, for t2, the most relevant documents probably have already been recommended to the user.
  6. The user modeling engine selects d1, d2, … dn for the user modeling process. d1 contains term t1, and d2…n contain term t2. The overall term frequency for t1 and t2 in cum is the same. Consequently, the density of t1 in d1 must be higher than the density of t2 in each of the documents d2…n. In other words, t1 occurs very frequently in d1, while t2 occurs only a few times in each of the documents d2…n. We would therefore assume that d1 covers t1 in depth, while d2…n cover the topic t2 only to some extent. We hypothesize that in this scenario, t1 is more suitable for describing the user’s information need. Hence, t1 should be weighted more strongly than t2, which is the case when using TF-IDuF, since only one document in cu contains t1, while many documents contain t2.
  7. The user modeling engine selects d1, d2, … dn for the user modeling process. d1 contains term t1, and d2…n contain term t2. The overall term frequency for t1 and t2 in cum is the same. Consequently, the density of t1 in d1 must be higher than the density of t2 in each of the documents d2…n. In other words, t1 occurs very frequently in d1, while t2 occurs only a few times in each of the documents d2…n. We would therefore assume that d1 covers t1 in depth, while d2…n cover the topic t2 only to some extent. We hypothesize that in this scenario, t1 is more suitable for describing the user’s information need. Hence, t1 should be weighted more strongly than t2, which is the case when using TF-IDuF, since only one document in cu contains t1, while many documents contain t2.
  8. Whenever Docear wanted to diaplay recommendations, Docear randomly selected on of the three weighting schemes. We measured how often users clicked on the recommendations.
  9. Click-through rate for TF-IDF was significantly higher than for TF-Only (5.09% vs. 4.06%), i.e. TF-IDF was approximately 25% more effective than TF-Only (Figure 3). This result confirms the previous findings of TF-IDF being more effective than term frequency alone. Although, this result is not surprising, we are, to the best of our knowledge, the first to empirically confirm this result for research-paper recommender systems. TF-IDuF achieved a CTR of 5.14%, meaning it performed equally well as TF-IDF, with its average CTR of 5.09% (the difference is statistically not significant).
  10. We performed the first evaluation of TF-IDuF using the mind maps of Docear’s users as personal document corpora. We were positively surprised by the results.