SlideShare a Scribd company logo
1 of 19
TF-IDuF: A Novel Term-Weighting Scheme for User
Modeling based on Users’ Personal Document
Collections
Joeran Beel, Stefan Langer, Bela Gipp
iConference 2017 -- 2017/03/24, presented by Maria Gäde
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 2
Outline
1. Term-Weighting Schemes
2. TF-IDuF Introduction
3. Evaluation
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 3
1. Term Weighting Schemes
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 4
Purpose of Term-Weighting Schemes
• Search Engines
• Calculate how well a term describes
a document’s content
• Match with search query
• User-Modelling and Recommender
Systems
• calculate how well a term describes
a user’s information need.
• Find most relevant documents to
satisfy the information need
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 5
TF-IDF
• TF-IDF was introduced by Jones
(1972)
• Probably the most popular
term-weighting scheme for
search
• One of the most popular
schemes for user modeling and
recommender systems.
• Two components
• Term Frequency (TF)
• Inverse document frequency
(IDF).
𝑇𝐹 − 𝐼𝐷𝐹 = 𝑡𝑓 𝑡 ∗ log
𝑁𝑟
𝑛 𝑟
t Term to weight
tf(t) Frequency of tin the documents of cum
cr A corpus of documents that may be
recommended to u
Nr Number of documents in cr
nr Number of documents in cr that contain t
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 6
TF-IDF Illustration
• User u possesses a document collection cu. This collection might contain, for instance,
all documents that the user downloaded, bought, or read.
• The user-modeling engine identifies those documents from cu that are relevant for
modeling the user’s information need. Relevant documents could be, for instance,
documents that the user downloaded or bought in the past x days. The engine selects
these documents as a temporary document collection cum to be used for user
modeling.
• The user-modeling engine weights each term that occurs in cum with TF-IDF
• The user-modeling engines stores the z highest weighted terms as user model um.
These terms are meant to represent the user’s information need.
• The recommender system matches um with the documents in cr and recommends the
most relevant recommendation candidates to u.
Identify relevant
documents
Weight terms ti...n
and create um
Match user
model and
rec. candidates
User model um of
user u
Temporary
document collection
for user modeling cum
Document collection cu
of user u
Corpus of recommendation
candidates cr
IDFTF
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 7
Problems of TF-IDF (for User Modelling)
1. To calculate IDF, access to the recommendation corpus is needed,
which is not always available.
2. Documents in a user’s document collection that are not selected
for the user modelling process are ignored in the weighting. We
assume that these remaining documents contain valuable
information.
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 8
2. TF-IDuF Introduction
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 9
TF-IDuF
• The term frequency (TF) component in TF-IDuF is the same as in TF-IDF:
terms are weighted higher, the more often they occur in the documents
selected for building the user model.
• The user-focused inverse document frequency (IDuF) differs from
traditional IDF. While the classic IDF is calculated using the document
frequencies in the recommendation corpus, IDuF is calculated using the
document frequencies in a user’s personal document collection cu, where
terms are weighted more strongly, the fewer documents in a user’s
collection contain these terms.
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 10
Rationale A (1)
• The user-modeling engine
selects a user’s two most
recently downloaded
documents d1 and d2.
• Frequency of t1 in d1 equals
frequency of t2 in d2 .
• User’s document collection
contains additional with t2,
but these documents were
not selected
Identify relevant
documents
Document contains t2 and is
relevant for user modeling
Document contains t1 and is
relevant for user modeling
Document collection cu
of user u
Document collection
for user modeling cum
Document contains t2 but is not
relevant for user modeling
Legend
d1
d2
Option 1
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 11
Rationale A (2)
• We assume
• t1 describes a new topic that the
author was previously not
interested in. Hence, t1 should be
weighted stronger than t2
• It is easier to generate good
recommendations for t1 than for
t2 because there are potentially
more documents on t1 that the
user does not yet know about
compared to documents on t2.
• Users have probably received
recommendations for t2 in the
past
Identify relevant
documents
Document contains t2 and is
relevant for user modeling
Document contains t1 and is
relevant for user modeling
Document collection cu
of user u
Document collection
for user modeling cum
Document contains t2 but is not
relevant for user modeling
Legend
d1
d2
Option 1
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 12
Rationale B (1)
• The user modeling engine
selects d1, d2, … dn
• d1 contains term t1, and d2…n
contain term t2.
• The overall term frequency for
t1 and t2 in cum is the same.
--> The density of t1 in d1 must be
higher than the density of t2 in
each of the documents d2…n. In
other words, t1 occurs very
frequently in d1, while t2 occurs
only a few times in each of the
documents d2…n.
Document cont
relevant for use
Document contains t1 and is
relevant for user modeling
Legend
Identify relevant
documents
Document collection cu
of user u
Document collection
for user modeling cum
d1
d2...n
Example 1
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 13
Rationale B (2)
• We assume
• d1 covers t1 in depth,
• d2…n cover the topic t2 only
to some extent.
• t1 is more suitable for
describing the user’s
information need. Hence, t1
should be weighted stronger
than t2
Document cont
relevant for use
Document contains t1 and is
relevant for user modeling
Legend
Identify relevant
documents
Document collection cu
of user u
Document collection
for user modeling cum
d1
d2...n
Example 1
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 14
3. Evaluation
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 15
Methodology
• A/B Test in Docear’s research-paper
recommender system.
• Docear is a reference manager that
allows users to manage references and
PDF files, similar to Mendeley and
Zotero.
• One key difference is that Docear’s
users manage their data in mind-maps.
Users’ mind-maps contain links to
PDFs, as well as the user’s annotations
made within those PDFs.
• To calculate TF-IDuF, each mind map of
a user was considered as one
document.
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 16
A/B Test Design
• Random Selection of
• TF-IDuF
• TF-IDF
• TF-only
• Evaluation with click-through rates (CTR).
• 228,762 recommendations to 3,483 users
• January – September 2014.
• All results are statistically significant (p<0.05), if not stated
otherwise.
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 17
Results
TF-Only TF-IDF TF-IDuF
CTR 4.06% 5.09% 5.14%
0%
1%
2%
3%
4%
5%
6%
CTR
WeightingScheme
• TF-IDF outperforms TF-Only by 25% (CTR 5.09% vs. 4.06%)
• Result is not surprising but we are the first to empirically confirm
this result for research-paper recommender systems.
• TF-IDuF performed equally well as TF-IDF (5.14% vs. 5.09%)
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 18
Conclusion
• TF-IDuF is equally effective as TF-IDF
• TF-IDuF is faster to calculate than TF-IDF and can be calculated
locally, without access to the global recommendation corpus,
• TF-IDuF and TF-IDF are not exclusive and could be used in a
complementary manner. This means, a term could be weighted
based on all three factors TF, IDF, and IDuF.
• Further research is necessary to confirm the promising performance
and to find out if TF-IDuF performs equally well on other types of
personal document corpora, such as users’ collections of research-
papers, websites or news articles.
--> TF-IDuF is a promising weighting scheme.
Thank You
Questions: Joeran Beel, beel@tcd.ie

More Related Content

What's hot

Text pre-processing of multilingual for sentiment analysis based on social ne...
Text pre-processing of multilingual for sentiment analysis based on social ne...Text pre-processing of multilingual for sentiment analysis based on social ne...
Text pre-processing of multilingual for sentiment analysis based on social ne...IJECEIAES
 
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET Journal
 
Analysis Model in the Cloud Optimization Consumption in Pricing the Internet ...
Analysis Model in the Cloud Optimization Consumption in Pricing the Internet ...Analysis Model in the Cloud Optimization Consumption in Pricing the Internet ...
Analysis Model in the Cloud Optimization Consumption in Pricing the Internet ...IJECEIAES
 
IRJET- Automated Document Summarization and Classification using Deep Lear...
IRJET- 	  Automated Document Summarization and Classification using Deep Lear...IRJET- 	  Automated Document Summarization and Classification using Deep Lear...
IRJET- Automated Document Summarization and Classification using Deep Lear...IRJET Journal
 
An Iterative Model as a Tool in Optimal Allocation of Resources in University...
An Iterative Model as a Tool in Optimal Allocation of Resources in University...An Iterative Model as a Tool in Optimal Allocation of Resources in University...
An Iterative Model as a Tool in Optimal Allocation of Resources in University...Dr. Amarjeet Singh
 
An Efficient Cloud Scheduling Algorithm for the Conservation of Energy throug...
An Efficient Cloud Scheduling Algorithm for the Conservation of Energy throug...An Efficient Cloud Scheduling Algorithm for the Conservation of Energy throug...
An Efficient Cloud Scheduling Algorithm for the Conservation of Energy throug...IJECEIAES
 
Improving IF Algorithm for Data Aggregation Techniques in Wireless Sensor Net...
Improving IF Algorithm for Data Aggregation Techniques in Wireless Sensor Net...Improving IF Algorithm for Data Aggregation Techniques in Wireless Sensor Net...
Improving IF Algorithm for Data Aggregation Techniques in Wireless Sensor Net...IJECEIAES
 
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...israel edem
 
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCEANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCEijesajournal
 
100503 bioinfo instsymp
100503 bioinfo instsymp100503 bioinfo instsymp
100503 bioinfo instsympNick Jones
 
A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...
A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...
A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...ijaia
 
Peer-to-Peer Data Sharing and Deduplication using Genetic Algorithm
Peer-to-Peer Data Sharing and Deduplication using Genetic AlgorithmPeer-to-Peer Data Sharing and Deduplication using Genetic Algorithm
Peer-to-Peer Data Sharing and Deduplication using Genetic AlgorithmIRJET Journal
 
Green computing on Consumer's buying behavior
Green computing on Consumer's buying behavior Green computing on Consumer's buying behavior
Green computing on Consumer's buying behavior Shibly Ahamed
 
Analysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry SystemAnalysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry SystemIJSRD
 
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET Journal
 
Australia's Environmental Predictive Capability
Australia's Environmental Predictive CapabilityAustralia's Environmental Predictive Capability
Australia's Environmental Predictive CapabilityTERN Australia
 
Suggestion Mining by Ahsan_CSE_CU
Suggestion Mining by Ahsan_CSE_CUSuggestion Mining by Ahsan_CSE_CU
Suggestion Mining by Ahsan_CSE_CUAhsan Ullah
 
IRJET - Mobile Chatbot for Information Search
 IRJET - Mobile Chatbot for Information Search IRJET - Mobile Chatbot for Information Search
IRJET - Mobile Chatbot for Information SearchIRJET Journal
 
IRJET- Methodologies used on News Articles :A Survey
IRJET- Methodologies used on News Articles :A SurveyIRJET- Methodologies used on News Articles :A Survey
IRJET- Methodologies used on News Articles :A SurveyIRJET Journal
 

What's hot (19)

Text pre-processing of multilingual for sentiment analysis based on social ne...
Text pre-processing of multilingual for sentiment analysis based on social ne...Text pre-processing of multilingual for sentiment analysis based on social ne...
Text pre-processing of multilingual for sentiment analysis based on social ne...
 
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
 
Analysis Model in the Cloud Optimization Consumption in Pricing the Internet ...
Analysis Model in the Cloud Optimization Consumption in Pricing the Internet ...Analysis Model in the Cloud Optimization Consumption in Pricing the Internet ...
Analysis Model in the Cloud Optimization Consumption in Pricing the Internet ...
 
IRJET- Automated Document Summarization and Classification using Deep Lear...
IRJET- 	  Automated Document Summarization and Classification using Deep Lear...IRJET- 	  Automated Document Summarization and Classification using Deep Lear...
IRJET- Automated Document Summarization and Classification using Deep Lear...
 
An Iterative Model as a Tool in Optimal Allocation of Resources in University...
An Iterative Model as a Tool in Optimal Allocation of Resources in University...An Iterative Model as a Tool in Optimal Allocation of Resources in University...
An Iterative Model as a Tool in Optimal Allocation of Resources in University...
 
An Efficient Cloud Scheduling Algorithm for the Conservation of Energy throug...
An Efficient Cloud Scheduling Algorithm for the Conservation of Energy throug...An Efficient Cloud Scheduling Algorithm for the Conservation of Energy throug...
An Efficient Cloud Scheduling Algorithm for the Conservation of Energy throug...
 
Improving IF Algorithm for Data Aggregation Techniques in Wireless Sensor Net...
Improving IF Algorithm for Data Aggregation Techniques in Wireless Sensor Net...Improving IF Algorithm for Data Aggregation Techniques in Wireless Sensor Net...
Improving IF Algorithm for Data Aggregation Techniques in Wireless Sensor Net...
 
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
 
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCEANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
 
100503 bioinfo instsymp
100503 bioinfo instsymp100503 bioinfo instsymp
100503 bioinfo instsymp
 
A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...
A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...
A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...
 
Peer-to-Peer Data Sharing and Deduplication using Genetic Algorithm
Peer-to-Peer Data Sharing and Deduplication using Genetic AlgorithmPeer-to-Peer Data Sharing and Deduplication using Genetic Algorithm
Peer-to-Peer Data Sharing and Deduplication using Genetic Algorithm
 
Green computing on Consumer's buying behavior
Green computing on Consumer's buying behavior Green computing on Consumer's buying behavior
Green computing on Consumer's buying behavior
 
Analysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry SystemAnalysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry System
 
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
 
Australia's Environmental Predictive Capability
Australia's Environmental Predictive CapabilityAustralia's Environmental Predictive Capability
Australia's Environmental Predictive Capability
 
Suggestion Mining by Ahsan_CSE_CU
Suggestion Mining by Ahsan_CSE_CUSuggestion Mining by Ahsan_CSE_CU
Suggestion Mining by Ahsan_CSE_CU
 
IRJET - Mobile Chatbot for Information Search
 IRJET - Mobile Chatbot for Information Search IRJET - Mobile Chatbot for Information Search
IRJET - Mobile Chatbot for Information Search
 
IRJET- Methodologies used on News Articles :A Survey
IRJET- Methodologies used on News Articles :A SurveyIRJET- Methodologies used on News Articles :A Survey
IRJET- Methodologies used on News Articles :A Survey
 

Similar to TF-IDuF: A Novel Term-Weighting Scheme for User Modeling based on Users’ Personal Document Collections

Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.Provectus
 
GenerativeAI and Automation - IEEE ACSOS 2023.pptx
GenerativeAI and Automation - IEEE ACSOS 2023.pptxGenerativeAI and Automation - IEEE ACSOS 2023.pptx
GenerativeAI and Automation - IEEE ACSOS 2023.pptxAllen Chan
 
IRJET- PDF Extraction using Data Mining Techniques
IRJET- PDF Extraction using Data Mining TechniquesIRJET- PDF Extraction using Data Mining Techniques
IRJET- PDF Extraction using Data Mining TechniquesIRJET Journal
 
Information retrieval systems irt ppt do
Information retrieval systems irt ppt doInformation retrieval systems irt ppt do
Information retrieval systems irt ppt doPonnuthuraiSelvaraj1
 
IRJET- Determining Document Relevance using Keyword Extraction
IRJET-  	  Determining Document Relevance using Keyword ExtractionIRJET-  	  Determining Document Relevance using Keyword Extraction
IRJET- Determining Document Relevance using Keyword ExtractionIRJET Journal
 
IoT Processing Topologies.pptx
IoT Processing Topologies.pptxIoT Processing Topologies.pptx
IoT Processing Topologies.pptxtaruian
 
IRJET- Conextualization: Generalization and Empowering Content Domain
IRJET- Conextualization: Generalization and Empowering Content DomainIRJET- Conextualization: Generalization and Empowering Content Domain
IRJET- Conextualization: Generalization and Empowering Content DomainIRJET Journal
 
Brochure quiterian DDWeb
Brochure quiterian DDWebBrochure quiterian DDWeb
Brochure quiterian DDWebJosep Arroyo
 
18CS81 IOT MODULE 4 PPT.pdf
18CS81 IOT MODULE 4 PPT.pdf18CS81 IOT MODULE 4 PPT.pdf
18CS81 IOT MODULE 4 PPT.pdfFURYGaming22
 
An Analysis on Query Optimization in Distributed Database
An Analysis on Query Optimization in Distributed DatabaseAn Analysis on Query Optimization in Distributed Database
An Analysis on Query Optimization in Distributed DatabaseEditor IJMTER
 
AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a fl...
AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a fl...AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a fl...
AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a fl...Dr. Haxel Consult
 
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...inside-BigData.com
 
Birgit Plietzsch “RDM within research computing support” SALCTG June 2013
Birgit Plietzsch “RDM within research computing support” SALCTG June 2013Birgit Plietzsch “RDM within research computing support” SALCTG June 2013
Birgit Plietzsch “RDM within research computing support” SALCTG June 2013SALCTG
 
Clinical Document Architecture Implementations - Lessons Learnt to Date
Clinical Document Architecture Implementations - Lessons Learnt to DateClinical Document Architecture Implementations - Lessons Learnt to Date
Clinical Document Architecture Implementations - Lessons Learnt to DateHealth Informatics New Zealand
 
A comparative study of secure search protocols in pay as-you-go clouds
A comparative study of secure search protocols in pay as-you-go cloudsA comparative study of secure search protocols in pay as-you-go clouds
A comparative study of secure search protocols in pay as-you-go cloudseSAT Publishing House
 
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...IRJET Journal
 

Similar to TF-IDuF: A Novel Term-Weighting Scheme for User Modeling based on Users’ Personal Document Collections (20)

Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
 
GenerativeAI and Automation - IEEE ACSOS 2023.pptx
GenerativeAI and Automation - IEEE ACSOS 2023.pptxGenerativeAI and Automation - IEEE ACSOS 2023.pptx
GenerativeAI and Automation - IEEE ACSOS 2023.pptx
 
IRJET- PDF Extraction using Data Mining Techniques
IRJET- PDF Extraction using Data Mining TechniquesIRJET- PDF Extraction using Data Mining Techniques
IRJET- PDF Extraction using Data Mining Techniques
 
Information retrieval systems irt ppt do
Information retrieval systems irt ppt doInformation retrieval systems irt ppt do
Information retrieval systems irt ppt do
 
IRJET- Determining Document Relevance using Keyword Extraction
IRJET-  	  Determining Document Relevance using Keyword ExtractionIRJET-  	  Determining Document Relevance using Keyword Extraction
IRJET- Determining Document Relevance using Keyword Extraction
 
IoT Processing Topologies.pptx
IoT Processing Topologies.pptxIoT Processing Topologies.pptx
IoT Processing Topologies.pptx
 
DU_SERIES_Session1.pdf
DU_SERIES_Session1.pdfDU_SERIES_Session1.pdf
DU_SERIES_Session1.pdf
 
Soa
SoaSoa
Soa
 
Introduction
IntroductionIntroduction
Introduction
 
IRJET- Conextualization: Generalization and Empowering Content Domain
IRJET- Conextualization: Generalization and Empowering Content DomainIRJET- Conextualization: Generalization and Empowering Content Domain
IRJET- Conextualization: Generalization and Empowering Content Domain
 
Brochure quiterian DDWeb
Brochure quiterian DDWebBrochure quiterian DDWeb
Brochure quiterian DDWeb
 
18CS81 IOT MODULE 4 PPT.pdf
18CS81 IOT MODULE 4 PPT.pdf18CS81 IOT MODULE 4 PPT.pdf
18CS81 IOT MODULE 4 PPT.pdf
 
An Analysis on Query Optimization in Distributed Database
An Analysis on Query Optimization in Distributed DatabaseAn Analysis on Query Optimization in Distributed Database
An Analysis on Query Optimization in Distributed Database
 
AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a fl...
AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a fl...AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a fl...
AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a fl...
 
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...
 
Birgit Plietzsch “RDM within research computing support” SALCTG June 2013
Birgit Plietzsch “RDM within research computing support” SALCTG June 2013Birgit Plietzsch “RDM within research computing support” SALCTG June 2013
Birgit Plietzsch “RDM within research computing support” SALCTG June 2013
 
Clinical Document Architecture Implementations - Lessons Learnt to Date
Clinical Document Architecture Implementations - Lessons Learnt to DateClinical Document Architecture Implementations - Lessons Learnt to Date
Clinical Document Architecture Implementations - Lessons Learnt to Date
 
A comparative study of secure search protocols in pay as-you-go clouds
A comparative study of secure search protocols in pay as-you-go cloudsA comparative study of secure search protocols in pay as-you-go clouds
A comparative study of secure search protocols in pay as-you-go clouds
 
From paper to digital
From paper to digitalFrom paper to digital
From paper to digital
 
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
 

Recently uploaded

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad EscortsCall girls in Ahmedabad High profile
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 

Recently uploaded (20)

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 

TF-IDuF: A Novel Term-Weighting Scheme for User Modeling based on Users’ Personal Document Collections

  • 1. TF-IDuF: A Novel Term-Weighting Scheme for User Modeling based on Users’ Personal Document Collections Joeran Beel, Stefan Langer, Bela Gipp iConference 2017 -- 2017/03/24, presented by Maria Gäde
  • 2. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 2 Outline 1. Term-Weighting Schemes 2. TF-IDuF Introduction 3. Evaluation
  • 3. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 3 1. Term Weighting Schemes
  • 4. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 4 Purpose of Term-Weighting Schemes • Search Engines • Calculate how well a term describes a document’s content • Match with search query • User-Modelling and Recommender Systems • calculate how well a term describes a user’s information need. • Find most relevant documents to satisfy the information need
  • 5. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 5 TF-IDF • TF-IDF was introduced by Jones (1972) • Probably the most popular term-weighting scheme for search • One of the most popular schemes for user modeling and recommender systems. • Two components • Term Frequency (TF) • Inverse document frequency (IDF). 𝑇𝐹 − 𝐼𝐷𝐹 = 𝑡𝑓 𝑡 ∗ log 𝑁𝑟 𝑛 𝑟 t Term to weight tf(t) Frequency of tin the documents of cum cr A corpus of documents that may be recommended to u Nr Number of documents in cr nr Number of documents in cr that contain t
  • 6. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 6 TF-IDF Illustration • User u possesses a document collection cu. This collection might contain, for instance, all documents that the user downloaded, bought, or read. • The user-modeling engine identifies those documents from cu that are relevant for modeling the user’s information need. Relevant documents could be, for instance, documents that the user downloaded or bought in the past x days. The engine selects these documents as a temporary document collection cum to be used for user modeling. • The user-modeling engine weights each term that occurs in cum with TF-IDF • The user-modeling engines stores the z highest weighted terms as user model um. These terms are meant to represent the user’s information need. • The recommender system matches um with the documents in cr and recommends the most relevant recommendation candidates to u. Identify relevant documents Weight terms ti...n and create um Match user model and rec. candidates User model um of user u Temporary document collection for user modeling cum Document collection cu of user u Corpus of recommendation candidates cr IDFTF
  • 7. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 7 Problems of TF-IDF (for User Modelling) 1. To calculate IDF, access to the recommendation corpus is needed, which is not always available. 2. Documents in a user’s document collection that are not selected for the user modelling process are ignored in the weighting. We assume that these remaining documents contain valuable information.
  • 8. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 8 2. TF-IDuF Introduction
  • 9. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 9 TF-IDuF • The term frequency (TF) component in TF-IDuF is the same as in TF-IDF: terms are weighted higher, the more often they occur in the documents selected for building the user model. • The user-focused inverse document frequency (IDuF) differs from traditional IDF. While the classic IDF is calculated using the document frequencies in the recommendation corpus, IDuF is calculated using the document frequencies in a user’s personal document collection cu, where terms are weighted more strongly, the fewer documents in a user’s collection contain these terms.
  • 10. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 10 Rationale A (1) • The user-modeling engine selects a user’s two most recently downloaded documents d1 and d2. • Frequency of t1 in d1 equals frequency of t2 in d2 . • User’s document collection contains additional with t2, but these documents were not selected Identify relevant documents Document contains t2 and is relevant for user modeling Document contains t1 and is relevant for user modeling Document collection cu of user u Document collection for user modeling cum Document contains t2 but is not relevant for user modeling Legend d1 d2 Option 1
  • 11. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 11 Rationale A (2) • We assume • t1 describes a new topic that the author was previously not interested in. Hence, t1 should be weighted stronger than t2 • It is easier to generate good recommendations for t1 than for t2 because there are potentially more documents on t1 that the user does not yet know about compared to documents on t2. • Users have probably received recommendations for t2 in the past Identify relevant documents Document contains t2 and is relevant for user modeling Document contains t1 and is relevant for user modeling Document collection cu of user u Document collection for user modeling cum Document contains t2 but is not relevant for user modeling Legend d1 d2 Option 1
  • 12. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 12 Rationale B (1) • The user modeling engine selects d1, d2, … dn • d1 contains term t1, and d2…n contain term t2. • The overall term frequency for t1 and t2 in cum is the same. --> The density of t1 in d1 must be higher than the density of t2 in each of the documents d2…n. In other words, t1 occurs very frequently in d1, while t2 occurs only a few times in each of the documents d2…n. Document cont relevant for use Document contains t1 and is relevant for user modeling Legend Identify relevant documents Document collection cu of user u Document collection for user modeling cum d1 d2...n Example 1
  • 13. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 13 Rationale B (2) • We assume • d1 covers t1 in depth, • d2…n cover the topic t2 only to some extent. • t1 is more suitable for describing the user’s information need. Hence, t1 should be weighted stronger than t2 Document cont relevant for use Document contains t1 and is relevant for user modeling Legend Identify relevant documents Document collection cu of user u Document collection for user modeling cum d1 d2...n Example 1
  • 14. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 14 3. Evaluation
  • 15. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 15 Methodology • A/B Test in Docear’s research-paper recommender system. • Docear is a reference manager that allows users to manage references and PDF files, similar to Mendeley and Zotero. • One key difference is that Docear’s users manage their data in mind-maps. Users’ mind-maps contain links to PDFs, as well as the user’s annotations made within those PDFs. • To calculate TF-IDuF, each mind map of a user was considered as one document.
  • 16. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 16 A/B Test Design • Random Selection of • TF-IDuF • TF-IDF • TF-only • Evaluation with click-through rates (CTR). • 228,762 recommendations to 3,483 users • January – September 2014. • All results are statistically significant (p<0.05), if not stated otherwise.
  • 17. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 17 Results TF-Only TF-IDF TF-IDuF CTR 4.06% 5.09% 5.14% 0% 1% 2% 3% 4% 5% 6% CTR WeightingScheme • TF-IDF outperforms TF-Only by 25% (CTR 5.09% vs. 4.06%) • Result is not surprising but we are the first to empirically confirm this result for research-paper recommender systems. • TF-IDuF performed equally well as TF-IDF (5.14% vs. 5.09%)
  • 18. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 18 Conclusion • TF-IDuF is equally effective as TF-IDF • TF-IDuF is faster to calculate than TF-IDF and can be calculated locally, without access to the global recommendation corpus, • TF-IDuF and TF-IDF are not exclusive and could be used in a complementary manner. This means, a term could be weighted based on all three factors TF, IDF, and IDuF. • Further research is necessary to confirm the promising performance and to find out if TF-IDuF performs equally well on other types of personal document corpora, such as users’ collections of research- papers, websites or news articles. --> TF-IDuF is a promising weighting scheme.
  • 19. Thank You Questions: Joeran Beel, beel@tcd.ie

Editor's Notes

  1. Term-weighting schemes are used by search engines and by user-modeling and recommender systems. Search engines use term-weighting schemes to calculate how well a term describes a document’s content, while user-modeling and recommender systems use term-weighting schemes to calculate how well a term describes a user’s information need. One popular term-weighting schemes is TF-IDF
  2. TF is the frequency with which a term occurs in a document or user model. The rationale is that the more frequently a term occurs, the more likely this term describes a document’s content or user’s information need. IDF reflects the importance of the term by computing the inverse frequency of documents containing the term within the entire corpus of documents to be searched or recommended. The basic assumption is that a term should be given a higher weight if few other documents also contain that term, because rare terms will likely be more representative of a document’s content or user’s interests.
  3. For instance, Nascimento, Laender, Silva, & Gonçalves (2011) create user models locally in their literature recommender system and then send the user model as search query to the ACM Digital Library (the search results are presented as recommendations). In such a scenario, IDF cannot be calculated by the recommender system. Traditional TF-IDF calculates term weights based on TF in the documents selected for the user-modeling process and IDF based on the number of documents containing the terms in the recommendation corpus.
  4. The user-modeling engine selects a user’s two most recently downloaded documents d1 and d2. d1 contains t1 in the same frequency as d2 contains t2. Based on term frequency alone, both terms would be considered equally suitable for describing the user’s information need. However, the user’s document collection contains a number of additional documents that contain t2, but these documents were not selected for the user modeling process, e.g. because they were downloaded many months ago. There are no further documents that contain t1 in the user’s document collection. In this scenario, we may assume that t1 describes a new topic that the author was previously not interested in. We hypothesize that in such a scenario, t1 should be weighted more strongly than t2 because: Users are likely to favor recommendations for the newer topic t1 rather than for the older topic t2. It is easier to generate good recommendations for t1 than for t2 because there are potentially more documents on t1 that the user does not yet know about compared to documents on t2. Users have probably received recommendations for t2 in the past, but they have likely not yet received many recommendations for t1. Hence, for t2, the most relevant documents probably have already been recommended to the user.
  5. The user-modeling engine selects a user’s two most recently downloaded documents d1 and d2. d1 contains t1 in the same frequency as d2 contains t2. Based on term frequency alone, both terms would be considered equally suitable for describing the user’s information need. However, the user’s document collection contains a number of additional documents that contain t2, but these documents were not selected for the user modeling process, e.g. because they were downloaded many months ago. There are no further documents that contain t1 in the user’s document collection. In this scenario, we may assume that t1 describes a new topic that the author was previously not interested in. We hypothesize that in such a scenario, t1 should be weighted more strongly than t2 because: Users are likely to favor recommendations for the newer topic t1 rather than for the older topic t2. It is easier to generate good recommendations for t1 than for t2 because there are potentially more documents on t1 that the user does not yet know about compared to documents on t2. Users have probably received recommendations for t2 in the past, but they have likely not yet received many recommendations for t1. Hence, for t2, the most relevant documents probably have already been recommended to the user.
  6. The user modeling engine selects d1, d2, … dn for the user modeling process. d1 contains term t1, and d2…n contain term t2. The overall term frequency for t1 and t2 in cum is the same. Consequently, the density of t1 in d1 must be higher than the density of t2 in each of the documents d2…n. In other words, t1 occurs very frequently in d1, while t2 occurs only a few times in each of the documents d2…n. We would therefore assume that d1 covers t1 in depth, while d2…n cover the topic t2 only to some extent. We hypothesize that in this scenario, t1 is more suitable for describing the user’s information need. Hence, t1 should be weighted more strongly than t2, which is the case when using TF-IDuF, since only one document in cu contains t1, while many documents contain t2.
  7. The user modeling engine selects d1, d2, … dn for the user modeling process. d1 contains term t1, and d2…n contain term t2. The overall term frequency for t1 and t2 in cum is the same. Consequently, the density of t1 in d1 must be higher than the density of t2 in each of the documents d2…n. In other words, t1 occurs very frequently in d1, while t2 occurs only a few times in each of the documents d2…n. We would therefore assume that d1 covers t1 in depth, while d2…n cover the topic t2 only to some extent. We hypothesize that in this scenario, t1 is more suitable for describing the user’s information need. Hence, t1 should be weighted more strongly than t2, which is the case when using TF-IDuF, since only one document in cu contains t1, while many documents contain t2.
  8. Whenever Docear wanted to diaplay recommendations, Docear randomly selected on of the three weighting schemes. We measured how often users clicked on the recommendations.
  9. Click-through rate for TF-IDF was significantly higher than for TF-Only (5.09% vs. 4.06%), i.e. TF-IDF was approximately 25% more effective than TF-Only (Figure 3). This result confirms the previous findings of TF-IDF being more effective than term frequency alone. Although, this result is not surprising, we are, to the best of our knowledge, the first to empirically confirm this result for research-paper recommender systems. TF-IDuF achieved a CTR of 5.14%, meaning it performed equally well as TF-IDF, with its average CTR of 5.09% (the difference is statistically not significant).
  10. We performed the first evaluation of TF-IDuF using the mind maps of Docear’s users as personal document corpora. We were positively surprised by the results.