SlideShare a Scribd company logo
1 of 15
Evaluating the CC-IDF citation-weighting scheme: How
effectively can ‘Inverse Document Frequency’ (IDF) be
applied to references?
Joeran Beel, Corinna Breitinger, Stefan Langer
iConference 2017 -- 2017/03/24, presented by Maria Gäde
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 2
Outline
1. CC-IDF
2. Evaluation
1. Methodology
2. Results
3. Discussion
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 3
1. CC-IDF: Common Citation – Inverse
Document Frequency
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 4
Introduction
• Introduced in 1998 in the digital library and citation-indexing
system CiteSeer (Bollacker, Lawrence, & Giles, 1998)
• CC-IDF = TF-IDF applied to citations/references
• Assumption: “if a very uncommon citation is shared by two
documents, this should be weighted more highly than a citation
made by a large number of documents” (Giles et al., 1998)
• Similar concept as Bibliographic Coupling (but different weighting
than Coupling Strength)
• In the original CC-IDF, the CC component is binary.
• We focus on IDF
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 5
Illustration
CC-IDF != Bibliographic Coupling Strength
Cited
Document dcited
2
Bibliogr. Coupled
Document dBC
3
Bibliogr. Coupled
Document dBC
4
Which bibliographic-coupled document
is more closely related to di?
Cited
Document dcited
1
Input
Document di
Bibliogr. Coupled
Document dBC
1
Bibliogr. Coupled
Document dBC
2
CC-IDF = 1/2CC-IDF = 1/3 CC-IDF = 1/3 CC-IDF = 1/3 + 1/2 = 5/6
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 6
Use of CC-IDF
• CC-IDF has been used in several recommender systems, and served
as a baseline in many evaluations.
• Ambiguous reports regarding the effectiveness of CC-IDF
• Sometimes, CC-IDF was found to perform better and other times
worse than simple bibliographic coupling and co-citation strength
(Küçüktunç, Saule, Kaya, & Çatalyürek, 2012; Küçüktunç et al.,
2013; Liang et al., 2011; Pan et al., 2015; Zhang et al., 2012)
• Compared to more advanced approaches such as HITS, PaperRank,
and Katz, CC-IDF performs usually poorly (Küçüktunç et al., 2012;
Pan et al., 2015)
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 7
Problem
• CC-IDF has never been compared to CC-Only, i.e. a simple citation
weighting scheme based only on the CC component and ignoring
IDF.
• The basic assumption underlying CC-IDF – namely that “if a very
uncommon citation is shared by two documents, this should be
weighted more highly than a citation made by a large number of
documents” – has never been evaluated for its effectiveness.
--> Goal: Compare CC-IDF vs. CC-Only
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 8
2. Evaluation
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 9
Methodology
• A/B Test in Docear’s research-paper
recommender system.
• Docear is a reference manager that
allows users to manage references and
PDF files, similar to Mendeley and
Zotero.
• One key difference is that Docear’s
users manage their data in mind-maps.
Users’ mind-maps contain links to
PDFs, as well as the user’s annotations
made within those PDFs.
• To calculate CC-IDF, each mind map of a
user was considered as one document.
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 10
CC-IDF in Docear
Recommendation Candidate
Corpus
Contains
Contains
Cites
Cites
Document d2
Document d1
Candidate c1
Cites
Candidate c4
Which candidate is more
related? Candidate c1 or
candidates c2/c3/c4 ?
Candidate c2
Cites
Candidate c3
Document collection of
User u
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 11
A/B Test Design
• Random Selection of
• CC-IDF
• CC-Only
• Evaluation with click-through rates (CTR).
• 238,681 recommendations to 3,561 users
• January – September 2014.
• All results are statistically significant (p<0.05), if not stated
otherwise.
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 12
Results
• CC-Only: 6.23%
• CC-IDF: 6.15%
• Difference is statistically not significant
1 2 3 4 [5-9] [10-14] [15-24] [25-74] [75-149] >=150
CC-Only (Dspld Recs) 180 451 667 713 1,964 896 2,098 7,329 5,848 4,675
CC-IDF (Dspld Recs) 252 496 600 770 1,660 1,075 2,484 10,114 5,555 4,979
CC-Only (CTR) 3.13% 1.93% 4.74% 5.00% 7.21% 8.08% 6.50% 6.00% 6.83% 5.87%
CC-IDF (CTR) 2.58% 3.86% 5.51% 5.62% 6.28% 8.03% 6.35% 6.64% 6.38% 4.92%
-
2,000
4,000
6,000
8,000
10,000
12,000
0%
1%
2%
3%
4%
5%
6%
7%
8%
9%
Numberofdisplayed
recommendations
CTR
Number of utilized references
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 13
Potential Reasons
1. Citations always have some significance
2. Document Frequency (DF) has too little variation based on
citations
3. New research articles have lower Document Frequency
4. No normalization for length of reference list
Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 14
Outlook
• More research with:
• Traditional related-document scenario
• Larger document corpus
• Users with more documents in their collections
• Higher quality of citation data
• Variation of CC component
Thank You
Questions: Joeran Beel, beel@tcd.ie

More Related Content

What's hot

IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET Journal
 
Text pre-processing of multilingual for sentiment analysis based on social ne...
Text pre-processing of multilingual for sentiment analysis based on social ne...Text pre-processing of multilingual for sentiment analysis based on social ne...
Text pre-processing of multilingual for sentiment analysis based on social ne...IJECEIAES
 
Analysis Model in the Cloud Optimization Consumption in Pricing the Internet ...
Analysis Model in the Cloud Optimization Consumption in Pricing the Internet ...Analysis Model in the Cloud Optimization Consumption in Pricing the Internet ...
Analysis Model in the Cloud Optimization Consumption in Pricing the Internet ...IJECEIAES
 
100503 bioinfo instsymp
100503 bioinfo instsymp100503 bioinfo instsymp
100503 bioinfo instsympNick Jones
 
Improving IF Algorithm for Data Aggregation Techniques in Wireless Sensor Net...
Improving IF Algorithm for Data Aggregation Techniques in Wireless Sensor Net...Improving IF Algorithm for Data Aggregation Techniques in Wireless Sensor Net...
Improving IF Algorithm for Data Aggregation Techniques in Wireless Sensor Net...IJECEIAES
 
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...israel edem
 
IRJET- Automated Document Summarization and Classification using Deep Lear...
IRJET- 	  Automated Document Summarization and Classification using Deep Lear...IRJET- 	  Automated Document Summarization and Classification using Deep Lear...
IRJET- Automated Document Summarization and Classification using Deep Lear...IRJET Journal
 
An Efficient Cloud Scheduling Algorithm for the Conservation of Energy throug...
An Efficient Cloud Scheduling Algorithm for the Conservation of Energy throug...An Efficient Cloud Scheduling Algorithm for the Conservation of Energy throug...
An Efficient Cloud Scheduling Algorithm for the Conservation of Energy throug...IJECEIAES
 
An Iterative Model as a Tool in Optimal Allocation of Resources in University...
An Iterative Model as a Tool in Optimal Allocation of Resources in University...An Iterative Model as a Tool in Optimal Allocation of Resources in University...
An Iterative Model as a Tool in Optimal Allocation of Resources in University...Dr. Amarjeet Singh
 
Peer-to-Peer Data Sharing and Deduplication using Genetic Algorithm
Peer-to-Peer Data Sharing and Deduplication using Genetic AlgorithmPeer-to-Peer Data Sharing and Deduplication using Genetic Algorithm
Peer-to-Peer Data Sharing and Deduplication using Genetic AlgorithmIRJET Journal
 
Green computing on Consumer's buying behavior
Green computing on Consumer's buying behavior Green computing on Consumer's buying behavior
Green computing on Consumer's buying behavior Shibly Ahamed
 
Australia's Environmental Predictive Capability
Australia's Environmental Predictive CapabilityAustralia's Environmental Predictive Capability
Australia's Environmental Predictive CapabilityTERN Australia
 
Suggestion Mining by Ahsan_CSE_CU
Suggestion Mining by Ahsan_CSE_CUSuggestion Mining by Ahsan_CSE_CU
Suggestion Mining by Ahsan_CSE_CUAhsan Ullah
 
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCEANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCEijesajournal
 
Analysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry SystemAnalysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry SystemIJSRD
 
Aligning Nuclear Physics Computing Techniques with Non-Research Physics Careers
Aligning Nuclear Physics Computing Techniques with Non-Research Physics CareersAligning Nuclear Physics Computing Techniques with Non-Research Physics Careers
Aligning Nuclear Physics Computing Techniques with Non-Research Physics CareersWouter Deconinck
 

What's hot (16)

IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
 
Text pre-processing of multilingual for sentiment analysis based on social ne...
Text pre-processing of multilingual for sentiment analysis based on social ne...Text pre-processing of multilingual for sentiment analysis based on social ne...
Text pre-processing of multilingual for sentiment analysis based on social ne...
 
Analysis Model in the Cloud Optimization Consumption in Pricing the Internet ...
Analysis Model in the Cloud Optimization Consumption in Pricing the Internet ...Analysis Model in the Cloud Optimization Consumption in Pricing the Internet ...
Analysis Model in the Cloud Optimization Consumption in Pricing the Internet ...
 
100503 bioinfo instsymp
100503 bioinfo instsymp100503 bioinfo instsymp
100503 bioinfo instsymp
 
Improving IF Algorithm for Data Aggregation Techniques in Wireless Sensor Net...
Improving IF Algorithm for Data Aggregation Techniques in Wireless Sensor Net...Improving IF Algorithm for Data Aggregation Techniques in Wireless Sensor Net...
Improving IF Algorithm for Data Aggregation Techniques in Wireless Sensor Net...
 
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
 
IRJET- Automated Document Summarization and Classification using Deep Lear...
IRJET- 	  Automated Document Summarization and Classification using Deep Lear...IRJET- 	  Automated Document Summarization and Classification using Deep Lear...
IRJET- Automated Document Summarization and Classification using Deep Lear...
 
An Efficient Cloud Scheduling Algorithm for the Conservation of Energy throug...
An Efficient Cloud Scheduling Algorithm for the Conservation of Energy throug...An Efficient Cloud Scheduling Algorithm for the Conservation of Energy throug...
An Efficient Cloud Scheduling Algorithm for the Conservation of Energy throug...
 
An Iterative Model as a Tool in Optimal Allocation of Resources in University...
An Iterative Model as a Tool in Optimal Allocation of Resources in University...An Iterative Model as a Tool in Optimal Allocation of Resources in University...
An Iterative Model as a Tool in Optimal Allocation of Resources in University...
 
Peer-to-Peer Data Sharing and Deduplication using Genetic Algorithm
Peer-to-Peer Data Sharing and Deduplication using Genetic AlgorithmPeer-to-Peer Data Sharing and Deduplication using Genetic Algorithm
Peer-to-Peer Data Sharing and Deduplication using Genetic Algorithm
 
Green computing on Consumer's buying behavior
Green computing on Consumer's buying behavior Green computing on Consumer's buying behavior
Green computing on Consumer's buying behavior
 
Australia's Environmental Predictive Capability
Australia's Environmental Predictive CapabilityAustralia's Environmental Predictive Capability
Australia's Environmental Predictive Capability
 
Suggestion Mining by Ahsan_CSE_CU
Suggestion Mining by Ahsan_CSE_CUSuggestion Mining by Ahsan_CSE_CU
Suggestion Mining by Ahsan_CSE_CU
 
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCEANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
 
Analysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry SystemAnalysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry System
 
Aligning Nuclear Physics Computing Techniques with Non-Research Physics Careers
Aligning Nuclear Physics Computing Techniques with Non-Research Physics CareersAligning Nuclear Physics Computing Techniques with Non-Research Physics Careers
Aligning Nuclear Physics Computing Techniques with Non-Research Physics Careers
 

Similar to Evaluating the CC-IDF citation-weighting scheme: How effectively can ‘Inverse Document Frequency’ (IDF) be applied to references?

Towards Generating Policy-compliant Datasets (poster)
Towards GeneratingPolicy-compliant Datasets (poster)Towards GeneratingPolicy-compliant Datasets (poster)
Towards Generating Policy-compliant Datasets (poster)Christophe Debruyne
 
Clinical Document Architecture Implementations - Lessons Learnt to Date
Clinical Document Architecture Implementations - Lessons Learnt to DateClinical Document Architecture Implementations - Lessons Learnt to Date
Clinical Document Architecture Implementations - Lessons Learnt to DateHealth Informatics New Zealand
 
Data Science BD2K Update for NIH
Data Science BD2K Update for NIH Data Science BD2K Update for NIH
Data Science BD2K Update for NIH Philip Bourne
 
Cyber Resilient Energy Delivery Consortium - Overview
Cyber Resilient Energy Delivery Consortium - OverviewCyber Resilient Energy Delivery Consortium - Overview
Cyber Resilient Energy Delivery Consortium - OverviewCheri Soliday
 
ELIXIR . Technical Coordinator
ELIXIR. Technical CoordinatorELIXIR. Technical Coordinator
ELIXIR . Technical CoordinatorRafael C. Jimenez
 
Hattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop SlidesHattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop SlidesJason Hattrick-Simpers
 
Datanomics: the value of research data.
 Datanomics: the value of research data.  Datanomics: the value of research data.
Datanomics: the value of research data. Neil Beagrie
 
Seven Degrees Presentation for 2015 ICEAA
Seven Degrees Presentation for 2015 ICEAASeven Degrees Presentation for 2015 ICEAA
Seven Degrees Presentation for 2015 ICEAAJames Lawlor
 
If You Are Not Embedding Analytics Into Your Day To Day Processes, You Are Do...
If You Are Not Embedding Analytics Into Your Day To Day Processes, You Are Do...If You Are Not Embedding Analytics Into Your Day To Day Processes, You Are Do...
If You Are Not Embedding Analytics Into Your Day To Day Processes, You Are Do...Dell World
 
Aitp presentation ed holub - october 23 2010
Aitp presentation   ed holub - october 23 2010Aitp presentation   ed holub - october 23 2010
Aitp presentation ed holub - october 23 2010AITPHouston
 
Commons credits model breakout
Commons credits model breakoutCommons credits model breakout
Commons credits model breakoutGeorge Komatsoulis
 
Williamson (2012 jan 10) IT project complexity, complication, and success - ...
Williamson (2012 jan 10) IT project complexity, complication, and success -  ...Williamson (2012 jan 10) IT project complexity, complication, and success -  ...
Williamson (2012 jan 10) IT project complexity, complication, and success - ...David Williamson
 
Standards for clinical research data - steps to an information model (CRIM).
Standards for clinical research data - steps to an information model (CRIM).Standards for clinical research data - steps to an information model (CRIM).
Standards for clinical research data - steps to an information model (CRIM).Wolfgang Kuchinke
 
Rapidly Enable Tangible Business Value through Data Virtualization
Rapidly Enable Tangible Business Value through Data VirtualizationRapidly Enable Tangible Business Value through Data Virtualization
Rapidly Enable Tangible Business Value through Data VirtualizationDenodo
 
Introduction to Data Science - Week 3 - Steps involved in Data Science
Introduction to Data Science - Week 3 - Steps involved in Data ScienceIntroduction to Data Science - Week 3 - Steps involved in Data Science
Introduction to Data Science - Week 3 - Steps involved in Data ScienceFerdin Joe John Joseph PhD
 
Subscribing to Your Critical Data Supply Chain - Getting Value from True Data...
Subscribing to Your Critical Data Supply Chain - Getting Value from True Data...Subscribing to Your Critical Data Supply Chain - Getting Value from True Data...
Subscribing to Your Critical Data Supply Chain - Getting Value from True Data...DATAVERSITY
 
DOES14 - Stephen Elliot - IDC - Delivering DevOps Business Metrics that Matter
DOES14 - Stephen Elliot - IDC - Delivering DevOps Business Metrics that MatterDOES14 - Stephen Elliot - IDC - Delivering DevOps Business Metrics that Matter
DOES14 - Stephen Elliot - IDC - Delivering DevOps Business Metrics that MatterGene Kim
 
Direct Project HIT Standards 10.27
Direct Project HIT Standards 10.27Direct Project HIT Standards 10.27
Direct Project HIT Standards 10.27Brian Ahier
 

Similar to Evaluating the CC-IDF citation-weighting scheme: How effectively can ‘Inverse Document Frequency’ (IDF) be applied to references? (20)

Towards Generating Policy-compliant Datasets (poster)
Towards GeneratingPolicy-compliant Datasets (poster)Towards GeneratingPolicy-compliant Datasets (poster)
Towards Generating Policy-compliant Datasets (poster)
 
Clinical Document Architecture Implementations - Lessons Learnt to Date
Clinical Document Architecture Implementations - Lessons Learnt to DateClinical Document Architecture Implementations - Lessons Learnt to Date
Clinical Document Architecture Implementations - Lessons Learnt to Date
 
Data Science BD2K Update for NIH
Data Science BD2K Update for NIH Data Science BD2K Update for NIH
Data Science BD2K Update for NIH
 
Cyber Resilient Energy Delivery Consortium - Overview
Cyber Resilient Energy Delivery Consortium - OverviewCyber Resilient Energy Delivery Consortium - Overview
Cyber Resilient Energy Delivery Consortium - Overview
 
ELIXIR . Technical Coordinator
ELIXIR. Technical CoordinatorELIXIR. Technical Coordinator
ELIXIR . Technical Coordinator
 
Hattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop SlidesHattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop Slides
 
Ortiz internap
Ortiz internapOrtiz internap
Ortiz internap
 
Datanomics: the value of research data.
 Datanomics: the value of research data.  Datanomics: the value of research data.
Datanomics: the value of research data.
 
BD2K Update
BD2K UpdateBD2K Update
BD2K Update
 
Seven Degrees Presentation for 2015 ICEAA
Seven Degrees Presentation for 2015 ICEAASeven Degrees Presentation for 2015 ICEAA
Seven Degrees Presentation for 2015 ICEAA
 
If You Are Not Embedding Analytics Into Your Day To Day Processes, You Are Do...
If You Are Not Embedding Analytics Into Your Day To Day Processes, You Are Do...If You Are Not Embedding Analytics Into Your Day To Day Processes, You Are Do...
If You Are Not Embedding Analytics Into Your Day To Day Processes, You Are Do...
 
Aitp presentation ed holub - october 23 2010
Aitp presentation   ed holub - october 23 2010Aitp presentation   ed holub - october 23 2010
Aitp presentation ed holub - october 23 2010
 
Commons credits model breakout
Commons credits model breakoutCommons credits model breakout
Commons credits model breakout
 
Williamson (2012 jan 10) IT project complexity, complication, and success - ...
Williamson (2012 jan 10) IT project complexity, complication, and success -  ...Williamson (2012 jan 10) IT project complexity, complication, and success -  ...
Williamson (2012 jan 10) IT project complexity, complication, and success - ...
 
Standards for clinical research data - steps to an information model (CRIM).
Standards for clinical research data - steps to an information model (CRIM).Standards for clinical research data - steps to an information model (CRIM).
Standards for clinical research data - steps to an information model (CRIM).
 
Rapidly Enable Tangible Business Value through Data Virtualization
Rapidly Enable Tangible Business Value through Data VirtualizationRapidly Enable Tangible Business Value through Data Virtualization
Rapidly Enable Tangible Business Value through Data Virtualization
 
Introduction to Data Science - Week 3 - Steps involved in Data Science
Introduction to Data Science - Week 3 - Steps involved in Data ScienceIntroduction to Data Science - Week 3 - Steps involved in Data Science
Introduction to Data Science - Week 3 - Steps involved in Data Science
 
Subscribing to Your Critical Data Supply Chain - Getting Value from True Data...
Subscribing to Your Critical Data Supply Chain - Getting Value from True Data...Subscribing to Your Critical Data Supply Chain - Getting Value from True Data...
Subscribing to Your Critical Data Supply Chain - Getting Value from True Data...
 
DOES14 - Stephen Elliot - IDC - Delivering DevOps Business Metrics that Matter
DOES14 - Stephen Elliot - IDC - Delivering DevOps Business Metrics that MatterDOES14 - Stephen Elliot - IDC - Delivering DevOps Business Metrics that Matter
DOES14 - Stephen Elliot - IDC - Delivering DevOps Business Metrics that Matter
 
Direct Project HIT Standards 10.27
Direct Project HIT Standards 10.27Direct Project HIT Standards 10.27
Direct Project HIT Standards 10.27
 

Recently uploaded

Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 

Recently uploaded (20)

Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 

Evaluating the CC-IDF citation-weighting scheme: How effectively can ‘Inverse Document Frequency’ (IDF) be applied to references?

  • 1. Evaluating the CC-IDF citation-weighting scheme: How effectively can ‘Inverse Document Frequency’ (IDF) be applied to references? Joeran Beel, Corinna Breitinger, Stefan Langer iConference 2017 -- 2017/03/24, presented by Maria Gäde
  • 2. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 2 Outline 1. CC-IDF 2. Evaluation 1. Methodology 2. Results 3. Discussion
  • 3. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 3 1. CC-IDF: Common Citation – Inverse Document Frequency
  • 4. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 4 Introduction • Introduced in 1998 in the digital library and citation-indexing system CiteSeer (Bollacker, Lawrence, & Giles, 1998) • CC-IDF = TF-IDF applied to citations/references • Assumption: “if a very uncommon citation is shared by two documents, this should be weighted more highly than a citation made by a large number of documents” (Giles et al., 1998) • Similar concept as Bibliographic Coupling (but different weighting than Coupling Strength) • In the original CC-IDF, the CC component is binary. • We focus on IDF
  • 5. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 5 Illustration CC-IDF != Bibliographic Coupling Strength Cited Document dcited 2 Bibliogr. Coupled Document dBC 3 Bibliogr. Coupled Document dBC 4 Which bibliographic-coupled document is more closely related to di? Cited Document dcited 1 Input Document di Bibliogr. Coupled Document dBC 1 Bibliogr. Coupled Document dBC 2 CC-IDF = 1/2CC-IDF = 1/3 CC-IDF = 1/3 CC-IDF = 1/3 + 1/2 = 5/6
  • 6. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 6 Use of CC-IDF • CC-IDF has been used in several recommender systems, and served as a baseline in many evaluations. • Ambiguous reports regarding the effectiveness of CC-IDF • Sometimes, CC-IDF was found to perform better and other times worse than simple bibliographic coupling and co-citation strength (Küçüktunç, Saule, Kaya, & Çatalyürek, 2012; Küçüktunç et al., 2013; Liang et al., 2011; Pan et al., 2015; Zhang et al., 2012) • Compared to more advanced approaches such as HITS, PaperRank, and Katz, CC-IDF performs usually poorly (Küçüktunç et al., 2012; Pan et al., 2015)
  • 7. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 7 Problem • CC-IDF has never been compared to CC-Only, i.e. a simple citation weighting scheme based only on the CC component and ignoring IDF. • The basic assumption underlying CC-IDF – namely that “if a very uncommon citation is shared by two documents, this should be weighted more highly than a citation made by a large number of documents” – has never been evaluated for its effectiveness. --> Goal: Compare CC-IDF vs. CC-Only
  • 8. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 8 2. Evaluation
  • 9. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 9 Methodology • A/B Test in Docear’s research-paper recommender system. • Docear is a reference manager that allows users to manage references and PDF files, similar to Mendeley and Zotero. • One key difference is that Docear’s users manage their data in mind-maps. Users’ mind-maps contain links to PDFs, as well as the user’s annotations made within those PDFs. • To calculate CC-IDF, each mind map of a user was considered as one document.
  • 10. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 10 CC-IDF in Docear Recommendation Candidate Corpus Contains Contains Cites Cites Document d2 Document d1 Candidate c1 Cites Candidate c4 Which candidate is more related? Candidate c1 or candidates c2/c3/c4 ? Candidate c2 Cites Candidate c3 Document collection of User u
  • 11. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 11 A/B Test Design • Random Selection of • CC-IDF • CC-Only • Evaluation with click-through rates (CTR). • 238,681 recommendations to 3,561 users • January – September 2014. • All results are statistically significant (p<0.05), if not stated otherwise.
  • 12. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 12 Results • CC-Only: 6.23% • CC-IDF: 6.15% • Difference is statistically not significant 1 2 3 4 [5-9] [10-14] [15-24] [25-74] [75-149] >=150 CC-Only (Dspld Recs) 180 451 667 713 1,964 896 2,098 7,329 5,848 4,675 CC-IDF (Dspld Recs) 252 496 600 770 1,660 1,075 2,484 10,114 5,555 4,979 CC-Only (CTR) 3.13% 1.93% 4.74% 5.00% 7.21% 8.08% 6.50% 6.00% 6.83% 5.87% CC-IDF (CTR) 2.58% 3.86% 5.51% 5.62% 6.28% 8.03% 6.35% 6.64% 6.38% 4.92% - 2,000 4,000 6,000 8,000 10,000 12,000 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% Numberofdisplayed recommendations CTR Number of utilized references
  • 13. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 13 Potential Reasons 1. Citations always have some significance 2. Document Frequency (DF) has too little variation based on citations 3. New research articles have lower Document Frequency 4. No normalization for length of reference list
  • 14. Asst. Prof. Dr. Joeran Beel | Intelligent Systems | Data & Knowledge Engineering Group | beelj@tcd.ie 14 Outlook • More research with: • Traditional related-document scenario • Larger document corpus • Users with more documents in their collections • Higher quality of citation data • Variation of CC component
  • 15. Thank You Questions: Joeran Beel, beel@tcd.ie

Editor's Notes

  1. The citation-weighting scheme CC-IDF was introduced in 1998 in the digital library and citation-indexing system CiteSeer (Bollacker, Lawrence, & Giles, 1998; Giles, Bollacker, & Lawrence, 1998). CiteSeer offered a link for retrieving a list of “related documents” beside each search result, and the list of related documents was calculated, among others, using CC-IDF. CC-IDF stands for “Common Citation – Inverse Document Frequency” and it consists namely of the common citation frequency (CC) for a citation and the inverse frequency of documents in a corpus containing that citation (IDF). Using IDF to weight citations was a novel concept at that time and was inspired by TF-IDF, one of the most popular text-weighting schemes in information retrieval (Jones, 1972; Salton, Wong, & Yang, 1975). The assumption of IDF when applied to citations is that “if a very uncommon citation is shared by two documents, this should be weighted more highly than a citation made by a large number of documents” (Giles et al., 1998). However, there is a difference between TF-IDF and the traditional CC-IDF measure. In TF-IDF, the term frequency TF expresses how often a term occurs in a particular document. In contrast, CC is a binary measure, which only specifies if a document contains (1) or does not contain (0) a reference.
  2. For a given input document di, a list of related documents must be identified. All documents that share at least one reference with di are considered potentially related, a concept also known as bibliographic coupling (BC). In the example, the bibliographically coupled documents are dBC1, dBC2, dBC3, and dBC4. According to CC-IDF, dBC1 and dBC2 are the least related documents to di, because they each share only one reference (dcited1) with di and this reference is cited in total by three documents in the corpus (dBC1, dBC2 and dBC3). Hence, for dBC1 and dBC2 CC-IDF calculates as 𝐶𝐶×𝐼𝐷𝐹( 𝑑 𝑖 , 𝑑 𝐵𝐶 1|2 )= 1 3 . In contrast, dBC4 also shares a single reference (dcited2) with di, but this reference is only cited twice in the corpus (namely by dBC3 and dBC4). Hence, 𝐶𝐶×𝐼𝐷𝐹 𝑑 𝑖 , 𝑑 𝐵𝐶 4 = 1 2 and dBC4 is regarded as more closely related to di than dBC1 and dBC2. In Figure 1, for all documents in the collection, document dBC3 is the most closely related to the input document di, because they share the two references dcited1 and dcited2. CC-IDF sums up the individual relatedness values, hence 𝐶𝐶×𝐼𝐷𝐹 𝑑 𝑖 , 𝑑 𝐵𝐶 3 = 1 2 + 1 3 = 5 6 . If the input document was considered to be part of the corpus, the number of documents would be four instead of three. However, for calculating document relatedness using CC-IDF it does not matter if the input document is counted or not. CC-IDF != Bibliographic Coupling Strength (BCC). For instance, BCC for d4 and d1 would be the same (i.e. 1) while CC-IDF is different
  3. We performed the first evaluation of TF-IDuF using the mind maps of Docear’s users as personal document corpora. We were positively surprised by the results.
  4. Documents of User = Mind Maps In the original CC-IDF approach, there is one input document for which a list of related documents is wanted, and related documents are found via bibliographic coupling with CC-IDF weighting. We utilized a user’s collection of mind-maps as input (instead of a single research paper), and we interpreted the link to, or reference of, a paper in a user’s collection as a citation of that paper. In addition, the original CC-IDF approach uses a binary weight for the CC component. We calculated CC as the frequency for how often a reference or link to a paper occurred in a user’s collection. More precisely, our recommender system only utilized a subset of the user’s most recently added documents.
  5. Whenever Docear wanted to diaplay recommendations, Docear randomly selected on of the two weighting schemes. We measured how often users clicked on the recommendations.
  6. there was no statistically significant difference between CC-Only (CTR = 6.23%) and CC-IDF (6.15%) (Table 1). The result remains the same when looking at different numbers of references being utilized (Figure 9). The effectiveness of CC-IDF and CC-Only is about the same. For instance, when a user model contained 15 to 24 references, CTR for CC-Only was 6.50% and for CC-IDF 6.35%. From the observed results, we would conclude that CC-IDF and CC-Only are equally effective, i.e. calculating IDF does not increase effectiveness compared to using CC-Only. Consequently, there would be little reason to use CC-IDF, because it is more complex to calculate than CC-Only.
  7. 1. Research papers usually contain thousands of unique terms. Consequently, it is important to identify the most descriptive terms. In contrast, a research paper usually contains few citations (maybe 5 or 10 for conference papers, or 30 for journal article, although this number can differ widely depending on the discipline). Consequently, the need – and the potential benefit – of identifying the most important citations is lower, because likely almost all references in an article will have some significance. 2. In a large corpus, some terms occur in millions of documents. In contrast, even the world’s most frequently occurring reference occurs only in 305,000 citing documents; and the vast majority of references occurs only in few documents, because typically research papers receive few citations (or none at all). Consequently, IDF values for citations will be within a smaller range than term-based IDF values. Therefore, we would expect IDF when applied to references to be less effective than IDF when applied to terms. 3. Older papers have more time to accumulate citations, while recently published papers typically have few or no citations. CC-IDF does not account for this, which could bias IDF calculations. For instance, consider the previous example of bibliographic coupling and CC-IDF (cf. section 2.3), but this time assume that dcited1 was published in 1982, and dcited2 was published in 2016 (Figure 10). CC-IDF would be 1/3 for dBC1…3 and 1 for dBC4. However, given the publication years, it would be expected that dcited1 has more citations than dcited2, and we intuitively would not believe, for instance, that dBC3 is less related to di than dBC4. We therefore suggest to analyze how CC-IDF performs when normalized by the documents’ publication years. 4. CC-IDF does not normalize for the number of entries in a bibliography and may provide different recommendations than a classic relative bibliographic-coupling strength (see section 2.3). In future research, we suggest comparing CC-IDF with relative bibliographic coupling strength and also to evaluate the effectiveness of a CC-IDF measure that normalizes for the number of entries in a bibliography. http://www.nature.com/news/the-top-100-papers-1.16224 To some extent, the same might be true for terms, but we assume the effect to be much stronger for citations.
  8. From the observed results, we would conclude that CC-IDF and CC-Only are equally effective, i.e. calculating IDF does not increase effectiveness compared to using CC-Only. Consequently, there would be little reason to use CC-IDF, because it is more complex to calculate than CC-Only. However, it is too early to draw such general conclusions from our results for the following reasons