SlideShare a Scribd company logo
1 of 9
Fletcher Series. 2016 Aug 26;1(1-10)
Abstracts Matter. But...
How much so?
Rascon CA1
1cynthia.alexander@gmail.com, San Francisco CA, 94105, USA.
Abstractff
The number of times a scientific paper is cited (citations count) has emerged as proxy of a paper’s
success within its field. Here, I aim to address how relevant is an abstract to a scientific publication,
and furthermore which features of such abstracts play the largest impact in a paper’s success (as
estimated by citations count).
The data set comprised all abstracts of scientific papers from 22 top biotech journals published in
the period of 1995-2016, a total of 310,175 papers. Journals name or the affiliation of the heads of
laboratories where not incorporated in this model, which aimed to be solely based on the abstracts
title and content. Data cleaning, and feature engineering largely relying on NLP metrics (LSA, Tf-idf,
POS-tagger), gave an good insight on what better predicts citation count across the
Biotech papers have a steady
trending curve
Figure 1. Number of citations per paper by year of publishing. The corpus data set after
cleaning is comprised by 202,173 abstracts. Each cyan dot represents a single paper
(transparency 0.3).
A journal prestige is dependent
on its impact factor
Figure 2. Journals used for the data set and the number of citations per paper
published between 1995-2010 shown as a violin plot. This differences reflect to some
extent each journals impact factor (the yearly average number of citations).
Figure 3. Final set of 134,374 papers (1995-2010). The
total number of citations per paper, (target, y), was
binned in two classes: under or over 10 total citations
since the paper’s publishing date (0 or 1, respectively).
(left side: Example of an Abstract and citation count)
.
Abstracts binned in two classes:
0 for 1-9 (25%), or 1 for 10 or more (75%) total citations
LAS, Tf-idf, and Positional Tagging
selected as star features, with Random
Forests as the model of choiceR
Figure 4. ROC and Precision/Recall curves for the top performing models.
Model over the last 5 years (2005-2009)
to predict the ‘success’ of 2010 papers:R
Figure 5. ROC and Precision/Recall curves for the top performing models. This time
modeling on 2005-2009 papers to predict 2010 papers ‘success’.
Features identified as important by RF for
predicting coming years’ papers success:
Figure 6. Feature importances as ranked by Random Forests, for a model trained on 2005-2009 and
tested on 2010 papers. *Abstract LSA (100 comp.), **Abstract LSA on Tfidf (100 comp.), *** in Title LSA
C2- **
C2- *
C4- *
C7- **
C4- **
POS tag ‘:’
C8- **
C5- **
Abstract length
C3- **
C1- *
C31-***
C15- **
C15- *
C14- *
C16- **
C3- *
C6- *
POS tag ‘.’
C29- **
1st – Next Generation Sequencing
sequenc: 0.20, method: 0.17, data: 0.16, genom: 0.16, avail: 0.14
2nd – Cellular regulation / gene
expression
cell: 0.71, activ: 0.19, induc: 0.08, regul: 0.08, mice: 0.07
3rd – Cellular models (methods)
cell: 0.28, use: 0.23, data: 0.19, method: 0.17, model: 0.16
4th – Applied genomics (mutants)
genom: 0.25, sequenc: 0.25, protein: 0.19,mutant: 0.12, human: 0.11
5th – Basic research (DNA related)
gene: 0.28, dna: 0.27, rna: 0.20, transcript: 0.20, genom: 0.17
Abstracts matter about:
81%
Need to consider:
Are better scientist simply better communicators?
Or… Great scientist are also really good at
communicating?
I did not incorporate a feature to account for
novelty. (quite the opposite)
It is circular to say the more papers exist in a filed
the more likely it is to be cited in the future.
However this suggests that trends exist in
academia. *duh*
Abstracts matter about:
81%
Future directions:
Multi-class case
Extend prediction forecast window. 2017??
Examine those abstracts in which the model did
poorly.
Flask app to ‘score’ new abstracts.
Time series, model topic trends over time. Is it too
early or is it too late for a paper to come out?

More Related Content

Similar to Paper Abstracts Matter... But How much?

Chemical intelligence that makes hidden knowledge effortlessly reachable
Chemical intelligence that makes hidden knowledge effortlessly reachableChemical intelligence that makes hidden knowledge effortlessly reachable
Chemical intelligence that makes hidden knowledge effortlessly reachableChemAxon
 
Hi I need to understand stemplots. thanksSolution .pdf
Hi I need to understand stemplots. thanksSolution               .pdfHi I need to understand stemplots. thanksSolution               .pdf
Hi I need to understand stemplots. thanksSolution .pdfamitsalesraipur
 
Quantitive Time Series Analysis of Malware and Vulnerability Trends
Quantitive Time Series Analysis of Malware and Vulnerability TrendsQuantitive Time Series Analysis of Malware and Vulnerability Trends
Quantitive Time Series Analysis of Malware and Vulnerability Trendsamiable_indian
 
Research on Haberman dataset also business required document
Research on Haberman dataset also business required documentResearch on Haberman dataset also business required document
Research on Haberman dataset also business required documentManjuYadav65
 
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...Damian R. Mingle, MBA
 
Criminal Justice Statistics Lab 4CRJS-3020-01 Points 30A
Criminal Justice Statistics Lab 4CRJS-3020-01  Points 30ACriminal Justice Statistics Lab 4CRJS-3020-01  Points 30A
Criminal Justice Statistics Lab 4CRJS-3020-01 Points 30ACruzIbarra161
 
Six sigma statistics
Six sigma statisticsSix sigma statistics
Six sigma statisticsShankaran Rd
 
UNDERSTANDING LEAST ABSOLUTE VALUE IN REGRESSION-BASED DATA MINING
UNDERSTANDING LEAST ABSOLUTE VALUE IN REGRESSION-BASED DATA MININGUNDERSTANDING LEAST ABSOLUTE VALUE IN REGRESSION-BASED DATA MINING
UNDERSTANDING LEAST ABSOLUTE VALUE IN REGRESSION-BASED DATA MININGIJDKP
 
12The Chi-Square Test Analyzing Categorical DataLea.docx
12The Chi-Square Test Analyzing Categorical DataLea.docx12The Chi-Square Test Analyzing Categorical DataLea.docx
12The Chi-Square Test Analyzing Categorical DataLea.docxhyacinthshackley2629
 
Das20502 chapter 1 descriptive statistics
Das20502 chapter 1 descriptive statisticsDas20502 chapter 1 descriptive statistics
Das20502 chapter 1 descriptive statisticsRozainita Rosley
 
Presentation (9).pptx
Presentation (9).pptxPresentation (9).pptx
Presentation (9).pptxAmitMasand5
 
A Validation of Object-Oriented Design Metrics as Quality Indicators
A Validation of Object-Oriented Design Metrics as Quality IndicatorsA Validation of Object-Oriented Design Metrics as Quality Indicators
A Validation of Object-Oriented Design Metrics as Quality Indicatorsvie_dels
 

Similar to Paper Abstracts Matter... But How much? (20)

Chemical intelligence that makes hidden knowledge effortlessly reachable
Chemical intelligence that makes hidden knowledge effortlessly reachableChemical intelligence that makes hidden knowledge effortlessly reachable
Chemical intelligence that makes hidden knowledge effortlessly reachable
 
The influence of data curation on QSAR Modeling – examining issues of qualit...
 The influence of data curation on QSAR Modeling – examining issues of qualit... The influence of data curation on QSAR Modeling – examining issues of qualit...
The influence of data curation on QSAR Modeling – examining issues of qualit...
 
Hi I need to understand stemplots. thanksSolution .pdf
Hi I need to understand stemplots. thanksSolution               .pdfHi I need to understand stemplots. thanksSolution               .pdf
Hi I need to understand stemplots. thanksSolution .pdf
 
SEM in IB - A critical look
SEM in IB - A critical lookSEM in IB - A critical look
SEM in IB - A critical look
 
Quantitive Time Series Analysis of Malware and Vulnerability Trends
Quantitive Time Series Analysis of Malware and Vulnerability TrendsQuantitive Time Series Analysis of Malware and Vulnerability Trends
Quantitive Time Series Analysis of Malware and Vulnerability Trends
 
Evaluacion cuatro
Evaluacion cuatroEvaluacion cuatro
Evaluacion cuatro
 
Research on Haberman dataset also business required document
Research on Haberman dataset also business required documentResearch on Haberman dataset also business required document
Research on Haberman dataset also business required document
 
Bab 4.ppt
Bab 4.pptBab 4.ppt
Bab 4.ppt
 
Sampling Data in T-SQL
Sampling Data in T-SQLSampling Data in T-SQL
Sampling Data in T-SQL
 
Session02
Session02Session02
Session02
 
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
 
Criminal Justice Statistics Lab 4CRJS-3020-01 Points 30A
Criminal Justice Statistics Lab 4CRJS-3020-01  Points 30ACriminal Justice Statistics Lab 4CRJS-3020-01  Points 30A
Criminal Justice Statistics Lab 4CRJS-3020-01 Points 30A
 
Six sigma statistics
Six sigma statisticsSix sigma statistics
Six sigma statistics
 
UNDERSTANDING LEAST ABSOLUTE VALUE IN REGRESSION-BASED DATA MINING
UNDERSTANDING LEAST ABSOLUTE VALUE IN REGRESSION-BASED DATA MININGUNDERSTANDING LEAST ABSOLUTE VALUE IN REGRESSION-BASED DATA MINING
UNDERSTANDING LEAST ABSOLUTE VALUE IN REGRESSION-BASED DATA MINING
 
Major project.pptx
Major project.pptxMajor project.pptx
Major project.pptx
 
12The Chi-Square Test Analyzing Categorical DataLea.docx
12The Chi-Square Test Analyzing Categorical DataLea.docx12The Chi-Square Test Analyzing Categorical DataLea.docx
12The Chi-Square Test Analyzing Categorical DataLea.docx
 
Das20502 chapter 1 descriptive statistics
Das20502 chapter 1 descriptive statisticsDas20502 chapter 1 descriptive statistics
Das20502 chapter 1 descriptive statistics
 
Presentation (9).pptx
Presentation (9).pptxPresentation (9).pptx
Presentation (9).pptx
 
IPPTCh008.pptx
IPPTCh008.pptxIPPTCh008.pptx
IPPTCh008.pptx
 
A Validation of Object-Oriented Design Metrics as Quality Indicators
A Validation of Object-Oriented Design Metrics as Quality IndicatorsA Validation of Object-Oriented Design Metrics as Quality Indicators
A Validation of Object-Oriented Design Metrics as Quality Indicators
 

Recently uploaded

Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...karishmasinghjnh
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 

Recently uploaded (20)

Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 

Paper Abstracts Matter... But How much?

  • 1. Fletcher Series. 2016 Aug 26;1(1-10) Abstracts Matter. But... How much so? Rascon CA1 1cynthia.alexander@gmail.com, San Francisco CA, 94105, USA. Abstractff The number of times a scientific paper is cited (citations count) has emerged as proxy of a paper’s success within its field. Here, I aim to address how relevant is an abstract to a scientific publication, and furthermore which features of such abstracts play the largest impact in a paper’s success (as estimated by citations count). The data set comprised all abstracts of scientific papers from 22 top biotech journals published in the period of 1995-2016, a total of 310,175 papers. Journals name or the affiliation of the heads of laboratories where not incorporated in this model, which aimed to be solely based on the abstracts title and content. Data cleaning, and feature engineering largely relying on NLP metrics (LSA, Tf-idf, POS-tagger), gave an good insight on what better predicts citation count across the
  • 2. Biotech papers have a steady trending curve Figure 1. Number of citations per paper by year of publishing. The corpus data set after cleaning is comprised by 202,173 abstracts. Each cyan dot represents a single paper (transparency 0.3).
  • 3. A journal prestige is dependent on its impact factor Figure 2. Journals used for the data set and the number of citations per paper published between 1995-2010 shown as a violin plot. This differences reflect to some extent each journals impact factor (the yearly average number of citations).
  • 4. Figure 3. Final set of 134,374 papers (1995-2010). The total number of citations per paper, (target, y), was binned in two classes: under or over 10 total citations since the paper’s publishing date (0 or 1, respectively). (left side: Example of an Abstract and citation count) . Abstracts binned in two classes: 0 for 1-9 (25%), or 1 for 10 or more (75%) total citations
  • 5. LAS, Tf-idf, and Positional Tagging selected as star features, with Random Forests as the model of choiceR Figure 4. ROC and Precision/Recall curves for the top performing models.
  • 6. Model over the last 5 years (2005-2009) to predict the ‘success’ of 2010 papers:R Figure 5. ROC and Precision/Recall curves for the top performing models. This time modeling on 2005-2009 papers to predict 2010 papers ‘success’.
  • 7. Features identified as important by RF for predicting coming years’ papers success: Figure 6. Feature importances as ranked by Random Forests, for a model trained on 2005-2009 and tested on 2010 papers. *Abstract LSA (100 comp.), **Abstract LSA on Tfidf (100 comp.), *** in Title LSA C2- ** C2- * C4- * C7- ** C4- ** POS tag ‘:’ C8- ** C5- ** Abstract length C3- ** C1- * C31-*** C15- ** C15- * C14- * C16- ** C3- * C6- * POS tag ‘.’ C29- ** 1st – Next Generation Sequencing sequenc: 0.20, method: 0.17, data: 0.16, genom: 0.16, avail: 0.14 2nd – Cellular regulation / gene expression cell: 0.71, activ: 0.19, induc: 0.08, regul: 0.08, mice: 0.07 3rd – Cellular models (methods) cell: 0.28, use: 0.23, data: 0.19, method: 0.17, model: 0.16 4th – Applied genomics (mutants) genom: 0.25, sequenc: 0.25, protein: 0.19,mutant: 0.12, human: 0.11 5th – Basic research (DNA related) gene: 0.28, dna: 0.27, rna: 0.20, transcript: 0.20, genom: 0.17
  • 8. Abstracts matter about: 81% Need to consider: Are better scientist simply better communicators? Or… Great scientist are also really good at communicating? I did not incorporate a feature to account for novelty. (quite the opposite) It is circular to say the more papers exist in a filed the more likely it is to be cited in the future. However this suggests that trends exist in academia. *duh*
  • 9. Abstracts matter about: 81% Future directions: Multi-class case Extend prediction forecast window. 2017?? Examine those abstracts in which the model did poorly. Flask app to ‘score’ new abstracts. Time series, model topic trends over time. Is it too early or is it too late for a paper to come out?

Editor's Notes

  1. The impact factor (IF) of an academic journal is a measure reflecting the yearly average number of citations to recent articles published in that journal.
  2. Took some time to get to this curve, data cleaning