Your SlideShare is downloading. ×
Publish or Perish:Towards a Ranking of Scientists using    Bibliographic Data Mining                 Lior RokachDepartment...
About MeProf. Lior RokachDepartment of Information Systems EngineeringFaculty of Engineering SciencesHead of the Machine L...
Outline:•   What is bibliometrics?•   Short tutorial on bibiometrics measures•   Our methodology: data mining•   Task 1: A...
Ranking scientists, WHY?•   Promotion•   Tenure•   Grants•   Prizes
Bibliometrics• “Man is an animal that writes letters”      – Attributed to Lewis Carroll (Charles Dodgson)• Scientist is a...
Publish or Perish“I don‟t mind your thinking slowly. I mind your  publishing faster than you can think.”(The Nobel Laureat...
Metrics: Do metrics matter?• According to Abbott et al.  (Nature, 2010):  – Department heads says ―No‖     • ―External let...
Quick Guide To Bibliometrics         Measures
Citation IndexA citation index is an index of citations between publications, allowing the user to easily establish which ...
The First Citation Index                                                                 Cited byThe first citation index ...
Simple Citations-Based Measures       to Evaluate Scientists• Total Citations (and its squared root)• Total Citations norm...
Why citations are not always ideal way to evaluate researchers publications• Uncitedness: It is a sobering fact that some ...
A Brief History of Citation Analysis• 1955:   – Eugene Garfield - Linguist   – Develop the impact factor.   – Founder of t...
1. Impact Factor (Garfield, 1955)• Citation Indexes for Science: A New Dimension in  Documentation through Association of ...
Criticisms of the Impact Factor• Subject variation: citation studies should be normalized to  take into account variables ...
Variations of Impact Factor and more:• Five years Impact Factor• Cited Half-Life - measure the achievability. The Cited Ha...
2. H-Index    (Hirsch, 2005; Egghe and Rousseau, 2006)• A scientist is said to have Hirsch index h if h of their  total, N...
• Using H-Index for Physicists by Hirsh:  – 10-12  tenure decisions  – 18  a full professorship  – 15–20  a fellowship ...
h ~ mn    (m=gradient, n=number of years)1. m ~ 1, h=20 after 20 years ―Successful Scientists―2. m ~ 2, h=40 after 20 year...
Modified H-Index Metrics           Scientists with the same H-IndexMeasure    Description                                 ...
Modified H-Index Metrics             To share the fame in a fair way              multi-authored manuscriptsMeasure      D...
Modified H-Index Metrics                        Age AdjustedMeasure        Description                                    ...
Revised H-Index Metrics                         OthersMeasure    Description                                              ...
Modified H-Index Metrics           Scientists with the same H-IndexMeasure    Description                                 ...
Modified H-Index Metrics             To share the fame in a fair way              multi-authored manuscriptsMeasure      D...
Modified H-Index Metrics                        Age AdjustedMeasure        Description                                    ...
Revised H-Index Metrics                         OthersMeasure    Description                                              ...
Limitations of H-Index• The h-index ignores the importance of the publications   – Évariste Galois h-index is 2, and will ...
Education Subject Category…
Eigenfactor.org Scores• Eigenfactor score: …the higher the better   – A measure of the overall value provided by all of th...
Other Journal Ranking Efforts…SCImago Journal Rank (SJR)  Similar to eigenfactor methods, but based on    citations in Sco...
SCImago
SCImago Journal Indicator Search…
SCImago Journal Search (AgronomyJournal)
A Few Other Journal Ranking      Proposals… many would like to use                 journal usage stats• Usage Factors – Ba...
Other Measures for Evaluating        Researchers (Tang, et al. 2008)• Uptrend - Nothing can catch peoples eyes more than a...
Other Measures for Evaluating      Researchers (Tang, et al. 2008)• Activity - Peoples activity is simply defined based  o...
Other Measures for Evaluating      Researchers (Tang, et al. 2008)• Diversity - Generally, an experts research may  includ...
Other Measures for Evaluating      Researchers (Tang, et al. 2008)• Sociability - The score of sociability is basically  d...
Richard Van Noorden (2010)
Bibliometrics Predictive Power• Prediction of Nobel Laureates –        – The Thomson Reuters rank among the top 0.1% of   ...
Research Questions• Primary Questions:  – To which extent do bibliometrics reflect scientists    ranking in CS?  – Which s...
Research Methods• Retrospective analysis of scientists‘ careers:   – Correlating academic positions with bibliometrics    ...
Process
ISI Web of Knowledge• Coverage  – Most Journals (13,000 journals)  – Some Conferences (192,000 conference proceedings)  – ...
Google Scholar• Coverage  – The largest  – Still has limited coverage of pre-1990 publications  – It is criticized for inc...
Why CS?• Variety of sub-fields with different citation patterns  (Bioinformatics vs AI).• Different types of important man...
Task 1: Nominating Committee
Inclusion/Exclusion Criteria47 Researchers   –   Researchers from Stanford, MIT, Berkley and Yale   –   Completed their Ph...
H-Index Over Time (for 7 professors)                                                               Drop Page Fields Here  ...
Citations Over Time (for 7 professors)                                                                        Drop Page Fi...
Evaluation• Procedure: Leave One Researcher Out                                                   ln(odds) b wT x• Base Cl...
Task 1.1: Ranking Researchers• Rank a researcher to one the following positions,  given only a snapshot of her bibliometri...
The Ranking Task – Results                      Top 10 MeasuresClassification            Cited Manuscript   Citing Manuscr...
The Ranking Task – Results                   Least Predictive Measures                              Cited Classification  ...
Not by bibliometrics alone                            Accuracy = 73.7% !!!                                Predicted       ...
Task 1.2: Promoting Researchers• Given the researcher‘s current position and  her bibliometrics measures, decide if she  s...
Promotion Decision Task - Results                                                                       Cited             ...
Not by bibliometrics alone                                            Improvement vs. Rank                  25.00%        ...
Google Scholar vs. ISI Thomson
Google Scholar vs. ISI Thomson
Self-Citations
Which Manuscripts Should be Taken into          Consideration?
Which Citing Manuscripts Should be    Taken into Consideration?
Conclusions – Take 1• Seniority is a good indicator for  promoting scientists in leading USA  universities.• Variation in ...
Task 2: And the AAAI Fellowship Goes To
AAAI FellowsihpTry to determine if and when an AI scientist is  qualified to be elected to the AAAI FellowshipData set:  –...
Task 2.1 – Leave One Scientist Out                            Criterion                        Average Performance        ...
Using a single measure                                                                              Fellows               ...
Task 2.2 – Predicting Next Year Fellows
Task 2.2 – Predicting Coming Fellows
Rules Example• (TC/A = (65.7085-inf)) and (TP/A = (26.084-inf)) and (Ih = (3.565-inf))  and (CpY = (13.191-inf)) => Fellow...
Task 2.3 – Social Network• Based on the idea of Erdos  number• Predict fellowship based  on co-authorship with  other fell...
Task 2.3                   Criterion                        Average Performance   Not Identifying a fellow (False Negative...
Task 2.3• (Count >= 5) and (CpP >= 7) and (TP/A >=  6.883) => Fellow=TRUE (51.0/3.0)• (TP/A >= 22.944) and (Avg <= 3.26666...
Conclusions – Take 2• Bibliometric measures can be used to  predict fellowship• Combining various measures using data  nin...
Very Near Future Work• Adding Google scholar dataset• Examine the contribution of conferences in  predicting the fellowshi...
Why God Never Received           Tenure at Any University1) He had only one major publication.2) It was in Hebrew.3) It ha...
•                                                 References    JOHAN BOLLEN, MARKO A. RODRIGUEZ, HERBERT VAN DE SOMPEL, J...
Publish or Perish:  Towards a Ranking of Scientists using Bibliographic Data Mining
Publish or Perish:  Towards a Ranking of Scientists using Bibliographic Data Mining
Upcoming SlideShare
Loading in...5
×

Publish or Perish: Towards a Ranking of Scientists using Bibliographic Data Mining

2,563

Published on

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,563
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "Publish or Perish: Towards a Ranking of Scientists using Bibliographic Data Mining"

  1. 1. Publish or Perish:Towards a Ranking of Scientists using Bibliographic Data Mining Lior RokachDepartment of Information Systems Engineering Ben-Gurion University of the Negev
  2. 2. About MeProf. Lior RokachDepartment of Information Systems EngineeringFaculty of Engineering SciencesHead of the Machine Learning LabBen-Gurion University of the NegevEmail: liorrk@bgu.ac.ilhttp://www.ise.bgu.ac.il/faculty/liorr/PhD (2004) from Tel Aviv University
  3. 3. Outline:• What is bibliometrics?• Short tutorial on bibiometrics measures• Our methodology: data mining• Task 1: Academic positions• Task 2: AAAI Fellowship• Results• Conclusions
  4. 4. Ranking scientists, WHY?• Promotion• Tenure• Grants• Prizes
  5. 5. Bibliometrics• “Man is an animal that writes letters” – Attributed to Lewis Carroll (Charles Dodgson)• Scientist is an animal that writes papers• Bibliometrics is measurement of (scientific) publications• The simplest measure – Number of publications - Disadvantage: counts Quantity and disregards Quality
  6. 6. Publish or Perish“I don‟t mind your thinking slowly. I mind your publishing faster than you can think.”(The Nobel Laureates physicist Wolfgang Pauli)
  7. 7. Metrics: Do metrics matter?• According to Abbott et al. (Nature, 2010): – Department heads says ―No‖ • ―External letters trump everything,‖ – But … • Admit that ―those „qualitative‟ letters of recommendation sometimes bring in quantitative metrics by the back door‖ • Most of the researchers (70%) believe it has an effect
  8. 8. Quick Guide To Bibliometrics Measures
  9. 9. Citation IndexA citation index is an index of citations between publications, allowing the user to easily establish which later documents cite which earlier documents
  10. 10. The First Citation Index Cited byThe first citation index is attributed to the Hebrew Talmud (see above), Dated th Centaury (Weinberg, 1997), while other refer to Shepards Citations created in 1873 as the first citation index.
  11. 11. Simple Citations-Based Measures to Evaluate Scientists• Total Citations (and its squared root)• Total Citations normalized by number of authors• Mean number of citations per year• Mean number of citations per paper
  12. 12. Why citations are not always ideal way to evaluate researchers publications• Uncitedness: It is a sobering fact that some 90% of articles that have been published in academic journals are never cited. Even Nobel Laureates have a rather large fraction (10% or more) of uncited publications (Egghe et al., 2011).• But the terms ―uncited‖ or ―seldom cited,‖ they are usually referring to uncited or seldom-cited in the journals monitored by Thomson Reuters and other similar databases, not to all journals, books, and reports;• ―uncited‖ or ―seldom-cited‖ is not a synonym for ―not used.‖ (MacRoberts MacRoberts, 2011)• Expert judgment is the best, and in the last resort the only, criterion of performance,
  13. 13. A Brief History of Citation Analysis• 1955: – Eugene Garfield - Linguist – Develop the impact factor. – Founder of the Institute for Scientific Information (ISI)• 1997: – Lee Giles; Kurt D. Bollacker; Steve Lawrence – Crawl and harvest papers on the web – Focus mainly on CS• 2004: – ―Stand on the shoulders of giants‖ – Freely accessible web search engine for scholarly literature• 2005: – Jorge E. Hirsch – Physicist – Develop the h-Index• 2007: – Carl Bergstrom – Biologist – Establish http://eigenfactor.org/ – Use PageRank algorithm to rank journals
  14. 14. 1. Impact Factor (Garfield, 1955)• Citation Indexes for Science: A New Dimension in Documentation through Association of Ideas – Garfield, E., Science, 1955, 122, 108-111• The impact factor for each journal, as used by Thomson Scientific, is the average number of citations acquired during the past two years for papers published over the same period.―The 2007 Impact factor for journal ABC‖ = Number of times articles published in ABC during 2005-2006 were cited in indexed journals during 2007 –––––––––-––––––––––––––––––––––––––––––––––––––––– Number of ―citable‖ articles published by ABC in 2005 and 2006
  15. 15. Criticisms of the Impact Factor• Subject variation: citation studies should be normalized to take into account variables such as field, discipline etc.• Long Tail: individual papers is largely uncorrelated to the impact factor of the journal in which it was published.• Limited subset of journals are indexed• Biased toward English-language journals• Short (two year) snapshot of journal• Includes self-citations• Some journals are unfairly promoting their own papers• Journal Inclusion Criteria are more than just quality
  16. 16. Variations of Impact Factor and more:• Five years Impact Factor• Cited Half-Life - measure the achievability. The Cited Half-Life of journal J in year X is the number of years after which 50% of the lifetime citations of J‘s content published in X have been received.• Ranking - Journals are often ranked by impact factor in an appropriate ThomsonReuters subject category. journals can be categorised in multiple subject categories which will cause their rank to be different and consequently a rank should always be in context to the subject category being utilised.Other Journal Ranking:• Eigenfactor - similar algorithm as Google‘s PageRank – By this approach, journals are considered to be influential if they are cited often by other influential journals. – Removes self-citations – Looks at five years of data
  17. 17. 2. H-Index (Hirsch, 2005; Egghe and Rousseau, 2006)• A scientist is said to have Hirsch index h if h of their total, N, papers have at least h citations each
  18. 18. • Using H-Index for Physicists by Hirsh: – 10-12  tenure decisions – 18  a full professorship – 15–20  a fellowship in the American Physical Society – 45 or higher  membership in the United States National Academy of Sciences.• H-Index in IS (Clarke, 2008) – Using Google Scholar
  19. 19. h ~ mn (m=gradient, n=number of years)1. m ~ 1, h=20 after 20 years ―Successful Scientists―2. m ~ 2, h=40 after 20 years ―outstanding scientists―3. m ~ 3, h=60 (20 years) or h=90 (30 years) ―truly uniqueindividuals‖Physics Nobel prizes (last 20 years) ‗h‘ (median) = 35 84 % had ‗h‘ ≥ 3049 % had m < 1
  20. 20. Modified H-Index Metrics Scientists with the same H-IndexMeasure Description RefRational It first calculate how many new citations are needed to increase the h- Ruane and TolH-Index index by one point. Let m denote the additional points needed. Thus the (2008)Distance rational hD=h1+1-m/(2h+1).Rational A researcher has an h-index of h if h is the largest number of papers with Ruane and TolH-Index at least h citations. However, some researchers may have more than h (2008)X papers, say n, with at least h citations. Let us define x= n-h. Thus the rational H-Index become hX=h+x/(s-h) where s is the total number of publications.e-index The (square root) of the surplus of citations in the h-set beyond h^2, i.e., Chun-Ting beyond the theoretical minimum required to obtain a h-index of h. The Zhang (2009) aim of the e-index is to differentiate between scientists with similar h- indices but different citation patterns.
  21. 21. Modified H-Index Metrics To share the fame in a fair way multi-authored manuscriptsMeasure Description RefIndividual It divides the standard h-index by the average number of authors in the Batista et al.h-index articles that contribute to the h-index, in order to reduce the effects of 2006 co-authorship;Norm It first normalizes the number of citations for each paper by dividingIndividual the number of citations by the number of authors for that paper, thenh-index calculates hI,norm as the h-index of the normalized citation counts. This approach is much more fine-grained than Batista et al.s; it more accurately accounts for any co-authorship effects that might be present and that it is a better approximation of the per-author impact, which is what the original h-index set out to provideSchreiber Schreibers method uses fractional paper counts (for example, only as SchreiberIndividual one third for three authors.) instead of reduced citation counts to (2008)h-index account for shared authorship of papers, and then determines the multi- authored hm index based on the resulting effective rank of the papers using undiluted citation counts.
  22. 22. Modified H-Index Metrics Age AdjustedMeasure Description RefContemporary It adds an age-related weighting to each cited article less weight to older articles. Sidiropoulos eth-index The weighting is parametrized; If we use gamma=4 and delta=1, this means that al. (2006) for an article published during the current year, its citations account four times. For an article published 4 years ago, its citations account only one time. For an article published 6 years ago, its citations account 4/6 times, and so on.AR-index It is an age-weighted citation rate, where the number of citations to a given paper Jin (2007) is divided by the age of that paper. Jin defines the AR-index as the square root of the sum of all age-weighted citation counts over all papers that contribute to the h-index.AWCR Like AR-index but sum over all papers instead (In particular, it allows younger and as yet less cited papers to contribute to the AWCR, even though they may not yet contribute to the h-index.)
  23. 23. Revised H-Index Metrics OthersMeasure Description RefAWCRpA The per-author age-weighted citation rate is similar to the plain AWCR, but is normalized to the number of authors for each paper.g-Index Given a set of articles ranked in decreasing order of the number Leo Egghe of citations that they received, the g-index is the (unique) (2006) largest number such that the top g articles received (together) at least g^2 citations. It aims to improve on the h-index by giving more weight to highly-cited articles.Pi-index The pi-index is equal to one hundredth of the number of Vinkler citations obtained to the top square root of the total number of (2009) journal papers (‗elite set of papers‘) ranked by the decreasing number of citations.
  24. 24. Modified H-Index Metrics Scientists with the same H-IndexMeasure Description RefRational It first calculate how many new citations are needed to increase the h- Ruane and TolH-Index index by one point. Let m denote the additional points needed. Thus the (2008)Distance rational hD=h1+1-m/(2h+1).Rational A researcher has an h-index of h if h is the largest number of papers with Ruane and TolH-Index at least h citations. However, some researchers may have more than h (2008)X papers, say n, with at least h citations. Let us define x= n-h. Thus the rational H-Index become hX=h+x/(s-h) where s is the total number of publications.e-index The (square root) of the surplus of citations in the h-set beyond h^2, i.e., Chun-Ting beyond the theoretical minimum required to obtain a h-index of h. The Zhang (2009) aim of the e-index is to differentiate between scientists with similar h- indices but different citation patterns.
  25. 25. Modified H-Index Metrics To share the fame in a fair way multi-authored manuscriptsMeasure Description RefIndividual It divides the standard h-index by the average number of authors in the Batista et al.h-index articles that contribute to the h-index, in order to reduce the effects of 2006 co-authorship;Norm It first normalizes the number of citations for each paper by dividingIndividual the number of citations by the number of authors for that paper, thenh-index calculates hI,norm as the h-index of the normalized citation counts. This approach is much more fine-grained than Batista et al.s; it more accurately accounts for any co-authorship effects that might be present and that it is a better approximation of the per-author impact, which is what the original h-index set out to provideSchreiber Schreibers method uses fractional paper counts (for example, only as SchreiberIndividual one third for three authors.) instead of reduced citation counts to (2008)h-index account for shared authorship of papers, and then determines the multi- authored hm index based on the resulting effective rank of the papers using undiluted citation counts.
  26. 26. Modified H-Index Metrics Age AdjustedMeasure Description RefContemporary It adds an age-related weighting to each cited article less weight to older articles. Sidiropoulos eth-index The weighting is parametrized; If we use gamma=4 and delta=1, this means that al. (2006) for an article published during the current year, its citations account four times. For an article published 4 years ago, its citations account only one time. For an article published 6 years ago, its citations account 4/6 times, and so on.AR-index It is an age-weighted citation rate, where the number of citations to a given paper Jin (2007) is divided by the age of that paper. Jin defines the AR-index as the square root of the sum of all age-weighted citation counts over all papers that contribute to the h-index.AWCR Like AR-index but sum over all papers instead (In particular, it allows younger and as yet less cited papers to contribute to the AWCR, even though they may not yet contribute to the h-index.)
  27. 27. Revised H-Index Metrics OthersMeasure Description RefAWCRpA The per-author age-weighted citation rate is similar to the plain AWCR, but is normalized to the number of authors for each paper.g-Index Given a set of articles ranked in decreasing order of the number Leo Egghe of citations that they received, the g-index is the (unique) (2006) largest number such that the top g articles received (together) at least g^2 citations. It aims to improve on the h-index by giving more weight to highly-cited articles.Pi-index The pi-index is equal to one hundredth of the number of Vinkler citations obtained to the top square root of the total number of (2009) journal papers (‗elite set of papers‘) ranked by the decreasing number of citations.
  28. 28. Limitations of H-Index• The h-index ignores the importance of the publications – Évariste Galois h-index is 2, and will remain so forever. – Had Albert Einstein died in early 1906, his h-index would be stuck at 4 or 5, despite his high reputation at that date.• Ignore context of citations: – Some papers are cited to flesh-out the introduction (related work) – Some citations made in a negative context• Gratuitous authorship
  29. 29. Education Subject Category…
  30. 30. Eigenfactor.org Scores• Eigenfactor score: …the higher the better – A measure of the overall value provided by all of the articles published in a given journal in a year; accounts for difference in prestige among citing journals. A measure of the journal‘s total importance to the scientific community. – Eigenfactor scores are scaled so that the sum of the Eigenfactor scores of all journals listed in Thomson‘s Journal Citation Reports (JCR) is 100.• Article Influence score: … the higher the better – Article Influence measures the average influence, per article, of the papers in a journal. As such, it is comparable to the Impact Factor. – Article Influence scores are normalized so that the mean article in the entire Thomson Journal Citation Reports (JCR) database has an article influence of 1.00. – Still, it‘s best to ―compare‖ within subjects.• Cost effectiveness: … the lower the better – price / eigenfactor [2006 data]
  31. 31. Other Journal Ranking Efforts…SCImago Journal Rank (SJR) Similar to eigenfactor methods, but based on citations in Scopus – Freely available at scimagojr.com – More journals (~13,500] – More international diversity – Uses PageRank algorithm (like eigenfactor.org) – 3 years of citations; no self-citations – But: Scopus only has citations back to ~1995
  32. 32. SCImago
  33. 33. SCImago Journal Indicator Search…
  34. 34. SCImago Journal Search (AgronomyJournal)
  35. 35. A Few Other Journal Ranking Proposals… many would like to use journal usage stats• Usage Factors – Based on journal usage (COUNTER stats [Counting Online Usage of Networked Electronic Resources]) uksg.org/usagefactors/final• Y factor, a combination of both the impact factor and the weighted page rank developed by Google (Bollen et al., 2006)• MESUR: MEtrics from Scholarly Usage of Resources – Uses citations & COUNTER stats http://www.mesur.org/MESUR.html
  36. 36. Other Measures for Evaluating Researchers (Tang, et al. 2008)• Uptrend - Nothing can catch peoples eyes more than a rising star. Uptrend measures are used to define the rising degree of a researcher.• The information of each author‘s paper including the published date and conferences impact factor. We use Least Squares Method to fit a curve from published papers in recent N years. Then we use the curve to predict ones score in the next year, which is defined as the score of Uptrend, formally
  37. 37. Other Measures for Evaluating Researchers (Tang, et al. 2008)• Activity - Peoples activity is simply defined based on ones papers published in the last years. We consider the importance of each paper and thus define the activity score as:
  38. 38. Other Measures for Evaluating Researchers (Tang, et al. 2008)• Diversity - Generally, an experts research may include several different research fields. Diversity is defined to quantitatively reflect the degree. In particular, we first use the author-conference-topic model (Tang, et al. 2008) to obtain the research fields for each expert.
  39. 39. Other Measures for Evaluating Researchers (Tang, et al. 2008)• Sociability - The score of sociability is basically defined based on how many coauthors an expert has. We define the score as :• where #copaperc denotes the number of papers coauthored between the expert and the coauthor c. In the next step, we will further consider the location, organization, nationality information, and research fields.
  40. 40. Richard Van Noorden (2010)
  41. 41. Bibliometrics Predictive Power• Prediction of Nobel Laureates – – The Thomson Reuters rank among the top 0.1% of researchers in their fields, based on citations of their published papers over the last two decades. – Since 2002, of those named Thomson Reuters Citation Laureates, 12 have gone on to win Nobel Prizes.• Jensen et al. (2009) used measurements to predict which f the CNRS researchers will be promoted: • h index leads to 48% of ―correct‖ promoted scientists • number of citations gives 46% • number of published papers only 42%.
  42. 42. Research Questions• Primary Questions: – To which extent do bibliometrics reflect scientists ranking in CS? – Which single measure is the best predictor? – How should different measures be combined?• Secondary Questions: – Which type of manuscripts should be taken into consideration? – Does Self-Citation really matter? – Which citation index is better?
  43. 43. Research Methods• Retrospective analysis of scientists‘ careers: – Correlating academic positions with bibliometrics values that evolve as time goes by. – AAAI Fellowship• Using Data Mining Techniques for building: – A snapshot classifier for ranking scientists to their academic position. – A decision making model for promoting scientists. – A classifier for deciding who should be awarded the AAAI Fellowship each year.• Comparative analysis
  44. 44. Process
  45. 45. ISI Web of Knowledge• Coverage – Most Journals (13,000 journals) – Some Conferences (192,000 conference proceedings) – Almost no Books (5,000 books) – All patents (23 million patents) – 256 subject categories in Science, Social Sciences, and Arts and Humanities, covering the full range of scholarship and research – Many citations (716 million) Only Citations that are fully match are• Accuracy – Very few errors – Very few missing values – No Duplications
  46. 46. Google Scholar• Coverage – The largest – Still has limited coverage of pre-1990 publications – It is criticized for including gray literature in its citation counts (Sanderson, 2008)• Accuracy – Missing values – Wrong values – Duplicate entries
  47. 47. Why CS?• Variety of sub-fields with different citation patterns (Bioinformatics vs AI).• Different types of important manuscripts (Journal, Conferences, Books, Chapters, Patents, etc).• Evolving field (senior professors completed their PhD in other fields).• We are personally interested in this field
  48. 48. Task 1: Nominating Committee
  49. 49. Inclusion/Exclusion Criteria47 Researchers – Researchers from Stanford, MIT, Berkley and Yale – Completed their PhD after 1970 – Researcher name can be disambiguated – CV: • Promotion years are known • No short-cut in the career. – Total of 724 ―research years‖.• ISI - Total number of items: 50K (2300 written by the targeted researchers).• Google Scholar - Total number of items: 300K
  50. 50. H-Index Over Time (for 7 professors) Drop Page Fields Here ISI H- INDEX 18 16 14Name 12 BEJERANO DEVADAS 10 GIFFORD GOLDBERG 8 HUDAK SUDAN 6 TENENBAUM 4 2 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Years from Phd
  51. 51. Citations Over Time (for 7 professors) Drop Page Fields Here Average of ISIfalsefalse0totalCitations 1000 900 800Name 700 BEJERANO 600 DEVADAS GIFFORD 500 GOLDBERG HUDAK 400 SUDAN TENENBAUM 300 200 100 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Years from Phd
  52. 52. Evaluation• Procedure: Leave One Researcher Out ln(odds) b wT x• Base Classifier – Logistics Regression 1 p• Publication type 1 e b -wT x – All – All – All – Journals – Journals - Journals• Self-Citations: – All – Self-Citation 1 (the target researcher is not one of the authors) – Self-Citations 2 (no overlap between original set of authors and the citing paper)
  53. 53. Task 1.1: Ranking Researchers• Rank a researcher to one the following positions, given only a snapshot of her bibliometrics measures: – Post – Assistant – Associate – Full• Note that we are not aware to scientist previous position or seniority.• Default accuracy = 35% Full Assistant Associate Post
  54. 54. The Ranking Task – Results Top 10 MeasuresClassification Cited Manuscript Citing Manuscript Self-CitationAccuracy Source Type Type Level Measure 59.95% ISI Journal Journal 1 g-Index 59.30% ISI Journal Journal 0 g-Index 59.30% ISI Journal Journal 2 g-Index 58.65% ISI All Journal 0 Norm h-index 58.65% ISI All Journal 1 Norm h-index 58.65% ISI All Journal 2 Norm h-index 58.00% ISI Journal Journal 1 Norm h-index 57.74% ISI Journal Journal 0 Norm h-index 57.74% ISI Journal Journal 2 Norm h-index 57.48% Google Journal Journal 2 Rational H Index X
  55. 55. The Ranking Task – Results Least Predictive Measures Cited Classification Manuscript Citing Manuscript Self-Citation Accuracy Source Type Type Level Measure 37.06% Google Journal * * # Publications Individual # 37.06% Google Journal * * Publications 37.19% Google Journal Journal 0 Schreiber h-index 38.10% Google All All 1 Individual h-index 38.10% Google All All 2 Individual h-index 38.10% ISI All All 1 Schreiber h-index 38.23% ISI All All 0 Schreiber h-index 38.23% ISI All All 2 Schreiber h-index 38.75% ISI All Journal 0 Schreiber h-index 38.75% ISI All Journal 2 Schreiber h-index* Statistical significance has been found
  56. 56. Not by bibliometrics alone Accuracy = 73.7% !!! Predicted Full Associate Assistant Post 0 0 56 3 Post Actual 0 36 167 15 Assistant 29 145 31 1 Associate 252 31 3 0 Full Years from PhD
  57. 57. Task 1.2: Promoting Researchers• Given the researcher‘s current position and her bibliometrics measures, decide if she should be promoted.• Measure the absolute deviation in years from the actual promotion time.
  58. 58. Promotion Decision Task - Results Cited Self Manuscript Citing Citations ManuscriptMeasure Calculated as Source Level Type Type Assistant Associate Full AverageRational H-Index 1 Absolute Value Google 1 All Journal 1.26 1.58 1.88 1.51Total Citations Change from Last Rank Google 0 Journal All 1.26 1.68 1.88 1.55Total Citations Change from Last Rank Google 2 Journal All 1.26 1.68 1.88 1.55Total Citations Change from Last Rank Google 1 Journal All 1.26 1.71 1.88 1.56Norm Individual H-Index Change from Last Rank Google All Journal 1.28 1.74 1.79 1.56… … … … … … … … … …Individual H Index Change from Last Rank Google 1 Journal Journal 1.30 2.03 2.38 1.80Contemporary H Index Absolute Value Google 1 Journal All 1.46 2.00 2.17 1.81 * No statistical significance has been found * About 2% of the cases, our system has not recommended to promote a researcher although this promotion actually took place.
  59. 59. Not by bibliometrics alone Improvement vs. Rank 25.00% 20.00% 15.00% 10.00% 5.00% 0.00% 1. Assistant 2. Associate 3. Full -5.00% -10.00% -15.00% -20.00% -25.00% -30.00% Measure Assistant Associate Full AveragePromoted to Associate-6 years from PhD Rational H-Index 1 1.26 1.58 1.88 1.51Promoted to Full –13years from PhD Years from Phd 1.02 1.72 2.38 1.45
  60. 60. Google Scholar vs. ISI Thomson
  61. 61. Google Scholar vs. ISI Thomson
  62. 62. Self-Citations
  63. 63. Which Manuscripts Should be Taken into Consideration?
  64. 64. Which Citing Manuscripts Should be Taken into Consideration?
  65. 65. Conclusions – Take 1• Seniority is a good indicator for promoting scientists in leading USA universities.• Variation in bibliometrics among scientists slightly contribute to the promotion timing.• No significant difference between ISI and Google• Self-Citation is not so important• After all, journals are more reliable than other publications.
  66. 66. Task 2: And the AAAI Fellowship Goes To
  67. 67. AAAI FellowsihpTry to determine if and when an AI scientist is qualified to be elected to the AAAI FellowshipData set: – 92 researchers that won the award from 1995 to 2009 only – 200 randomly selected AI researchers with at least 5 papers in top tier AI Journals/Conferences – Using ISI data. • Google Scholar Coming soon
  68. 68. Task 2.1 – Leave One Scientist Out Criterion Average Performance Not Identifying a fellow (False Negative) 21% Wrongly identifying a non-fellow (False Positive) 8.2%
  69. 69. Using a single measure Fellows H-Index Criterion Average Performance Not Identifying a fellow (False Negative) 48% Wrongly identifying a non-fellow (False Positive) 6.1%
  70. 70. Task 2.2 – Predicting Next Year Fellows
  71. 71. Task 2.2 – Predicting Coming Fellows
  72. 72. Rules Example• (TC/A = (65.7085-inf)) and (TP/A = (26.084-inf)) and (Ih = (3.565-inf)) and (CpY = (13.191-inf)) => FellowWon=TRUE (49.0/5.0)• (Pi = (0.645-inf)) and (AWCR = (1.0555-3.6035]) and (TC/A = (80.875- inf)) => FellowWon=TRUE (29.0/3.0)• (TP = (7.5-inf)) and (e = (6.595-inf)) and (TP = (47.5-inf)) and (AWCR = (1.0735-3.849]) and (AWCRpA = (2.1705-inf)) and (SIh = (0.5-3.5]) => FellowWon=TRUE (18.0/1.0)• …
  73. 73. Task 2.3 – Social Network• Based on the idea of Erdos number• Predict fellowship based on co-authorship with other fellows.• http://academic.research.m icrosoft.com/VisualExplor er.aspx#1802181&84132
  74. 74. Task 2.3 Criterion Average Performance Not Identifying a fellow (False Negative) 52%Wrongly identifying a non-fellow (False Positive) 6.6% + Criterion Average Performance Not Identifying a fellow (False Negative) 21%Wrongly identifying a non-fellow (False Positive) 8.2% = Criterion Average Performance Not Identifying a fellow (False Negative) 16%Wrongly identifying a non-fellow (False Positive) 5.9%
  75. 75. Task 2.3• (Count >= 5) and (CpP >= 7) and (TP/A >= 6.883) => Fellow=TRUE (51.0/3.0)• (TP/A >= 22.944) and (Avg <= 3.266667) and (TP <= 40) => Fellow=TRUE (23.0/3.0)• (Count >= 5) and (e >= 7.071) and (CpP <= 1.618) => Fellow=TRUE (11.0/1.0)• …
  76. 76. Conclusions – Take 2• Bibliometric measures can be used to predict fellowship• Combining various measures using data nining techniques improve prediction power• Co-authorship relations can slightly boost the accuracy
  77. 77. Very Near Future Work• Adding Google scholar dataset• Examine the contribution of conferences in predicting the fellowship.• Tell Me Who Cite You, …
  78. 78. Why God Never Received Tenure at Any University1) He had only one major publication.2) It was in Hebrew.3) It had no references.4) It wasnt published in a refereed journal.5) Some even doubt he wrote it himself.6) It may be true that he created the world, but what has he done since then?7) His cooperative efforts have been quite limited.8) The scientific community has had a hard time replicating his results.9) He never applied to the Ethics Board for permission to use human subjects.10) When an experiment went awry, he tried to cover it up by drowning the subjects.11) When subjects didnt behave as predicted, he deleted them from the sample.12) He rarely came to class, just told students to read the book.13) Some say he had his son teach the class.14) He expelled his first two students for learning.15) Although there were only ten requirements, most students failed his tests.16) His office hours were infrequent and usually held on a mountaintop.
  79. 79. • References JOHAN BOLLEN, MARKO A. RODRIGUEZ, HERBERT VAN DE SOMPEL, Journal status, Scientometrics, Vol. 69, No. 3 (2006) 669- 687• Christenson J A, Sigelman L. Accrediting knowledge: Journal stature and citation impact in social science. Soc. Sci. Quart. 66:964- 75, 1985.• RAAN, A. F. J, VAN (2006), Performance-related differences of bibliometric statistical properties of research groups: cumulative advantages and hierarchically layered networks, Journal of the American Society for Information Science and Technology, 57 (14) : 1919– 1935.• EPSTEIN, D. (2007), Impact factor manipulation. The Write Stuff, 16 : 133–134.• ANTONIA ANDRADE, RAÚL GONZÁLEZ-JONTE, JUAN MIGUEL CAMPANARIO, Journals that increase their impact factor at least fourfold in a few years: The role of journal self-citations, Scientometrics, Vol. 80, No. 2 (2009) 517—530• Peter Vinkler, The pi-index: a new indicator for assessing scientific impact, Journal of Information Science, Vol. 35, No. 5, 602-612 (2009)• Peter Vinkler, An attempt for defining some basic categories of scientometrics and classifying the indicators of evaluative scientometrics, Scientometrics, Vol. 50, No. 3 (2001) 539-544• Peter Jacso, Testing the Calculation of a Realistic h-index in Google Scholar, Scopus, and Web of Science for F. W. Lancaster, LIBRARY TRENDS, Vol. 56, No. 4, Spring 2008 pp. 784-815• R. K. Merton, ―The Matthew Effect in Science,‖ Science, vol. 159, no. 3810, pp. 56–63, January 1968.• J. Beel and B. Gipp, ―The Potential of Collaborative Document Evaluation for Science,‖ in 11th International Conference on Digital Asian Libraries (ICADL08), ser. Lecture Notes in Computer Science (LNCS), G. Buchanan, M. Masoodian, and S. J. Cunningham, Eds., vol. 5362. Heidelberg (Germany): Springer, December 2008, pp. 375–378.• Tang, J. and Zhang, J. and Yao, L. and Li, J. and Zhang, L. and Su, Z., Arnetminer: Extraction and mining of academic social networks, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 990-- 998, 2008, ACM.• B H Weinberg, The Earliest Hebrew Citation Indexes, JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE. 48(4):318–330, 1997• Richard Van Noorden (2010), A profusion of measures, Nature Vol 465• Leo Egghe, Raf Guns, Ronald Rousseau(2011), Thoughts on Uncitedness: Nobel Laureates and Fields Medalists as Case Studies• M.H. MacRoberts and B.R. MacRoberts, Problems of Citation Analysis: A Study of Uncited and Seldom-Cited Influences (2011)

×