Overview of Bibliometrics - IAP Course version 1.1

Overview of Citation Analysis
ClickstreamDataYieldsHigh-ResolutionMapsofScience.ByJohan
Bollen,HerbertVandeSompel,AricHagberg,LuisBettencourt,
RyanChute,MarkoA.Rodriguez,LyudmilaBalakireva.Public
LibraryofScienceONE,March11,2009.
Version: 4/30/15

Micah Altman
Director of Research
MIT Libraries
Prepared for
IAPril
MIT
April 2014

DISCLAIMER
These opinions are my own, they are not the opinions
of MIT, Brookings, any of the project funders, nor (with
the exception of co-authored previously published
work) my collaborators
Secondary disclaimer:
“It’s tough to make predictions, especially about the
future!”
-- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill,
Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico Fermi,
Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx, Dan Quayle,
George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White,
etc.
Overview of Citation AnalysisVersion: 4/30/15

Collaborators & Co-Conspirators
• Thanks To…
– Sean Thomas
Program Manager for Scholarly Repository
Services and the Product Manager of
DSpace@MIT
– Michael Noga
– Peter Cohn
– Courtney Crummett

Related Work
• K. Smith-Yoshimura, et al., 2014, Registering Researchers in
Authority Files, OCLC Research.
• Liz Allen, Jo Scott, Amy Brand, Marjorie M.K. Hlava, Micah
Altman 2014, Beyond authorship: recognising the
contributions to research; Nature.
• Data Synthesis Task Group. 2014. Joint Principles for Data
Citation.
• CODATA Data Citation Task Group, 2013. Out of Cite, Out of
Mind: The Current State of Practice, Policy and Technology for
Data Citation. Data Science Journal. 2013;12:1–75.
Slides and reprints available from:
informatics.mit.edu

The MIT libraries provide support for all researchers at MIT:
• Research consulting, including:
bibliographic information management; literature searches; subject-specific
consultation
• Data management, including:
data management plan consulting; data archiving; metadata creation
• Data acquisition and analysis, including:
database licensing; statistical software training; GIS consulting, analysis & data
collection
• Scholarly publishing:
open access publication & licensing
libraries.mit.edu

Roadmap
* Background *
* Metrics *
* Data *
* Tools *
* Data Processing *
* Hacking *
* Resources *

Background
(Why?)
(What?)
(Which?)
Version: 4/30/15

What are bibliometrics?
(simple definition)
Bibliometrics are measures
of scholarly outputs.

Scholarly output effects reputation, ranking, and funding
of the discipline, institution, and individual scholar
We initially use bibliometric analysis to look at
the top institutions, by publications and
citation count for the past ten years…
Universities are ranked by several
indicators of academic or
research performance, including…
highly cited researchers…
Citations… are the best understood and most
widely accepted measure of research strength.
Version: 4/30/15 Overview of Citation Analysis

Then
Clarke, Beverly L. "Multiple authorship
trends in scientific papers." Science
143.3608 (1964): 822-824.
Version: 4/30/15

Now

Now is More

What are bibliometrics?
(Extended Definition)
• Analysis of
characteristics of/relationships among
research/scholarly
outputs/publications
– Analysis includes:
lists, descriptive statistics, visualization, inference
– Outputs include:
grants, articles, books, databases, software,
patents

Which questions are bibliometrics being
used to answer?
Some examples:
• What are the most influential journals in a
particular field?
• How influential is this scholar?
• Where is interdisciplinary research occurring?
• Which groups of people effectively collaborate?
• Which institutions are using funding most
productively?

Data
(Leading Databases)
(Subject-Specific)
(MIT Internal)
(Selection)Version: 4/30/15

Google Scholar
Data Sources
• Unspecified coverage, but…
• Wide coverage of books,
preprint, conference
proceedings, non-english
work, working papers,
patents, institutional
repositories
Built-in Metrics
• Journal H-Index
• Author Profiles
– Total & Five-Year Counts
– I-10 index and H-index
– Yearly citations
• Limited filtering
scholar.google.com
Version: 4/30/15

Data
• Frequently updated/current
• Covers journal articles
published after 1995
• Wide disciplinary coverage
• Includes theses and patents,
and citations from these
• Includes some institutional
repositories
• Commercial
Metrics
• Citation lists & counts
• Author impact & articles
– Statistics
– Metrics
– Graphs
• Journal impact
– Statistics
– Metrics
– graphs
scopus.com
Scopus
Version: 4/30/15

Data
• Journal coverage after 1899
• Many conference proceedings
since 1990
• Many books since 2005
• Limited coverage of non-
english works
• Doesn’t index institutional
repositories and e-print
servers
• Commercial
Metrics
• Citation lists & counts
• Author impact & articles
– Statistics
– Metrics
– Graphs
• Journal impact
– Statistics
– Metrics
– graphs
apps.webofknowledge.com
Web of Science
Version: 4/30/15

Major Subject Specific Catalogs With Citation Metrics
• SciFinder:
chemical abstracts
scifinder.cas.org
• PsycInfo:
psychological literature
www.apa.org/pubs/databases/psycinfo/
• Business Source Complete:
business articles
www.ebscohost.com/academic/business-
source-complete
• arXiv:
physics, mathematics, nonlinear sciences,
computer science, quantitative biology,
quantitative finance, statistics (Integrates
w/NASA-ADS and INSPIRE)
arxiv.org
• mathSciNet
Mathematical Reviews. Computes
collaboration distances.
www.ams.org/mathscinet/
• IEEE Digital Library
content published by the IEEE including
citing references
• USPTO:
find patents that are cited by/cite others
uspto.gov/patft/
• ACM Digital Libraries
Full text and citation of ACM articles and
proceedings
dl.acm.org
VERA: owens.mit.edu/sfx_local/az/mit_db
Version: 4/30/15

APIs for Scholarly Resources
What are API’s?
• Application programming interface
(APIs), are tools used to expose raw
data, query interfaces, or other
functions to other software
applications
• Typically more flexible than
interactive interfaces
Challenges
• Requires programming
• Requires data manipulation and
reorganization
• Variety of interfaces, coverage,
results and terms of service
libguides.mit.edu/apis
API What it does
arXiv API
Gives programmatic access to all of the arXiv data, search and linking
facilities
BioMed Central API
Retrieves: 1) BMC Latest Articles; 2) BMC Editors picks; 3) Data on article
subscription and access; 4) Bibliographic search data
DVN (Dataverse
Network) API for
Data Sharing
Allows programmatic access to data and metadata in the Dataverse
Network, which includes theHarvard Dataverse Network, MIT Libraries-
purchased data, and data deposited inother Dataverse Network repositories.
Two modules exist: Metadata/Search and Data Access.
Digital Public
Library of America
(DPLA) API
Allows programmatic access to metadata in DPLA collections, including
partner data from Harvard, New York Public Library, ARTstor, and others.
IEEE Xplore XML
Search API
Allows IEEE customers and 3rd parties such as federated search vendors to
query the IEEE Xplore content repository and retrieve results for
manipulation and presentation on local web interfaces
JSTOR Data for
Research
Not a true API, but allows computational analysis and selection of JSTOR's
scholarly journal and primary resource collections Includes tools for faceted
searching and filtering, text analysis, topic modeling, data extraction, and
visualization.
Nature Blogs API
Blog tracking and indexing service; tracks Nature blogs and other third-party
science blogs
Nature OpenSearch
API
Bibliographic search service for Nature content
NLM APIs NLM offers 21 different APIs for accessing various NLM databases.
ORCID API
Queries and searches the ORCID researcher identifier system and obtain
researcher profile data
PLoS Article-Level
Metrics API
Retrieves article-level metrics (including usage statistics, citation counts,
and social networking activity) for articles published in PLOS journals and
articles added to PLOS Hubs: Biodiversity
PLoS Search API
Allows PLoS content to be queried using the 23 terms in the PLoS search,
for integration into web, desktop, or mobile applications
PubMed E-Utilities
API
Set of 8 server-side programs for searching 38 NCBI Entrez databases of
biomedical literature and data
Scopus Integration
Scopus Document Search API displays search results on a website. Scopus
Cited-By Count API, displays cited-by count for a publication as an image on
a web site.
Springer Images
API
Provides images and related text for over 300,000 free images available on
Springer Images.
Springer Metadata
API
Provides metadata for over 5 million online documents (e.g. journal articles,
book chapters, protocols).
Springer Open
Access API
Provides metadata, full-text content, and images for over 80,000 open
access articles from BioMed Central and SpringerOpen journals.
STAT!Ref
OpenSearch API
Bibliographic search service for displaying syndicated results on a website.
Web of Science
Web Services
Bibliographic search service. Allows automatic, real-time querying of
records. Primarly for populating an institutional repository.
World Bank
Indicators Provides access to nine World Bank statistical databases:
Version: 4/30/15

Full (text) Monte API’s
Why use Full Text?
• Name-entity extraction for
relationship analysis:
e.g. acknowledged people,
funders, grants, institutions,
software, datasets
• Name-entity extraction for
subject analysis: e.g.
locations, reagents, genes,
chemicals
• Topic extraction and
clustering
Primacy Sources for Full-Text
• CrossRef TDM API:
tdmsupport.crossref.org/research
ers/
• ArXiv bulk
data:arxiv.org/help/bulk_data_s3
• PLOS API
api.plos.org
• Pubmed API:
– www.ncbi.nlm.nih.gov/pmc/tools/
oai/
– www.ncbi.nlm.nih.gov/books/NBK
25501/
• OpenAire
api.openaire.eu

Using API’s
Choosing tools
• Recommend python or R
• Many resources such as
PUBMED, DataVerse, and
arXiv are accessible through
OAI-PMH protocol
• More in tools section and
resources section
Example: Harvesting ArXiv with pyoai
from oaipmh.client import Client
from oaipmh.metadata import MetadataRegistry
from lxml import etree
URL = 'http://export.arxiv.org/oai2’
registry = MetadataRegistry()
class Reader(object):
def __call__(self, element):
return etree.tostring(element, pretty_print=True,
encoding='UTF8')
registry.registerReader('oai_dc', Reader())
client = Client(URL, registry)
for count, record in
enumerate(client.listRecords(metadataPrefix='oai_dc')):
header = record[0]
metadata = record[1] or '’
print header.identifier()
print metadata

MIT Internal Data
Institute Data
(Restricted Use)
• IS&T DataWarehouse
Data from administrative systems. E.g.
MIT people, organizations, grants and
awards
ist.mit.edu/warehouse
• Office of the Provost – Institutional
Research
Provides analytical and research
support to the Provost, academic
departments, research laboratories and
centers.
web.mit.edu/ir/
Libraries Data
• DSpace@MIT
lists of publications in Dspace
by author/department
dspace.mit.edu
• Barton
lists of MIT these by
author/advisor
library.mit.edu

Comparing Databases
Coverage
• Years
• Disciplines
• Publishers/sources
• Venue –
journals/conferences/worki
ng paper/IR/personal web
sites
• Documentation of coverage
• Completeness
Characteristics
• Internal vs. external
• Free vs. fee-based
• API vs. interactive
• Open data vs. restrictive
licensed
• Structured vs. unstructured
• Full text vs. metadata

Selecting a Database
• Free, quick, and useful  Google Scholar
• Extract data for further simple analysis 
Scholarometer (google scholar extract),
Scopus, WOS
• More complete coverage  use multiple
databases
• Specialized subject/single article 
disciplinary database/API
• Extract data for network analysis, text
mining  API
Free & Easy
$$ and/or
programmatic
Version: 4/30/15

Measures
(Article metrics)
(Author Impact)
(Journal Impact)
(Collaboration)
(Network Analysis)
Version: 4/30/15

Article Metrics: Overview
What are article-level metrics?
• Measures on specific
published articles
• Typically used in
construction of literature
reviews; or as building
blocks for other measures
Common measures
• Citations list
• Citation counts
• References
• Captures/bookmarks
• Downloads
• Mentions
• Likes
• Views
• Readers
sparc.arl.org/sites/default/files/sparc-alm-primer.pdf
Version: 4/30/15

Article Metrics: Using Google Scholar
Steps
1. Go to scholar.google.com
2. Search (Full Text + Metadata)
– Unstructured keyword search
OR
– “Advanced” fielded search
3. Sort
– by relevance
OR
– ny date
4. Filter
– By Date range AND/OR
– By Corpus (case law, patents)
Results
• Number of citations to
article indexed google
scholar
• List of citing articles
• Article text
(sometimes)

Article Metrics: Example – Google Scholar

Article Metrics: Altmetrics
Types
• Captures/bookmarks
• Downloads
• Mentions
• Likes
• Views
• Readers
Sources
• Social media
• Reference management
(e.g. citeulike, mendeley )
• Indexes/searches
(e.g. Scopus)
Sources
• PLOS article metrics
article-level-
metrics.plos.org
• Plum Analytics
plumanalytics.com
• ImpactStory
impactstory.org

Article Metrics: Database Comparison
Google Scholar,
Scopus,
WOS
PLOS
Plos Articles Only
PlumX
Coverage Wide variety PLOS Articles Wide Variety
Measures Citation count
Citation list
Citation count
Citation list
Views
Downloads
Mentions
Bookmarks
Comments
Citation count
Citation list
Views
Downloads
Mentions
Bookmarks
Comments

‘Impact’ Factors: Overview
What are impact factors?
• Descriptive statistics
• Usually based on citations
• Commonly treated as a
proxy for the level of
influence of an article,
person, or journal
Common measures
• ISI Journal Impact Factor:
The frequency with which the “average
article” has been cited in a particular year. It
is based on the most recent two years of
citations. It is only supplied for journals
indexed by ISI in the Web of Science.
• Article Citation Count:
Total number of citations received from other
articles to target article.
• H-Index:
The maximum number of articles h such that
each has received at least h citations
libraries.mit.edu/scholarly/publishing/impact-factors/
Version: 4/30/15

Author Impact: Example – Google Scholar

Author Impact: Example – Exporting Data with Scholarometer

Author Impact: Example – Web of Science

Author Impact: Database Comparison
Google Scholar Scholar+Scholaro
meter
Scopus Web of Science
Select Any
Author
Only w/profiles Yes Yes Yes
Export data No Yes Yes Yes
Exclude articles No Yes Yes Yes
Metrics H-
index,I10,num
cites
H-index,I10,num
cites
H-index,… H-index
Visualization Minimal Minimal Yes Yes
Longitudinal Minimal Minimal Yes Yes

Journal Impact: Using Online Services
Scholar
1. Go to
scholar.google.
com
2. Click on
METRICS
3. Google rank
and journal h-5
factor
displayed
4. Filter by
country & field
Scopus
• Go to
scopus.com
• Click on
Journal
Analyzer
• Select journal
• Select statistics
Web of Science
1. Go to admin-
apps.webofkno
wledge.com/JC
R/
2. Select field and
year + SUBMIT
3. Select subject +
SUBMIT
Version: 4/30/15

Journal Impact: Example – Google Scholar

Journal Impact: Example – Web of Science

Journal Impact: Example – Scopus

Journal Impact: Database Comparison
Google Scholar Scopus Web of Science
Journals Covered Top 100 ranked in
each language
Mostly english-language Many (selected)
Journals
Metrics H5 Median Many Impact factor, Many
others
Visualization No Yes Yes
Longitudinal
analysis
No Yes Yes
Discipline Rankings No No Yes

Network Analysis
What is network analysis?
• Study of objects and
interactions modeled as an
induced network (or graph)
• Units of observation form
nodes
• Relationships form edges
Common measures
• Community detection
– Modularity
– Clustering
– Clique
• Centrality
– Betweeness
– Degree
– Closeness
• Diameter
• Visualization

Network Analysis: Example – CitNetExplorer

Network Analysis: Example – CitNetExplorer
1. Use WOS to locate records
2. Add records to “marked list”
3. Click “marked list”
4. Check “cited references”
5. Save to other file formats
6. Select windows tab delimeted
7. Open in CitNetExplorerVersion: 4/30/15

CoAuthorship Analysis Example –
Using R and JSTOR – Part 1

% cut -d"," -f 1-11 citations.CSV >areastudies2003.csv
R> areastudies.df< read.table(file="citations.CSV",row.names=NULL
,sep=",",quote="",stringsAsFactors=F,header=T)
R> authorList <- strsplit(areastudies.df$author,perl=TRUE,split="t")
R> plot(table(sapply(authorList,length)))

createCoauthorlist<-function(pl){
coauthors<-list()
updateCoauthor<-function(co,paperAuthors) {
tmp <- unlist( coauthors[co] )
tmp <- union(tmp,unlist(paperAuthors))
coauthors[[co]]<<-tmp
}
sapply(pl, function(x)sapply(x,function(y)updateCoauthor(y,x)))
return (coauthors)
}
R> R> coa
<-createCoauthorlist(authorList)
R> plot(table(sapply(coa,length)))
Note: Results are biased down, if a
sample of records is used!
Version: 4/30/15

Variations: Retrieving Authors from PLOS
library(rplos)
options(PlosApiKey= “YOURKEY")
fetchPlosResults<-function(qstring, fstring,start=0) {
moreResults <- TRUE
results.df <- NULL
batStart<-start
batSize <- 999
while (moreResults) {
tmp.df <- try(silent=TRUE,
searchplos(terms="*:*",
toquery = qstring,
fields=fstring,
start=batStart, limit=batSize)
)
if (class(tmp.df) == "try-error") {
moreResults<-FALSE
} else if (is.null(dim(tmp.df))) {
moreResults<-FALSE
} else if (dim(tmp.df)[1]==0) {
moreResults<-FALSE
} else {
results.df<-merge(tmp.df,results.df,all=TRUE)
batStart <- batStart + batSize
cat (paste(batStart,date(),"n"))
save(results.df,file="/tmp/plosTMP.RData")
}
}
return(results.df)
}
plosRes.df <- fetchPlosResults(
qstring= 'publication_date:[2012-01-01T00:00:00Z TO
2012-12-31T23:59:59Z]',
fstring=
"id,author,journal,publication_date,subject,subject_level_1
,references,article_type"
)
Version: 4/30/15

Limitations
Limitations of data
• Citation differs systematically from
sharing, reading, or ‘use’
• Relationships signaled by citation are
heterogenous: citations may indicate
evidentiary support, definitions,
disagreement, kudos,…
• Cited objects are heterogenous – e.g.
journals include letters, comments,
reviews and original research
• Databases may have limited or
inconsistent coverage of publishers,
fields, years, or types of publications
(e.g. conference proceedings), types of
objects (databases, software, books,
articles)
• Some types of objects are often used
without being cited
Limitations of measures
• Most measures are vulnerable
to self-citation and other sorts
of manipulation
• Most measures are descriptive
estimates – they are not
forecasting or causal
inferences
• Few studies of the external
validity of measures
• Few studies on error and bias
in estimators

Tools
(Built-in tools)
(Analysis tools)
Version: 4/30/15

Built-in Tools
• Database portals have built-in tools:
Google Scholar; Scholarometer; Web of Science …
• Typical restrictions of built-in tools
– Single database
– Number of records
– Usually single-author/single journal metrics
– Lacks statistical forecasting/causal models
– Limited data-cleaning options
– Simple visualizations

External Tools
Feature sets
• Data retrieval
• Data processing
(next section)
• Core statistics
• Visualization
• Exploratory network
analysis
• Network modeling
Choosing a tool
• Open vs. closed source
• Free vs. commercial
• GUI vs. CLI
• Scalability
• Single Platform/Multi-
Platform
• Feature Set
• Maintenance/support

Publish or Perish
• Automatic data retrieval
– MS Academic Search
– Google Scholar
• Standard single-author
metrics
– Total number of papers and
total number of citations
– Average citations per paper,
citations per author, papers
per author, and citations per
year
– Hirsch's h-index and related
parameters and variations
• Data export to CSV
www.harzing.com/pop.htm

Scholarometer
Data
• Google Scholar
• Crowd-source tags
(disciplines) – available
through API
• Data export to CSV
Metrics
• Single/combined author
citation count/h-index rank
• Discipline rank/
• Author network
visualization
• Discipline network
visualization
scholarometer.indiana.edu
Version: 4/30/15

Pajek
Analysis
• Network visualization
• Supports complex networks:
multi-relational,
longitudinal, 2-mode
• Layout control
• Clustering
pajek.imfm.si
Source: www.public.asu.edu/~majansse/pubs/SupplementIHDP.htm
Version: 4/30/15

CitNetExplorer
Features
• Citation/bibliometric specific
tool
• Web of Science import.
• Pajek export.
• Large networks.
(millions of publications)
• Simple network visualizations
• Network measures: connected
components, clusters, core
publications …
citnetexplorer.nl
Version: 4/30/15

CiteSpace
Features
• Citation/bibliometric tool
• Import from
WOS, ArXiV, NSF, ADS,Pubmed
• Export to CSV, GraphML, Pajek
• Time slicing
• Network measures: connected
components, clusters, core
publications …
• Topic clustering
cluster.cis.drexel.edu/~cchen/citespace
Version: 4/30/15

SciMat
Features
• Workflow support
• Network visualization
• Data processing and
cleanup
• Longitudinal analysis
• Metrics: h-index
sci2s.ugr.es/scimat/
Version: 4/30/15

Gephi
Analysis
• Network graphs & layout
• Dynamic filtering
(including time-sliders)
• Clustering
• SNA: betweeness,
closeness, diameter,
PageRank, HITS,…
(modularity)
gephi.org
Version: 4/30/15

Sci2Tool
Analysis and Visualization
• Temporal – burst detection
• Geospatial
• Topical
• Networks – trees and
graphs
Additional Benefits
• Parsers for citation data
• Bibliometric analysis tools
• Portable output files
• Direct connections to R and
Gephi
http://sci2.cns.iu.edu
Version: 4/30/15

Command-Line Tools
Using Python
• Scipy:
scientific data processing, statistics, visualization
scipy.org
• NLTK:
text processing and analysis
nltk.org
• NetworkX:
network measures (descriptive)
networkx.github.io
• Bibtools:
parse WOS data, and identify comunities of
cocitation
www.sebastian-grauwin.com/?page_id=492
• PythonOAI:
retrieve bibliographic metadata from OAI sources,
such as arXiv
pypi.python.org/pypi/pyoai/
Using R
• tm:
simple text processing and analysis
cran.r-project.org/web/packages/tm/
• StatNet:
network measures (descriptive); social network
analysis (forecasting, causal); visualization
statnet.org
• Citan:
citation analysis
cran.r-project.org/web/packages/CITAN
• Rplos:
retrieve citation data from PLOS
http://cran.r-project.org/web/packages/rplos/
• Rmendeley
retrieve citation data from Mendeley
http://ropensci.org/packages/rmendeley.html
• RISmed
retrieve data from NCBI
http://cran.r-
project.org/web/packages/RISmed/index.html
• OAIHarvester
retrieve data from OAI-PMH Sources
cran.r-project.org/web/packages/OAIHarvester/
Web integration for interactive visualization: d3js.org
Version: 4/30/15

Characteristics of Tools
• Built-in vs. external
• Free vs. fee-based
• Command line vs. interactive
• Open source vs. closed source
• Domain
– Data extraction, retrieval, integration
– Data cleaning and manipulation
– Network visualization
– Advanced measures
– Statistical analysis

Choosing tools.
• Simple standard impact 
built-in database tools; Publish or Perish;
Scholarometer
• Messy data  OpenRefine + …
• Network analysis measures
– Network measures  Sci2,SciMat, Pajek
– Visualizations  Gephi, Pajek, CitNet, SciMat
• Need to estimate complex statistical
(predictive, statistical) models  R
• Need maximum software flexibility,
integration with software  Python
Quick Start
Power Tools
Version: 4/30/15

Data Processing
(reorganizing data)
(cleaning data)
(matching names)
Version: 4/30/15

Open Refine
• Spreadsheet/database combination
– Ease of use of spreadsheets
– Reporting and manipulative power of databases
• Filters, facets, and clustering
– Allow granular overview of what’s in your data
– Easily see occurrence distribution of values
– Easily make global corrections
• Supports both row-level and record-level
(multi-row) operations
openrefine.org
Version: 4/30/15

Open Refine – Reorganize Data
Reorganizing Data
• Splitting/joining multi-
valued cells
• Transposing rows/columns
• Supports logic-based
transformation
– Google Refine Expression
Language (GREL)
– Clojure
– Jython
openrefine.org
Version: 4/30/15

Open Refine – Cleaning Data
Cleaning Data
• Duplicate detection
• Common data
transformations
– Trimming whitespace
– Normalizing text case
• Cluster/edit for matching
and normalization
Additional Benefits
• Perform mass edits
efficiently
• Revision history allows for
roll-back to earlier state
• Transformations recorded
as JSON
– Portable for future data sets
• Browser-based
openrefine.org
Version: 4/30/15

Open Refine – Matching Names
Matching names
• Create filters to navigate
larger datasets
• Create facets to see all
unique values/occurrences
• Auto-detect variant entries
• Cluster/edit for matching
and normalization
• Reconciliation services
against external data for
normalization/aggregation
openrefine.org
Version: 4/30/15

Name Disambiguation
Methods
• Dictionary-based entity
matching
• Phonetic Matching
• Rules-based linkage
• Probability based linking
– Edit distance
– Felligi-Sunter algorithm
– Machine-learning
Tools
• Febrl
sourceforge.net/projects/fe
brl/
• RecordLinkage (for R)
cran.r-
project.org/web/packages/
RecordLinkage/
• Link-King (for SAS)
the-link-king.com
Source: en.wikipedia.org/wiki/Record_linkage
Version: 4/30/15

Matching Names – Author Identifiers
What are Author Identifiers?
• Author identifiers give you a way to
reliably and unambiguously connect
your names(s) with your work
throughout your career, including your
papers, data, biographical information,
etc. This can be helpful in a number of
ways:
• Provides a means to distinguish
between you and other authors with
identical or similar names.
• Links together all of your works even if
you have used different names over the
course of your career.
• Makes it easy for others (grant funders,
other researchers etc.) to find your
research output.
• Ensures that your work is clearly
attributed to you.
Getting started with ORCID...
• ORCID (Open Researcher and Contributor ID)
is a non-prorietary, non-profit community-
based registry of research identifiers.
• Links authors to their datasets and other
works in addition to articles.
• Authors can control what information in their
ORCID profile they share. Only the ORCID ID
is automatically shared. (See their privacy
policy.)
• It is easy to import research output from
other sources (including ResearcherID,
Scopus Author ID, and Datacite Metadata
Store to your ORCID profile. (See ORCID's
import works page.)
• Many organizations and publishers have
created integrations with ORCID including
Nature Publishing Group, Elsevier, and the
American Physical Society.
• Free, private, 30-second registration:
orcid.org/register
libguides.mit.edu/content.php?pid=573578&sid=4729602
Version: 4/30/15

Bibliometric
Hacking
Version: 4/30/15

Sharing, Creativity, Collaboration,
Clarity Associated with Citation Impact
• Collaboration/team science increases impact
• Open access associated with substantially higher citations
• Self Citation in moderation is associated with reinforced
impact
• Sharing data is associated with higher citation rates
• Publishing regularly is associated with much higher impact
• Citation measures only one type of use
– you can collect evidence and measure others
• Use clear, titles, and meaningful keywords and abstracts
• Mainstream social media, especially twitter, can indicate
broader use
• Creativity matters

Not-so-positive
findings
Daniel
Schectman’s
Lab Notebook
Providing
Initial
Evidence of
Quasi Crystals
• Null results are less likely to be submitted and published
 submit all your results
• Publication bias leads to overestimates of effects/significance in
many fields
• Many data sharing and replication policies are not followed
 share even when you are not forced to
• Good science may not pass peer review
be persistent
• Much research is not replicable
 make yours replicable
• Many publications are not cited
• Multidisciplinary work less cited
• Edited volumes are not well cited
 think carefully about publication venue, significance of
research
• Retraction rates in scientific journals have substantially
increased
• Author order is overemphasized in evaluation
 discuss authorship early, use other ways of describing
contributions and distributing credit
• Delays in peer-review, and publishing are frequent, and
important
 track your submissions, and politely, but actively manage
delays
• Not enough time spent on research
 develop a research habit, and build research in your
schedule
Version: 4/30/15

From 10 Simple Rules …
Graduate Students
• Share your scientific success with the world
Postdoctoral Positions
• Negotiate first authorship before you start.
Getting Published
• If you do not write well in the English
language, take lessons early
• Become a reviewer early in your career.
• Decide early on where to try to publish
your paper.
• Quality (of journals) is everything.
Building Reputation
• Think Before You Act
• Do not ignore criticism
• Do not ignore people
• Diligently check everything you publish
• Always declare conflicts of interest
• Do your share for the community
• Do not commit to tasks you cannot
complete
• Do not write poor reviews
• Do not write references for people who do
not deserve it
• Never plagiarize, or doctor your data
Bourne, Philip E. "Ten simple rules for getting published." PLoS computational biology 1, no. 5 (2005): e57.; Gu, Jenny, and
Philip E. Bourne. "Ten simple rules for graduate students." PLoS computational biology 3.11 (2007): e229.; Bourne, Philip E.,
and Virginia Barbour. "Ten simple rules for building and maintaining a scientific reputation." PLoS computational biology 7,
no. 6 (2011): e1002108. Bourne, Philip E., and Iddo Friedberg. "Ten simple rules for selecting a postdoctoral position." PLoS
computational biology 2, no. 11 (2006): e121.
Version: 4/30/15

Self Experimentation*: 10 Simple Steps
Identify yourself -- register for:
1. An identifier – ORCID
2. Information hubs: ORCID; LinkedIN; your own domain name  forward to LinkedIN ; Slideshare
3. Communication channels: twitter, LinkedIN
Describe yourself
4. Write and share a 1-paragraph bio
5. Describe your research program in 2 paragraph
6. Create a CV
[Post these on your LinkedIn and ORCID profiles]
Share
7. Share (on Twitter & LinkedIN) news about something you did or published; an upcoming event in which
you will participate; interesting news and publications in your field
8. Make writing; data; publication; software available as Open Access (through your institutional
repository, SlideShare, FigShare, Dataverse)
Monitor
…check and record these things regularly, but not too frequently (once a month) -- and no need to react
or adjust immediately
9. Set up tracking– google scholar, google alert,
10. Find your klout schore, H-index,
*Question: How do you tell an
extroverted researcher?
Answer: When she talks, she looks
down at your shoes.
Version: 4/30/15

Resources
(Readings)
(Software)
(Data)
(Glossary)
Version: 4/30/15

Recommended Reading
• Data Processing - General
– Getting Started:
programminghistorian.org/lessons/cleaning-data-with-openrefine
– References:
Verborgh, Ruben, and Max De Wilde. Using OpenRefine. Packt Publishing Ltd,
2013.
– Tutorials:
github.com/OpenRefine/OpenRefine/wiki/External-Resources
• Data Processing – Dealing with Names
– Getting Started -- author identifiers guide:
– References:
Winkler 2012; Name Matching and Record Linkages, U.S.
Censushttp://www.census.gov/srd/papers/pdf/rr93-8.pdf

Recommended Reading (Continued)
• Bibliometric Analysis
– Tutorials:
Anne-Wil Harzing ,2011 The Publish or Perish Book, part 3: Doing
bibliometric research with Google Scholar, Tarma software press
Wouter De Nooy , et al.,2011, Exploratory Social Network Analysis with Pajek,
2nd Edition, Cambridge University Press
author identifiers guide:
article level metrics:
sparc.arl.org/sites/default/files/sparc-alm-primer.pdf
– References:
Eric D. Kolaczyk, 2009, Statistical Analysis of Network Data: Methods and
Models, Springer.

Available Databases & API’s
• Scholarly APIs:
libguides.mit.edu/apis
• Google Scholar:
scholar.google.com
• Scopus:
scopus.com
• Web of science:
admin-apps.webofknowledge.com
• Author identifiers:
• List of MIT-licensed Databases: owens.mit.edu/sfx_local/az/mit_db
• Altmetrics:
– PLOS article metrics article-level-metrics.plos.org
– Plum Analytics plumanalytics.com
– ImpactStory impactstory.org

Additional Selected Tools
• OpenRefine: openrefine.org
• Publish or Perish: www.harzing.com/pop.htm
• Scholarometer: scholarometer.indiana.edu
• CitNet citnetexplorer.nl
• CiteSpace cluster.cis.drexel.edu/~cchen/citespace
• Gephi gephi.org
• Sci2 sci2.cns.iu.edu
• Pajek pajek.imfm.si
• Scimat sci2s.ugr.es/scimat/
• R Packages:
– tm cran.r-project.org/web/packages/tm/
– StatNet statnet.org
– CITAN cran.r-project.org/web/packages/CITAN
– Rplos: cran.r-project.org/web/packages/rplos/
– Rmendeley ropensci.org/packages/rmendeley.html
– RISmed cran.r-project.org/web/packages/RISmed
– OAIHarvester cran.r-project.org/web/packages/OAIHarvester/l
• Python Packages:
– scipy scipy.org
– Nltk nltk.org
– networkx networkx.github.io
– bibtools: www.sebastian-grauwin.com/?page_id=492
– pyOAI pypi.python.org/pypi/pyoai/

Glossary of Metrics
• Author H-Index:
The maximum number of articles h such that each has received at least h citations
• Centrality
A measure of the importance of some node in the network based on a selected abstract model of
influence/flow across network. Centrality measures include degree centrality (number of connections);
closeness centrality (distance of node to other nodes in network); betweenness centrality (proportion of
information that must pass through the node to go from one part of the network to another)
• (ISI Journal) Impact Factor:
The frequency with which the “average article” has been cited in a particular year. It is based on the most
recent two years of citations. It is only supplied for journals indexed by ISI in the Web of Science.
• Clustering:
Method that partition n observations into k clusters based on the characteristics of the object. Clusters are
defined either by a set of heuristics for forming the cluster, or according to a solution concept that the
clusters will satisfy.
One common algorithm, K-Means assigns each observation to a fixed-K number of clusters such that each
observation belongs to the cluster that has a mean value closest to that of the observation
• Network community structure measures:
The detection of highly-interconnected groups of nodes within a network. Methods include hierarchical-
clustering; information maximization; modularity; clique-detection
• Network Diameter:
The greatest distance between any two nodes in the network.
• Page Rank:
a family of iteratively-calculated recursive impact factors in which citations from other journals are weighted
by the impact of those journals
Maxi, j Min distance i, j( )( )( )
Version: 4/30/15

Questions?
E-mail: escience@mit.edu
Web: informatics.mit.edu

Overview of Bibliometrics - IAP Course version 1.1

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Overview of Bibliometrics - IAP Course version 1.1

Similar to Overview of Bibliometrics - IAP Course version 1.1 (20)

More from Micah Altman

More from Micah Altman (20)

Recently uploaded

Recently uploaded (20)

Overview of Bibliometrics - IAP Course version 1.1

Editor's Notes