CiteSeerX: Mining Scholarly Big Data

CiteSeerX:
Mining Scholarly Big Data
Invited talk at MITRE Corporation, Hampton, VA, April, 2019
Jian Wu
Assistant Professor of Computer Science
Old Dominion University
Search “jian wu odu” on Google, the first result is my page

About myself
• PhD: 2004 - 2011
– Astronomy & Astrophysics
– Sloan Digital Sky Survey (SDSS)
– Hubble Space Telescope (HST)
• Postdoc: 2011 - 2017
– With Dr. C. Lee Giles at Penn State
– Tech Leader (CiteSeerX)
• Assistant Teaching Professor: 2017 - 2018
• Assistant Professor (2018 – )
– Web Science Digital Library Group (Dr. Nelson and Dr. Weigle)
2

Outline
• Why Scholarly Big Data
– Big picture, key questions, and approaches
• Highlighted Work
– Document type classification
– Citation parsing
– Entity matching
– Domain knowledge entity extraction
• Ongoing and Future Research
3

What is Scholarly Big Data
4
Steve Bryson (NASA) David Kenwright (NASA)
Michael Cox (NASA) David EIIsworth (NASA)
Robert Haines (MIT)
Communications of ACM (1999)
• Scholary Big Data (SBD)
a.k.a. Big Scholarly Data
• Coined in the keynote speech by Dr.
C. Lee Giles at the 22nd ACM
Conference on Information &
Knowledge Management (CIKM ’13)
• “Scholarly Big Data" appears in the
2013 KDD Cup report

How BIG is SBD?
Khabsa & Giles (2014) PLOS
Xia & Wang (2017) IEEE TBD
114M (estimated)
100M (estimated)

Where To Find SBD?
Data types OAG*
Google
Scholar*
Web of
Science Medline CiteSeerX DBLP
Documents 209 M 100 M 45 M 22 M 10 M 5 M
Metadata ✓ X ✓ ✓ ✓ ✓
Citations ✓ X ✓ X ✓ X
URLs ✓ X X X ✓ ✓
Full text X X X X ✓ X
Disambiguated
Authors
X X X X ✓ X
• OAG: Open Academic Graph (2018-11 release)
• Google Scholar: estimated [Khabsa & Giles 2014 PLOS]
✓ Available X Not available
6

CiteSeerX Facts
• 10+ million full text English documents and metadata.
• 1 billion hits and 180 million downloads annually.
• Googling "CiteSeerX OR CiteSeer" returns 10 million results.
• 3 million individual users world wide, 1/3 from the USA.
• Metadata with 32 million authors and 240 million citation mentions.
• Citation graph with 71 million nodes and 183 million edges.
• OAI metadata accessed 30 million times annually.
• URLs of crawled and indexed documents with duplicates: 40 million.
7

Why Do We Care About SBD?
Larsen and von Ins (2010) Scientometrics
• The exponential growth of
scientific publications since the
end of WWI
• Search and ranking: Quick and
accurate document search on
hundreds of millions of
documents
• Recommendation: stay tuned
for new and impactful
discoveries and inventions
• Science of science: Understand
the the trend of science
November 1918

SBD Used in “Science of Science”

Key Question and Approaches
• Key question: How to make it easier to retrieve relevant
and important information out of scholarly big data?
10
Data
Mining
Big
Data
Heuristic
Machine Learning
Deep Learning
Parsing, Tagging
Language Modeling
Semantics (Word Embedding)
Database
Indexing
Searching
Cloud Computing
MapReduce
System
Mining
Scholarly Big
Data
Natural
Language
Processing
Information
Retrieval

Academic papers
Non-
Academic
Classification
Textual Non-textual
Title CitationVenue Year AbstractAuthors
Figure/
Table
Algorithm
Math
expression
IE
Body
text
DisambiguationDeduplication Keyphrases
Typed Entities Relations
Knowledge Base
Local DBExternal DB
Data Linking
Semantics
PDFs
11
Chemical
formulae
Most high impact academic papers are published in PDF in English

Research Highlights in Mining SBD
1. Document type classification
2. Citation parsing
3. Entity matching
4. Domain knowledge entity extraction
12

1. Document type classification
• Task: academic vs. non-academic
• Traditional approach: rule-based (~80% F1-measure)
– look for “references”, “bibliography”, etc. in text
• Challenges:
– articles use different headings for reference sections, e.g., “Notes”
– “references” are used in other documents, e.g., “resumes”
• Machine learning approach: (>90% F1-measure)
– Random Forest + structural features
• Extension: multiple type classification
– Papers, theses, CVs, slides, books
[Caragea et al. WSDM-WSCBD’14; Caragea et al. AAAI ’16]
13

Structural Features
• File specific features
– size (kilobytes), #pages, etc.
• Text specific features
– #characters, #words, #lines, etc.
• Section specific features
– section names (e.g., “abstract”, “references”),
positions, etc.
• Containment features
– specific phrases (e.g., “this book”, “this chapter”), etc.
14
[Patel et al. 2019 in preparation]

2. Citation Parsing
15
@article{bea1997evaluation,
title={Evaluation of storm loadings on and capacities of offshore
platforms},
author={Bea, RG and Mortazavi, MM and Loch, KJ},
journal={Journal of waterway, port, coastal, and ocean engineering},
volume={123},
number={2},
pages={73--81},
year={1997},
publisher={American Society of Civil Engineers} }

Why doing citation parsing?
• Automated indexing
– Navigate to cited papers
• Document conflation
– Link citation mentions and
paper metadata
• Construct citation graph
(right figure)
16
Citation Graph generated based on
CiteSeerX data by Giselle Zeno
(2014)

Tools – ParsCit vs. Neural ParsCit
• ParsCit: sequential labeling with Conditional Random Field
17
Bea, R. G., Mortazavi, M. M., and Loch, K. J., “Evaluation of Storm Loadings
and Capacities of Offshore Platforms,” Journal of Waterway. Port. Coastal and
Ocean Engineering. Vol. 123, No. 2, ASCE, March/April 1997.
A A A A A A T T T T
J J J J
The Label of a token depends on features of the current token and nearby
tokens, AND labels of nearby tokens.
[Councill, Giles and Kan, LREC’08, Wu et al. IAAI ’14]
• Neural ParsCit: Character-level word embedding + CRF

Neural Network + CRF
18
… in Proc of …
character-level encoding word-level encoding
[Prasad et al. IJDL 2018]

3. Entity Matching
• Match data records across multiple databases
– Challenges: primary keys are not available in most cases
• Previous work: search-based (~74% F1-measure)
– Document representation: titles (empirical)
– Based on n-gram query and Jaccard similarity of titles
Indexed DBLP metadata
CiteSeerX
Titles
N-grams
CandidatesSimilarity
search-based approach 19
[Caragea et al. ECIR’14]

Entity Matching with Machine Learning
• Document representation
– metadata (title, authors, year, abstracts) + citations (aka, references)
• ML + search
20
(Noisy data)
(indexed external metadata) (indexed external citation graph)
(external data)
[Sefid et al. IAAI’19]
1
2
matching by header metadata matching by citations

Entity Matching Evaluation
• Ground truth
• Outperforms search-based method by 14% in precision!
• Best performance using Web of Science ground truth:
– match by metadata: 92.2% F1-measure
– match by metadata + citation: 99.2% F1-measure
[Wu et al. K-CAP ’17, Sefid et al. IAAI ’19]
21
External data Positive matching pairs
IEEE Xplore 51 (metadata only)
DBLP 292 (metadata only)
Web of Science 345 (with citations)
Combined 688

Entity Matching Application
• Applications
– Data cleansing: cleanse metadata and citation graph
– 50% CiteSeerX papers’ metadata cleansable using Web of
Science, Medline or DBLP database
22
[Wu et al. Big Data’18]
correct title: A New Metric for Banking Integration in Europe
incorrect title: A New Metric for Banking Integration in Europe 1
correct authors: Jian Wu, Allen C. Ge, C. Lee Giles
incorrect author: Jian Wu1, Allen C. Ge1, C. Lee Giles1,2 IST, Penn State University

4. Domain Knowledge Entity Extraction
• A Domain Knowledge Entity is a phrase representing
domain knowledge in an academic document.
• Noun phrases
• NOT just keyphrases, though keyphrases CAN BE
domain knowledge entities
23
[Wu et al. SIGMOD-SBD’16; Wu et al. JCDL’17]
EvolutionaryAlgorithms arethe stochastic optimization methods, simulating the behavior of natural
evolution.
Stanford NER tag: ORGANIZATION
evolution.

Training/Testing Datasets
• SemEval 2017 Task 10, dual-labeled
– 400 documents, 350 training, 50 testing
• Each document is a passage from a journal article in ScienceDirect in
Computer Science, Physics, and Materials Science
• Challenge: must extract the exact phrase span (positions of characters)
Evolutionary Algorithms: 1-23
stochastic optimization methods: 32-63
natural evolution: 92-109
evolution.
24

Extractor architecture
Test
passage
Preprocessing
Noun Phrase
(NP) Chunking
SVM Classifier
NP_N
CRF Model Training
passages
External
text corpora
NP_S
NP_U
Rule-based
Filters
Entities
sequentialNon-sequential
25

Entity Extractor Performance
Approach Precision Recall F1
NP-chunking 26% 54% 35%
NP-chunking + SVM classifier 42% 40% 41%
CRF 46% 33% 39%
CRF + NP-chunking + SVM classifier 39% 56% 46%
CRF + NP-chunking + SVM classifier + rule-based filters 46% 56% 50%
Winner (without B-LSTM) - - 50%
Winner (with B-LSTM) - - 54%
[Wu et al. JCDL’17; Ammar et al. 2017]
26

Ongoing Work
• Subject Category Classification
– Motivation: support facet search of scholarly big data
27
Facet search on Amazon

Problem Formalization
• Multiclass Classification
• Final goal: 252 Subject Categories (Web of Science Schema)
• Preliminary study: 6 subject categories:
28
Physics Chemistry Biology
Materials
Science
Computer
Science
Others
1.10M 1.09M 456k 260k 169k 150k
0
2
4
6
8
10
0.74
0.76
0.78
0.8
0.82
0.84
LR RF MNB SVM MLP
TestTime(sec)
Micro-F1
MLP VS. CLASSIC ML CLASSIFIERS
Micro-F1 Test Time
[Wu et al. BigData’18]

Collaborators
• NSF CRI: Towards sustainable support of scholarly big data
– Co-PI, $770K, PI: C. Lee Giles (Penn State)
• Keyphrase Extraction - Cornelia Caragea (UIC)
• ETD Mining - Edward A. Fox (VTech)
• Math IR - Richard Zanibbi (RIT)
• Citation Parsing – Min-Yen Kan (NUS)
29

CiteSeerX: Mining Scholarly Big Data

Recommended

Recommended

More Related Content

Similar to CiteSeerX: Mining Scholarly Big Data

Similar to CiteSeerX: Mining Scholarly Big Data (20)

Recently uploaded

Recently uploaded (20)

CiteSeerX: Mining Scholarly Big Data

Editor's Notes