SlideShare a Scribd company logo
1 of 29
CiteSeerX:
Mining Scholarly Big Data
Invited talk at MITRE Corporation, Hampton, VA, April, 2019
Jian Wu
Assistant Professor of Computer Science
Old Dominion University
Search “jian wu odu” on Google, the first result is my page
About myself
• PhD: 2004 - 2011
– Astronomy & Astrophysics
– Sloan Digital Sky Survey (SDSS)
– Hubble Space Telescope (HST)
• Postdoc: 2011 - 2017
– With Dr. C. Lee Giles at Penn State
– Tech Leader (CiteSeerX)
• Assistant Teaching Professor: 2017 - 2018
• Assistant Professor (2018 – )
– Web Science Digital Library Group (Dr. Nelson and Dr. Weigle)
2
Outline
• Why Scholarly Big Data
– Big picture, key questions, and approaches
• Highlighted Work
– Document type classification
– Citation parsing
– Entity matching
– Domain knowledge entity extraction
• Ongoing and Future Research
3
What is Scholarly Big Data
4
Steve Bryson (NASA) David Kenwright (NASA)
Michael Cox (NASA) David EIIsworth (NASA)
Robert Haines (MIT)
Communications of ACM (1999)
• Scholary Big Data (SBD)
a.k.a. Big Scholarly Data
• Coined in the keynote speech by Dr.
C. Lee Giles at the 22nd ACM
Conference on Information &
Knowledge Management (CIKM ’13)
• “Scholarly Big Data" appears in the
2013 KDD Cup report
How BIG is SBD?
Khabsa & Giles (2014) PLOS
Xia & Wang (2017) IEEE TBD
114M (estimated)
100M (estimated)
Where To Find SBD?
Data types OAG*
Google
Scholar*
Web of
Science Medline CiteSeerX DBLP
Documents 209 M 100 M 45 M 22 M 10 M 5 M
Metadata ✓ X ✓ ✓ ✓ ✓
Citations ✓ X ✓ X ✓ X
URLs ✓ X X X ✓ ✓
Full text X X X X ✓ X
Disambiguated
Authors
X X X X ✓ X
• OAG: Open Academic Graph (2018-11 release)
• Google Scholar: estimated [Khabsa & Giles 2014 PLOS]
✓ Available X Not available
6
CiteSeerX Facts
• 10+ million full text English documents and metadata.
• 1 billion hits and 180 million downloads annually.
• Googling "CiteSeerX OR CiteSeer" returns 10 million results.
• 3 million individual users world wide, 1/3 from the USA.
• Metadata with 32 million authors and 240 million citation mentions.
• Citation graph with 71 million nodes and 183 million edges.
• OAI metadata accessed 30 million times annually.
• URLs of crawled and indexed documents with duplicates: 40 million.
7
Why Do We Care About SBD?
Larsen and von Ins (2010) Scientometrics
• The exponential growth of
scientific publications since the
end of WWI
• Search and ranking: Quick and
accurate document search on
hundreds of millions of
documents
• Recommendation: stay tuned
for new and impactful
discoveries and inventions
• Science of science: Understand
the the trend of science
November 1918
SBD Used in “Science of Science”
Key Question and Approaches
• Key question: How to make it easier to retrieve relevant
and important information out of scholarly big data?
10
Data
Mining
Big
Data
Heuristic
Machine Learning
Deep Learning
Parsing, Tagging
Language Modeling
Semantics (Word Embedding)
Database
Indexing
Searching
Cloud Computing
MapReduce
System
Mining
Scholarly Big
Data
Natural
Language
Processing
Information
Retrieval
Academic papers
Non-
Academic
Classification
Textual Non-textual
Title CitationVenue Year AbstractAuthors
Figure/
Table
Algorithm
Math
expression
IE
Body
text
DisambiguationDeduplication Keyphrases
Typed Entities Relations
Knowledge Base
Local DBExternal DB
Data Linking
Semantics
PDFs
11
Chemical
formulae
Most high impact academic papers are published in PDF in English
Research Highlights in Mining SBD
1. Document type classification
2. Citation parsing
3. Entity matching
4. Domain knowledge entity extraction
12
1. Document type classification
• Task: academic vs. non-academic
• Traditional approach: rule-based (~80% F1-measure)
– look for “references”, “bibliography”, etc. in text
• Challenges:
– articles use different headings for reference sections, e.g., “Notes”
– “references” are used in other documents, e.g., “resumes”
• Machine learning approach: (>90% F1-measure)
– Random Forest + structural features
• Extension: multiple type classification
– Papers, theses, CVs, slides, books
[Caragea et al. WSDM-WSCBD’14; Caragea et al. AAAI ’16]
13
Structural Features
• File specific features
– size (kilobytes), #pages, etc.
• Text specific features
– #characters, #words, #lines, etc.
• Section specific features
– section names (e.g., “abstract”, “references”),
positions, etc.
• Containment features
– specific phrases (e.g., “this book”, “this chapter”), etc.
14
[Patel et al. 2019 in preparation]
2. Citation Parsing
15
@article{bea1997evaluation,
title={Evaluation of storm loadings on and capacities of offshore
platforms},
author={Bea, RG and Mortazavi, MM and Loch, KJ},
journal={Journal of waterway, port, coastal, and ocean engineering},
volume={123},
number={2},
pages={73--81},
year={1997},
publisher={American Society of Civil Engineers} }
Why doing citation parsing?
• Automated indexing
– Navigate to cited papers
• Document conflation
– Link citation mentions and
paper metadata
• Construct citation graph
(right figure)
16
Citation Graph generated based on
CiteSeerX data by Giselle Zeno
(2014)
Tools – ParsCit vs. Neural ParsCit
• ParsCit: sequential labeling with Conditional Random Field
17
Bea, R. G., Mortazavi, M. M., and Loch, K. J., “Evaluation of Storm Loadings
and Capacities of Offshore Platforms,” Journal of Waterway. Port. Coastal and
Ocean Engineering. Vol. 123, No. 2, ASCE, March/April 1997.
A A A A A A T T T T
J J J J
The Label of a token depends on features of the current token and nearby
tokens, AND labels of nearby tokens.
[Councill, Giles and Kan, LREC’08, Wu et al. IAAI ’14]
• Neural ParsCit: Character-level word embedding + CRF
Neural Network + CRF
18
… in Proc of …
character-level encoding word-level encoding
[Prasad et al. IJDL 2018]
3. Entity Matching
• Match data records across multiple databases
– Challenges: primary keys are not available in most cases
• Previous work: search-based (~74% F1-measure)
– Document representation: titles (empirical)
– Based on n-gram query and Jaccard similarity of titles
Indexed DBLP metadata
CiteSeerX
Titles
N-grams
CandidatesSimilarity
search-based approach 19
[Caragea et al. ECIR’14]
Entity Matching with Machine Learning
• Document representation
– metadata (title, authors, year, abstracts) + citations (aka, references)
• ML + search
20
(Noisy data)
(indexed external metadata) (indexed external citation graph)
(external data)
[Sefid et al. IAAI’19]
1
2
matching by header metadata matching by citations
Entity Matching Evaluation
• Ground truth
• Outperforms search-based method by 14% in precision!
• Best performance using Web of Science ground truth:
– match by metadata: 92.2% F1-measure
– match by metadata + citation: 99.2% F1-measure
[Wu et al. K-CAP ’17, Sefid et al. IAAI ’19]
21
External data Positive matching pairs
IEEE Xplore 51 (metadata only)
DBLP 292 (metadata only)
Web of Science 345 (with citations)
Combined 688
Entity Matching Application
• Applications
– Data cleansing: cleanse metadata and citation graph
– 50% CiteSeerX papers’ metadata cleansable using Web of
Science, Medline or DBLP database
22
[Wu et al. Big Data’18]
correct title: A New Metric for Banking Integration in Europe
incorrect title: A New Metric for Banking Integration in Europe 1
correct authors: Jian Wu, Allen C. Ge, C. Lee Giles
incorrect author: Jian Wu1, Allen C. Ge1, C. Lee Giles1,2 IST, Penn State University
4. Domain Knowledge Entity Extraction
• A Domain Knowledge Entity is a phrase representing
domain knowledge in an academic document.
• Noun phrases
• NOT just keyphrases, though keyphrases CAN BE
domain knowledge entities
23
[Wu et al. SIGMOD-SBD’16; Wu et al. JCDL’17]
EvolutionaryAlgorithms arethe stochastic optimization methods, simulating the behavior of natural
evolution.
Stanford NER tag: ORGANIZATION
EvolutionaryAlgorithms arethe stochastic optimization methods, simulating the behavior of natural
evolution.
Training/Testing Datasets
• SemEval 2017 Task 10, dual-labeled
– 400 documents, 350 training, 50 testing
• Each document is a passage from a journal article in ScienceDirect in
Computer Science, Physics, and Materials Science
• Challenge: must extract the exact phrase span (positions of characters)
Evolutionary Algorithms: 1-23
stochastic optimization methods: 32-63
natural evolution: 92-109
EvolutionaryAlgorithms arethe stochastic optimization methods, simulating the behavior of natural
evolution.
24
Extractor architecture
Test
passage
Preprocessing
Noun Phrase
(NP) Chunking
SVM Classifier
NP_N
CRF Model Training
passages
External
text corpora
NP_S
NP_U
Rule-based
Filters
Entities
sequentialNon-sequential
25
Entity Extractor Performance
Approach Precision Recall F1
NP-chunking 26% 54% 35%
NP-chunking + SVM classifier 42% 40% 41%
CRF 46% 33% 39%
CRF + NP-chunking + SVM classifier 39% 56% 46%
CRF + NP-chunking + SVM classifier + rule-based filters 46% 56% 50%
Winner (without B-LSTM) - - 50%
Winner (with B-LSTM) - - 54%
[Wu et al. JCDL’17; Ammar et al. 2017]
26
Ongoing Work
• Subject Category Classification
– Motivation: support facet search of scholarly big data
27
Facet search on Amazon
Problem Formalization
• Multiclass Classification
• Final goal: 252 Subject Categories (Web of Science Schema)
• Preliminary study: 6 subject categories:
28
Physics Chemistry Biology
Materials
Science
Computer
Science
Others
1.10M 1.09M 456k 260k 169k 150k
0
2
4
6
8
10
0.74
0.76
0.78
0.8
0.82
0.84
LR RF MNB SVM MLP
TestTime(sec)
Micro-F1
MLP VS. CLASSIC ML CLASSIFIERS
Micro-F1 Test Time
[Wu et al. BigData’18]
Collaborators
• NSF CRI: Towards sustainable support of scholarly big data
– Co-PI, $770K, PI: C. Lee Giles (Penn State)
• Keyphrase Extraction - Cornelia Caragea (UIC)
• ETD Mining - Edward A. Fox (VTech)
• Math IR - Richard Zanibbi (RIT)
• Citation Parsing – Min-Yen Kan (NUS)
29

More Related Content

Similar to CiteSeerX: Mining Scholarly Big Data

Improving Semantic Search Using Query Log Analysis
Improving Semantic Search Using Query Log AnalysisImproving Semantic Search Using Query Log Analysis
Improving Semantic Search Using Query Log AnalysisStuart Wrigley
 
Enriching Literature Reviews with Text Mining Tools Case: Group Support Systems
Enriching Literature Reviews with Text Mining ToolsCase: Group Support SystemsEnriching Literature Reviews with Text Mining ToolsCase: Group Support Systems
Enriching Literature Reviews with Text Mining Tools Case: Group Support Systemsfpmconnect
 
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014Susanna-Assunta Sansone
 
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Lucidworks
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemTrey Grainger
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...Angelo Salatino
 
A Knowledge Discovery Framework for Planetary Defense
A Knowledge Discovery Framework for Planetary DefenseA Knowledge Discovery Framework for Planetary Defense
A Knowledge Discovery Framework for Planetary DefenseYongyao Jiang
 
Multi-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing ParadigmsMulti-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing ParadigmsJiaheng Lu
 
07 data structures_and_representations
07 data structures_and_representations07 data structures_and_representations
07 data structures_and_representationsMarco Quartulli
 
Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analyti...
Semantics-enhanced Cyberinfrastructure for ICMSE :  Interoperability, Analyti...Semantics-enhanced Cyberinfrastructure for ICMSE :  Interoperability, Analyti...
Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analyti...Artificial Intelligence Institute at UofSC
 
Trailblazing in the Wilderness of Data Management
Trailblazing in the Wilderness of Data ManagementTrailblazing in the Wilderness of Data Management
Trailblazing in the Wilderness of Data ManagementStephanie Wright
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesElsevier
 
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...Yongyao Jiang
 
The paper trail:steps towards a reference model for the metadata ecology
The paper trail:steps towards a reference model for the metadata ecologyThe paper trail:steps towards a reference model for the metadata ecology
The paper trail:steps towards a reference model for the metadata ecologyR. John Robertson
 
DataONE Education Module 08: Data Citation
DataONE Education Module 08: Data CitationDataONE Education Module 08: Data Citation
DataONE Education Module 08: Data CitationDataONE
 
Data legend dh_benelux_2017.key
Data legend dh_benelux_2017.keyData legend dh_benelux_2017.key
Data legend dh_benelux_2017.keyRichard Zijdeman
 
empirical-SLR.pptx
empirical-SLR.pptxempirical-SLR.pptx
empirical-SLR.pptxJitha Kannan
 

Similar to CiteSeerX: Mining Scholarly Big Data (20)

Improving Semantic Search Using Query Log Analysis
Improving Semantic Search Using Query Log AnalysisImproving Semantic Search Using Query Log Analysis
Improving Semantic Search Using Query Log Analysis
 
Enriching Literature Reviews with Text Mining Tools Case: Group Support Systems
Enriching Literature Reviews with Text Mining ToolsCase: Group Support SystemsEnriching Literature Reviews with Text Mining ToolsCase: Group Support Systems
Enriching Literature Reviews with Text Mining Tools Case: Group Support Systems
 
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014
 
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data system
 
A Taxonomy of Semantic Web data Retrieval Techniques
A Taxonomy of Semantic Web data Retrieval TechniquesA Taxonomy of Semantic Web data Retrieval Techniques
A Taxonomy of Semantic Web data Retrieval Techniques
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
 
A Knowledge Discovery Framework for Planetary Defense
A Knowledge Discovery Framework for Planetary DefenseA Knowledge Discovery Framework for Planetary Defense
A Knowledge Discovery Framework for Planetary Defense
 
Multi-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing ParadigmsMulti-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing Paradigms
 
Realizing Semantic Web - Light Weight semantics and beyond
Realizing Semantic Web - Light Weight semantics and beyondRealizing Semantic Web - Light Weight semantics and beyond
Realizing Semantic Web - Light Weight semantics and beyond
 
07 data structures_and_representations
07 data structures_and_representations07 data structures_and_representations
07 data structures_and_representations
 
Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analyti...
Semantics-enhanced Cyberinfrastructure for ICMSE :  Interoperability, Analyti...Semantics-enhanced Cyberinfrastructure for ICMSE :  Interoperability, Analyti...
Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analyti...
 
Trailblazing in the Wilderness of Data Management
Trailblazing in the Wilderness of Data ManagementTrailblazing in the Wilderness of Data Management
Trailblazing in the Wilderness of Data Management
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific Tables
 
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
 
The paper trail:steps towards a reference model for the metadata ecology
The paper trail:steps towards a reference model for the metadata ecologyThe paper trail:steps towards a reference model for the metadata ecology
The paper trail:steps towards a reference model for the metadata ecology
 
DataONE Education Module 08: Data Citation
DataONE Education Module 08: Data CitationDataONE Education Module 08: Data Citation
DataONE Education Module 08: Data Citation
 
Data legend dh_benelux_2017.key
Data legend dh_benelux_2017.keyData legend dh_benelux_2017.key
Data legend dh_benelux_2017.key
 
empirical-SLR.pptx
empirical-SLR.pptxempirical-SLR.pptx
empirical-SLR.pptx
 

Recently uploaded

Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Servicemonikaservice1
 
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learninglevieagacer
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxSuji236384
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Serviceshivanisharma5244
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Servicenishacall1
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxseri bangash
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptxryanrooker
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professormuralinath2
 

Recently uploaded (20)

Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 

CiteSeerX: Mining Scholarly Big Data

  • 1. CiteSeerX: Mining Scholarly Big Data Invited talk at MITRE Corporation, Hampton, VA, April, 2019 Jian Wu Assistant Professor of Computer Science Old Dominion University Search “jian wu odu” on Google, the first result is my page
  • 2. About myself • PhD: 2004 - 2011 – Astronomy & Astrophysics – Sloan Digital Sky Survey (SDSS) – Hubble Space Telescope (HST) • Postdoc: 2011 - 2017 – With Dr. C. Lee Giles at Penn State – Tech Leader (CiteSeerX) • Assistant Teaching Professor: 2017 - 2018 • Assistant Professor (2018 – ) – Web Science Digital Library Group (Dr. Nelson and Dr. Weigle) 2
  • 3. Outline • Why Scholarly Big Data – Big picture, key questions, and approaches • Highlighted Work – Document type classification – Citation parsing – Entity matching – Domain knowledge entity extraction • Ongoing and Future Research 3
  • 4. What is Scholarly Big Data 4 Steve Bryson (NASA) David Kenwright (NASA) Michael Cox (NASA) David EIIsworth (NASA) Robert Haines (MIT) Communications of ACM (1999) • Scholary Big Data (SBD) a.k.a. Big Scholarly Data • Coined in the keynote speech by Dr. C. Lee Giles at the 22nd ACM Conference on Information & Knowledge Management (CIKM ’13) • “Scholarly Big Data" appears in the 2013 KDD Cup report
  • 5. How BIG is SBD? Khabsa & Giles (2014) PLOS Xia & Wang (2017) IEEE TBD 114M (estimated) 100M (estimated)
  • 6. Where To Find SBD? Data types OAG* Google Scholar* Web of Science Medline CiteSeerX DBLP Documents 209 M 100 M 45 M 22 M 10 M 5 M Metadata ✓ X ✓ ✓ ✓ ✓ Citations ✓ X ✓ X ✓ X URLs ✓ X X X ✓ ✓ Full text X X X X ✓ X Disambiguated Authors X X X X ✓ X • OAG: Open Academic Graph (2018-11 release) • Google Scholar: estimated [Khabsa & Giles 2014 PLOS] ✓ Available X Not available 6
  • 7. CiteSeerX Facts • 10+ million full text English documents and metadata. • 1 billion hits and 180 million downloads annually. • Googling "CiteSeerX OR CiteSeer" returns 10 million results. • 3 million individual users world wide, 1/3 from the USA. • Metadata with 32 million authors and 240 million citation mentions. • Citation graph with 71 million nodes and 183 million edges. • OAI metadata accessed 30 million times annually. • URLs of crawled and indexed documents with duplicates: 40 million. 7
  • 8. Why Do We Care About SBD? Larsen and von Ins (2010) Scientometrics • The exponential growth of scientific publications since the end of WWI • Search and ranking: Quick and accurate document search on hundreds of millions of documents • Recommendation: stay tuned for new and impactful discoveries and inventions • Science of science: Understand the the trend of science November 1918
  • 9. SBD Used in “Science of Science”
  • 10. Key Question and Approaches • Key question: How to make it easier to retrieve relevant and important information out of scholarly big data? 10 Data Mining Big Data Heuristic Machine Learning Deep Learning Parsing, Tagging Language Modeling Semantics (Word Embedding) Database Indexing Searching Cloud Computing MapReduce System Mining Scholarly Big Data Natural Language Processing Information Retrieval
  • 11. Academic papers Non- Academic Classification Textual Non-textual Title CitationVenue Year AbstractAuthors Figure/ Table Algorithm Math expression IE Body text DisambiguationDeduplication Keyphrases Typed Entities Relations Knowledge Base Local DBExternal DB Data Linking Semantics PDFs 11 Chemical formulae Most high impact academic papers are published in PDF in English
  • 12. Research Highlights in Mining SBD 1. Document type classification 2. Citation parsing 3. Entity matching 4. Domain knowledge entity extraction 12
  • 13. 1. Document type classification • Task: academic vs. non-academic • Traditional approach: rule-based (~80% F1-measure) – look for “references”, “bibliography”, etc. in text • Challenges: – articles use different headings for reference sections, e.g., “Notes” – “references” are used in other documents, e.g., “resumes” • Machine learning approach: (>90% F1-measure) – Random Forest + structural features • Extension: multiple type classification – Papers, theses, CVs, slides, books [Caragea et al. WSDM-WSCBD’14; Caragea et al. AAAI ’16] 13
  • 14. Structural Features • File specific features – size (kilobytes), #pages, etc. • Text specific features – #characters, #words, #lines, etc. • Section specific features – section names (e.g., “abstract”, “references”), positions, etc. • Containment features – specific phrases (e.g., “this book”, “this chapter”), etc. 14 [Patel et al. 2019 in preparation]
  • 15. 2. Citation Parsing 15 @article{bea1997evaluation, title={Evaluation of storm loadings on and capacities of offshore platforms}, author={Bea, RG and Mortazavi, MM and Loch, KJ}, journal={Journal of waterway, port, coastal, and ocean engineering}, volume={123}, number={2}, pages={73--81}, year={1997}, publisher={American Society of Civil Engineers} }
  • 16. Why doing citation parsing? • Automated indexing – Navigate to cited papers • Document conflation – Link citation mentions and paper metadata • Construct citation graph (right figure) 16 Citation Graph generated based on CiteSeerX data by Giselle Zeno (2014)
  • 17. Tools – ParsCit vs. Neural ParsCit • ParsCit: sequential labeling with Conditional Random Field 17 Bea, R. G., Mortazavi, M. M., and Loch, K. J., “Evaluation of Storm Loadings and Capacities of Offshore Platforms,” Journal of Waterway. Port. Coastal and Ocean Engineering. Vol. 123, No. 2, ASCE, March/April 1997. A A A A A A T T T T J J J J The Label of a token depends on features of the current token and nearby tokens, AND labels of nearby tokens. [Councill, Giles and Kan, LREC’08, Wu et al. IAAI ’14] • Neural ParsCit: Character-level word embedding + CRF
  • 18. Neural Network + CRF 18 … in Proc of … character-level encoding word-level encoding [Prasad et al. IJDL 2018]
  • 19. 3. Entity Matching • Match data records across multiple databases – Challenges: primary keys are not available in most cases • Previous work: search-based (~74% F1-measure) – Document representation: titles (empirical) – Based on n-gram query and Jaccard similarity of titles Indexed DBLP metadata CiteSeerX Titles N-grams CandidatesSimilarity search-based approach 19 [Caragea et al. ECIR’14]
  • 20. Entity Matching with Machine Learning • Document representation – metadata (title, authors, year, abstracts) + citations (aka, references) • ML + search 20 (Noisy data) (indexed external metadata) (indexed external citation graph) (external data) [Sefid et al. IAAI’19] 1 2 matching by header metadata matching by citations
  • 21. Entity Matching Evaluation • Ground truth • Outperforms search-based method by 14% in precision! • Best performance using Web of Science ground truth: – match by metadata: 92.2% F1-measure – match by metadata + citation: 99.2% F1-measure [Wu et al. K-CAP ’17, Sefid et al. IAAI ’19] 21 External data Positive matching pairs IEEE Xplore 51 (metadata only) DBLP 292 (metadata only) Web of Science 345 (with citations) Combined 688
  • 22. Entity Matching Application • Applications – Data cleansing: cleanse metadata and citation graph – 50% CiteSeerX papers’ metadata cleansable using Web of Science, Medline or DBLP database 22 [Wu et al. Big Data’18] correct title: A New Metric for Banking Integration in Europe incorrect title: A New Metric for Banking Integration in Europe 1 correct authors: Jian Wu, Allen C. Ge, C. Lee Giles incorrect author: Jian Wu1, Allen C. Ge1, C. Lee Giles1,2 IST, Penn State University
  • 23. 4. Domain Knowledge Entity Extraction • A Domain Knowledge Entity is a phrase representing domain knowledge in an academic document. • Noun phrases • NOT just keyphrases, though keyphrases CAN BE domain knowledge entities 23 [Wu et al. SIGMOD-SBD’16; Wu et al. JCDL’17] EvolutionaryAlgorithms arethe stochastic optimization methods, simulating the behavior of natural evolution. Stanford NER tag: ORGANIZATION EvolutionaryAlgorithms arethe stochastic optimization methods, simulating the behavior of natural evolution.
  • 24. Training/Testing Datasets • SemEval 2017 Task 10, dual-labeled – 400 documents, 350 training, 50 testing • Each document is a passage from a journal article in ScienceDirect in Computer Science, Physics, and Materials Science • Challenge: must extract the exact phrase span (positions of characters) Evolutionary Algorithms: 1-23 stochastic optimization methods: 32-63 natural evolution: 92-109 EvolutionaryAlgorithms arethe stochastic optimization methods, simulating the behavior of natural evolution. 24
  • 25. Extractor architecture Test passage Preprocessing Noun Phrase (NP) Chunking SVM Classifier NP_N CRF Model Training passages External text corpora NP_S NP_U Rule-based Filters Entities sequentialNon-sequential 25
  • 26. Entity Extractor Performance Approach Precision Recall F1 NP-chunking 26% 54% 35% NP-chunking + SVM classifier 42% 40% 41% CRF 46% 33% 39% CRF + NP-chunking + SVM classifier 39% 56% 46% CRF + NP-chunking + SVM classifier + rule-based filters 46% 56% 50% Winner (without B-LSTM) - - 50% Winner (with B-LSTM) - - 54% [Wu et al. JCDL’17; Ammar et al. 2017] 26
  • 27. Ongoing Work • Subject Category Classification – Motivation: support facet search of scholarly big data 27 Facet search on Amazon
  • 28. Problem Formalization • Multiclass Classification • Final goal: 252 Subject Categories (Web of Science Schema) • Preliminary study: 6 subject categories: 28 Physics Chemistry Biology Materials Science Computer Science Others 1.10M 1.09M 456k 260k 169k 150k 0 2 4 6 8 10 0.74 0.76 0.78 0.8 0.82 0.84 LR RF MNB SVM MLP TestTime(sec) Micro-F1 MLP VS. CLASSIC ML CLASSIFIERS Micro-F1 Test Time [Wu et al. BigData’18]
  • 29. Collaborators • NSF CRI: Towards sustainable support of scholarly big data – Co-PI, $770K, PI: C. Lee Giles (Penn State) • Keyphrase Extraction - Cornelia Caragea (UIC) • ETD Mining - Edward A. Fox (VTech) • Math IR - Richard Zanibbi (RIT) • Citation Parsing – Min-Yen Kan (NUS) 29

Editor's Notes

  1. In Williams & Giles ‘14 Doceng: only 20 matching pairs were used Use examples and illustrations.