In this paper, we suggest how mathematical formulae could be cited and define a Formula Concept Retrieval challenge with two subtasks: Formula Concept Discovery (FCD) and Formula Concept Recognition (FCR). While the former aims at the definition and exploration of a Formula Concept that names bundled equivalent representations of a formula, the latter is designed to match a given formula to a prior assigned concept ID.
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Towards Formula Concept Discovery and Recognition
1. Towards Formula Concept Discovery
and Recognition
Philipp Scharpf1, Moritz Schubotz2,4,
Howard Cohl3, and Bela Gipp1,4
1University of Konstanz
2FIZ Karlsruhe
3National Institute of Standards and Technology
4University of Wuppertal
Shortpaper BIRNDL@SIGIR 2019
2. Outline
• Motivation
• Formula Concept Definition
• Formula Concept Retrieval Challenge
• Our Approach
• Our Results
• Future Work
29.07.2019 Scharpf, Schubotz, Cohl, Gipp – Towards Formula Concept Discovery and Recognition 2
4. Formula Concept
29.07.2019 4
Collection of equivalent formulae (same mathematical statement)
with different representations
Scharpf, Schubotz, Cohl, Gipp – Towards Formula Concept Discovery and Recognition
5. Formula Concept Retrieval Challenge
Goal:
29.07.2019 5
Map all of the various representations of a formula
to a unique and open concept ID, e.g., Wikidata item Q868967
Scharpf, Schubotz, Cohl, Gipp – Towards Formula Concept Discovery and Recognition
6. Formula Concept Retrieval Challenge
Subtasks:
29.07.2019 6
Formula Concept Discovery (FCD): method
to find common equivalent representations
and a name candidate for a given formula
Formula Concept Recognition (FCR): approach
to recognize formulae in documents as being instances of
prior openly defined formula concept
Scharpf, Schubotz, Cohl, Gipp – Towards Formula Concept Discovery and Recognition
7. Our Approach:
Data Selection
29.07.2019 7
NTCIR-11 arXiv dataset
~ 60 M
formulae
astro-ph subject class
680 docs
104.000
documents
formula length range
10-30
chars
discard stubs*
3495
formulae
*
50
samples
Scharpf, Schubotz, Cohl, Gipp – Towards Formula Concept Discovery and Recognition
8. 29.07.2019 8
Our Approach:
Data Processing
Equivalent formula representations Formula name candidates
MathML
<mi> / <mo> tags
Surrounding text
Word window ± 500 chars
TfidfVectorizer Word2Vec
k-nearest neighbors
Top 5
Scharpf, Schubotz, Cohl, Gipp – Towards Formula Concept Discovery and Recognition
9. Our Results: 50 Samples
29.07.2019 9
…
Scharpf, Schubotz, Cohl, Gipp – Towards Formula Concept Discovery and Recognition
10. Our Results:
Equivalent Representations
29.07.2019 10
…
math2vec : 70%
semantics tf-idf : 15%
semantics2vec : 11%
math tf-idf : 4%
Overall, for 34/50 = 68% of the sample formulae,
we could retrieve equivalent representations
Scharpf, Schubotz, Cohl, Gipp – Towards Formula Concept Discovery and Recognition
11. Our Results:
Name Candidates
29.07.2019 11
…
We achieve a recall of 36/50 = 72% for the formula name recommendations
For 41/50 = 82% of the retrieved name candidates, there was a Wikidata QID available
Scharpf, Schubotz, Cohl, Gipp – Towards Formula Concept Discovery and Recognition
12. Future Work
• Manual examination of formula instances to further develop
the definition of a Formula Concept
• Formula Concepts Database
• Formula Concept Discovery by Recognition
• Formula representation augmentation generation using
Computer Algebra Systems (CAS)
• Formula name and identifier annotation recommender system
29.07.2019 12Scharpf, Schubotz, Cohl, Gipp – Towards Formula Concept Discovery and Recognition
13. Future Work
• Manual examination of formula instances to further develop
the definition of a Formula Concept
• Formula Concepts Database
• Formula Concept Discovery by Recognition
• Formula representation augmentation generation using
Computer Algebra Systems (CAS)
• Formula name and identifier annotation recommender system
29.07.2019 13
Wikipedia & arXiv
Scharpf, Schubotz, Cohl, Gipp – Towards Formula Concept Discovery and Recognition
14. Einstein‘s Field Equations (Wikipedia): 11
29.07.2019 14Scharpf, Schubotz, Cohl, Gipp – Towards Formula Concept Discovery and Recognition
15. Einstein‘s Field Equations (arXiv): 77
29.07.2019 15Scharpf, Schubotz, Cohl, Gipp – Towards Formula Concept Discovery and Recognition
16. Future Work
• Manual examination of formula instances to further develop
the definition of a Formula Concept
• Formula Concepts Database
• Formula Concept Discovery by Recognition
• Formula representation augmentation generation using
Computer Algebra Systems (CAS)
• Formula name and identifier annotation recommender system
29.07.2019 16Scharpf, Schubotz, Cohl, Gipp – Towards Formula Concept Discovery and Recognition
17. Future Work
• Manual examination of formula instances to further develop
the definition of a Formula Concept
• Formula Concepts Database
• Formula Concept Discovery by Recognition
• Formula representation augmentation generation using
Computer Algebra Systems (CAS)
• Formula name and identifier annotation recommender system
29.07.2019 17
AnnoMathTeX
Scharpf, Schubotz, Cohl, Gipp – Towards Formula Concept Discovery and Recognition
18. 29.07.2019 18
Work In Progress:
AnnoMathTeX
Scharpf, Schubotz, Cohl, Gipp – Towards Formula Concept Discovery and Recognition