JCDL 2018 slides for the full paper ''Improving the Representation and Conversion of Mathematical Formulae by Considering their Textual Content''. Presented by André Greiner-Petter.
Find the associated paper here: https://www.gipp.com/wp-content/papercite-data/pdf/schubotz2018.pdf
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Improving the Representation and Conversion of Mathematical Formulae by Considering their Textual Content
1. Improving the Representation and Conversion
of Mathematical Formulae by Considering
their Textual Context
Moritz Schubotz1
, André Greiner-Petter*1
, Philipp Scharpf*1
,
Norman Meuschke1
, Howard S. Cohl2
, Bela Gipp1
June 5, 2018
1University of Konstanz, Germany
2National Institute of Standards and Technology, USA
*sponsored by SIGIR Student Travel Grant 1/14
3. Formats of Mathematical Formulae
Riemann Zeta Function
Rendered Version:
ζ(s) = 0 ⇒ s = 1
2 ∨ s = 0
2/14
4. Formats of Mathematical Formulae
Riemann Zeta Function
Rendered Version:
ζ(s) = 0 ⇒ s = 1
2 ∨ s = 0
LATEX:
zeta(s) = 0 Rightarrow Re s = frac12 lor Im s=0
2/14
5. Formats of Mathematical Formulae
Riemann Zeta Function
Rendered Version:
ζ(s) = 0 ⇒ s = 1
2 ∨ s = 0
LATEX:
zeta(s) = 0 Rightarrow Re s = frac12 lor Im s=0
Mathematica:
Implies[
Equal[Zeta[s], 0],
Or[
Equal[Re[s], Rational[1, 2]],
Equal[Im[s], 0]
]
]
2/14
6. Formats of Mathematical Formulae
Riemann Zeta Function
Rendered Version:
ζ(s) = 0 ⇒ s = 1
2 ∨ s = 0
LATEX:
zeta(s) = 0 Rightarrow Re s = frac12 lor Im s=0
Mathematica:
Implies[
Equal[Zeta[s], 0],
Or[
Equal[Re[s], Rational[1, 2]],
Equal[Im[s], 0]
]
]
2/14
7. Formats of Mathematical Formulae
Riemann Zeta Function
Rendered Version:
ζ(s) = 0 ⇒ s = 1
2 ∨ s = 0
LATEX:
zeta(s) = 0 Rightarrow Re s = frac12 lor Im s=0
Mathematica:
Implies[
Equal[Zeta[s], 0],
Or[
Equal[Re[s], Rational[1, 2]],
Equal[Im[s], 0]
]
]
← 18 tokens with max depth of 2
← 16 tokens with max depth of 5
2/14
8. Formats of Mathematical Formulae - MathML
A Combined Format
Find another format that (1) provides presentation and semantic
information, (2) is easy to parse, and (3) is extendible.
⇒ Mathematical Markup Language 3.0.
3/14
9. Formats of Mathematical Formulae - MathML
A Combined Format
Find another format that (1) provides presentation and semantic
information, (2) is easy to parse, and (3) is extendible.
⇒ Mathematical Markup Language 3.0.
3/14
10. Formats of Mathematical Formulae - MathML
A Combined Format
Find another format that (1) provides presentation and semantic
information, (2) is easy to parse, and (3) is extendible.
⇒ Mathematical Markup Language 3.0.
3/14
11. Formats of Mathematical Formulae - MathML
A Combined Format
Find another format that (1) provides presentation and semantic
information, (2) is easy to parse, and (3) is extendible.
⇒ Mathematical Markup Language 3.0.
3/14
12. Formats of Mathematical Formulae - MathML
A Combined Format
Find another format that (1) provides presentation and semantic
information, (2) is easy to parse, and (3) is extendible.
⇒ Mathematical Markup Language 3.0.
3/14
13. Formats of Mathematical Formulae - MathML
A Combined Format
Find another format that (1) provides presentation and semantic
information, (2) is easy to parse, and (3) is extendible.
⇒ Mathematical Markup Language 3.0.
Part of the MathML for ζ(s) = 0 ⇒ s = 1
2 ∨ s = 0:
<math><semantics><mrow>. . .
<mo id="5" xref="20">=</mo>
<mn id="5" xref="21">0</mn>
<mo id="7" xref="19">⇒</mo>. . .</mrow>
<annotation−xml encoding="MathML−Content">
<apply><implies id="19" xref="7"/>
<apply><eq id="20" xref="5"/>
<apply><csymbol id="21" xref="1">ζ</csymbol>. . .
</annotation−xml></semantics></math>
3/14
14. Contributions
Our Contributions
We present the following three main contributions
1. MathMLBen - benchmark for MathML,
2. Evaluate state-of-the art translation tools,
3. Propose a new approach that consider textual context.
4/14
15. Contributions
Our Contributions
We present the following three main contributions
1. MathMLBen - benchmark for MathML,
2. Evaluate state-of-the art translation tools,
3. Propose a new approach that consider textual context.
4/14
16. Contributions
Our Contributions
We present the following three main contributions
1. MathMLBen - benchmark for MathML,
2. Evaluate state-of-the art translation tools,
3. Propose a new approach that consider textual context.
4/14
17. Contributions
Our Contributions
We present the following three main contributions
1. MathMLBen - benchmark for MathML,
2. Evaluate state-of-the art translation tools,
3. Propose a new approach that consider textual context.
4/14
19. Create MML Benchmark Dataset mathmlben.wmflabs.org
SwitchSwitch
POM-Tagger
Dictionaries
β
β
~~~~~~~~~~
~~~~~~~
~~~~~~~~
~~~~~~~~~~
~~~~~~~
~~~~~~~~
~~~~~~~~~~
~~~~~~~
~~~~~~~~
Documents
Formula
a+b
Semantic
Formula
a+b
Manual
Refinements
Gold Standard
MML
Multiple
Cycles
Random
Selection
Annotated MMLAnnotated MML
LaTeXML
MML ComparisonMML Comparison
VS
® Vecteezy.com
&
Converter
β β
Mathematical
Language Processor
Identifier &
Definiens
Tree
Refinemets
MathMLben#422f2d908725a379336f2c6083c5b6edf69157ca
β
π ζ
XML
5/14
20. Create MML Benchmark Dataset mathmlben.wmflabs.org
SwitchSwitch
POM-Tagger
Dictionaries
β
β
~~~~~~~~~~
~~~~~~~
~~~~~~~~
~~~~~~~~~~
~~~~~~~
~~~~~~~~
~~~~~~~~~~
~~~~~~~
~~~~~~~~
Documents
Formula
a+b
Semantic
Formula
a+b
Manual
Refinements
Gold Standard
MML
Multiple
Cycles
Random
Selection
Annotated MMLAnnotated MML
LaTeXML
MML ComparisonMML Comparison
VS
® Vecteezy.com
&
Converter
β β
Mathematical
Language Processor
Identifier &
Definiens
Tree
Refinemets
MathMLben#422f2d908725a379336f2c6083c5b6edf69157ca
β
π ζ
XML
5/14
21. Create MML Benchmark Dataset mathmlben.wmflabs.org
SwitchSwitch
POM-Tagger
Dictionaries
β
β
~~~~~~~~~~
~~~~~~~
~~~~~~~~
~~~~~~~~~~
~~~~~~~
~~~~~~~~
~~~~~~~~~~
~~~~~~~
~~~~~~~~
Documents
Formula
a+b
Semantic
Formula
a+b
Manual
Refinements
Gold Standard
MML
Multiple
Cycles
Random
Selection
Annotated MMLAnnotated MML
LaTeXML
MML ComparisonMML Comparison
VS
® Vecteezy.com
&
Converter
β β
Mathematical
Language Processor
Identifier &
Definiens
Tree
Refinemets
MathMLben#422f2d908725a379336f2c6083c5b6edf69157ca
β
π ζ
XML
5/14
22. Create MML Benchmark Dataset mathmlben.wmflabs.org
SwitchSwitch
POM-Tagger
Dictionaries
β
β
~~~~~~~~~~
~~~~~~~
~~~~~~~~
~~~~~~~~~~
~~~~~~~
~~~~~~~~
~~~~~~~~~~
~~~~~~~
~~~~~~~~
Documents
Formula
a+b
Semantic
Formula
a+b
Manual
Refinements
Gold Standard
MML
Multiple
Cycles
Random
Selection
Annotated MMLAnnotated MML
LaTeXML
MML ComparisonMML Comparison
VS
® Vecteezy.com
&
Converter
β β
Mathematical
Language Processor
Identifier &
Definiens
Tree
Refinemets
MathMLben#422f2d908725a379336f2c6083c5b6edf69157ca
β
π ζ
XML
5/14
23. Create MML Benchmark Dataset mathmlben.wmflabs.org
SwitchSwitch
POM-Tagger
Dictionaries
β
β
~~~~~~~~~~
~~~~~~~
~~~~~~~~
~~~~~~~~~~
~~~~~~~
~~~~~~~~
~~~~~~~~~~
~~~~~~~
~~~~~~~~
Documents
Formula
a+b
Semantic
Formula
a+b
Manual
Refinements
Gold Standard
MML
Multiple
Cycles
Random
Selection
Annotated MMLAnnotated MML
LaTeXML
MML ComparisonMML Comparison
VS
® Vecteezy.com
&
Converter
β β
Mathematical
Language Processor
Identifier &
Definiens
Tree
Refinemets
MathMLben#422f2d908725a379336f2c6083c5b6edf69157ca
β
π ζ
XML
5/14
24. Create MML Benchmark Dataset mathmlben.wmflabs.org
SwitchSwitch
POM-Tagger
Dictionaries
β
β
~~~~~~~~~~
~~~~~~~
~~~~~~~~
~~~~~~~~~~
~~~~~~~
~~~~~~~~
~~~~~~~~~~
~~~~~~~
~~~~~~~~
Documents
Formula
a+b
Semantic
Formula
a+b
Manual
Refinements
Gold Standard
MML
Multiple
Cycles
Random
Selection
Annotated MMLAnnotated MML
LaTeXML
MML ComparisonMML Comparison
VS
® Vecteezy.com
&
Converter
β β
Mathematical
Language Processor
Identifier &
Definiens
Tree
Refinemets
MathMLben#422f2d908725a379336f2c6083c5b6edf69157ca
β
π ζ
XML
5/14
25. Create MML Benchmark Dataset mathmlben.wmflabs.org
Annotated MathML
Content MathML do not provide enough semantic information.
We overcome this issue by manually annotate content MathML
with Wikidata IDs. We used special TEX-macros to
1) annotate identifier with Wikidata IDs, and
2) manipulate the expression tree.
6/14
26. Create MML Benchmark Dataset mathmlben.wmflabs.org
Annotated MathML
Content MathML do not provide enough semantic information.
We overcome this issue by manually annotate content MathML
with Wikidata IDs. We used special TEX-macros to
1) annotate identifier with Wikidata IDs, and
2) manipulate the expression tree.
6/14
27. Create MML Benchmark Dataset mathmlben.wmflabs.org
Annotated MathML
Content MathML do not provide enough semantic information.
We overcome this issue by manually annotate content MathML
with Wikidata IDs. We used special TEX-macros to
1) annotate identifier with Wikidata IDs, and
2) manipulate the expression tree.
6/14
28. Create MML Benchmark Dataset mathmlben.wmflabs.org
Annotated MathML
Content MathML do not provide enough semantic information.
We overcome this issue by manually annotate content MathML
with Wikidata IDs. We used special TEX-macros to
1) annotate identifier with Wikidata IDs, and
2) manipulate the expression tree.
Original LATEX: W(2, k)
6/14
29. Create MML Benchmark Dataset mathmlben.wmflabs.org
Annotated MathML
Content MathML do not provide enough semantic information.
We overcome this issue by manually annotate content MathML
with Wikidata IDs. We used special TEX-macros to
1) annotate identifier with Wikidata IDs, and
2) manipulate the expression tree.
Original LATEX: W(2, k)
LaTeXML Input: wf{Q7913892}{W}(2, w{Q12503}{k})
6/14
30. Create MML Benchmark Dataset mathmlben.wmflabs.org
Annotated MathML
Content MathML do not provide enough semantic information.
We overcome this issue by manually annotate content MathML
with Wikidata IDs. We used special TEX-macros to
1) annotate identifier with Wikidata IDs, and
2) manipulate the expression tree.
Original LATEX: W(2, k)
LaTeXML Input: wf{Q7913892}{W}(2, w{Q12503}{k})
MathML Output:
<apply id="p1.1.m1.1.13.1.1.cmml" xref="p1.1.m1.1.13.1.2">
<csymbol cd="wikidata" id=". . ." xref=". . .">Q7913892</csymbol>
<cn type="integer" id=". . ." xref=". . .">2</cn>
<csymbol cd="wikidata" id=". . ." xref=". . .">Q12503</csymbol>
</apply>
6/14
31. Create MML Benchmark Dataset mathmlben.wmflabs.org
MathMLBen Collection
We annotated in total 305 formulae.
• 1 to 100: randomly sampled from Wikipedia. Used for
’National Institute of Informatics Testbeds and Community for
Information access Research Project’ (NTCIR) 11
• 101 to 200: randomly sampled from the sources of NIST
Digital Library of Mathematical Functions (contains 9,897
labeled formulae).
• 201 to 305: 70% from NTCIR arXiv and 30% from NTCIR-12
Wikipedia datasets.
All data is available at https://mathmlben.wmflabs.org/.
7/14
32. Create MML Benchmark Dataset mathmlben.wmflabs.org
MathMLBen Collection
We annotated in total 305 formulae.
• 1 to 100: randomly sampled from Wikipedia. Used for
’National Institute of Informatics Testbeds and Community for
Information access Research Project’ (NTCIR) 11
• 101 to 200: randomly sampled from the sources of NIST
Digital Library of Mathematical Functions (contains 9,897
labeled formulae).
• 201 to 305: 70% from NTCIR arXiv and 30% from NTCIR-12
Wikipedia datasets.
All data is available at https://mathmlben.wmflabs.org/.
7/14
33. Create MML Benchmark Dataset mathmlben.wmflabs.org
MathMLBen Collection
We annotated in total 305 formulae.
• 1 to 100: randomly sampled from Wikipedia. Used for
’National Institute of Informatics Testbeds and Community for
Information access Research Project’ (NTCIR) 11
• 101 to 200: randomly sampled from the sources of NIST
Digital Library of Mathematical Functions (contains 9,897
labeled formulae).
• 201 to 305: 70% from NTCIR arXiv and 30% from NTCIR-12
Wikipedia datasets.
All data is available at https://mathmlben.wmflabs.org/.
7/14
34. Create MML Benchmark Dataset mathmlben.wmflabs.org
MathMLBen Collection
We annotated in total 305 formulae.
• 1 to 100: randomly sampled from Wikipedia. Used for
’National Institute of Informatics Testbeds and Community for
Information access Research Project’ (NTCIR) 11
• 101 to 200: randomly sampled from the sources of NIST
Digital Library of Mathematical Functions (contains 9,897
labeled formulae).
• 201 to 305: 70% from NTCIR arXiv and 30% from NTCIR-12
Wikipedia datasets.
All data is available at https://mathmlben.wmflabs.org/.
7/14
41. Benchmarking Conversion Tools - Accuracy
Tested Conversion Tools
1. LaTeXML: Perl tool used to create DLMF
2. LaTeX2MathML: small Python project
3. Mathoid: service that allows to generate also SVG and PNG
4. SnuggleTeX: Java library developed at University of Edinburgh
5. MathToWeb: Java web application
6. TeXZilla: Javascript web application
7. Mathematical: Ruby application that can generate SVG/PNG
8. CAS: Computer Algebra System that is capable to parse LATEX
9. Part-Of-Math (POM) Tagger: grammar-based LATEX parser
that was used to perform translations from LATEX to CAS.
10/14
42. Benchmarking Conversion Tools - Accuracy
Tested Conversion Tools
1. LaTeXML: Perl tool used to create DLMF
2. LaTeX2MathML: small Python project
3. Mathoid: service that allows to generate also SVG and PNG
4. SnuggleTeX: Java library developed at University of Edinburgh
5. MathToWeb: Java web application
6. TeXZilla: Javascript web application
7. Mathematical: Ruby application that can generate SVG/PNG
8. CAS: Computer Algebra System that is capable to parse LATEX
9. Part-Of-Math (POM) Tagger: grammar-based LATEX parser
that was used to perform translations from LATEX to CAS.
10/14
43. Benchmarking Conversion Tools - Accuracy
305 305 295 305
288
229
290
305 305
0
50
100
150
200
250
300
0
10
20
30
40
50
60
70
80
SuccessfullyParsedExpressions
TreeEditDistance
Average Distance of Presentation Subtree
Average Distance of Content Subtree
Successfully Parsed LaTeX Expressions
Average of Structural Distances & Successfully
Parsed Expressions
11/14
46. Approach to Improve Conversion Tools
SwitchSwitch
POM-Tagger
Dictionaries
β
β
~~~~~~~~~~
~~~~~~~
~~~~~~~~
~~~~~~~~~~
~~~~~~~
~~~~~~~~
~~~~~~~~~~
~~~~~~~
~~~~~~~~
Documents
Formula
a+b
Semantic
Formula
a+b
Manual
Refinements
Gold Standard
MML
Multiple
Cycles
Random
Selection
Annotated MMLAnnotated MML
LaTeXML
MML ComparisonMML Comparison
VS
® Vecteezy.com
&
Converter
β β
Mathematical
Language Processor
Identifier &
Definiens
Tree
Refinemets
MathMLben#422f2d908725a379336f2c6083c5b6edf69157ca
β
π ζ
XML
13/14
47. Approach to Improve Conversion Tools
SwitchSwitch
POM-Tagger
Dictionaries
β
β
~~~~~~~~~~
~~~~~~~
~~~~~~~~
~~~~~~~~~~
~~~~~~~
~~~~~~~~
~~~~~~~~~~
~~~~~~~
~~~~~~~~
Documents
Formula
a+b
Semantic
Formula
a+b
Manual
Refinements
Gold Standard
MML
Multiple
Cycles
Random
Selection
Annotated MMLAnnotated MML
LaTeXML
MML ComparisonMML Comparison
VS
® Vecteezy.com
&
Converter
β β
Mathematical
Language Processor
Identifier &
Definiens
Tree
Refinemets
MathMLben#422f2d908725a379336f2c6083c5b6edf69157ca
β
π ζ
XML
13/14
48. Approach to Improve Conversion Tools
SwitchSwitch
POM-Tagger
Dictionaries
β
β
~~~~~~~~~~
~~~~~~~
~~~~~~~~
~~~~~~~~~~
~~~~~~~
~~~~~~~~
~~~~~~~~~~
~~~~~~~
~~~~~~~~
Documents
Formula
a+b
Semantic
Formula
a+b
Manual
Refinements
Gold Standard
MML
Multiple
Cycles
Random
Selection
Annotated MMLAnnotated MML
LaTeXML
MML ComparisonMML Comparison
VS
® Vecteezy.com
&
Converter
β β
Mathematical
Language Processor
Identifier &
Definiens
Tree
Refinemets
MathMLben#422f2d908725a379336f2c6083c5b6edf69157ca
β
π ζ
XML
13/14
49. Approach to Improve Conversion Tools
POM-Tagger
Dictionaries
~~~~~~~~~~
~~~~~~~
~~~~~~~~
~~~~~~~~~~
~~~~~~~
~~~~~~~~
~~~~~~~~~~
~~~~~~~
~~~~~~~~
Documents
Formula
a+b
Semantic
Formula
a+b
Manual
Refinements
Gold Standard
MML
Multiple
Cycles
Random
Selection
Annotated MMLAnnotated MML
LaTeXML
MML ComparisonMML Comparison
VS
® Vecteezy.com
&
Converter
β β
Mathematical
Language Processor
Identifier &
Definiens
Tree
Refinemets
MathMLben#422f2d908725a379336f2c6083c5b6edf69157ca
β
π ζ
XML
SwitchSwitch
β
β
13/14
50. Approach to Improve Conversion Tools
Example for item 101 mathmlben.wmflabs.org/101
A function f(x, y) is continuous at a point (a, b) if
lim
(x,y)→(a,b)
f(x, y) = f(a, b), (1)
that is, for every arbitrarily small positive constant there exists
δ(> 0) such that
|f(a + α, b + β) − f(a, b)| < , (2)
for all α and β that satisfy |α|, |β| < δ.
14/14
51. Approach to Improve Conversion Tools
Example for item 101 mathmlben.wmflabs.org/101
A function f(x, y) is continuous at a point (a, b) if
lim
(x,y)→(a,b)
f(x, y) = f(a, b), (1)
that is, for every arbitrarily small positive constant there exists
δ(> 0) such that
|f(a + α, b + β) − f(a, b)| < , (2)
for all α and β that satisfy |α|, |β| < δ.
14/14