Mathematical Language Processing via Tree Embeddings

Mathematical Language Processing
via Tree Embeddings
Jack Wang, Andrew Lan, Richard Baraniuk
June 15, 2021

Mathematical Language Is Everywhere
textbooks
academic papers
Wikipedia articles
Difficult to extract and synthesize information from massive content
How to efficiently find relevant mathematical content?

The Mathematical Content Retrieval Problem
Difficult to extract and synthesize information from massive content
Desired: efficient, automated system to aid indexing, searching, and organizing
mathematical contents
We focus on formula retrieval:
- Search for and retrieve similar equations, given a query equation

Current search engines lack ability to effectively search for mathematical content
Machine
learning

Current search engines lack ability to effectively search for mathematical content
query equation in a machine learning textbook
Search results contain only
specific characters that match
with input query but NOT the
entire equation

Desired retrieval

Our Solution: Formula Representation via
Tree Embeddings
A novel framework that learns a good representation of mathematical formulae
Based on the encoder-decoder architecture
● A novel encoding scheme: equation as trees
● A novel decoding scheme: generate equation as trees
formula encoder decoder
Reconstructed
formula
Formula
embedding
Minimize this reconstruction loss

Our Solution, part #1: Equation Encoding
Explicitly capture the semantic and syntactic information in an equation
Encoder
(GRU)

Encoder
(GRU)
The formula embedding that we will use in the formula retrieval task

Encoder
(GRU)
After the encoding step
- Decode to recover the input formula tree, using the formula embedding
- Tree beam search to improve reconstruction quality

Formula Retrieval Experiment
- 18 queries formulae
- Train (and search) on 770k equations
- Compute the embedding of all equations and queries
- Compute the cosine similarity between all equations and each query
- For each query, choose the top 25 most relevant equations
- Human evaluation: compute % of relevant equations for each query

Formula Retrieval: Main Results
Our method outperforms the data-driven baseline

Formula Retrieval: Main Results
Our method achieves state-of-the-art when combined with Approach0

Formula Retrieval: Examples
Our method retrieves structurally and semantically more similar formulae

Learnt Formula Representation: T-SNE Example
Our method embeds good representations of different formulae

Summary
Framework to process equations via tree embeddings
- Novel encoder + decoder + beam search
- State-of-the-art formula retrieval performance
- Application to textbook math content search and beyond
Future work
- Joint math and text processing
- Deploy and pilot study at OpenStax
- Open-ended math solution feedback
Zhang et al. Math Operation Embeddings for Open-ended Solution Analysis and Feedback. To appear @EDM’21
https://arxiv.org/abs/2104.12047

Mathematical Language Processing via Tree Embeddings

More Related Content

What's hot

Similar to Mathematical Language Processing via Tree Embeddings

More from Sergey Sosnovsky

Recently uploaded

Mathematical Language Processing via Tree Embeddings

Editor's Notes