Your SlideShare is downloading.
×

- 1. Mathematical Language Processing via Tree Embeddings Jack Wang, Andrew Lan, Richard Baraniuk June 15, 2021
- 2. Mathematical Language Is Everywhere textbooks academic papers Wikipedia articles Difficult to extract and synthesize information from massive content How to efficiently find relevant mathematical content?
- 3. The Mathematical Content Retrieval Problem Difficult to extract and synthesize information from massive content Desired: efficient, automated system to aid indexing, searching, and organizing mathematical contents We focus on formula retrieval: - Search for and retrieve similar equations, given a query equation
- 4. The Mathematical Content Retrieval Problem Current search engines lack ability to effectively search for mathematical content Machine learning
- 5. The Mathematical Content Retrieval Problem Current search engines lack ability to effectively search for mathematical content query equation in a machine learning textbook Search results contain only specific characters that match with input query but NOT the entire equation
- 6. The Mathematical Content Retrieval Problem Desired retrieval
- 7. Our Solution: Formula Representation via Tree Embeddings A novel framework that learns a good representation of mathematical formulae Based on the encoder-decoder architecture ● A novel encoding scheme: equation as trees ● A novel decoding scheme: generate equation as trees formula encoder decoder Reconstructed formula Formula embedding Minimize this reconstruction loss
- 8. Our Solution, part #1: Equation Encoding Explicitly capture the semantic and syntactic information in an equation Encoder (GRU)
- 9. Our Solution, part #1: Equation Encoding Encoder (GRU) The formula embedding that we will use in the formula retrieval task
- 10. Our Solution, part #1: Equation Encoding Encoder (GRU) After the encoding step - Decode to recover the input formula tree, using the formula embedding - Tree beam search to improve reconstruction quality
- 11. Formula Retrieval Experiment - 18 queries formulae - Train (and search) on 770k equations - Compute the embedding of all equations and queries - Compute the cosine similarity between all equations and each query - For each query, choose the top 25 most relevant equations - Human evaluation: compute % of relevant equations for each query
- 12. Formula Retrieval Experiment
- 13. Formula Retrieval: Main Results Our method outperforms the data-driven baseline
- 14. Formula Retrieval: Main Results Our method achieves state-of-the-art when combined with Approach0
- 15. Formula Retrieval: Examples Our method retrieves structurally and semantically more similar formulae
- 16. Learnt Formula Representation: T-SNE Example Our method embeds good representations of different formulae
- 17. Summary Framework to process equations via tree embeddings - Novel encoder + decoder + beam search - State-of-the-art formula retrieval performance - Application to textbook math content search and beyond Future work - Joint math and text processing - Deploy and pilot study at OpenStax - Open-ended math solution feedback Zhang et al. Math Operation Embeddings for Open-ended Solution Analysis and Feedback. To appear @EDM’21 https://arxiv.org/abs/2104.12047