Successfully reported this slideshow.
Your SlideShare is downloading. ×

Mathematical Language Processing via Tree Embeddings

Ad

Mathematical Language Processing
via Tree Embeddings
Jack Wang, Andrew Lan, Richard Baraniuk
June 15, 2021

Ad

Mathematical Language Is Everywhere
textbooks
academic papers
Wikipedia articles
Difficult to extract and synthesize infor...

Ad

The Mathematical Content Retrieval Problem
Difficult to extract and synthesize information from massive content
Desired: e...

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Check these out next

1 of 17 Ad
1 of 17 Ad
Advertisement

More Related Content

Slideshows for you (19)

Similar to Mathematical Language Processing via Tree Embeddings (20)

Advertisement

More from Sergey Sosnovsky (20)

Advertisement

Mathematical Language Processing via Tree Embeddings

  1. 1. Mathematical Language Processing via Tree Embeddings Jack Wang, Andrew Lan, Richard Baraniuk June 15, 2021
  2. 2. Mathematical Language Is Everywhere textbooks academic papers Wikipedia articles Difficult to extract and synthesize information from massive content How to efficiently find relevant mathematical content?
  3. 3. The Mathematical Content Retrieval Problem Difficult to extract and synthesize information from massive content Desired: efficient, automated system to aid indexing, searching, and organizing mathematical contents We focus on formula retrieval: - Search for and retrieve similar equations, given a query equation
  4. 4. The Mathematical Content Retrieval Problem Current search engines lack ability to effectively search for mathematical content Machine learning
  5. 5. The Mathematical Content Retrieval Problem Current search engines lack ability to effectively search for mathematical content query equation in a machine learning textbook Search results contain only specific characters that match with input query but NOT the entire equation
  6. 6. The Mathematical Content Retrieval Problem Desired retrieval
  7. 7. Our Solution: Formula Representation via Tree Embeddings A novel framework that learns a good representation of mathematical formulae Based on the encoder-decoder architecture ● A novel encoding scheme: equation as trees ● A novel decoding scheme: generate equation as trees formula encoder decoder Reconstructed formula Formula embedding Minimize this reconstruction loss
  8. 8. Our Solution, part #1: Equation Encoding Explicitly capture the semantic and syntactic information in an equation Encoder (GRU)
  9. 9. Our Solution, part #1: Equation Encoding Encoder (GRU) The formula embedding that we will use in the formula retrieval task
  10. 10. Our Solution, part #1: Equation Encoding Encoder (GRU) After the encoding step - Decode to recover the input formula tree, using the formula embedding - Tree beam search to improve reconstruction quality
  11. 11. Formula Retrieval Experiment - 18 queries formulae - Train (and search) on 770k equations - Compute the embedding of all equations and queries - Compute the cosine similarity between all equations and each query - For each query, choose the top 25 most relevant equations - Human evaluation: compute % of relevant equations for each query
  12. 12. Formula Retrieval Experiment
  13. 13. Formula Retrieval: Main Results Our method outperforms the data-driven baseline
  14. 14. Formula Retrieval: Main Results Our method achieves state-of-the-art when combined with Approach0
  15. 15. Formula Retrieval: Examples Our method retrieves structurally and semantically more similar formulae
  16. 16. Learnt Formula Representation: T-SNE Example Our method embeds good representations of different formulae
  17. 17. Summary Framework to process equations via tree embeddings - Novel encoder + decoder + beam search - State-of-the-art formula retrieval performance - Application to textbook math content search and beyond Future work - Joint math and text processing - Deploy and pilot study at OpenStax - Open-ended math solution feedback Zhang et al. Math Operation Embeddings for Open-ended Solution Analysis and Feedback. To appear @EDM’21 https://arxiv.org/abs/2104.12047

Editor's Notes

  • Hello my name is Jack Wang and today I am going to present my project on mathematical language processing.
  • The question we focus here is: how do we efficiently find relevant mathematical content?
  • In this talk, I will primarily focus on the problem of formula retrieval as a representative problem. Namely, given an equation, we would like to find the most relevant ones. You can think of this as a search engine such as Google but it is devoted to mathematical formulae.

    The ability to search for formula is useful for a number of educational related applications. For example, a student might want to search for relevant assessment questions given a query question, or they want to search for relevant content in a textbook given a query formula.
  • Here is a concrete hypothetical example. Say you have a machine learning textbook and you are searching relevant formula given a query formula.

    Current search engines lack the ability to effectively search for formulae.
  • If you look at the retrieval results , you will find that they contain specific components that match query but not the entire formulae.

    This observation suggests that we need a method that better captures the semantics of a math formula such that a search engine can return the most relevant ones.
  • For example, this retrieval result is a good match to the query
  • In this project, we present a solution from a representation learning perspective. The starting point is that, we want to learn a good representation of math formulae, such that we can use this representation for the formula retrieval task.

    Our solution is a novel framework that processes math formula in the form of trees. This is because every formula can be inherently represented as a tree structure, and by explicitly learning their tree representations, our framework retains the inherent properties of formulae and therefore improves the retrieval performance.

    More specifically, the framework contains 3 key components. The first component is a tree encoder, which encodes the formula in its tree format into a vector representation, or embedding. The second component is a generator, which reconstructs the input formula tree. The entire pipeline is optimized end-to-end by minimizing the reconstruction error between the input formula tree and the reconstructed formulae tree.
  • As I mentioned earlier, this step us to explicitly capture the semantic and syntactic information in an equation.
  • This embedding is what we will use for the formula retrieval task.
  • To complete the pipeline, After the encoding step, we use a decoder that reconstructs the input formula in its tree format. To improve reconstruction quality, we also develop a beam search algorithm specifically for tree structured data. I’ll skip the technical details but you can find them in the paper.
  • We validate our framework on a formula retrieval task. In this task, we have 18 query formula
  • Here are some examples of queries. You can see that they are diverse in appearance and subject domain.
  • First of all, we can first observe that our method outperforms the other data-driven baseline on both metrics.
  • So we develop a new method that combines the strengths of both our method and Approach0. We can see that this method achieves state-of-the-art performance on this formula retrieval task.
  • We can see that our method retrieves equations that are semantically and structurally more similar to the query, whereas the tangentCFT baseline fails to do so in some cases.
  • I also want to visualize how the learnt formula representations are. Here, we choose a small number of formula from different math topics and plot their 2 dimensional TSNE embeddings.

    We can see that these embeddings form nice clusters. Which indicates that our model learns meaningful representations of these formula.
  • And finally, we can apply our method to analyze students step-wise answers to open ended math questions. We have a paper that is going to appear in the educational data mining conference later this month. The arxiv version is already out. If you are interested you are welcome to checkout the paper and attend our talk at EDM to learn more. Thanks

×