LoA (Librarian of Alexandria): An AI-Powered Linux-Python Tool for Comprehensive Extraction of Chemical Data

Apr. 11, 2023
LoA (Librarian of Alexandria): An AI-Powered Linux-Python Tool for Comprehensive Extraction of Chemical Data

Apr. 11, 2023
Amidst the ever-expanding cosmos of scientific literature, the quest for tools that can deftly extract and analyze data from publications has grown increasingly vital. Enter LoA (Librarian of Alexandria), a Linux-Python tool that harnesses the power of artificial intelligence to scour chemistry-related papers, meticulously extracting invaluable textual measurements and data into structured Excel files.

This approach engenders the formation of an expansive, high-quality dataset, poised to train predictive models in assessing the properties of chemicals, proteins, and other compounds, with a predominant emphasis on the field of chemistry. LoA's sophisticated AI algorithms artfully decipher and extract essential information from intricate scientific documents. Although the tool currently focuses on extracting text, plans for future iterations include deciphering images and delving into deeper analysis. LoA's potential transcends the realm of chemistry, with prospective applications in diverse scientific disciplines such as materials science, biology, and pharmacology, by tailoring the extraction process to the domain at hand.

Like the ancient Library of Alexandria, LoA aspires to become a beacon of knowledge, a comprehensive instrument for navigating the vast ocean of human intellect. As it is presented at an undergraduate research symposium, this innovative tool embodies the potential for revolutionizing scientific research, opening doors to interdisciplinary applications, and illuminating the path to discoveries yet unimagined. Embracing the beauty of imperfection, LoA continues to evolve, a testament to the resilience and adaptability that characterizes the boundless realm of scientific inquiry.

LoA (Librarian of Alexandria): An AI-Powered Linux-Python Tool for Comprehensive Extraction of Chemical Data

  1. 1. ‘LoA’ (Librarian of Alexandria) A tool for mass searching pre-print servers, downloading pdfs of research papers, and then using AI to pull information from them. By: Morgan Grougan
  2. 2. Background info: • There are millions of research papers in existence (hard to estimate an exact figure). • Scraping pre-print servers is an existing technology. • Megatron-BERT AI models are trained to understand specific fields of interest, and the relevant vocab. • Using these functions, we can create tables of information that would otherwise be incredibly hard to collect. • Using these tables, we can train new models to predict the information we would otherwise be extracting (such as quantum yield, and frequency of color fluoresced).
  3. 3. Main screen, and scraping papers Asks the user if they would like to update the metadata Must be done on the first run
  4. 4. Searching through downloaded metadata for relevant articles The program will now ask the user for search terms, and check the metadata of the servers for matching articles.
  5. 5. Searching through google scholar for relevant articles Here we are able to search through google scholar by selecting ‘y’ and then typing in a search query
  6. 6. Downloading all normally available pdfs of search results The program now prompts the user if they would like to download all pdfs that are available for free normally. Note: this footage is from a different run of the program from the previous slide. (which is why I did not again search google scholar, as I already had previously)
  7. 7. Downloading all pdfs available through sci-hub Now the Program asks the user if they would like to as well search through sci-hub to get pdfs of papers that would otherwise be unavailable for download Note: It is illegal to release any papers obtained from Sci-Hub to the public
  8. 8. Fixing the formatting of the pdfs files that were downloaded so plain text can be extracted Here we are going through each pdf that has been downloaded, and fixing the formatting so that it can be converted to plaintext by the package PyPDF.
  9. 9. Getting the information from huggingface.co to download our preferred model Here I am just showing what part of the huggingface.co link is necessary for the program to download the model of choice
  10. 10. Downloading our preferred model Here you can see when prompted we chose ‘d’ for download a model, and pasted in the part of the huggingface.co link we grabbed
  11. 11. Running question-answering using our model of choice to extract information Here I show that you can, once downloaded, simply select the model of choice from the list of available ones. After I make my choice, finally, it’s time to extract! Note: you can use [answer] in a question to insert the previous answer into the current question
  12. 12. Automation! This all takes a very long time, especially the question answering step, so by calling ‘python main.py –auto’ the program will read a script file of my own design, and run in a batchable, non-interactive way The file structure I’ve made is straightforward, and quite resistant to errors in syntax. It simply looks for ‘task =‘ in each line, and runs the specified task using the variables specified below it, until it reaches a line containing ‘#end’. This can be batched an unlimited number of times in the same automation.txt file.
  13. 13. Automation.txt:
  14. 14. Planned features: • Expected answer type • For loop as a result of numerical answer • Functionalize parts of scraping+QA, so that they can be accessed independently (such as downloading models) • Add support for more types of models (GPT, IR, T5, etc.)
  15. 15. Planned features cont.: • Add data management screen for cleanup • More pre-print servers • Way to allow user to input credentials for specific pre-print servers that only allow scraping with permission • OCR for text in images
  16. 16. Current issues/limitations: • Scholar search kicks user off because a captcha appears • Some downloaded pdfs must be re-formatted to remove HTML • Lengthy step downloading PDFs and converting them to plaintext • Requires ~500Gb of RAM to run • Only works on Linux
  17. 17. Thanks! • To Dr. Alice Walker for suggesting this as an area of research, and supporting my pursuance of it • To Dr. Mark Hix for helping me along as I learn python • To anyone and everyone interested in using my program, and/or providing feedback and suggestions
  18. 18. Download: github.com/MorganRO8/LoA Email: gi1632@wayne.edu

