2023 Undergraduate Research Symposium: Morgan Grougan
Amidst the ever-expanding cosmos of scientific literature, the quest for tools that can deftly extract and analyze data from publications has grown increasingly vital. Enter LoA (Librarian of Alexandria), a Linux-Python tool that harnesses the power of artificial intelligence to scour chemistry-related papers, meticulously extracting invaluable textual measurements and data into structured Excel files.
This approach engenders the formation of an expansive, high-quality dataset, poised to train predictive models in assessing the properties of chemicals, proteins, and other compounds, with a predominant emphasis on the field of chemistry. LoA's sophisticated AI algorithms artfully decipher and extract essential information from intricate scientific documents. Although the tool currently focuses on extracting text, plans for future iterations include deciphering images and delving into deeper analysis. LoA's potential transcends the realm of chemistry, with prospective applications in diverse scientific disciplines such as materials science, biology, and pharmacology, by tailoring the extraction process to the domain at hand.
Like the ancient Library of Alexandria, LoA aspires to become a beacon of knowledge, a comprehensive instrument for navigating the vast ocean of human intellect. As it is presented at an undergraduate research symposium, this innovative tool embodies the potential for revolutionizing scientific research, opening doors to interdisciplinary applications, and illuminating the path to discoveries yet unimagined. Embracing the beauty of imperfection, LoA continues to evolve, a testament to the resilience and adaptability that characterizes the boundless realm of scientific inquiry.
LoA (Librarian of Alexandria): An AI-Powered Linux-Python Tool for Comprehensive Extraction of Chemical Data
1. ‘LoA’
(Librarian of Alexandria)
A tool for mass searching pre-print servers, downloading pdfs of
research papers, and then using AI to pull information from them.
By: Morgan Grougan
2. Background info:
• There are millions of research papers in existence (hard to estimate
an exact figure).
• Scraping pre-print servers is an existing technology.
• Megatron-BERT AI models are trained to understand specific fields of
interest, and the relevant vocab.
• Using these functions, we can create tables of information that would
otherwise be incredibly hard to collect.
• Using these tables, we can train new models to predict the
information we would otherwise be extracting (such as quantum
yield, and frequency of color fluoresced).
3. Main screen, and scraping papers
Asks the user if they would like to update the metadata
Must be done on the first run
4. Searching through downloaded metadata for relevant articles
The program will now ask the user for search terms,
and check the metadata of the servers for matching
articles.
5. Searching through google scholar for relevant articles
Here we are able to search through google scholar
by selecting ‘y’ and then typing in a search query
6. Downloading all normally available pdfs of search results
The program now prompts the user if they would like to
download all pdfs that are available for free normally.
Note: this footage is from a different run of the program from
the previous slide. (which is why I did not again search google
scholar, as I already had previously)
7. Downloading all pdfs available through sci-hub
Now the Program asks the user if they would like to
as well search through sci-hub to get pdfs of papers
that would otherwise be unavailable for download
Note: It is illegal to release any papers obtained from
Sci-Hub to the public
8. Fixing the formatting of the pdfs files that were downloaded so plain text can be extracted
Here we are going through each pdf that has
been downloaded, and fixing the formatting so
that it can be converted to plaintext by the
package PyPDF.
9. Getting the information from huggingface.co to download our preferred model
Here I am just showing what part of the
huggingface.co link is necessary for the
program to download the model of choice
10. Downloading our preferred model
Here you can see when prompted we chose ‘d’
for download a model, and pasted in the part
of the huggingface.co link we grabbed
11. Running question-answering using our model of choice to extract information
Here I show that you can, once downloaded, simply
select the model of choice from the list of available
ones.
After I make my choice, finally, it’s time to extract!
Note: you can use [answer] in a question to insert the
previous answer into the current question
12. Automation!
This all takes a very long time, especially the question answering step,
so by calling ‘python main.py –auto’ the program will read a script file
of my own design, and run in a batchable, non-interactive way
The file structure I’ve made is straightforward, and quite resistant to
errors in syntax. It simply looks for ‘task =‘ in each line, and runs the
specified task using the variables specified below it, until it reaches a
line containing ‘#end’. This can be batched an unlimited number of
times in the same automation.txt file.
14. Planned features:
• Expected answer type
• For loop as a result of numerical answer
• Functionalize parts of scraping+QA, so that they can be accessed
independently (such as downloading models)
• Add support for more types of models (GPT, IR, T5, etc.)
15. Planned features cont.:
• Add data management screen for cleanup
• More pre-print servers
• Way to allow user to input credentials for specific pre-print servers
that only allow scraping with permission
• OCR for text in images
16. Current issues/limitations:
• Scholar search kicks user off because a captcha appears
• Some downloaded pdfs must be re-formatted to remove HTML
• Lengthy step downloading PDFs and converting them to plaintext
• Requires ~500Gb of RAM to run
• Only works on Linux
17. Thanks!
• To Dr. Alice Walker for suggesting this as an area of research, and
supporting my pursuance of it
• To Dr. Mark Hix for helping me along as I learn python
• To anyone and everyone interested in using my program, and/or
providing feedback and suggestions