LoA (Librarian of Alexandria): An AI-Powered Linux-Python Tool for Comprehensive Extraction of Chemical Data

‘LoA’
(Librarian of Alexandria)
A tool for mass searching pre-print servers, downloading pdfs of
research papers, and then using AI to pull information from them.
By: Morgan Grougan

Background info:
• There are millions of research papers in existence (hard to estimate
an exact figure).
• Scraping pre-print servers is an existing technology.
• Megatron-BERT AI models are trained to understand specific fields of
interest, and the relevant vocab.
• Using these functions, we can create tables of information that would
otherwise be incredibly hard to collect.
• Using these tables, we can train new models to predict the
information we would otherwise be extracting (such as quantum
yield, and frequency of color fluoresced).

Main screen, and scraping papers
Asks the user if they would like to update the metadata
Must be done on the first run

Searching through downloaded metadata for relevant articles
The program will now ask the user for search terms,
and check the metadata of the servers for matching
articles.

Searching through google scholar for relevant articles
Here we are able to search through google scholar
by selecting ‘y’ and then typing in a search query

Downloading all normally available pdfs of search results
The program now prompts the user if they would like to
download all pdfs that are available for free normally.
Note: this footage is from a different run of the program from
the previous slide. (which is why I did not again search google
scholar, as I already had previously)

Downloading all pdfs available through sci-hub
Now the Program asks the user if they would like to
as well search through sci-hub to get pdfs of papers
that would otherwise be unavailable for download
Note: It is illegal to release any papers obtained from
Sci-Hub to the public

Fixing the formatting of the pdfs files that were downloaded so plain text can be extracted
Here we are going through each pdf that has
been downloaded, and fixing the formatting so
that it can be converted to plaintext by the
package PyPDF.

Getting the information from huggingface.co to download our preferred model
Here I am just showing what part of the
huggingface.co link is necessary for the
program to download the model of choice

Downloading our preferred model
Here you can see when prompted we chose ‘d’
for download a model, and pasted in the part
of the huggingface.co link we grabbed

Running question-answering using our model of choice to extract information
Here I show that you can, once downloaded, simply
select the model of choice from the list of available
ones.
After I make my choice, finally, it’s time to extract!
Note: you can use [answer] in a question to insert the
previous answer into the current question

Automation!
This all takes a very long time, especially the question answering step,
so by calling ‘python main.py –auto’ the program will read a script file
of my own design, and run in a batchable, non-interactive way
The file structure I’ve made is straightforward, and quite resistant to
errors in syntax. It simply looks for ‘task =‘ in each line, and runs the
specified task using the variables specified below it, until it reaches a
line containing ‘#end’. This can be batched an unlimited number of
times in the same automation.txt file.

Planned features:
• Expected answer type
• For loop as a result of numerical answer
• Functionalize parts of scraping+QA, so that they can be accessed
independently (such as downloading models)
• Add support for more types of models (GPT, IR, T5, etc.)

Planned features cont.:
• Add data management screen for cleanup
• More pre-print servers
• Way to allow user to input credentials for specific pre-print servers
that only allow scraping with permission
• OCR for text in images

Current issues/limitations:
• Scholar search kicks user off because a captcha appears
• Some downloaded pdfs must be re-formatted to remove HTML
• Lengthy step downloading PDFs and converting them to plaintext
• Requires ~500Gb of RAM to run
• Only works on Linux

Thanks!
• To Dr. Alice Walker for suggesting this as an area of research, and
supporting my pursuance of it
• To Dr. Mark Hix for helping me along as I learn python
• To anyone and everyone interested in using my program, and/or
providing feedback and suggestions

Download:
github.com/MorganRO8/LoA
Email:
gi1632@wayne.edu

LoA (Librarian of Alexandria): An AI-Powered Linux-Python Tool for Comprehensive Extraction of Chemical Data

Recommended

Recommended

More Related Content

Similar to LoA (Librarian of Alexandria): An AI-Powered Linux-Python Tool for Comprehensive Extraction of Chemical Data

Similar to LoA (Librarian of Alexandria): An AI-Powered Linux-Python Tool for Comprehensive Extraction of Chemical Data (20)

More from Wayne State University College of Liberal Arts and Sciences

More from Wayne State University College of Liberal Arts and Sciences (20)

Recently uploaded

Recently uploaded (20)

LoA (Librarian of Alexandria): An AI-Powered Linux-Python Tool for Comprehensive Extraction of Chemical Data