SlideShare a Scribd company logo
‘LoA’
(Librarian of Alexandria)
A tool for mass searching pre-print servers, downloading pdfs of
research papers, and then using AI to pull information from them.
By: Morgan Grougan
Background info:
• There are millions of research papers in existence (hard to estimate
an exact figure).
• Scraping pre-print servers is an existing technology.
• Megatron-BERT AI models are trained to understand specific fields of
interest, and the relevant vocab.
• Using these functions, we can create tables of information that would
otherwise be incredibly hard to collect.
• Using these tables, we can train new models to predict the
information we would otherwise be extracting (such as quantum
yield, and frequency of color fluoresced).
Main screen, and scraping papers
Asks the user if they would like to update the metadata
Must be done on the first run
Searching through downloaded metadata for relevant articles
The program will now ask the user for search terms,
and check the metadata of the servers for matching
articles.
Searching through google scholar for relevant articles
Here we are able to search through google scholar
by selecting ‘y’ and then typing in a search query
Downloading all normally available pdfs of search results
The program now prompts the user if they would like to
download all pdfs that are available for free normally.
Note: this footage is from a different run of the program from
the previous slide. (which is why I did not again search google
scholar, as I already had previously)
Downloading all pdfs available through sci-hub
Now the Program asks the user if they would like to
as well search through sci-hub to get pdfs of papers
that would otherwise be unavailable for download
Note: It is illegal to release any papers obtained from
Sci-Hub to the public
Fixing the formatting of the pdfs files that were downloaded so plain text can be extracted
Here we are going through each pdf that has
been downloaded, and fixing the formatting so
that it can be converted to plaintext by the
package PyPDF.
Getting the information from huggingface.co to download our preferred model
Here I am just showing what part of the
huggingface.co link is necessary for the
program to download the model of choice
Downloading our preferred model
Here you can see when prompted we chose ‘d’
for download a model, and pasted in the part
of the huggingface.co link we grabbed
Running question-answering using our model of choice to extract information
Here I show that you can, once downloaded, simply
select the model of choice from the list of available
ones.
After I make my choice, finally, it’s time to extract!
Note: you can use [answer] in a question to insert the
previous answer into the current question
Automation!
This all takes a very long time, especially the question answering step,
so by calling ‘python main.py –auto’ the program will read a script file
of my own design, and run in a batchable, non-interactive way
The file structure I’ve made is straightforward, and quite resistant to
errors in syntax. It simply looks for ‘task =‘ in each line, and runs the
specified task using the variables specified below it, until it reaches a
line containing ‘#end’. This can be batched an unlimited number of
times in the same automation.txt file.
Automation.txt:
Planned features:
• Expected answer type
• For loop as a result of numerical answer
• Functionalize parts of scraping+QA, so that they can be accessed
independently (such as downloading models)
• Add support for more types of models (GPT, IR, T5, etc.)
Planned features cont.:
• Add data management screen for cleanup
• More pre-print servers
• Way to allow user to input credentials for specific pre-print servers
that only allow scraping with permission
• OCR for text in images
Current issues/limitations:
• Scholar search kicks user off because a captcha appears
• Some downloaded pdfs must be re-formatted to remove HTML
• Lengthy step downloading PDFs and converting them to plaintext
• Requires ~500Gb of RAM to run
• Only works on Linux
Thanks!
• To Dr. Alice Walker for suggesting this as an area of research, and
supporting my pursuance of it
• To Dr. Mark Hix for helping me along as I learn python
• To anyone and everyone interested in using my program, and/or
providing feedback and suggestions
Download:
github.com/MorganRO8/LoA
Email:
gi1632@wayne.edu

More Related Content

Similar to LoA (Librarian of Alexandria): An AI-Powered Linux-Python Tool for Comprehensive Extraction of Chemical Data

IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
IEEEMEMTECHSTUDENTPROJECTS
 
Reproducibility in artificial intelligence
Reproducibility in artificial intelligenceReproducibility in artificial intelligence
Reproducibility in artificial intelligence
Carlos Toxtli
 
AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016
Robert Grossman
 
Toolboxes for data scientists
Toolboxes for data scientistsToolboxes for data scientists
Toolboxes for data scientists
Sudipto Krishna Dutta
 
System design for Web Application
System design for Web ApplicationSystem design for Web Application
System design for Web Application
Michael Choi
 
If you want to automate, you learn to code
If you want to automate, you learn to codeIf you want to automate, you learn to code
If you want to automate, you learn to code
Alan Richardson
 
onlinelibrarymanagementsystem-160511065906.pdf
onlinelibrarymanagementsystem-160511065906.pdfonlinelibrarymanagementsystem-160511065906.pdf
onlinelibrarymanagementsystem-160511065906.pdf
rohanthombre2
 
Online library management system
Online library management systemOnline library management system
Online library management system
Bharat Kunwar
 
Cool Tools for Technical Writers
Cool Tools for Technical WritersCool Tools for Technical Writers
Cool Tools for Technical Writers
Jeff Haas
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
jixuan1989
 
Introduction to Google Colaboratory.pdf
Introduction to Google Colaboratory.pdfIntroduction to Google Colaboratory.pdf
Introduction to Google Colaboratory.pdf
Yomna Mahmoud Ibrahim Hassan
 
Deep Learning with CNTK
Deep Learning with CNTKDeep Learning with CNTK
Deep Learning with CNTK
Ashish Jaiman
 
Automation in Drupal
Automation in DrupalAutomation in Drupal
Automation in Drupal
Bozhidar Boshnakov
 
Django Article V0
Django Article V0Django Article V0
Django Article V0Udi Bauman
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul Divyanshu
 
BDD, Behat & Drupal
BDD, Behat & DrupalBDD, Behat & Drupal
BDD, Behat & Drupal
Bozhidar Boshnakov
 
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
Trivadis
 
Putting it where they need it: How to Populate a Salesforce Knowledge base wi...
Putting it where they need it: How to Populate a Salesforce Knowledge base wi...Putting it where they need it: How to Populate a Salesforce Knowledge base wi...
Putting it where they need it: How to Populate a Salesforce Knowledge base wi...
John Sgammato
 

Similar to LoA (Librarian of Alexandria): An AI-Powered Linux-Python Tool for Comprehensive Extraction of Chemical Data (20)

IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
 
Reproducibility in artificial intelligence
Reproducibility in artificial intelligenceReproducibility in artificial intelligence
Reproducibility in artificial intelligence
 
Unix commands
Unix commandsUnix commands
Unix commands
 
AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016
 
Toolboxes for data scientists
Toolboxes for data scientistsToolboxes for data scientists
Toolboxes for data scientists
 
System design for Web Application
System design for Web ApplicationSystem design for Web Application
System design for Web Application
 
If you want to automate, you learn to code
If you want to automate, you learn to codeIf you want to automate, you learn to code
If you want to automate, you learn to code
 
Introduction to django
Introduction to djangoIntroduction to django
Introduction to django
 
onlinelibrarymanagementsystem-160511065906.pdf
onlinelibrarymanagementsystem-160511065906.pdfonlinelibrarymanagementsystem-160511065906.pdf
onlinelibrarymanagementsystem-160511065906.pdf
 
Online library management system
Online library management systemOnline library management system
Online library management system
 
Cool Tools for Technical Writers
Cool Tools for Technical WritersCool Tools for Technical Writers
Cool Tools for Technical Writers
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
 
Introduction to Google Colaboratory.pdf
Introduction to Google Colaboratory.pdfIntroduction to Google Colaboratory.pdf
Introduction to Google Colaboratory.pdf
 
Deep Learning with CNTK
Deep Learning with CNTKDeep Learning with CNTK
Deep Learning with CNTK
 
Automation in Drupal
Automation in DrupalAutomation in Drupal
Automation in Drupal
 
Django Article V0
Django Article V0Django Article V0
Django Article V0
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentation
 
BDD, Behat & Drupal
BDD, Behat & DrupalBDD, Behat & Drupal
BDD, Behat & Drupal
 
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
 
Putting it where they need it: How to Populate a Salesforce Knowledge base wi...
Putting it where they need it: How to Populate a Salesforce Knowledge base wi...Putting it where they need it: How to Populate a Salesforce Knowledge base wi...
Putting it where they need it: How to Populate a Salesforce Knowledge base wi...
 

More from Wayne State University College of Liberal Arts and Sciences

Wayne County PM2.5 Levels on Lab Visit Days Among Youth With Asthma: A Method...
Wayne County PM2.5 Levels on Lab Visit Days Among Youth With Asthma: A Method...Wayne County PM2.5 Levels on Lab Visit Days Among Youth With Asthma: A Method...
Wayne County PM2.5 Levels on Lab Visit Days Among Youth With Asthma: A Method...
Wayne State University College of Liberal Arts and Sciences
 
Claiming the People's Home: Competing Uses of Folkhemmet in Sweden's Radical ...
Claiming the People's Home: Competing Uses of Folkhemmet in Sweden's Radical ...Claiming the People's Home: Competing Uses of Folkhemmet in Sweden's Radical ...
Claiming the People's Home: Competing Uses of Folkhemmet in Sweden's Radical ...
Wayne State University College of Liberal Arts and Sciences
 
Economic Relief Provided by the OLLIF Surgical Technique
Economic Relief Provided by the OLLIF Surgical TechniqueEconomic Relief Provided by the OLLIF Surgical Technique
Economic Relief Provided by the OLLIF Surgical Technique
Wayne State University College of Liberal Arts and Sciences
 
A Transposon Insertion Site Identifies Two Potential Genes That Regulate a De...
A Transposon Insertion Site Identifies Two Potential Genes That Regulate a De...A Transposon Insertion Site Identifies Two Potential Genes That Regulate a De...
A Transposon Insertion Site Identifies Two Potential Genes That Regulate a De...
Wayne State University College of Liberal Arts and Sciences
 
Mild Intermittent Hypoxia: A Prophylactic for Autonomic Dysfunction in Indivi...
Mild Intermittent Hypoxia: A Prophylactic for Autonomic Dysfunction in Indivi...Mild Intermittent Hypoxia: A Prophylactic for Autonomic Dysfunction in Indivi...
Mild Intermittent Hypoxia: A Prophylactic for Autonomic Dysfunction in Indivi...
Wayne State University College of Liberal Arts and Sciences
 
Utilization of In vivo Longitudinal PET Imaging to Assess Class IIa HDAC Acti...
Utilization of In vivo Longitudinal PET Imaging to Assess Class IIa HDAC Acti...Utilization of In vivo Longitudinal PET Imaging to Assess Class IIa HDAC Acti...
Utilization of In vivo Longitudinal PET Imaging to Assess Class IIa HDAC Acti...
Wayne State University College of Liberal Arts and Sciences
 
Developmental Defects in Adolescents as a Result of the Pandemic.pptx
Developmental Defects in Adolescents as a Result of the Pandemic.pptxDevelopmental Defects in Adolescents as a Result of the Pandemic.pptx
Developmental Defects in Adolescents as a Result of the Pandemic.pptx
Wayne State University College of Liberal Arts and Sciences
 
Investigating High Rates of Suicidal Ideation in Individuals Who Stutter
Investigating High Rates of Suicidal Ideation in Individuals Who StutterInvestigating High Rates of Suicidal Ideation in Individuals Who Stutter
Investigating High Rates of Suicidal Ideation in Individuals Who Stutter
Wayne State University College of Liberal Arts and Sciences
 
Iridium (III) Complexes as Sensors for Cytochrome P450 Enzymes
Iridium (III) Complexes as Sensors for Cytochrome P450 EnzymesIridium (III) Complexes as Sensors for Cytochrome P450 Enzymes
Iridium (III) Complexes as Sensors for Cytochrome P450 Enzymes
Wayne State University College of Liberal Arts and Sciences
 
Mental Health and Incarceration
Mental Health and IncarcerationMental Health and Incarceration
Accounting Evolution
Accounting EvolutionAccounting Evolution
Misuse of Analgesics and How They Contribute to the Opioid Epidemic
Misuse of Analgesics and How They Contribute to the Opioid EpidemicMisuse of Analgesics and How They Contribute to the Opioid Epidemic
Misuse of Analgesics and How They Contribute to the Opioid Epidemic
Wayne State University College of Liberal Arts and Sciences
 
How Can CRNAs Better Help/Understand Their Patients?
How Can CRNAs Better Help/Understand Their Patients?How Can CRNAs Better Help/Understand Their Patients?
How Can CRNAs Better Help/Understand Their Patients?
Wayne State University College of Liberal Arts and Sciences
 
The Male and Female Gaze in Film Theory
The Male and Female Gaze in Film TheoryThe Male and Female Gaze in Film Theory
The Male and Female Gaze in Film Theory
Wayne State University College of Liberal Arts and Sciences
 
WSU CAPS Resource: Mental Health Screening Utilization
WSU CAPS Resource: Mental Health Screening UtilizationWSU CAPS Resource: Mental Health Screening Utilization
WSU CAPS Resource: Mental Health Screening Utilization
Wayne State University College of Liberal Arts and Sciences
 
Are Health Awareness Months Dedicated to Urological Diseases Effective at Inc...
Are Health Awareness Months Dedicated to Urological Diseases Effective at Inc...Are Health Awareness Months Dedicated to Urological Diseases Effective at Inc...
Are Health Awareness Months Dedicated to Urological Diseases Effective at Inc...
Wayne State University College of Liberal Arts and Sciences
 
Adolescent Suicide to School Psychologists
Adolescent Suicide to School PsychologistsAdolescent Suicide to School Psychologists
Adolescent Suicide to School Psychologists
Wayne State University College of Liberal Arts and Sciences
 
The Importance of Rural Family Physicians
The Importance of Rural Family PhysiciansThe Importance of Rural Family Physicians
The Importance of Rural Family Physicians
Wayne State University College of Liberal Arts and Sciences
 
Quality Matters: A Procedure for Valid and Reliable Hippocampal Subfield Segm...
Quality Matters: A Procedure for Valid and Reliable Hippocampal Subfield Segm...Quality Matters: A Procedure for Valid and Reliable Hippocampal Subfield Segm...
Quality Matters: A Procedure for Valid and Reliable Hippocampal Subfield Segm...
Wayne State University College of Liberal Arts and Sciences
 
Identification of Novel Inhibitors Targeting KRAS-SOS1 Interactions by Struct...
Identification of Novel Inhibitors Targeting KRAS-SOS1 Interactions by Struct...Identification of Novel Inhibitors Targeting KRAS-SOS1 Interactions by Struct...
Identification of Novel Inhibitors Targeting KRAS-SOS1 Interactions by Struct...
Wayne State University College of Liberal Arts and Sciences
 

More from Wayne State University College of Liberal Arts and Sciences (20)

Wayne County PM2.5 Levels on Lab Visit Days Among Youth With Asthma: A Method...
Wayne County PM2.5 Levels on Lab Visit Days Among Youth With Asthma: A Method...Wayne County PM2.5 Levels on Lab Visit Days Among Youth With Asthma: A Method...
Wayne County PM2.5 Levels on Lab Visit Days Among Youth With Asthma: A Method...
 
Claiming the People's Home: Competing Uses of Folkhemmet in Sweden's Radical ...
Claiming the People's Home: Competing Uses of Folkhemmet in Sweden's Radical ...Claiming the People's Home: Competing Uses of Folkhemmet in Sweden's Radical ...
Claiming the People's Home: Competing Uses of Folkhemmet in Sweden's Radical ...
 
Economic Relief Provided by the OLLIF Surgical Technique
Economic Relief Provided by the OLLIF Surgical TechniqueEconomic Relief Provided by the OLLIF Surgical Technique
Economic Relief Provided by the OLLIF Surgical Technique
 
A Transposon Insertion Site Identifies Two Potential Genes That Regulate a De...
A Transposon Insertion Site Identifies Two Potential Genes That Regulate a De...A Transposon Insertion Site Identifies Two Potential Genes That Regulate a De...
A Transposon Insertion Site Identifies Two Potential Genes That Regulate a De...
 
Mild Intermittent Hypoxia: A Prophylactic for Autonomic Dysfunction in Indivi...
Mild Intermittent Hypoxia: A Prophylactic for Autonomic Dysfunction in Indivi...Mild Intermittent Hypoxia: A Prophylactic for Autonomic Dysfunction in Indivi...
Mild Intermittent Hypoxia: A Prophylactic for Autonomic Dysfunction in Indivi...
 
Utilization of In vivo Longitudinal PET Imaging to Assess Class IIa HDAC Acti...
Utilization of In vivo Longitudinal PET Imaging to Assess Class IIa HDAC Acti...Utilization of In vivo Longitudinal PET Imaging to Assess Class IIa HDAC Acti...
Utilization of In vivo Longitudinal PET Imaging to Assess Class IIa HDAC Acti...
 
Developmental Defects in Adolescents as a Result of the Pandemic.pptx
Developmental Defects in Adolescents as a Result of the Pandemic.pptxDevelopmental Defects in Adolescents as a Result of the Pandemic.pptx
Developmental Defects in Adolescents as a Result of the Pandemic.pptx
 
Investigating High Rates of Suicidal Ideation in Individuals Who Stutter
Investigating High Rates of Suicidal Ideation in Individuals Who StutterInvestigating High Rates of Suicidal Ideation in Individuals Who Stutter
Investigating High Rates of Suicidal Ideation in Individuals Who Stutter
 
Iridium (III) Complexes as Sensors for Cytochrome P450 Enzymes
Iridium (III) Complexes as Sensors for Cytochrome P450 EnzymesIridium (III) Complexes as Sensors for Cytochrome P450 Enzymes
Iridium (III) Complexes as Sensors for Cytochrome P450 Enzymes
 
Mental Health and Incarceration
Mental Health and IncarcerationMental Health and Incarceration
Mental Health and Incarceration
 
Accounting Evolution
Accounting EvolutionAccounting Evolution
Accounting Evolution
 
Misuse of Analgesics and How They Contribute to the Opioid Epidemic
Misuse of Analgesics and How They Contribute to the Opioid EpidemicMisuse of Analgesics and How They Contribute to the Opioid Epidemic
Misuse of Analgesics and How They Contribute to the Opioid Epidemic
 
How Can CRNAs Better Help/Understand Their Patients?
How Can CRNAs Better Help/Understand Their Patients?How Can CRNAs Better Help/Understand Their Patients?
How Can CRNAs Better Help/Understand Their Patients?
 
The Male and Female Gaze in Film Theory
The Male and Female Gaze in Film TheoryThe Male and Female Gaze in Film Theory
The Male and Female Gaze in Film Theory
 
WSU CAPS Resource: Mental Health Screening Utilization
WSU CAPS Resource: Mental Health Screening UtilizationWSU CAPS Resource: Mental Health Screening Utilization
WSU CAPS Resource: Mental Health Screening Utilization
 
Are Health Awareness Months Dedicated to Urological Diseases Effective at Inc...
Are Health Awareness Months Dedicated to Urological Diseases Effective at Inc...Are Health Awareness Months Dedicated to Urological Diseases Effective at Inc...
Are Health Awareness Months Dedicated to Urological Diseases Effective at Inc...
 
Adolescent Suicide to School Psychologists
Adolescent Suicide to School PsychologistsAdolescent Suicide to School Psychologists
Adolescent Suicide to School Psychologists
 
The Importance of Rural Family Physicians
The Importance of Rural Family PhysiciansThe Importance of Rural Family Physicians
The Importance of Rural Family Physicians
 
Quality Matters: A Procedure for Valid and Reliable Hippocampal Subfield Segm...
Quality Matters: A Procedure for Valid and Reliable Hippocampal Subfield Segm...Quality Matters: A Procedure for Valid and Reliable Hippocampal Subfield Segm...
Quality Matters: A Procedure for Valid and Reliable Hippocampal Subfield Segm...
 
Identification of Novel Inhibitors Targeting KRAS-SOS1 Interactions by Struct...
Identification of Novel Inhibitors Targeting KRAS-SOS1 Interactions by Struct...Identification of Novel Inhibitors Targeting KRAS-SOS1 Interactions by Struct...
Identification of Novel Inhibitors Targeting KRAS-SOS1 Interactions by Struct...
 

Recently uploaded

Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
Pavel ( NSTU)
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
Delapenabediema
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
Nguyen Thanh Tu Collection
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
GeoBlogs
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
Atul Kumar Singh
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Atul Kumar Singh
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
SACHIN R KONDAGURI
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
BhavyaRajput3
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Tamralipta Mahavidyalaya
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
EduSkills OECD
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
DeeptiGupta154
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
timhan337
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 

Recently uploaded (20)

Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
 

LoA (Librarian of Alexandria): An AI-Powered Linux-Python Tool for Comprehensive Extraction of Chemical Data

  • 1. ‘LoA’ (Librarian of Alexandria) A tool for mass searching pre-print servers, downloading pdfs of research papers, and then using AI to pull information from them. By: Morgan Grougan
  • 2. Background info: • There are millions of research papers in existence (hard to estimate an exact figure). • Scraping pre-print servers is an existing technology. • Megatron-BERT AI models are trained to understand specific fields of interest, and the relevant vocab. • Using these functions, we can create tables of information that would otherwise be incredibly hard to collect. • Using these tables, we can train new models to predict the information we would otherwise be extracting (such as quantum yield, and frequency of color fluoresced).
  • 3. Main screen, and scraping papers Asks the user if they would like to update the metadata Must be done on the first run
  • 4. Searching through downloaded metadata for relevant articles The program will now ask the user for search terms, and check the metadata of the servers for matching articles.
  • 5. Searching through google scholar for relevant articles Here we are able to search through google scholar by selecting ‘y’ and then typing in a search query
  • 6. Downloading all normally available pdfs of search results The program now prompts the user if they would like to download all pdfs that are available for free normally. Note: this footage is from a different run of the program from the previous slide. (which is why I did not again search google scholar, as I already had previously)
  • 7. Downloading all pdfs available through sci-hub Now the Program asks the user if they would like to as well search through sci-hub to get pdfs of papers that would otherwise be unavailable for download Note: It is illegal to release any papers obtained from Sci-Hub to the public
  • 8. Fixing the formatting of the pdfs files that were downloaded so plain text can be extracted Here we are going through each pdf that has been downloaded, and fixing the formatting so that it can be converted to plaintext by the package PyPDF.
  • 9. Getting the information from huggingface.co to download our preferred model Here I am just showing what part of the huggingface.co link is necessary for the program to download the model of choice
  • 10. Downloading our preferred model Here you can see when prompted we chose ‘d’ for download a model, and pasted in the part of the huggingface.co link we grabbed
  • 11. Running question-answering using our model of choice to extract information Here I show that you can, once downloaded, simply select the model of choice from the list of available ones. After I make my choice, finally, it’s time to extract! Note: you can use [answer] in a question to insert the previous answer into the current question
  • 12. Automation! This all takes a very long time, especially the question answering step, so by calling ‘python main.py –auto’ the program will read a script file of my own design, and run in a batchable, non-interactive way The file structure I’ve made is straightforward, and quite resistant to errors in syntax. It simply looks for ‘task =‘ in each line, and runs the specified task using the variables specified below it, until it reaches a line containing ‘#end’. This can be batched an unlimited number of times in the same automation.txt file.
  • 14. Planned features: • Expected answer type • For loop as a result of numerical answer • Functionalize parts of scraping+QA, so that they can be accessed independently (such as downloading models) • Add support for more types of models (GPT, IR, T5, etc.)
  • 15. Planned features cont.: • Add data management screen for cleanup • More pre-print servers • Way to allow user to input credentials for specific pre-print servers that only allow scraping with permission • OCR for text in images
  • 16. Current issues/limitations: • Scholar search kicks user off because a captcha appears • Some downloaded pdfs must be re-formatted to remove HTML • Lengthy step downloading PDFs and converting them to plaintext • Requires ~500Gb of RAM to run • Only works on Linux
  • 17. Thanks! • To Dr. Alice Walker for suggesting this as an area of research, and supporting my pursuance of it • To Dr. Mark Hix for helping me along as I learn python • To anyone and everyone interested in using my program, and/or providing feedback and suggestions