SlideShare a Scribd company logo
1 of 18
Download to read offline
‘LoA’
(Librarian of Alexandria)
A tool for mass searching pre-print servers, downloading pdfs of
research papers, and then using AI to pull information from them.
By: Morgan Grougan
Background info:
• There are millions of research papers in existence (hard to estimate
an exact figure).
• Scraping pre-print servers is an existing technology.
• Megatron-BERT AI models are trained to understand specific fields of
interest, and the relevant vocab.
• Using these functions, we can create tables of information that would
otherwise be incredibly hard to collect.
• Using these tables, we can train new models to predict the
information we would otherwise be extracting (such as quantum
yield, and frequency of color fluoresced).
Main screen, and scraping papers
Asks the user if they would like to update the metadata
Must be done on the first run
Searching through downloaded metadata for relevant articles
The program will now ask the user for search terms,
and check the metadata of the servers for matching
articles.
Searching through google scholar for relevant articles
Here we are able to search through google scholar
by selecting ‘y’ and then typing in a search query
Downloading all normally available pdfs of search results
The program now prompts the user if they would like to
download all pdfs that are available for free normally.
Note: this footage is from a different run of the program from
the previous slide. (which is why I did not again search google
scholar, as I already had previously)
Downloading all pdfs available through sci-hub
Now the Program asks the user if they would like to
as well search through sci-hub to get pdfs of papers
that would otherwise be unavailable for download
Note: It is illegal to release any papers obtained from
Sci-Hub to the public
Fixing the formatting of the pdfs files that were downloaded so plain text can be extracted
Here we are going through each pdf that has
been downloaded, and fixing the formatting so
that it can be converted to plaintext by the
package PyPDF.
Getting the information from huggingface.co to download our preferred model
Here I am just showing what part of the
huggingface.co link is necessary for the
program to download the model of choice
Downloading our preferred model
Here you can see when prompted we chose ‘d’
for download a model, and pasted in the part
of the huggingface.co link we grabbed
Running question-answering using our model of choice to extract information
Here I show that you can, once downloaded, simply
select the model of choice from the list of available
ones.
After I make my choice, finally, it’s time to extract!
Note: you can use [answer] in a question to insert the
previous answer into the current question
Automation!
This all takes a very long time, especially the question answering step,
so by calling ‘python main.py –auto’ the program will read a script file
of my own design, and run in a batchable, non-interactive way
The file structure I’ve made is straightforward, and quite resistant to
errors in syntax. It simply looks for ‘task =‘ in each line, and runs the
specified task using the variables specified below it, until it reaches a
line containing ‘#end’. This can be batched an unlimited number of
times in the same automation.txt file.
Automation.txt:
Planned features:
• Expected answer type
• For loop as a result of numerical answer
• Functionalize parts of scraping+QA, so that they can be accessed
independently (such as downloading models)
• Add support for more types of models (GPT, IR, T5, etc.)
Planned features cont.:
• Add data management screen for cleanup
• More pre-print servers
• Way to allow user to input credentials for specific pre-print servers
that only allow scraping with permission
• OCR for text in images
Current issues/limitations:
• Scholar search kicks user off because a captcha appears
• Some downloaded pdfs must be re-formatted to remove HTML
• Lengthy step downloading PDFs and converting them to plaintext
• Requires ~500Gb of RAM to run
• Only works on Linux
Thanks!
• To Dr. Alice Walker for suggesting this as an area of research, and
supporting my pursuance of it
• To Dr. Mark Hix for helping me along as I learn python
• To anyone and everyone interested in using my program, and/or
providing feedback and suggestions
Download:
github.com/MorganRO8/LoA
Email:
gi1632@wayne.edu

More Related Content

Similar to LoA (Librarian of Alexandria): An AI-Powered Linux-Python Tool for Comprehensive Extraction of Chemical Data

Reproducibility in artificial intelligence
Reproducibility in artificial intelligenceReproducibility in artificial intelligence
Reproducibility in artificial intelligenceCarlos Toxtli
 
AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016Robert Grossman
 
System design for Web Application
System design for Web ApplicationSystem design for Web Application
System design for Web ApplicationMichael Choi
 
If you want to automate, you learn to code
If you want to automate, you learn to codeIf you want to automate, you learn to code
If you want to automate, you learn to codeAlan Richardson
 
onlinelibrarymanagementsystem-160511065906.pdf
onlinelibrarymanagementsystem-160511065906.pdfonlinelibrarymanagementsystem-160511065906.pdf
onlinelibrarymanagementsystem-160511065906.pdfrohanthombre2
 
Online library management system
Online library management systemOnline library management system
Online library management systemBharat Kunwar
 
Cool Tools for Technical Writers
Cool Tools for Technical WritersCool Tools for Technical Writers
Cool Tools for Technical WritersJeff Haas
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdbjixuan1989
 
Deep Learning with CNTK
Deep Learning with CNTKDeep Learning with CNTK
Deep Learning with CNTKAshish Jaiman
 
Django Article V0
Django Article V0Django Article V0
Django Article V0Udi Bauman
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul Divyanshu
 
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTrivadis
 
Putting it where they need it: How to Populate a Salesforce Knowledge base wi...
Putting it where they need it: How to Populate a Salesforce Knowledge base wi...Putting it where they need it: How to Populate a Salesforce Knowledge base wi...
Putting it where they need it: How to Populate a Salesforce Knowledge base wi...John Sgammato
 
Putting it where they need it: How to Populate a Salesforce Knowledge base wi...
Putting it where they need it: How to Populate a Salesforce Knowledge base wi...Putting it where they need it: How to Populate a Salesforce Knowledge base wi...
Putting it where they need it: How to Populate a Salesforce Knowledge base wi...John Sgammato
 

Similar to LoA (Librarian of Alexandria): An AI-Powered Linux-Python Tool for Comprehensive Extraction of Chemical Data (20)

Reproducibility in artificial intelligence
Reproducibility in artificial intelligenceReproducibility in artificial intelligence
Reproducibility in artificial intelligence
 
Unix commands
Unix commandsUnix commands
Unix commands
 
AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016
 
Toolboxes for data scientists
Toolboxes for data scientistsToolboxes for data scientists
Toolboxes for data scientists
 
System design for Web Application
System design for Web ApplicationSystem design for Web Application
System design for Web Application
 
If you want to automate, you learn to code
If you want to automate, you learn to codeIf you want to automate, you learn to code
If you want to automate, you learn to code
 
Introduction to django
Introduction to djangoIntroduction to django
Introduction to django
 
onlinelibrarymanagementsystem-160511065906.pdf
onlinelibrarymanagementsystem-160511065906.pdfonlinelibrarymanagementsystem-160511065906.pdf
onlinelibrarymanagementsystem-160511065906.pdf
 
Online library management system
Online library management systemOnline library management system
Online library management system
 
Cool Tools for Technical Writers
Cool Tools for Technical WritersCool Tools for Technical Writers
Cool Tools for Technical Writers
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
 
Introduction to Google Colaboratory.pdf
Introduction to Google Colaboratory.pdfIntroduction to Google Colaboratory.pdf
Introduction to Google Colaboratory.pdf
 
Deep Learning with CNTK
Deep Learning with CNTKDeep Learning with CNTK
Deep Learning with CNTK
 
Automation in Drupal
Automation in DrupalAutomation in Drupal
Automation in Drupal
 
Django Article V0
Django Article V0Django Article V0
Django Article V0
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentation
 
BDD, Behat & Drupal
BDD, Behat & DrupalBDD, Behat & Drupal
BDD, Behat & Drupal
 
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
 
Putting it where they need it: How to Populate a Salesforce Knowledge base wi...
Putting it where they need it: How to Populate a Salesforce Knowledge base wi...Putting it where they need it: How to Populate a Salesforce Knowledge base wi...
Putting it where they need it: How to Populate a Salesforce Knowledge base wi...
 
Putting it where they need it: How to Populate a Salesforce Knowledge base wi...
Putting it where they need it: How to Populate a Salesforce Knowledge base wi...Putting it where they need it: How to Populate a Salesforce Knowledge base wi...
Putting it where they need it: How to Populate a Salesforce Knowledge base wi...
 

More from Wayne State University College of Liberal Arts and Sciences

More from Wayne State University College of Liberal Arts and Sciences (20)

Wayne County PM2.5 Levels on Lab Visit Days Among Youth With Asthma: A Method...
Wayne County PM2.5 Levels on Lab Visit Days Among Youth With Asthma: A Method...Wayne County PM2.5 Levels on Lab Visit Days Among Youth With Asthma: A Method...
Wayne County PM2.5 Levels on Lab Visit Days Among Youth With Asthma: A Method...
 
Claiming the People's Home: Competing Uses of Folkhemmet in Sweden's Radical ...
Claiming the People's Home: Competing Uses of Folkhemmet in Sweden's Radical ...Claiming the People's Home: Competing Uses of Folkhemmet in Sweden's Radical ...
Claiming the People's Home: Competing Uses of Folkhemmet in Sweden's Radical ...
 
Economic Relief Provided by the OLLIF Surgical Technique
Economic Relief Provided by the OLLIF Surgical TechniqueEconomic Relief Provided by the OLLIF Surgical Technique
Economic Relief Provided by the OLLIF Surgical Technique
 
A Transposon Insertion Site Identifies Two Potential Genes That Regulate a De...
A Transposon Insertion Site Identifies Two Potential Genes That Regulate a De...A Transposon Insertion Site Identifies Two Potential Genes That Regulate a De...
A Transposon Insertion Site Identifies Two Potential Genes That Regulate a De...
 
Mild Intermittent Hypoxia: A Prophylactic for Autonomic Dysfunction in Indivi...
Mild Intermittent Hypoxia: A Prophylactic for Autonomic Dysfunction in Indivi...Mild Intermittent Hypoxia: A Prophylactic for Autonomic Dysfunction in Indivi...
Mild Intermittent Hypoxia: A Prophylactic for Autonomic Dysfunction in Indivi...
 
Utilization of In vivo Longitudinal PET Imaging to Assess Class IIa HDAC Acti...
Utilization of In vivo Longitudinal PET Imaging to Assess Class IIa HDAC Acti...Utilization of In vivo Longitudinal PET Imaging to Assess Class IIa HDAC Acti...
Utilization of In vivo Longitudinal PET Imaging to Assess Class IIa HDAC Acti...
 
Developmental Defects in Adolescents as a Result of the Pandemic.pptx
Developmental Defects in Adolescents as a Result of the Pandemic.pptxDevelopmental Defects in Adolescents as a Result of the Pandemic.pptx
Developmental Defects in Adolescents as a Result of the Pandemic.pptx
 
Investigating High Rates of Suicidal Ideation in Individuals Who Stutter
Investigating High Rates of Suicidal Ideation in Individuals Who StutterInvestigating High Rates of Suicidal Ideation in Individuals Who Stutter
Investigating High Rates of Suicidal Ideation in Individuals Who Stutter
 
Iridium (III) Complexes as Sensors for Cytochrome P450 Enzymes
Iridium (III) Complexes as Sensors for Cytochrome P450 EnzymesIridium (III) Complexes as Sensors for Cytochrome P450 Enzymes
Iridium (III) Complexes as Sensors for Cytochrome P450 Enzymes
 
Mental Health and Incarceration
Mental Health and IncarcerationMental Health and Incarceration
Mental Health and Incarceration
 
Accounting Evolution
Accounting EvolutionAccounting Evolution
Accounting Evolution
 
Misuse of Analgesics and How They Contribute to the Opioid Epidemic
Misuse of Analgesics and How They Contribute to the Opioid EpidemicMisuse of Analgesics and How They Contribute to the Opioid Epidemic
Misuse of Analgesics and How They Contribute to the Opioid Epidemic
 
How Can CRNAs Better Help/Understand Their Patients?
How Can CRNAs Better Help/Understand Their Patients?How Can CRNAs Better Help/Understand Their Patients?
How Can CRNAs Better Help/Understand Their Patients?
 
The Male and Female Gaze in Film Theory
The Male and Female Gaze in Film TheoryThe Male and Female Gaze in Film Theory
The Male and Female Gaze in Film Theory
 
WSU CAPS Resource: Mental Health Screening Utilization
WSU CAPS Resource: Mental Health Screening UtilizationWSU CAPS Resource: Mental Health Screening Utilization
WSU CAPS Resource: Mental Health Screening Utilization
 
Are Health Awareness Months Dedicated to Urological Diseases Effective at Inc...
Are Health Awareness Months Dedicated to Urological Diseases Effective at Inc...Are Health Awareness Months Dedicated to Urological Diseases Effective at Inc...
Are Health Awareness Months Dedicated to Urological Diseases Effective at Inc...
 
Adolescent Suicide to School Psychologists
Adolescent Suicide to School PsychologistsAdolescent Suicide to School Psychologists
Adolescent Suicide to School Psychologists
 
The Importance of Rural Family Physicians
The Importance of Rural Family PhysiciansThe Importance of Rural Family Physicians
The Importance of Rural Family Physicians
 
Quality Matters: A Procedure for Valid and Reliable Hippocampal Subfield Segm...
Quality Matters: A Procedure for Valid and Reliable Hippocampal Subfield Segm...Quality Matters: A Procedure for Valid and Reliable Hippocampal Subfield Segm...
Quality Matters: A Procedure for Valid and Reliable Hippocampal Subfield Segm...
 
Identification of Novel Inhibitors Targeting KRAS-SOS1 Interactions by Struct...
Identification of Novel Inhibitors Targeting KRAS-SOS1 Interactions by Struct...Identification of Novel Inhibitors Targeting KRAS-SOS1 Interactions by Struct...
Identification of Novel Inhibitors Targeting KRAS-SOS1 Interactions by Struct...
 

Recently uploaded

Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 

Recently uploaded (20)

Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 

LoA (Librarian of Alexandria): An AI-Powered Linux-Python Tool for Comprehensive Extraction of Chemical Data

  • 1. ‘LoA’ (Librarian of Alexandria) A tool for mass searching pre-print servers, downloading pdfs of research papers, and then using AI to pull information from them. By: Morgan Grougan
  • 2. Background info: • There are millions of research papers in existence (hard to estimate an exact figure). • Scraping pre-print servers is an existing technology. • Megatron-BERT AI models are trained to understand specific fields of interest, and the relevant vocab. • Using these functions, we can create tables of information that would otherwise be incredibly hard to collect. • Using these tables, we can train new models to predict the information we would otherwise be extracting (such as quantum yield, and frequency of color fluoresced).
  • 3. Main screen, and scraping papers Asks the user if they would like to update the metadata Must be done on the first run
  • 4. Searching through downloaded metadata for relevant articles The program will now ask the user for search terms, and check the metadata of the servers for matching articles.
  • 5. Searching through google scholar for relevant articles Here we are able to search through google scholar by selecting ‘y’ and then typing in a search query
  • 6. Downloading all normally available pdfs of search results The program now prompts the user if they would like to download all pdfs that are available for free normally. Note: this footage is from a different run of the program from the previous slide. (which is why I did not again search google scholar, as I already had previously)
  • 7. Downloading all pdfs available through sci-hub Now the Program asks the user if they would like to as well search through sci-hub to get pdfs of papers that would otherwise be unavailable for download Note: It is illegal to release any papers obtained from Sci-Hub to the public
  • 8. Fixing the formatting of the pdfs files that were downloaded so plain text can be extracted Here we are going through each pdf that has been downloaded, and fixing the formatting so that it can be converted to plaintext by the package PyPDF.
  • 9. Getting the information from huggingface.co to download our preferred model Here I am just showing what part of the huggingface.co link is necessary for the program to download the model of choice
  • 10. Downloading our preferred model Here you can see when prompted we chose ‘d’ for download a model, and pasted in the part of the huggingface.co link we grabbed
  • 11. Running question-answering using our model of choice to extract information Here I show that you can, once downloaded, simply select the model of choice from the list of available ones. After I make my choice, finally, it’s time to extract! Note: you can use [answer] in a question to insert the previous answer into the current question
  • 12. Automation! This all takes a very long time, especially the question answering step, so by calling ‘python main.py –auto’ the program will read a script file of my own design, and run in a batchable, non-interactive way The file structure I’ve made is straightforward, and quite resistant to errors in syntax. It simply looks for ‘task =‘ in each line, and runs the specified task using the variables specified below it, until it reaches a line containing ‘#end’. This can be batched an unlimited number of times in the same automation.txt file.
  • 14. Planned features: • Expected answer type • For loop as a result of numerical answer • Functionalize parts of scraping+QA, so that they can be accessed independently (such as downloading models) • Add support for more types of models (GPT, IR, T5, etc.)
  • 15. Planned features cont.: • Add data management screen for cleanup • More pre-print servers • Way to allow user to input credentials for specific pre-print servers that only allow scraping with permission • OCR for text in images
  • 16. Current issues/limitations: • Scholar search kicks user off because a captcha appears • Some downloaded pdfs must be re-formatted to remove HTML • Lengthy step downloading PDFs and converting them to plaintext • Requires ~500Gb of RAM to run • Only works on Linux
  • 17. Thanks! • To Dr. Alice Walker for suggesting this as an area of research, and supporting my pursuance of it • To Dr. Mark Hix for helping me along as I learn python • To anyone and everyone interested in using my program, and/or providing feedback and suggestions