In this work, we present the ChemNLP library that can be used for 1) curating open access datasets for materials and chemistry literature, developing and comparing traditional machine learning, transformers and graph neural network models for 2) classifying and clustering texts, 3) named entity recognition for large-scale text-mining, 4) abstractive summarization for generating titles of articles from abstracts, 5) text generation for suggesting abstracts from titles, 6) integration with density functional theory dataset for identifying potential candidate materials such as superconductors, and 7) web-interface development for text and reference query. We primarily use the publicly available arXiv and Pubchem datasets but the tools can be used for other datasets as well. Moreover, as new models are developed, they can be easily integrated in the library.
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
ChemNLP: A Natural Language Processing based Library for Materials Chemistry Text Data
1. ChemNLP
A Natural Language Processing based Library
for Materials Chemistry Text Data
Kamal Choudhary
https://jarvis.nist.gov/
NIST, Gaithersburg, MD, USA
Polymer group
7/13/2023
1
Joint Automated Repository for Various Integrated Simulations
4. Established: January 2017
Published: >40 articles
Users: >20000+ users worldwide
Materials: >80000, millions of properties
Events:
• Quantum Matters in Materials Science (QMMS)
• Artificial Intelligence for Materials Science (AIMS)
• JARVIS-School
User-comments:
• “There are many different theoretical levels on which you can
approach the field. JARVIS is unusual in that it spans more levels
than other databases.”
• “A pure gold-mine for the data-quality effort…”
• Thanks for your generous sharing. Your works inspire me a lot.
• “You guys are doing something really beneficial…”
• “I find JARVIS-DFT very useful for my research…”
JARVIS: Databases, Tools, Events, Outreach
4
https://jarvis.nist.gov
Requires login credentials, free registration
11. Transformers & “Attention Is All You Need”
Much better than RNN, LSTM etc.
Attention: extremely long-term memory
https://towardsdatascience.com/illustrated-guide-to-transformers-step-by-step-explanation-f74876522bc0
16. ChemNLP Webpage for Composition Search
https://jarvis.nist.gov/jarvischemnlp/
17. ChemNLP
Abstractive summarization (Abstract to Title)
Google’s T5-225million transformer model
ROGUE-1 score: 46.5 %
Text generation (Title to Abstract)
GPT2-medium LLM model
ROGUE-1 score: 32 %
Without fine tuning: 26 %
ROUGE:
Recall-Oriented Understudy for Gisting Evaluation
T5: Text-to-Text Transfer Transformer
19. 19
ChemNLP for supercondutors
Confusion matrix for text classification (137927 articles) • arXiv cond-mat.supr-con and JARVIS-SuperconDB
• Venn diagram for chemical formula
ChatGPT response
20. JARVIS-Leaderboard: Large Scale Benchmark
Challenges in materials science community:
• Reproducibility
• Transparency
• Validation
• Fidelity
• Data vs. metadata
• What is the ground truth/reference to compare our models to?
How does this change depending on the model?
• Synergy of computational and experimental databases
Community
effort to tackle
challenges:
https://pages.nist.gov/jarvis_leaderboard/
22. JARVIS-Leaderboard: Methods and Data
Types of Data:
• Atomic structure (Molecule, Crystal)
• Material Property (Bandgap, bulk modulus)
• Images (Microscopy: SEM, TEM, STM)
• Spectra (Diffraction: X-ray, Neutron, PL)
• Text (Research articles, notebooks, blogs)
• Eigensolver (Quantum Computation algorithms)
1) Electronic Structure
2) Artificial Intelligence
3) Force Field
4) Quantum Computation
5) Experiment
23. JARVIS-Leaderboard: Benchmarks
Contributions
1) Electronic Structure
2) Artificial Intelligence
3) Force Field
4) Quantum Computation
5) Experiment
Benchmarks (reference point)
1) Experiment/s
2) Test dataset
3) Electronic Structure
4) Analytical results
5) Other Experiments
Error metrics
*Benchmarks must be well-defined with an associated DOI