ChemNLP: A Natural Language Processing based Library for Materials Chemistry Text Data

ChemNLP
A Natural Language Processing based Library
for Materials Chemistry Text Data
Kamal Choudhary
https://jarvis.nist.gov/
NIST, Gaithersburg, MD, USA
Polymer group
7/13/2023
1
Joint Automated Repository for Various Integrated Simulations

Outline
2
• Introduction
• AI for Materials
• JARVIS
• NLP basics
• ChemNLP
• Datasets
• TextClassification
• TokenClassification
• WebApp
• TextSummarization
• TextGeneration
• Integrating DFT database
• JARVIS-Leaderboard/benchmarking
• Hands-on
• Summary
Electronic structure
DFT,DMFT,
TB,QMC
Quantum
Computation
AtomQC
Force-Field
JARVIS-FF
ALIGNN-FF
AI/ML
CFID
ALIGNN
AtomVision
ChemNLP

Established: January 2017
Published: >40 articles
Users: >20000+ users worldwide
Materials: >80000, millions of properties
Events:
• Quantum Matters in Materials Science (QMMS)
• Artificial Intelligence for Materials Science (AIMS)
• JARVIS-School
User-comments:
• “There are many different theoretical levels on which you can
approach the field. JARVIS is unusual in that it spans more levels
than other databases.”
• “A pure gold-mine for the data-quality effort…”
• Thanks for your generous sharing. Your works inspire me a lot.
• “You guys are doing something really beneficial…”
• “I find JARVIS-DFT very useful for my research…”
JARVIS: Databases, Tools, Events, Outreach
4
https://jarvis.nist.gov
Requires login credentials, free registration

Updates
• 80,000 materials
• QMC, tight binding, ALIGNN, ALIGNN-FF,
• AtomVision, ChemNLP, JARVIS-Leaderboard
• Quantum Computation algorithms
• Superconductors (bulk and 2D), magnetic topological mats.
Recent Updates to JARVIS

Tools
Used for hands-on session!

Text classification & Token classification

Text summarization & Text-generation

Conventional NLP: TFIDF
https://www.kaggle.com/code/ashoksrinivas/nlp-with-tfidf-neural-networks
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
https://i.stack.imgur.com/mtmP6.png

Transformers & “Attention Is All You Need”
Much better than RNN, LSTM etc.
Attention: extremely long-term memory
https://towardsdatascience.com/illustrated-guide-to-transformers-step-by-step-explanation-f74876522bc0

ChemNLP Datasets
Exploratory data analysis (EDA)

Named Entity Recognition/Token classification
87 % F1 score

ChemNLP Webpage for Composition Search
https://jarvis.nist.gov/jarvischemnlp/

ChemNLP
Abstractive summarization (Abstract to Title)
Google’s T5-225million transformer model
ROGUE-1 score: 46.5 %
Text generation (Title to Abstract)
GPT2-medium LLM model
ROGUE-1 score: 32 %
Without fine tuning: 26 %
ROUGE:
Recall-Oriented Understudy for Gisting Evaluation
T5: Text-to-Text Transfer Transformer

19
ChemNLP for supercondutors
Confusion matrix for text classification (137927 articles) • arXiv cond-mat.supr-con and JARVIS-SuperconDB
• Venn diagram for chemical formula
ChatGPT response

JARVIS-Leaderboard: Large Scale Benchmark
Challenges in materials science community:
• Reproducibility
• Transparency
• Validation
• Fidelity
• Data vs. metadata
• What is the ground truth/reference to compare our models to?
How does this change depending on the model?
• Synergy of computational and experimental databases
Community
effort to tackle
challenges:
https://pages.nist.gov/jarvis_leaderboard/

JARVIS-Leaderboard: Contributors
• Growing list of collaborators
• Multi-institutional effort
• Contributions are welcomed and
encouraged from community!

JARVIS-Leaderboard: Methods and Data
Types of Data:
• Atomic structure (Molecule, Crystal)
• Material Property (Bandgap, bulk modulus)
• Images (Microscopy: SEM, TEM, STM)
• Spectra (Diffraction: X-ray, Neutron, PL)
• Text (Research articles, notebooks, blogs)
• Eigensolver (Quantum Computation algorithms)
1) Electronic Structure
2) Artificial Intelligence
3) Force Field
4) Quantum Computation
5) Experiment

JARVIS-Leaderboard: Benchmarks
Contributions
2) Artificial Intelligence
3) Force Field
4) Quantum Computation
5) Experiment
Benchmarks (reference point)
1) Experiment/s
2) Test dataset
4) Analytical results
5) Other Experiments
Error metrics
*Benchmarks must be well-defined with an associated DOI

Hands-on session notebooks (later)
Natural Language Processing [44,45]
1. ChemNLP example (Part I)
2. ChemNLP example (Part II)
JARVIS-Leaderboard [5]
Analyzing benchmarks in the JARVIS-Leaderboard

27
Summary
• NIST-JARVIS infrastructure with multiple components
• ChemNLP for solids currently, expand to polymers…
• Several events to engage (sign-up today & Demo!)
• Continuously growing, contribute, collaborate…
https://jarvis.nist.gov
https://github.com/usnistgov/jarvis
https://github.com/usnistgov/alignn
https://github.com/usnistgov/atomvision
https://github.com/usnistgov/chemnlp
https://github.com/usnistgov/atomqc
https://github.com/usnistgov/jarvis_leaderboard
Email: kamal.choudhary@nist.gov,
@dr_k_choudhary
@knc6
Slides:https://www.slideshare.net/KAMALCHOUDHARY4

ChemNLP: A Natural Language Processing based Library for Materials Chemistry Text Data

Recommended

Recommended

More Related Content

Similar to ChemNLP: A Natural Language Processing based Library for Materials Chemistry Text Data

Similar to ChemNLP: A Natural Language Processing based Library for Materials Chemistry Text Data (20)

More from KAMAL CHOUDHARY

More from KAMAL CHOUDHARY (9)

Recently uploaded

Recently uploaded (20)

ChemNLP: A Natural Language Processing based Library for Materials Chemistry Text Data