SlideShare a Scribd company logo
Natural language processing for extracting
synthesis recipes and applications to
autonomous laboratories
Anubhav Jain
Lawrence Berkeley National Laboratory
COMBI workshop, Sept 2022
Slides (already) posted to hackingmaterials.lbl.gov
Autonomous labs can benefit from access to external
data sets
2
Plan
Synthesize
Characterize
Analyze
local db
Automated Lab A
Plan
Synthesize
Characterize
Analyze
local db
Automated Lab B
Plan
Synthesize
Characterize
Analyze
local db
Automated Lab C
Literature data
+ broad coverage
– difficult to parse
– lack negative examples
Other A-lab data
+ structured data formats
+ negative examples
– not much out there …
Theory data
+ readily available
– difficult to establish
relevance to synthesis
Autonomous labs can benefit from access to external
data sets
3
Plan
Synthesize
Characterize
Analyze
local db
Automated Lab A
Plan
Synthesize
Characterize
Analyze
local db
Automated Lab B
Plan
Synthesize
Characterize
Analyze
local db
Automated Lab C
Literature data
+ broad coverage
– difficult to parse
– lack negative examples
Other A-lab data
+ structured data formats
+ negative examples
– not much out there …
Theory data
+ readily available
– difficult to establish
relevance to synthesis
The NLP Solution to Literature Data
• A lot of prior experimental data already exists in the literature that would take
untold costs and labor to replicate again
• Advantages to this data set are broad coverage of materials and techniques
• Disadvantages include:
• Getting access to the data
• lack of negative examples in the data
• missing / unreliable information
• difficulty to obtain structured data from unstructured text
• Natural language processing can help with the last part, although considerable
difficulties are still involved
• Named entity recognition
• Identify precursors, amounts, characteristics, etc.
• Relationship modeling
• Relate the extracted entities to one another
Previous approach for extracting data from
text
5
Weston, L. et al Named Entity Recognition
and Normalization Applied to Large-Scale
Information Extraction from the Materials
Science Literature. J. Chem. Inf. Model.
(2019)
Recently, we also tried BERT variants
Trewartha, A.; Walker, N.; Huo, H.; Lee, S.;
Cruse, K.; Dagdelen, J.; Dunn, A.; Persson,
K. A.; Ceder, G.; Jain, A. Quantifying the
Advantage of Domain-Specific Pre-Training
on Named Entity Recognition Tasks in
Materials Science. Patterns 2022, 3 (4),
100488.
Models were good for labeling entities, but
didn’t understand relationships
6
Named Entity Recognition
• Custom machine learning models to
extract the most valuable materials-related
information.
• Utilizes a long short-term memory (LSTM)
network trained on ~1000 hand-annotated
abstracts.
Trewartha, A.; Walker, N.; Huo, H.; Lee, S.;
Cruse, K.; Dagdelen, J.; Dunn, A.; Persson,
K. A.; Ceder, G.; Jain, A. Quantifying the
Advantage of Domain-Specific Pre-Training
on Named Entity Recognition Tasks in
Materials Science. Patterns 2022, 3 (4),
100488.
A Sequence-to-Sequence Approach
• Language model takes a sequence of tokens as input and
outputs a sequence of tokens
• Maximizes the likelihood of the output conditioned on the input
• Additionally includes task conditioning
• Capacity for “understanding” language as well as “world
knowledge”
• Task conditioning with arbitrary Seq2Seq provides extremely
flexible framework
• Large seq2seq2 models can generate text that naturally
completes a paragraph
How a sequence-to-sequence approach works
8
Seq2Seq model
(GPT3)
Text in (“prompt”) Text out (“completion”)
Another example
9
Seq2Seq model
(GPT3)
Text in (“prompt”) Text out (“completion”)
Structured data
10
Seq2Seq model
(GPT3)
Text in (“prompt”) Text out (“completion”)
But it’s not perfect for technical data
11
Seq2Seq model
(GPT3)
Text in (“prompt”) Text out (“completion”)
A workflow for fine-tuning GPT-3
1. Initial training set of templates
filled via zero-shot Q/A
2. Fine-tune model to fill
templates
3. Predict new set of templates
4. Correct the new templates
5. Add the corrected templates to
the training set
6. Repeat steps 2-5 as necessary
Templated extraction of synthesis recipes
• Annotate paragraphs to output
structured recipe templates
• JSON-format
• Designed using domain knowledge
from experimentalists
• Template is relation graph to be
filled in by model
Example Prediction
Performance (work in progress, initial tests)
• Precision: 90%
• Recall: 90%
• F1 Score: 90%
• Transcription: 97%
• Overall: 86%
• F1 accuracy for placing information in the right fields
• Transcription accuracy for putting the right information in said fields
Applied to solid state synthesis / doping
We have performed the first-principles calculations onto the structural,
electronic and magnetic properties of seven 3d transition-metal (TM=V, Cr,
Mn, Fe, Co, Ni and Cu) atom substituting cation Zn in both zigzag (10,0) and
armchair (6,6) zinc oxide nanotubes (ZnONTs). The results show that there
exists a structural distortion around 3d TM impurities with respect to the
pristine ZnONTs. The magnetic moment increases for V-, Cr-doped ZnONTs
and reaches maximum for Mn-doped ZnONTs, and then decreases for Fe-, Co-
, Ni- and Cu-doped ZnONTs successively, which is consistent with the
predicted trend of Hund’s rule for maximizing the magnetic moments of the
doped TM ions. However, the values of the magnetic moments are smaller than
the predicted values of Hund’s rule due to strong hybridization between p
orbitals of the nearest neighbor O atoms of ZnONTs and d orbitals of the TM
atoms. Furthermore, the Mn-, Fe-, Co-, Cu-doped (10,0) and (6,6) ZnONTs
with half-metal and thus 100% spin polarization characters seem to be good
candidates for spintronic applications.
Use in initial hypothesis generation
17
classifying AuNP
morphologies based
on precursors used
predicting AuNR
aspect ratios based
on amount of AgNO3
in growth solution
predicting doping – if
a material can be
doped with A, can it
be doped with B?
Developing an automated lab (“A-lab”) that makes use
of literature data is in progress
18
Plan
Synthesize
Characterize
Analyze
local db
Automated Lab A
Plan
Synthesize
Characterize
Analyze
local db
Automated Lab B
Plan
Synthesize
Characterize
Analyze
local db
Automated Lab C
Literature data
+ broad coverage
– difficult to parse
– lack negative examples
Other A-lab data
+ structured data formats
+ negative examples
– not much out there …
Theory data
+ readily available
– difficult to establish
relevance to synthesis
The A-lab facility is designed to handle inorganic
powders
19
In operation:
XRD
Robot
Box furnaces
Setting up:
Tube
furnace x 4
LBNL bldg. 30
Dosing and mixing
Facility will handle powder-
based synthesis of inorganic
materials, with automated
characterization and
experimental planning
Collaboration w/ G. Ceder & H. Kim
July 2022
- Tube furnaces and
SEM ready
Hardware
development
Platform
Integration
Automated
Synthesis
AI-guided
Synthesis
April 2022
Box furnace, XRD,
& robots ready
November 2022
- Powder dosing system
- First automated syntheses
Summer 2023
AI-guided synthesis
Closed-
Loop
Materials
Discovery
Summer 2024
Closed-loop
materials discovery
Early stages of the facility
20
The continuing challenge – putting it all together!
Currently we are still working on various components
Historical-data
Initial hypotheses
data-api
Acknowledgements
NLP
• Nick Walker
• John Dagdelen
• Alex Dunn
• Sanghoon Lee
• Amalie Trewartha
22
A-lab
• Rishi Kumar
• Yuxing Fei
• Haegyum Kim
• Gerbrand Ceder
Funding provided by:
• U.S. Department of Energy, Basic Energy Science, “D2S2” program
• Toyota Research Institutes, Accelerated Materials Design program
• Lawrence Berkeley National Laboratory “LDRD” program
Slides (already) posted to hackingmaterials.lbl.gov

More Related Content

Similar to Natural language processing for extracting synthesis recipes and applications to autonomous laboratories

Computational Chemistry: From Theory to Practice
Computational Chemistry: From Theory to PracticeComputational Chemistry: From Theory to Practice
Computational Chemistry: From Theory to Practice
David Thompson
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...
Anubhav Jain
 
Materials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learningMaterials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learning
Anubhav Jain
 
Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...
Anubhav Jain
 
Machine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methodsMachine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methods
Anubhav Jain
 
R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the Cloud
DataMine Lab
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
Ian Foster
 
When The New Science Is In The Outliers
When The New Science Is In The OutliersWhen The New Science Is In The Outliers
When The New Science Is In The Outliers
aimsnist
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
Gaignard Alban
 
Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...
Anubhav Jain
 
Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...
Anubhav Jain
 
ME Synopsis
ME SynopsisME Synopsis
ME Synopsis
Poonam Debnath
 
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Anubhav Jain
 
The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...
Anubhav Jain
 
AI Beyond Deep Learning
AI Beyond Deep LearningAI Beyond Deep Learning
AI Beyond Deep Learning
Andre Freitas
 
Available methods for predicting materials synthesizability using computation...
Available methods for predicting materials synthesizability using computation...Available methods for predicting materials synthesizability using computation...
Available methods for predicting materials synthesizability using computation...
Anubhav Jain
 
Combining density functional theory calculations, supercomputing, and data-dr...
Combining density functional theory calculations, supercomputing, and data-dr...Combining density functional theory calculations, supercomputing, and data-dr...
Combining density functional theory calculations, supercomputing, and data-dr...
Anubhav Jain
 
Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...
BaoTramDuong2
 
Undergraduate Modeling Workshop - Forest Cover Working Group Final Presentati...
Undergraduate Modeling Workshop - Forest Cover Working Group Final Presentati...Undergraduate Modeling Workshop - Forest Cover Working Group Final Presentati...
Undergraduate Modeling Workshop - Forest Cover Working Group Final Presentati...
The Statistical and Applied Mathematical Sciences Institute
 
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACH
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACHCOLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACH
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACH
IJCI JOURNAL
 

Similar to Natural language processing for extracting synthesis recipes and applications to autonomous laboratories (20)

Computational Chemistry: From Theory to Practice
Computational Chemistry: From Theory to PracticeComputational Chemistry: From Theory to Practice
Computational Chemistry: From Theory to Practice
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...
 
Materials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learningMaterials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learning
 
Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...
 
Machine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methodsMachine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methods
 
R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the Cloud
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
When The New Science Is In The Outliers
When The New Science Is In The OutliersWhen The New Science Is In The Outliers
When The New Science Is In The Outliers
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...
 
Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...
 
ME Synopsis
ME SynopsisME Synopsis
ME Synopsis
 
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
 
The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...
 
AI Beyond Deep Learning
AI Beyond Deep LearningAI Beyond Deep Learning
AI Beyond Deep Learning
 
Available methods for predicting materials synthesizability using computation...
Available methods for predicting materials synthesizability using computation...Available methods for predicting materials synthesizability using computation...
Available methods for predicting materials synthesizability using computation...
 
Combining density functional theory calculations, supercomputing, and data-dr...
Combining density functional theory calculations, supercomputing, and data-dr...Combining density functional theory calculations, supercomputing, and data-dr...
Combining density functional theory calculations, supercomputing, and data-dr...
 
Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...
 
Undergraduate Modeling Workshop - Forest Cover Working Group Final Presentati...
Undergraduate Modeling Workshop - Forest Cover Working Group Final Presentati...Undergraduate Modeling Workshop - Forest Cover Working Group Final Presentati...
Undergraduate Modeling Workshop - Forest Cover Working Group Final Presentati...
 
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACH
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACHCOLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACH
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACH
 

More from Anubhav Jain

Discovering advanced materials for energy applications: theory, high-throughp...
Discovering advanced materials for energy applications: theory, high-throughp...Discovering advanced materials for energy applications: theory, high-throughp...
Discovering advanced materials for energy applications: theory, high-throughp...
Anubhav Jain
 
Applications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and Design
Anubhav Jain
 
An AI-driven closed-loop facility for materials synthesis
An AI-driven closed-loop facility for materials synthesisAn AI-driven closed-loop facility for materials synthesis
An AI-driven closed-loop facility for materials synthesis
Anubhav Jain
 
Best practices for DuraMat software dissemination
Best practices for DuraMat software disseminationBest practices for DuraMat software dissemination
Best practices for DuraMat software dissemination
Anubhav Jain
 
Best practices for DuraMat software dissemination
Best practices for DuraMat software disseminationBest practices for DuraMat software dissemination
Best practices for DuraMat software dissemination
Anubhav Jain
 
Efficient methods for accurately calculating thermoelectric properties – elec...
Efficient methods for accurately calculating thermoelectric properties – elec...Efficient methods for accurately calculating thermoelectric properties – elec...
Efficient methods for accurately calculating thermoelectric properties – elec...
Anubhav Jain
 
Machine Learning for Catalyst Design
Machine Learning for Catalyst DesignMachine Learning for Catalyst Design
Machine Learning for Catalyst Design
Anubhav Jain
 
Accelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine LearningAccelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine Learning
Anubhav Jain
 
DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …
Anubhav Jain
 
The Materials Project
The Materials ProjectThe Materials Project
The Materials Project
Anubhav Jain
 
Evaluating Chemical Composition and Crystal Structure Representations using t...
Evaluating Chemical Composition and Crystal Structure Representations using t...Evaluating Chemical Composition and Crystal Structure Representations using t...
Evaluating Chemical Composition and Crystal Structure Representations using t...
Anubhav Jain
 
Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...
Anubhav Jain
 
The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...
Anubhav Jain
 
Machine Learning Platform for Catalyst Design
Machine Learning Platform for Catalyst DesignMachine Learning Platform for Catalyst Design
Machine Learning Platform for Catalyst Design
Anubhav Jain
 
Applications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials DesignApplications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials Design
Anubhav Jain
 
Assessing Factors Underpinning PV Degradation through Data Analysis
Assessing Factors Underpinning PV Degradation through Data AnalysisAssessing Factors Underpinning PV Degradation through Data Analysis
Assessing Factors Underpinning PV Degradation through Data Analysis
Anubhav Jain
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomate
Anubhav Jain
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Anubhav Jain
 
Conducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials ProjectConducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials Project
Anubhav Jain
 
Natural Language Processing for Materials Design - What Can We Extract From t...
Natural Language Processing for Materials Design - What Can We Extract From t...Natural Language Processing for Materials Design - What Can We Extract From t...
Natural Language Processing for Materials Design - What Can We Extract From t...
Anubhav Jain
 

More from Anubhav Jain (20)

Discovering advanced materials for energy applications: theory, high-throughp...
Discovering advanced materials for energy applications: theory, high-throughp...Discovering advanced materials for energy applications: theory, high-throughp...
Discovering advanced materials for energy applications: theory, high-throughp...
 
Applications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and Design
 
An AI-driven closed-loop facility for materials synthesis
An AI-driven closed-loop facility for materials synthesisAn AI-driven closed-loop facility for materials synthesis
An AI-driven closed-loop facility for materials synthesis
 
Best practices for DuraMat software dissemination
Best practices for DuraMat software disseminationBest practices for DuraMat software dissemination
Best practices for DuraMat software dissemination
 
Best practices for DuraMat software dissemination
Best practices for DuraMat software disseminationBest practices for DuraMat software dissemination
Best practices for DuraMat software dissemination
 
Efficient methods for accurately calculating thermoelectric properties – elec...
Efficient methods for accurately calculating thermoelectric properties – elec...Efficient methods for accurately calculating thermoelectric properties – elec...
Efficient methods for accurately calculating thermoelectric properties – elec...
 
Machine Learning for Catalyst Design
Machine Learning for Catalyst DesignMachine Learning for Catalyst Design
Machine Learning for Catalyst Design
 
Accelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine LearningAccelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine Learning
 
DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …
 
The Materials Project
The Materials ProjectThe Materials Project
The Materials Project
 
Evaluating Chemical Composition and Crystal Structure Representations using t...
Evaluating Chemical Composition and Crystal Structure Representations using t...Evaluating Chemical Composition and Crystal Structure Representations using t...
Evaluating Chemical Composition and Crystal Structure Representations using t...
 
Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...
 
The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...
 
Machine Learning Platform for Catalyst Design
Machine Learning Platform for Catalyst DesignMachine Learning Platform for Catalyst Design
Machine Learning Platform for Catalyst Design
 
Applications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials DesignApplications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials Design
 
Assessing Factors Underpinning PV Degradation through Data Analysis
Assessing Factors Underpinning PV Degradation through Data AnalysisAssessing Factors Underpinning PV Degradation through Data Analysis
Assessing Factors Underpinning PV Degradation through Data Analysis
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomate
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
 
Conducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials ProjectConducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials Project
 
Natural Language Processing for Materials Design - What Can We Extract From t...
Natural Language Processing for Materials Design - What Can We Extract From t...Natural Language Processing for Materials Design - What Can We Extract From t...
Natural Language Processing for Materials Design - What Can We Extract From t...
 

Recently uploaded

Pancreas_functional anatomy_enzymes.pptx
Pancreas_functional anatomy_enzymes.pptxPancreas_functional anatomy_enzymes.pptx
Pancreas_functional anatomy_enzymes.pptx
muralinath2
 
Complementary interstellar detections from the heliotail
Complementary interstellar detections from the heliotailComplementary interstellar detections from the heliotail
Complementary interstellar detections from the heliotail
Sérgio Sacani
 
Traditional, current and future use of fish and seaweed for fertilisation - ...
Traditional, current and future use of fish and seaweed for fertilisation -  ...Traditional, current and future use of fish and seaweed for fertilisation -  ...
Traditional, current and future use of fish and seaweed for fertilisation - ...
Faculty of Applied Chemistry and Materials Science
 
SOFIA/HAWC+ FAR-INFRARED POLARIMETRIC LARGE-AREA CMZ EXPLORATION (FIREPLACE) ...
SOFIA/HAWC+ FAR-INFRARED POLARIMETRIC LARGE-AREA CMZ EXPLORATION (FIREPLACE) ...SOFIA/HAWC+ FAR-INFRARED POLARIMETRIC LARGE-AREA CMZ EXPLORATION (FIREPLACE) ...
SOFIA/HAWC+ FAR-INFRARED POLARIMETRIC LARGE-AREA CMZ EXPLORATION (FIREPLACE) ...
Sérgio Sacani
 
The Dynamical Origins of the Dark Comets and a Proposed Evolutionary Track
The Dynamical Origins of the Dark Comets and a Proposed Evolutionary TrackThe Dynamical Origins of the Dark Comets and a Proposed Evolutionary Track
The Dynamical Origins of the Dark Comets and a Proposed Evolutionary Track
Sérgio Sacani
 
Detection of the elusive dangling OH ice features at ~2.7 μm in Chamaeleon I ...
Detection of the elusive dangling OH ice features at ~2.7 μm in Chamaeleon I ...Detection of the elusive dangling OH ice features at ~2.7 μm in Chamaeleon I ...
Detection of the elusive dangling OH ice features at ~2.7 μm in Chamaeleon I ...
Sérgio Sacani
 
Potential of Marine Renewable and Non renewable energy.pptx
Potential of Marine Renewable and Non renewable energy.pptxPotential of Marine Renewable and Non renewable energy.pptx
Potential of Marine Renewable and Non renewable energy.pptx
J. Bovas Joel BFSc
 
Testing the Son of God Hypothesis (Jesus Christ)
Testing the Son of God Hypothesis (Jesus Christ)Testing the Son of God Hypothesis (Jesus Christ)
Testing the Son of God Hypothesis (Jesus Christ)
Robert Luk
 
Biochar impregnation as slow release fertilizer - Violeta Alexandra Ion
Biochar impregnation as slow release fertilizer - Violeta Alexandra IonBiochar impregnation as slow release fertilizer - Violeta Alexandra Ion
Biochar impregnation as slow release fertilizer - Violeta Alexandra Ion
Faculty of Applied Chemistry and Materials Science
 
Types of Hypersensitivity Reactions.pptx
Types of Hypersensitivity Reactions.pptxTypes of Hypersensitivity Reactions.pptx
Types of Hypersensitivity Reactions.pptx
Isha Pandey
 
Synopsis: Analysis of a Metallic Specimen
Synopsis: Analysis of a Metallic SpecimenSynopsis: Analysis of a Metallic Specimen
Synopsis: Analysis of a Metallic Specimen
Sérgio Sacani
 
Rapid pulse drying of marine biomasses - Sigurd Sannan
Rapid pulse drying of marine biomasses - Sigurd SannanRapid pulse drying of marine biomasses - Sigurd Sannan
Rapid pulse drying of marine biomasses - Sigurd Sannan
Faculty of Applied Chemistry and Materials Science
 
Celebrity Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl S...
Celebrity Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl S...Celebrity Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl S...
Celebrity Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl S...
bellared2
 
AlgaeBrew project - Unlocking the potential of microalgae for the valorisatio...
AlgaeBrew project - Unlocking the potential of microalgae for the valorisatio...AlgaeBrew project - Unlocking the potential of microalgae for the valorisatio...
AlgaeBrew project - Unlocking the potential of microalgae for the valorisatio...
Faculty of Applied Chemistry and Materials Science
 
Simulations of pulsed overpressure jets: formation of bellows and ripples in ...
Simulations of pulsed overpressure jets: formation of bellows and ripples in ...Simulations of pulsed overpressure jets: formation of bellows and ripples in ...
Simulations of pulsed overpressure jets: formation of bellows and ripples in ...
Sérgio Sacani
 
Potential of Marine renewable and Non renewable energy.pptx
Potential of Marine renewable and Non renewable energy.pptxPotential of Marine renewable and Non renewable energy.pptx
Potential of Marine renewable and Non renewable energy.pptx
J. Bovas Joel BFSc
 
Bioconversion of sago waste and oil cakes into biobutanol using Environmental...
Bioconversion of sago waste and oil cakes into biobutanol using Environmental...Bioconversion of sago waste and oil cakes into biobutanol using Environmental...
Bioconversion of sago waste and oil cakes into biobutanol using Environmental...
Dr NEETHU ASOKAN
 
MCQ in Electrostatics. for class XII pptx
MCQ in Electrostatics. for class XII  pptxMCQ in Electrostatics. for class XII  pptx
MCQ in Electrostatics. for class XII pptx
ArunachalamM22
 
[1] Data Mining - Concepts and Techniques (3rd Ed).pdf
[1] Data Mining - Concepts and Techniques (3rd Ed).pdf[1] Data Mining - Concepts and Techniques (3rd Ed).pdf
[1] Data Mining - Concepts and Techniques (3rd Ed).pdf
PANDURANGLAWATE1
 
MARIGREEN PROJECT - overview, Oana Cristina Pârvulescu
MARIGREEN PROJECT - overview, Oana Cristina PârvulescuMARIGREEN PROJECT - overview, Oana Cristina Pârvulescu
MARIGREEN PROJECT - overview, Oana Cristina Pârvulescu
Faculty of Applied Chemistry and Materials Science
 

Recently uploaded (20)

Pancreas_functional anatomy_enzymes.pptx
Pancreas_functional anatomy_enzymes.pptxPancreas_functional anatomy_enzymes.pptx
Pancreas_functional anatomy_enzymes.pptx
 
Complementary interstellar detections from the heliotail
Complementary interstellar detections from the heliotailComplementary interstellar detections from the heliotail
Complementary interstellar detections from the heliotail
 
Traditional, current and future use of fish and seaweed for fertilisation - ...
Traditional, current and future use of fish and seaweed for fertilisation -  ...Traditional, current and future use of fish and seaweed for fertilisation -  ...
Traditional, current and future use of fish and seaweed for fertilisation - ...
 
SOFIA/HAWC+ FAR-INFRARED POLARIMETRIC LARGE-AREA CMZ EXPLORATION (FIREPLACE) ...
SOFIA/HAWC+ FAR-INFRARED POLARIMETRIC LARGE-AREA CMZ EXPLORATION (FIREPLACE) ...SOFIA/HAWC+ FAR-INFRARED POLARIMETRIC LARGE-AREA CMZ EXPLORATION (FIREPLACE) ...
SOFIA/HAWC+ FAR-INFRARED POLARIMETRIC LARGE-AREA CMZ EXPLORATION (FIREPLACE) ...
 
The Dynamical Origins of the Dark Comets and a Proposed Evolutionary Track
The Dynamical Origins of the Dark Comets and a Proposed Evolutionary TrackThe Dynamical Origins of the Dark Comets and a Proposed Evolutionary Track
The Dynamical Origins of the Dark Comets and a Proposed Evolutionary Track
 
Detection of the elusive dangling OH ice features at ~2.7 μm in Chamaeleon I ...
Detection of the elusive dangling OH ice features at ~2.7 μm in Chamaeleon I ...Detection of the elusive dangling OH ice features at ~2.7 μm in Chamaeleon I ...
Detection of the elusive dangling OH ice features at ~2.7 μm in Chamaeleon I ...
 
Potential of Marine Renewable and Non renewable energy.pptx
Potential of Marine Renewable and Non renewable energy.pptxPotential of Marine Renewable and Non renewable energy.pptx
Potential of Marine Renewable and Non renewable energy.pptx
 
Testing the Son of God Hypothesis (Jesus Christ)
Testing the Son of God Hypothesis (Jesus Christ)Testing the Son of God Hypothesis (Jesus Christ)
Testing the Son of God Hypothesis (Jesus Christ)
 
Biochar impregnation as slow release fertilizer - Violeta Alexandra Ion
Biochar impregnation as slow release fertilizer - Violeta Alexandra IonBiochar impregnation as slow release fertilizer - Violeta Alexandra Ion
Biochar impregnation as slow release fertilizer - Violeta Alexandra Ion
 
Types of Hypersensitivity Reactions.pptx
Types of Hypersensitivity Reactions.pptxTypes of Hypersensitivity Reactions.pptx
Types of Hypersensitivity Reactions.pptx
 
Synopsis: Analysis of a Metallic Specimen
Synopsis: Analysis of a Metallic SpecimenSynopsis: Analysis of a Metallic Specimen
Synopsis: Analysis of a Metallic Specimen
 
Rapid pulse drying of marine biomasses - Sigurd Sannan
Rapid pulse drying of marine biomasses - Sigurd SannanRapid pulse drying of marine biomasses - Sigurd Sannan
Rapid pulse drying of marine biomasses - Sigurd Sannan
 
Celebrity Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl S...
Celebrity Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl S...Celebrity Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl S...
Celebrity Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl S...
 
AlgaeBrew project - Unlocking the potential of microalgae for the valorisatio...
AlgaeBrew project - Unlocking the potential of microalgae for the valorisatio...AlgaeBrew project - Unlocking the potential of microalgae for the valorisatio...
AlgaeBrew project - Unlocking the potential of microalgae for the valorisatio...
 
Simulations of pulsed overpressure jets: formation of bellows and ripples in ...
Simulations of pulsed overpressure jets: formation of bellows and ripples in ...Simulations of pulsed overpressure jets: formation of bellows and ripples in ...
Simulations of pulsed overpressure jets: formation of bellows and ripples in ...
 
Potential of Marine renewable and Non renewable energy.pptx
Potential of Marine renewable and Non renewable energy.pptxPotential of Marine renewable and Non renewable energy.pptx
Potential of Marine renewable and Non renewable energy.pptx
 
Bioconversion of sago waste and oil cakes into biobutanol using Environmental...
Bioconversion of sago waste and oil cakes into biobutanol using Environmental...Bioconversion of sago waste and oil cakes into biobutanol using Environmental...
Bioconversion of sago waste and oil cakes into biobutanol using Environmental...
 
MCQ in Electrostatics. for class XII pptx
MCQ in Electrostatics. for class XII  pptxMCQ in Electrostatics. for class XII  pptx
MCQ in Electrostatics. for class XII pptx
 
[1] Data Mining - Concepts and Techniques (3rd Ed).pdf
[1] Data Mining - Concepts and Techniques (3rd Ed).pdf[1] Data Mining - Concepts and Techniques (3rd Ed).pdf
[1] Data Mining - Concepts and Techniques (3rd Ed).pdf
 
MARIGREEN PROJECT - overview, Oana Cristina Pârvulescu
MARIGREEN PROJECT - overview, Oana Cristina PârvulescuMARIGREEN PROJECT - overview, Oana Cristina Pârvulescu
MARIGREEN PROJECT - overview, Oana Cristina Pârvulescu
 

Natural language processing for extracting synthesis recipes and applications to autonomous laboratories

  • 1. Natural language processing for extracting synthesis recipes and applications to autonomous laboratories Anubhav Jain Lawrence Berkeley National Laboratory COMBI workshop, Sept 2022 Slides (already) posted to hackingmaterials.lbl.gov
  • 2. Autonomous labs can benefit from access to external data sets 2 Plan Synthesize Characterize Analyze local db Automated Lab A Plan Synthesize Characterize Analyze local db Automated Lab B Plan Synthesize Characterize Analyze local db Automated Lab C Literature data + broad coverage – difficult to parse – lack negative examples Other A-lab data + structured data formats + negative examples – not much out there … Theory data + readily available – difficult to establish relevance to synthesis
  • 3. Autonomous labs can benefit from access to external data sets 3 Plan Synthesize Characterize Analyze local db Automated Lab A Plan Synthesize Characterize Analyze local db Automated Lab B Plan Synthesize Characterize Analyze local db Automated Lab C Literature data + broad coverage – difficult to parse – lack negative examples Other A-lab data + structured data formats + negative examples – not much out there … Theory data + readily available – difficult to establish relevance to synthesis
  • 4. The NLP Solution to Literature Data • A lot of prior experimental data already exists in the literature that would take untold costs and labor to replicate again • Advantages to this data set are broad coverage of materials and techniques • Disadvantages include: • Getting access to the data • lack of negative examples in the data • missing / unreliable information • difficulty to obtain structured data from unstructured text • Natural language processing can help with the last part, although considerable difficulties are still involved • Named entity recognition • Identify precursors, amounts, characteristics, etc. • Relationship modeling • Relate the extracted entities to one another
  • 5. Previous approach for extracting data from text 5 Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019) Recently, we also tried BERT variants Trewartha, A.; Walker, N.; Huo, H.; Lee, S.; Cruse, K.; Dagdelen, J.; Dunn, A.; Persson, K. A.; Ceder, G.; Jain, A. Quantifying the Advantage of Domain-Specific Pre-Training on Named Entity Recognition Tasks in Materials Science. Patterns 2022, 3 (4), 100488.
  • 6. Models were good for labeling entities, but didn’t understand relationships 6 Named Entity Recognition • Custom machine learning models to extract the most valuable materials-related information. • Utilizes a long short-term memory (LSTM) network trained on ~1000 hand-annotated abstracts. Trewartha, A.; Walker, N.; Huo, H.; Lee, S.; Cruse, K.; Dagdelen, J.; Dunn, A.; Persson, K. A.; Ceder, G.; Jain, A. Quantifying the Advantage of Domain-Specific Pre-Training on Named Entity Recognition Tasks in Materials Science. Patterns 2022, 3 (4), 100488.
  • 7. A Sequence-to-Sequence Approach • Language model takes a sequence of tokens as input and outputs a sequence of tokens • Maximizes the likelihood of the output conditioned on the input • Additionally includes task conditioning • Capacity for “understanding” language as well as “world knowledge” • Task conditioning with arbitrary Seq2Seq provides extremely flexible framework • Large seq2seq2 models can generate text that naturally completes a paragraph
  • 8. How a sequence-to-sequence approach works 8 Seq2Seq model (GPT3) Text in (“prompt”) Text out (“completion”)
  • 9. Another example 9 Seq2Seq model (GPT3) Text in (“prompt”) Text out (“completion”)
  • 10. Structured data 10 Seq2Seq model (GPT3) Text in (“prompt”) Text out (“completion”)
  • 11. But it’s not perfect for technical data 11 Seq2Seq model (GPT3) Text in (“prompt”) Text out (“completion”)
  • 12. A workflow for fine-tuning GPT-3 1. Initial training set of templates filled via zero-shot Q/A 2. Fine-tune model to fill templates 3. Predict new set of templates 4. Correct the new templates 5. Add the corrected templates to the training set 6. Repeat steps 2-5 as necessary
  • 13. Templated extraction of synthesis recipes • Annotate paragraphs to output structured recipe templates • JSON-format • Designed using domain knowledge from experimentalists • Template is relation graph to be filled in by model
  • 15. Performance (work in progress, initial tests) • Precision: 90% • Recall: 90% • F1 Score: 90% • Transcription: 97% • Overall: 86% • F1 accuracy for placing information in the right fields • Transcription accuracy for putting the right information in said fields
  • 16. Applied to solid state synthesis / doping We have performed the first-principles calculations onto the structural, electronic and magnetic properties of seven 3d transition-metal (TM=V, Cr, Mn, Fe, Co, Ni and Cu) atom substituting cation Zn in both zigzag (10,0) and armchair (6,6) zinc oxide nanotubes (ZnONTs). The results show that there exists a structural distortion around 3d TM impurities with respect to the pristine ZnONTs. The magnetic moment increases for V-, Cr-doped ZnONTs and reaches maximum for Mn-doped ZnONTs, and then decreases for Fe-, Co- , Ni- and Cu-doped ZnONTs successively, which is consistent with the predicted trend of Hund’s rule for maximizing the magnetic moments of the doped TM ions. However, the values of the magnetic moments are smaller than the predicted values of Hund’s rule due to strong hybridization between p orbitals of the nearest neighbor O atoms of ZnONTs and d orbitals of the TM atoms. Furthermore, the Mn-, Fe-, Co-, Cu-doped (10,0) and (6,6) ZnONTs with half-metal and thus 100% spin polarization characters seem to be good candidates for spintronic applications.
  • 17. Use in initial hypothesis generation 17 classifying AuNP morphologies based on precursors used predicting AuNR aspect ratios based on amount of AgNO3 in growth solution predicting doping – if a material can be doped with A, can it be doped with B?
  • 18. Developing an automated lab (“A-lab”) that makes use of literature data is in progress 18 Plan Synthesize Characterize Analyze local db Automated Lab A Plan Synthesize Characterize Analyze local db Automated Lab B Plan Synthesize Characterize Analyze local db Automated Lab C Literature data + broad coverage – difficult to parse – lack negative examples Other A-lab data + structured data formats + negative examples – not much out there … Theory data + readily available – difficult to establish relevance to synthesis
  • 19. The A-lab facility is designed to handle inorganic powders 19 In operation: XRD Robot Box furnaces Setting up: Tube furnace x 4 LBNL bldg. 30 Dosing and mixing Facility will handle powder- based synthesis of inorganic materials, with automated characterization and experimental planning Collaboration w/ G. Ceder & H. Kim July 2022 - Tube furnaces and SEM ready Hardware development Platform Integration Automated Synthesis AI-guided Synthesis April 2022 Box furnace, XRD, & robots ready November 2022 - Powder dosing system - First automated syntheses Summer 2023 AI-guided synthesis Closed- Loop Materials Discovery Summer 2024 Closed-loop materials discovery
  • 20. Early stages of the facility 20
  • 21. The continuing challenge – putting it all together! Currently we are still working on various components Historical-data Initial hypotheses data-api
  • 22. Acknowledgements NLP • Nick Walker • John Dagdelen • Alex Dunn • Sanghoon Lee • Amalie Trewartha 22 A-lab • Rishi Kumar • Yuxing Fei • Haegyum Kim • Gerbrand Ceder Funding provided by: • U.S. Department of Energy, Basic Energy Science, “D2S2” program • Toyota Research Institutes, Accelerated Materials Design program • Lawrence Berkeley National Laboratory “LDRD” program Slides (already) posted to hackingmaterials.lbl.gov