Natural Language Processing for Data Extraction and Synthesizability Prediction from the Energy Materials Literature

Natural Language Processing for Data Extraction
and Synthesizability Prediction from the Energy
Materials Literature
Anubhav Jain
Lawrence Berkeley National Laboratory
MRS Fall meeting, Nov 2022
Slides (already) posted to hackingmaterials.lbl.gov

Literature data can be a key source of materials learning
2
Plan
Synthesize
Characterize
Analyze
local db +
ML
Automated Lab A
Plan
Synthesize
Characterize
Analyze
Conventional Lab B
Plan
Synthesize
Characterize
Analyze
local db +
ML
Automated Lab C
Literature data
+ broad coverage
– difficult to parse
– lack negative examples
– reproducibility
Other A-lab data
+ structured data formats
+ negative examples
– not much out there …
Theory data
+ readily available
– difficult to establish
relevance to synthesis
– computation time

Several research groups are now attempting to
collect data sets from the research literature
3
Weston, L. et al Named Entity Recognition
and Normalization Applied to Large-Scale
Information Extraction from the Materials
Science Literature. J. Chem. Inf. Model.
(2019)
Recently, we also tried BERT variants
Trewartha, A.; Walker, N.; Huo, H.; Lee, S.;
Cruse, K.; Dagdelen, J.; Dunn, A.; Persson,
K. A.; Ceder, G.; Jain, A. Quantifying the
Advantage of Domain-Specific Pre-Training
on Named Entity Recognition Tasks in
Materials Science. Patterns 2022, 3 (4),
100488.

Models were good for labeling entities, but
didn’t understand relationships
4
Named Entity Recognition
• Custom machine learning models to
extract the most valuable materials-related
information.
• Utilizes a long short-term memory (LSTM)
network trained on ~1000 hand-annotated
abstracts.
Trewartha, A.; Walker, N.; Huo, H.; Lee, S.;
Cruse, K.; Dagdelen, J.; Dunn, A.; Persson,
K. A.; Ceder, G.; Jain, A. Quantifying the
Advantage of Domain-Specific Pre-Training
on Named Entity Recognition Tasks in
Materials Science. Patterns 2022, 3 (4),
100488.
Relationships have usually been extracted
via either manual or semi-automated
regular expression construction along
with grammar tree analysis, e.g.
ChemDataExtractor – can be tedious!

Outline
• Using sequence-to-sequence models for combined entity
detection and relationship extraction
• Analyzing synthesis of Au nanorods using literature data
• Analyzing synthesis of phase-pure BiFeO3 using literature data
5

A Sequence-to-Sequence Approach
• Language model takes a sequence of tokens
as input and outputs a sequence of tokens
• Maximizes the likelihood of the output
conditioned on the input
• Additionally includes task conditioning, which can
learn the desired format for outputs
• We’ve done many explorations now with
OpenAI’s GPT-3 which has 175 billion
parameters
• interact with the model through their (paid) API,
although costs are relatively modest
• Capacity for “understanding” language as well
as “world knowledge”

How a sequence-to-sequence approach works
7
Seq2Seq model
(GPT3)
Text in (“prompt”) Text out (“completion”)

Another example
8
Seq2Seq model
(GPT3)

Structured data
9
Seq2Seq model
(GPT3)

But it’s not perfect for technical data
10
Seq2Seq model
(GPT3)

A workflow for fine-tuning GPT-3
1. Initial training set of templates
filled mostly manually, as zero-
shot GPT is often poor for
technical tasks
2. Fine-tune model to fill
templates, use the model to
assist in annotation
3. Repeat as necessary until
desired inference accuracy is
achieved

This procedure can extract complex,
hierarchical relationships between entities
12

Outline
• Analyzing synthesis of phase-pure BiFeO3 using literature data
13

Templated extraction of synthesis recipes
• Annotate paragraphs to output
structured recipe templates
• JSON-format
• Designed using domain knowledge
from experimentalists
• Template is relation graph to be
filled in by model

Example Extraction for Au nanorod synthesis
Note: we are still formally evaluating performance various
issues in getting an accurate evaluation, e.g., predictions
that are functionally correct but written differently

Analyzing AuNR synthesis data set
16
Note that this data set was collected manually via hand-tuned
regular expressions, not NLP or GPT-3 as it was done in parallel
to that work.
We are currently looking at pros/cons of manual approach vs
GPT_3 approach.
Representing recipes as precursor vectors for machine learning

Training a decision tree to predict AuNR
shape shows similar conclusions as literature
17
Rod
Cube
Rod
Cube Bipyramid Star Bipyramid
None
None
None
None
None
None None
• Decision tree shows seed capping
agent type as first decision
boundary for shape determination
• “Citrate-capped gold seeds form
penta-twinned structure, while
CTAB-capped seeds are single
crystalline, hence former leads to
bipyramids and latter leads to
rods”1,2
1 Liu and Guyot-Sionnest, J.
Phys. Chem. B, 2005 109 (47),
22192-22200
2
Grzelczak et al., Chem. Soc.
Rev., 2008,37, 1783-1791

We also see some effect of AgNO3
concentration on AuNR size, but data is noisy
18
N. D. Burrows et al., Langmuir 2017 33 (8), 1891-1907
growth: HAuCl4, CTAB, AA, AgNO3
growth: HAuCl4, CTAB, AA, AgNO3 w/ HAuCl4/CTAB<0.01 filter
growth: HAuCl4, CTAB, AA, AgNO3 + HCl

Overall thoughts on AuNR data set
• The seq2seq method is showing good capabilities in terms of
extracting complex nanorod synthesis data
• We are going to start integrating this into our own pipeline to replace
manual regex for relationship extraction
• Performing machine learning to form hypothesis generation on
AuNR shape and size is messy
• Data sets are messy, and not particularly large
• Nevertheless, it is encouraging that conclusions from the
literature can be automatically found by machine learning
19

Outline
• Analyzing synthesis of phase-pure BiFeO3 using literature
data
20

Seq2Seq approach for solid state synthesis
Initial tests of the seq2seq method on solid state synthesis has encouraging results, but needs further testing

For now, we use manual data extraction to
tackle the problem of BiFeO3 synthesis
22
340 total synthesis recipes (from 178 articles); 57 features per recipe

Machine learning (decision tree) predictions
are in-line with common knowledge
23
Machine learning (decision tree) predictions
are in-line with common knowledge
24

Missing synthesis information – can it be
recovered / reproduced easily?
24
Could not reproduce
Partially reproducible
Reproducible

Exploring unexplored portions of synthesis
space
25
These
decision trees
are
interpretable,
but are they
physical?

Conclusions
• As large language models grow larger and more capable, they are able to parse
increasingly complex scientific text into structured formats
• Applying NLP + ML on synthesis data sets shows that scientific heuristics can be
automatically uncovered, which is promising
• Nevertheless, issues remain in applying NLP to predictive synthesis
• Reproducibility / missing information / conflicting information
• General lack of negative examples
• Unknown data quality
• Thus, results from such techniques will likely need to be treated as initial
hypotheses to be complemented by further experiments
26

Acknowledgements
NLP (seq2seq)
• Alex Dunn
• John Dagdelen
• Nick Walker
• Sanghoon Lee
• Amalie Trewartha
27
Funding provided by:
• U.S. Department of Energy, Basic Energy Science, “D2S2” program
• Toyota Research Institutes, Accelerated Materials Design program
Slides (already) posted to hackingmaterials.lbl.gov
AuNR analysis
• Sanghoon Lee
• Sam Gleason
• Kevin Cruse
BiFeO3 analysis
• Kevin Cruse
• Viktoriia Baibakova
• Maged Abdelsamie
• Kootak Hong
• Carolin Sutter-Fella
• Gerbrand Ceder

Sol-gel synthesis of BiFeO3
28

Natural Language Processing for Data Extraction and Synthesizability Prediction from the Energy Materials Literature

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Natural Language Processing for Data Extraction and Synthesizability Prediction from the Energy Materials Literature

Similar to Natural Language Processing for Data Extraction and Synthesizability Prediction from the Energy Materials Literature (20)

More from Anubhav Jain

More from Anubhav Jain (20)

Recently uploaded

Recently uploaded (20)

Natural Language Processing for Data Extraction and Synthesizability Prediction from the Energy Materials Literature