Applications of Natural Language Processing to Materials Design

Applications of Natural Language Processing to
Materials Design
Anubhav Jain
Energy Technologies Area
Lawrence Berkeley National Laboratory
Berkeley, CA
UCB MSE Seminar, March 31 2022
Slides (already) posted to hackingmaterials.lbl.gov

2
Can ML help us work through our backlog of information we
need to assimilate from text sources?
Flood of information
Important things get missed
Useful data, but unstructured
NLP algorithms

• Small things – search is not chemistry-aware
– a search for “TiNiSn” will give different results than “NiTiSn”
– a search for “SnBi4Te7” won’t match text that reads “we studied SnBi4X7
(X=S, Se, Te)”.
• Medium things – it is difficult to ask questions or compile
summaries, e.g.:
– What is the band gap of “Si”?
– What are all the known dopants into GaAs?
– What are all materials studied as thermoelectrics?
• Big things – one can’t make predictive use of information in text
– Based on all that is known, what materials should be studied as
thermoelectrics?
– Given a synthesis target of a novel compound (composition + structure),
what kind of synthesis protocol should be followed to realize the compound?
3
Some ways in which existing tools for
searching the literature fall short

The types of features we want to enable
4
Zinc oxide
ZnO
OZn
Chemistry aware search
(same input, same results)
Summary data
• Physical properties
• Synthesis information
• Known applications
ferroelectrics All known compositions
(PbTiO3, BaTiO3, etc.)
Links to computational databases
Zn0.5O0.5
Composition A
Composition B
Composition A synthesis
Composition B synthesis
Known;
summary of all
previous
syntheses
Unknown;
suggested
synthesis
protocol
new thermoelectrics

What is matscholar?
• Matscholar is an attempt to organize the world’s
information on materials science
• It is an effort to use state-of-the-art natural
language processing to make collective use of
the information in millions of articles

Today, this is usually done manually or
(recently) semi-automatically with custom rules
6
Data extracted manually Data extracted
semi-automatically
Largely rule-based, not example-based (ML)

With Matscholar, we are engaged in two primary efforts
1. Collect raw information from the research
literature to serve as a source for text mining
2. Develop machine learning models that can be
applied to text sources (like the research
literature) to extract useful information
7

One of our main machine learning projects concerns
named entity recognition, or automatically labeling text
8
This allows for search
and is crucial to
downstream tasks

9
How does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).

10
Data collection is a multi-step process
Currently, ~4 million
entries (article abstracts)
have been parsed.
Separately, a full-text
database of comparable
size for is compiled via
publisher negotiation
(Berkeley - Ceder group)

11
(2019).

• First split the text into sentences
– Seems simple, but remember edge cases like ”et al.” or
“etc.” does not necessarily signify end of sentence despite
the period
• Then split the sentences into words
– Tricky things are detecting and normalizing chemical
formulas, selective lowercasing (“Battery” vs “battery” or
“BaS” vs “BAs”), homogenizing numbers, etc.
• Historically done with ChemDataExtractor* with
some custom improvements
– We are moving towards a fully custom tokenizer
12
Step 2 - tokenization
*http://chemdataextractor.org

13
(2019).

• Part A is marking abstracts
as relevant / non-relevant
to inorganic materials
science
• Part B is tediously labeling
~600 abstracts
– Largely done by one person
– Spot-check of 25 abstracts
by a second person gave
87.4% agreement
14
Step 3 – hand label abstracts

15
(2019).

• We use the word2vec
algorithm (Google) to turn
each unique word in our
corpus into a 200-
dimensional vector
• These vectors encode the
meaning of each word
meaning based on trying to
predict context words
around the target
16
Step 4a: the word2vec algorithm is used to “featurize” words
Barazza, L. How does Word2Vec’s Skip-Gram work? Becominghuman.ai. 2017

• We use the word2vec
algorithm (Google) to turn
each unique word in our
corpus into a 200-
dimensional vector
• These vectors encode the
meaning of each word
meaning based on trying to
predict context words
around the target
17
Step 4a: the word2vec algorithm is used to “featurize” words
Barazza, L. How does Word2Vec’s Skip-Gram work? Becominghuman.ai. 2017
“You shall know a word by
the company it keeps”
- John Rupert Firth (1957)

• The classic example is:
– “king” - “man” + “woman” = ? → “queen”
18
Word embeddings trained on ”normal” text learns
relationships between words

19
For scientific text, it learns scientific concepts as well
crystal structures of the elements
Tshitoyan, V. et al. Unsupervised word embeddings capture latent
knowledge from materials science literature. Nature 571, 95–98 (2019).
When we train
word2vec on inorganic
materials science
abstracts, we get
representations in-line
with chemical
knowledge

20
There seems to be materials knowledge encoded in the
word vectors

21
Word embeddings also have the periodic table encoded in it
with no prior knowledge
“word embedding”
periodic table

22
Side note: the learned element embeddings from text
mining are now used in various state-of-the-art ML models
Uses mat2vec
embeddings
Uses 1-hot encoded
embeddings
Uses mat2vec
embeddings
Uses 1-hot encoded
embeddings
Currently, the two best-performing ML
models for predicting various materials
properties from a chemical
composition make use of mat2vec
embeddings!
”Crabnet”
https://www.nature.com/articles/s41524-021-00545-1
https://www.nature.com/articles/s41467-020-19964-7
”RooST”

23
Ok so how does this work? High-level view
(2019).

• If you read this sentence:
“The band gap of ___ is 4.5 eV”
It is clear that the blank should be filled in with a
material word (not a synthesis method, characterization
method, etc.)
How do we get a neural network to take into account
context (as well as properties of the word itself)?
24
Step 4b: How do we train a model to recognize context?

25
Step 4b.An LSTM neural net classifies words by reading
word sequences
(2019).

26
Ok so how does this work? High-level view
(2019).

27
Step 5. Let the model label things for you!
Named Entity Recognition
X
• Custom machine learning models to
extract the most valuable materials-related
information.
• Utilizes a long short-term memory (LSTM)
network trained on ~1000 hand-annotated
abstracts.
• f1 scores of ~0.9. f1 score for inorganic
materials extraction is >0.9.
(2019).

28
Now we can search!
Live on www.matscholar.com

29
We are also integrating matscholar tools with the
Materials Project database
www.materialsproject.org is a free database of computed
materials properties and over >200K registered users

30
Adding generic search capabilities to MP database
Currently, you need to type a very
strict search format into MP search
bar – either a list of elements or
specific chemical formulas
Can’t search “ferroelectric” for
example, just “BaTiO3”

31
Prototype integration with Materials Project
is already underway
* Working out some kinks that lead to LiCoO2, LiFePO4, etc not being sorted correctly

32
Zinc oxide
ZnO
OZn
Summary data
Zn0.5O0.5
new thermoelectrics
Composition A
Composition B
Known;
summary of all
previous
syntheses
Unknown;
suggested
synthesis
protocol

• The publication data set is not complete
• Currently analyzing abstracts only
• The algorithms are not perfect
• The search interface could be improved further
• We would like to hear from you if you try this!
33
Limitations (it is not perfect)

34
Could these techniques also be used to predict which
materials we might want to screen for an application?
papers to read “someday”
NLP algorithms

• Dot product of a composition word with
the word “thermoelectric” essentially
predicts how likely that word is to appear
in an abstract with the word
thermoelectric
• Compositions with high dot products are
typically known thermoelectrics
• Sometimes, compositions have a high dot
product with “thermoelectric” but have
never been studied as a thermoelectric
• These compositions usually have high
computed power factors!
(DFT+BoltzTraP)
35
Making predictions: dot products measure likelihood for
words to co-occur
Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from
materials science literature. Nature 571, 95–98 (2019).

36
Try ”going back in time” and ranking materials, and follow
what happens in later years
Tshitoyan, V. et al.
Unsupervised word
embeddings capture latent
knowledge from materials
science literature. Nature
571, 95–98 (2019).

– For every year since
2001, see which
compounds we would
have predicted using
only literature data until
that point in time
– Make predictions of
what materials are the
most promising
thermoelectrics for
data until that year
– See if those materials
were actually studied as
thermoelectrics in
subsequent years 37
A more comprehensive “back in time” test
Tshitoyan, V. et al. Unsupervised word embeddings capture
latent knowledge from materials science literature. Nature
571, 95–98 (2019).

38
We also published a list of potential new thermoelectrics
Tshitoyan, V. et al. Unsupervised word embeddings capture
latent knowledge from materials science literature. Nature
571, 95–98 (2019).
It is one thing to
retroactively test, but
perhaps another to see
how things go after
publication

39
Overall: ~33% of predictions were studied as
thermoelectrics within 3 years
Investigated as thermoelectrics
(independently of our study)
• About 1/3 of predicted compounds have been
studied within 3 years – better than we expect
• However, almost all studies were computational
explorations of thermoelectricity / first principles
calculations and not experiments
• 3 compounds had zT measured experimentally:
• Li3Sb reached a peak zT ~ 0.3
• Cu7Te5 reached a peak zT ~ 0.14
• CsGeI3 (after further doping) reached a peak
zT ~ 0.12
• Overall – the forward prediction of materials that are
likely to be studied as thermoelectrics seems to
mostly work
• However, they are not particularly good
thermoelectrics.
Investigated by our own collaborators
(as a result of our study)

40
How is this working?
“Context
words” link
together
information
from different
sources

41
Zinc oxide
ZnO
OZn
Summary data
Zn0.5O0.5
new thermoelectrics
Composition A
Composition B
Known;
summary of all
previous
syntheses
Unknown;
suggested
synthesis
protocol

43
Improving the accuracy of the model:
training a BERT-based model
The BERT model is more advanced than word2vec and better takes into account context.
Performance on all tasks is improved; we are currently investigating other models that may
have even easier annotation and better performance.
Walker, Nicholas, et al. "The Impact of Domain-Specific Pre-Training on Named Entity Recognition Tasks in Materials Science." Available at SSRN 3950755 (2021).

• For some tasks/domains, extracting entities is
sufficient
• For others, we need to relate them! NER does not
tell us enough.
44
Improving the capabilities of extraction:
relating entities to one another for complex information
Dopants
Transition metals
Sm
Sn
Base materials
ZnO
ZnS
Dopant quantities
5 at. %
?
?
?
What was doped with Sn??
Doping of transition metals into ZnS and ZnO nanoparticles . . .
The ZnO:Sm system was formed at 5 at.% . . .
The ZnS sample was also doped with Sn . . .

• Our goal is to extract structured graphs of entities
rather than just the entities themselves
• Structured acyclic entity graphs give complete
information for extraction and analysis
45
Doping of transition metals into ZnS and ZnO nanoparticles . . .
The ZnO:Sm system was formed at 5 at.% . . .
The ZnS sample was also doped with Sn . . .
ZnS
ZnO
Transition metals
Sm
Sn
5 at. %
“was doped with”
“to the amount of”
By relating entities, we get much more
powerful and useful information extraction

• Earlier, dependency extraction was done using grammar rules (e.g.
dependency trees) but it was not particularly successful
• We have been experimenting with large seq2seq transformer models
• These can take in an unstructured text sequence and output a structured
text sequence (e.g., OpenAI Codex that solves programming tasks)
• Can be trained with few (<50) examples due to few-shot capability
46
Utilizing large seq2seq models for ERM
Transition metal doping is an effective tool for controlling optical
absorption in ZnS and hence the number of photons absorbed by
photovoltaic devices. By using first principle density functional
calculations, we compute the change in number of photons absorbed
upon doping with a selected transition metal and found that Ni
offers the best chance to improve the performance. This is
attributed to the formation of defect states in the band gap of the
host ZnS which give rise to additional dipole-allowed optical
transition pathways between the conduction and valence band.
Analysis of the defect level in the band gap shows that TM dopants
do not pin Fermi levels in ZnS and hence the host can be made n- or
p- type with other suitable dopants. The measured optical spectra
from the doped solution processed ZnS nanocrystal supports our
theoretical finding that Ni doping enhances optical absorption the
most compared to Co and Mn doping.
Raw scientific text
Seq2seq
Model
Trained on
intermediate reps.
Entity Relationships
Output
sequence
Input seq Output seq
Deterministic decoding

• Previous NER experiments can be extended with
ERM to include much more information
47
Applying ERM to Dopant/Host extraction
CaCu3Ti4-xCoxO12 is a doped result with
descriptor ceramic and phase cubic from base
material CaCu3Ti4O12 (AKA calcium copper
titanate) and dopant Co + 2 (AKA cobalt).
{
“basemats”: {
0: {
“aliases”: [“CaCuTi4O12”, “calcium copper titanate”],
“descriptor”: null,
...}}
“dopants”: {
0: {
“aliases”: [“Co+2”, “cobalt”],
...}},
”results”: {
0: {
“aliases”: “CaCu3Ti$_{bf 4-emph{x}}$Co$_{bfemph{x}}$O12"
“linked_basemats”: [0],
“linked_dopants”: [0],
“descriptors”: [“ceramics”],
...}}
Seq2seq model
unstructured to structured
Manual parser

For example, we hope to parse a literature-derived
database of dopants and dopability
48
With this capability, we plan to release structured materials
properties databases based on NLP parsing of literature
Sentence Base Material Dopant Doping Concentr.
…the influence of yttrium doping (0-10mol%) on BSCF… BSCF Yttrium 0-10 mol%
undoped, anion-doped(Sb,Bi) and cation-doped(Ca,Zn) solid sln. of Mg10Si2Sn3… Mg10Si2Sn3 Sb, Bi, Ca, Zn
The zT of As2Cd3 with electron doping is found to be ~ with n=10^20cm-3 As2Cd3 electron n=10^20cm-3
This leads to zT=0.5 obtained at 500K (p=10^20cm-3) in p-type As2Cd3T As2Cd3 p-type p=10^20cm-3
The undoped and 0.25wt% La doped CdO films show 111…
…however, …. for doping concentrations greater than 0.50wt%.
CdO La 0.25wt%,
>0.5%
Which elements are commonly doped
into the same materials (i.e., co-occur
as dopants)?

• Experimentalists identified relevant factors for gold
nanorod dimensions
– Experimental temperatures
– Solution ages/timing
– Precursor amounts
49
We can also tackle complex syntheses if we can do
entity relationship modeling
Seq2Seq model outputs JSON
(form of entity graph)
"seed": {
"prec": {
"HAuCl4": {
"vol": "5 mL",
"concn": "0.25 mM"
},
"CTAB": {
"vol": "HAuCl4",
"concn": "0.1 M"
},
"NaBH4": {
"vol": "0.3 mL",
"concn": "10 mM"
}
},
"seed": {
"size": "3 nm"
},
"temp": "25 degC",
"age": "5 min"
},
Types of factors important in synthesis
Values as extracted from raw text

50
Tests on Au Nanorod Synthesis indicate it is working
Seed Solution
(age, stir rate,
temperature,
precursor properties,
seed properties)
Growth
Solution
(age, stir rate,
temperature,
precursor properties)
AuNR
(aspect ratios,
lengths, widths,
TSPRs, and LSPRs)
Entity detected
(F1 score)
0.94 0.92 0.76
Exact match to entity
(accuracy)
0.73 0.77 0.52
Support 159 244 96
Aggregated scores by AuNR recipe component
Evaluated on 40 test paragraphs
Trained on 40 (manual annotation) and 200 (assisted) paragraphs
Entity detected = We correctly detected the types of synthesis information present
Exact match = The extracted synthesis information is an exact string match

51
Zinc oxide
ZnO
OZn
Summary data
Zn0.5O0.5
new thermoelectrics
Composition A
Composition B
Known;
summary of all
previous
syntheses
Unknown;
suggested
synthesis
protocol
???

52
Note –
we are creating open-source libraries to help with NLP tasks
https://github.com/lbnlp

• There exists a lot of data and knowledge in the
historical corpus of scientific journal articles, but
getting the knowledge has been difficult to do on
a large scale
• Machine learning presents a new frontier for
being able to make use of this information
53
Conclusion

54
The Matscholar team
Funding from:
Slides (already) posted to
hackingmaterials.lbl.gov
John
Dagdelen
Alex
Dunn
Viktoriia
Baibakova
John
Dagdelen
Viktoriia
Baibakova
Nick
Walker
Kristin Persson
Anubhav Jain
Gerbrand Ceder
Leigh
Weston
Vahe
Tshitoyan
Amalie
Trewartha
alumni
Sanghoon
Lee

Applications of Natural Language Processing to Materials Design

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Applications of Natural Language Processing to Materials Design

Similar to Applications of Natural Language Processing to Materials Design (20)

More from Anubhav Jain

More from Anubhav Jain (20)

Recently uploaded

Recently uploaded (20)

Applications of Natural Language Processing to Materials Design