SlideShare a Scribd company logo
Applications of Natural Language Processing to
Materials Design
Anubhav Jain
Energy Technologies Area
Lawrence Berkeley National Laboratory
Berkeley, CA
UCB MSE Seminar, March 31 2022
Slides (already) posted to hackingmaterials.lbl.gov
2
Can ML help us work through our backlog of information we
need to assimilate from text sources?
Flood of information
Important things get missed
Useful data, but unstructured
NLP algorithms
• Small things – search is not chemistry-aware
– a search for “TiNiSn” will give different results than “NiTiSn”
– a search for “SnBi4Te7” won’t match text that reads “we studied SnBi4X7
(X=S, Se, Te)”.
• Medium things – it is difficult to ask questions or compile
summaries, e.g.:
– What is the band gap of “Si”?
– What are all the known dopants into GaAs?
– What are all materials studied as thermoelectrics?
• Big things – one can’t make predictive use of information in text
– Based on all that is known, what materials should be studied as
thermoelectrics?
– Given a synthesis target of a novel compound (composition + structure),
what kind of synthesis protocol should be followed to realize the compound?
3
Some ways in which existing tools for
searching the literature fall short
The types of features we want to enable
4
Zinc oxide
ZnO
OZn
Chemistry aware search
(same input, same results)
Summary data
• Physical properties
• Synthesis information
• Known applications
ferroelectrics All known compositions
(PbTiO3, BaTiO3, etc.)
Links to computational databases
Zn0.5O0.5
Composition A
Composition B
Composition A synthesis
Composition B synthesis
Known;
summary of all
previous
syntheses
Unknown;
suggested
synthesis
protocol
new thermoelectrics
What is matscholar?
• Matscholar is an attempt to organize the world’s
information on materials science
• It is an effort to use state-of-the-art natural
language processing to make collective use of
the information in millions of articles
Today, this is usually done manually or
(recently) semi-automatically with custom rules
6
Data extracted manually Data extracted
semi-automatically
Largely rule-based, not example-based (ML)
With Matscholar, we are engaged in two primary efforts
1. Collect raw information from the research
literature to serve as a source for text mining
2. Develop machine learning models that can be
applied to text sources (like the research
literature) to extract useful information
7
One of our main machine learning projects concerns
named entity recognition, or automatically labeling text
8
This allows for search
and is crucial to
downstream tasks
9
How does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
10
Data collection is a multi-step process
Currently, ~4 million
entries (article abstracts)
have been parsed.
Separately, a full-text
database of comparable
size for is compiled via
publisher negotiation
(Berkeley - Ceder group)
11
How does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
• First split the text into sentences
– Seems simple, but remember edge cases like ”et al.” or
“etc.” does not necessarily signify end of sentence despite
the period
• Then split the sentences into words
– Tricky things are detecting and normalizing chemical
formulas, selective lowercasing (“Battery” vs “battery” or
“BaS” vs “BAs”), homogenizing numbers, etc.
• Historically done with ChemDataExtractor* with
some custom improvements
– We are moving towards a fully custom tokenizer
12
Step 2 - tokenization
*http://chemdataextractor.org
13
How does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
• Part A is marking abstracts
as relevant / non-relevant
to inorganic materials
science
• Part B is tediously labeling
~600 abstracts
– Largely done by one person
– Spot-check of 25 abstracts
by a second person gave
87.4% agreement
14
Step 3 – hand label abstracts
15
How does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
• We use the word2vec
algorithm (Google) to turn
each unique word in our
corpus into a 200-
dimensional vector
• These vectors encode the
meaning of each word
meaning based on trying to
predict context words
around the target
16
Step 4a: the word2vec algorithm is used to “featurize” words
Barazza, L. How does Word2Vec’s Skip-Gram work? Becominghuman.ai. 2017
• We use the word2vec
algorithm (Google) to turn
each unique word in our
corpus into a 200-
dimensional vector
• These vectors encode the
meaning of each word
meaning based on trying to
predict context words
around the target
17
Step 4a: the word2vec algorithm is used to “featurize” words
Barazza, L. How does Word2Vec’s Skip-Gram work? Becominghuman.ai. 2017
“You shall know a word by
the company it keeps”
- John Rupert Firth (1957)
• The classic example is:
– “king” - “man” + “woman” = ? → “queen”
18
Word embeddings trained on ”normal” text learns
relationships between words
19
For scientific text, it learns scientific concepts as well
crystal structures of the elements
Tshitoyan, V. et al. Unsupervised word embeddings capture latent
knowledge from materials science literature. Nature 571, 95–98 (2019).
When we train
word2vec on inorganic
materials science
abstracts, we get
representations in-line
with chemical
knowledge
20
There seems to be materials knowledge encoded in the
word vectors
Tshitoyan, V. et al. Unsupervised word embeddings capture latent
knowledge from materials science literature. Nature 571, 95–98 (2019).
21
Word embeddings also have the periodic table encoded in it
with no prior knowledge
“word embedding”
periodic table
Tshitoyan, V. et al. Unsupervised word embeddings capture latent
knowledge from materials science literature. Nature 571, 95–98 (2019).
22
Side note: the learned element embeddings from text
mining are now used in various state-of-the-art ML models
Uses mat2vec
embeddings
Uses 1-hot encoded
embeddings
Uses mat2vec
embeddings
Uses 1-hot encoded
embeddings
Currently, the two best-performing ML
models for predicting various materials
properties from a chemical
composition make use of mat2vec
embeddings!
”Crabnet”
https://www.nature.com/articles/s41524-021-00545-1
https://www.nature.com/articles/s41467-020-19964-7
”RooST”
23
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
• If you read this sentence:
“The band gap of ___ is 4.5 eV”
It is clear that the blank should be filled in with a
material word (not a synthesis method, characterization
method, etc.)
How do we get a neural network to take into account
context (as well as properties of the word itself)?
24
Step 4b: How do we train a model to recognize context?
25
Step 4b.An LSTM neural net classifies words by reading
word sequences
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
26
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
27
Step 5. Let the model label things for you!
Named Entity Recognition
X
• Custom machine learning models to
extract the most valuable materials-related
information.
• Utilizes a long short-term memory (LSTM)
network trained on ~1000 hand-annotated
abstracts.
• f1 scores of ~0.9. f1 score for inorganic
materials extraction is >0.9.
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
28
Now we can search!
Live on www.matscholar.com
29
We are also integrating matscholar tools with the
Materials Project database
www.materialsproject.org is a free database of computed
materials properties and over >200K registered users
30
Adding generic search capabilities to MP database
Currently, you need to type a very
strict search format into MP search
bar – either a list of elements or
specific chemical formulas
Can’t search “ferroelectric” for
example, just “BaTiO3”
31
Prototype integration with Materials Project
is already underway
* Working out some kinks that lead to LiCoO2, LiFePO4, etc not being sorted correctly
The types of features we want to enable
32
Zinc oxide
ZnO
OZn
Chemistry aware search
(same input, same results)
Summary data
• Physical properties
• Synthesis information
• Known applications
ferroelectrics All known compositions
(PbTiO3, BaTiO3, etc.)
Links to computational databases
Zn0.5O0.5
new thermoelectrics
Composition A
Composition B
Composition A synthesis
Composition B synthesis
Known;
summary of all
previous
syntheses
Unknown;
suggested
synthesis
protocol
• The publication data set is not complete
• Currently analyzing abstracts only
• The algorithms are not perfect
• The search interface could be improved further
• We would like to hear from you if you try this!
33
Limitations (it is not perfect)
34
Could these techniques also be used to predict which
materials we might want to screen for an application?
papers to read “someday”
NLP algorithms
• Dot product of a composition word with
the word “thermoelectric” essentially
predicts how likely that word is to appear
in an abstract with the word
thermoelectric
• Compositions with high dot products are
typically known thermoelectrics
• Sometimes, compositions have a high dot
product with “thermoelectric” but have
never been studied as a thermoelectric
• These compositions usually have high
computed power factors!
(DFT+BoltzTraP)
35
Making predictions: dot products measure likelihood for
words to co-occur
Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from
materials science literature. Nature 571, 95–98 (2019).
36
Try ”going back in time” and ranking materials, and follow
what happens in later years
Tshitoyan, V. et al.
Unsupervised word
embeddings capture latent
knowledge from materials
science literature. Nature
571, 95–98 (2019).
– For every year since
2001, see which
compounds we would
have predicted using
only literature data until
that point in time
– Make predictions of
what materials are the
most promising
thermoelectrics for
data until that year
– See if those materials
were actually studied as
thermoelectrics in
subsequent years 37
A more comprehensive “back in time” test
Tshitoyan, V. et al. Unsupervised word embeddings capture
latent knowledge from materials science literature. Nature
571, 95–98 (2019).
38
We also published a list of potential new thermoelectrics
Tshitoyan, V. et al. Unsupervised word embeddings capture
latent knowledge from materials science literature. Nature
571, 95–98 (2019).
It is one thing to
retroactively test, but
perhaps another to see
how things go after
publication
39
Overall: ~33% of predictions were studied as
thermoelectrics within 3 years
Investigated as thermoelectrics
(independently of our study)
• About 1/3 of predicted compounds have been
studied within 3 years – better than we expect
• However, almost all studies were computational
explorations of thermoelectricity / first principles
calculations and not experiments
• 3 compounds had zT measured experimentally:
• Li3Sb reached a peak zT ~ 0.3
• Cu7Te5 reached a peak zT ~ 0.14
• CsGeI3 (after further doping) reached a peak
zT ~ 0.12
• Overall – the forward prediction of materials that are
likely to be studied as thermoelectrics seems to
mostly work
• However, they are not particularly good
thermoelectrics.
Investigated by our own collaborators
(as a result of our study)
40
How is this working?
“Context
words” link
together
information
from different
sources
The types of features we want to enable
41
Zinc oxide
ZnO
OZn
Chemistry aware search
(same input, same results)
Summary data
• Physical properties
• Synthesis information
• Known applications
ferroelectrics All known compositions
(PbTiO3, BaTiO3, etc.)
Links to computational databases
Zn0.5O0.5
new thermoelectrics
Composition A
Composition B
Composition A synthesis
Composition B synthesis
Known;
summary of all
previous
syntheses
Unknown;
suggested
synthesis
protocol
42
Roadmap – what’s next?
43
Improving the accuracy of the model:
training a BERT-based model
The BERT model is more advanced than word2vec and better takes into account context.
Performance on all tasks is improved; we are currently investigating other models that may
have even easier annotation and better performance.
Walker, Nicholas, et al. "The Impact of Domain-Specific Pre-Training on Named Entity Recognition Tasks in Materials Science." Available at SSRN 3950755 (2021).
• For some tasks/domains, extracting entities is
sufficient
• For others, we need to relate them! NER does not
tell us enough.
44
Improving the capabilities of extraction:
relating entities to one another for complex information
Dopants
Transition metals
Sm
Sn
Base materials
ZnO
ZnS
Dopant quantities
5 at. %
?
?
?
What was doped with Sn??
Doping of transition metals into ZnS and ZnO nanoparticles . . .
The ZnO:Sm system was formed at 5 at.% . . .
The ZnS sample was also doped with Sn . . .
• Our goal is to extract structured graphs of entities
rather than just the entities themselves
• Structured acyclic entity graphs give complete
information for extraction and analysis
45
Doping of transition metals into ZnS and ZnO nanoparticles . . .
The ZnO:Sm system was formed at 5 at.% . . .
The ZnS sample was also doped with Sn . . .
ZnS
ZnO
Transition metals
Sm
Sn
5 at. %
“was doped with”
“to the amount of”
By relating entities, we get much more
powerful and useful information extraction
• Earlier, dependency extraction was done using grammar rules (e.g.
dependency trees) but it was not particularly successful
• We have been experimenting with large seq2seq transformer models
• These can take in an unstructured text sequence and output a structured
text sequence (e.g., OpenAI Codex that solves programming tasks)
• Can be trained with few (<50) examples due to few-shot capability
46
Utilizing large seq2seq models for ERM
Transition metal doping is an effective tool for controlling optical
absorption in ZnS and hence the number of photons absorbed by
photovoltaic devices. By using first principle density functional
calculations, we compute the change in number of photons absorbed
upon doping with a selected transition metal and found that Ni
offers the best chance to improve the performance. This is
attributed to the formation of defect states in the band gap of the
host ZnS which give rise to additional dipole-allowed optical
transition pathways between the conduction and valence band.
Analysis of the defect level in the band gap shows that TM dopants
do not pin Fermi levels in ZnS and hence the host can be made n- or
p- type with other suitable dopants. The measured optical spectra
from the doped solution processed ZnS nanocrystal supports our
theoretical finding that Ni doping enhances optical absorption the
most compared to Co and Mn doping.
Raw scientific text
Seq2seq
Model
Trained on
intermediate reps.
Entity Relationships
Output
sequence
Input seq Output seq
Deterministic decoding
• Previous NER experiments can be extended with
ERM to include much more information
47
Applying ERM to Dopant/Host extraction
CaCu3Ti4-xCoxO12 is a doped result with
descriptor ceramic and phase cubic from base
material CaCu3Ti4O12 (AKA calcium copper
titanate) and dopant Co + 2 (AKA cobalt).
{
“basemats”: {
0: {
“aliases”: [“CaCuTi4O12”, “calcium copper titanate”],
“descriptor”: null,
...}}
“dopants”: {
0: {
“aliases”: [“Co+2”, “cobalt”],
...}},
”results”: {
0: {
“aliases”: “CaCu3Ti$_{bf 4-emph{x}}$Co$_{bfemph{x}}$O12"
“linked_basemats”: [0],
“linked_dopants”: [0],
“descriptors”: [“ceramics”],
...}}
Seq2seq model
unstructured to structured
Manual parser
For example, we hope to parse a literature-derived
database of dopants and dopability
48
With this capability, we plan to release structured materials
properties databases based on NLP parsing of literature
Sentence Base Material Dopant Doping Concentr.
…the influence of yttrium doping (0-10mol%) on BSCF… BSCF Yttrium 0-10 mol%
undoped, anion-doped(Sb,Bi) and cation-doped(Ca,Zn) solid sln. of Mg10Si2Sn3… Mg10Si2Sn3 Sb, Bi, Ca, Zn
The zT of As2Cd3 with electron doping is found to be ~ with n=10^20cm-3 As2Cd3 electron n=10^20cm-3
This leads to zT=0.5 obtained at 500K (p=10^20cm-3) in p-type As2Cd3T As2Cd3 p-type p=10^20cm-3
The undoped and 0.25wt% La doped CdO films show 111…
…however, …. for doping concentrations greater than 0.50wt%.
CdO La 0.25wt%,
>0.5%
Which elements are commonly doped
into the same materials (i.e., co-occur
as dopants)?
• Experimentalists identified relevant factors for gold
nanorod dimensions
– Experimental temperatures
– Solution ages/timing
– Precursor amounts
49
We can also tackle complex syntheses if we can do
entity relationship modeling
Seq2Seq model outputs JSON
(form of entity graph)
"seed": {
"prec": {
"HAuCl4": {
"vol": "5 mL",
"concn": "0.25 mM"
},
"CTAB": {
"vol": "HAuCl4",
"concn": "0.1 M"
},
"NaBH4": {
"vol": "0.3 mL",
"concn": "10 mM"
}
},
"seed": {
"size": "3 nm"
},
"temp": "25 degC",
"age": "5 min"
},
Types of factors important in synthesis
Values as extracted from raw text
50
Tests on Au Nanorod Synthesis indicate it is working
Seed Solution
(age, stir rate,
temperature,
precursor properties,
seed properties)
Growth
Solution
(age, stir rate,
temperature,
precursor properties)
AuNR
(aspect ratios,
lengths, widths,
TSPRs, and LSPRs)
Entity detected
(F1 score)
0.94 0.92 0.76
Exact match to entity
(accuracy)
0.73 0.77 0.52
Support 159 244 96
Aggregated scores by AuNR recipe component
Evaluated on 40 test paragraphs
Trained on 40 (manual annotation) and 200 (assisted) paragraphs
Entity detected = We correctly detected the types of synthesis information present
Exact match = The extracted synthesis information is an exact string match
The types of features we want to enable
51
Zinc oxide
ZnO
OZn
Chemistry aware search
(same input, same results)
Summary data
• Physical properties
• Synthesis information
• Known applications
ferroelectrics All known compositions
(PbTiO3, BaTiO3, etc.)
Links to computational databases
Zn0.5O0.5
new thermoelectrics
Composition A
Composition B
Composition A synthesis
Composition B synthesis
Known;
summary of all
previous
syntheses
Unknown;
suggested
synthesis
protocol
???
52
Note –
we are creating open-source libraries to help with NLP tasks
https://github.com/lbnlp
• There exists a lot of data and knowledge in the
historical corpus of scientific journal articles, but
getting the knowledge has been difficult to do on
a large scale
• Machine learning presents a new frontier for
being able to make use of this information
53
Conclusion
54
The Matscholar team
Funding from:
Slides (already) posted to
hackingmaterials.lbl.gov
John
Dagdelen
Alex
Dunn
Viktoriia
Baibakova
John
Dagdelen
Viktoriia
Baibakova
Nick
Walker
Kristin Persson
Anubhav Jain
Gerbrand Ceder
Leigh
Weston
Vahe
Tshitoyan
Amalie
Trewartha
alumni
Sanghoon
Lee

More Related Content

What's hot

Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Anubhav Jain
 
Materials design using knowledge from millions of journal articles via natura...
Materials design using knowledge from millions of journal articles via natura...Materials design using knowledge from millions of journal articles via natura...
Materials design using knowledge from millions of journal articles via natura...
Anubhav Jain
 
Capturing and leveraging materials science knowledge from millions of journal...
Capturing and leveraging materials science knowledge from millions of journal...Capturing and leveraging materials science knowledge from millions of journal...
Capturing and leveraging materials science knowledge from millions of journal...
Anubhav Jain
 
Discovering advanced materials for energy applications by mining the scientif...
Discovering advanced materials for energy applications by mining the scientif...Discovering advanced materials for energy applications by mining the scientif...
Discovering advanced materials for energy applications by mining the scientif...
Anubhav Jain
 
TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...
BrianDeCost
 
Materials Informatics and Python
Materials Informatics and PythonMaterials Informatics and Python
Materials Informatics and Python
Shintaro Fukushima
 
Accelerating materials design through natural language processing
Accelerating materials design through natural language processingAccelerating materials design through natural language processing
Accelerating materials design through natural language processing
Anubhav Jain
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data sets
Anubhav Jain
 
Materials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learningMaterials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learning
Anubhav Jain
 
DuraMat Data Management and Analytics
DuraMat Data Management and AnalyticsDuraMat Data Management and Analytics
DuraMat Data Management and Analytics
Anubhav Jain
 
Smart Metrics for High Performance Material Design
Smart Metrics for High Performance Material DesignSmart Metrics for High Performance Material Design
Smart Metrics for High Performance Material Design
aimsnist
 
Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...
Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...
Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...
aimsnist
 
Hattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop SlidesHattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop Slides
Jason Hattrick-Simpers
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
PyData
 
Graph Centric Analysis of Road Network Patterns for CBD’s of Metropolitan Cit...
Graph Centric Analysis of Road Network Patterns for CBD’s of Metropolitan Cit...Graph Centric Analysis of Road Network Patterns for CBD’s of Metropolitan Cit...
Graph Centric Analysis of Road Network Patterns for CBD’s of Metropolitan Cit...
Punit Sharnagat
 
A Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge SystemsA Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge Systems
aimsnist
 
A Framework and Infrastructure for Uncertainty Quantification and Management ...
A Framework and Infrastructure for Uncertainty Quantification and Management ...A Framework and Infrastructure for Uncertainty Quantification and Management ...
A Framework and Infrastructure for Uncertainty Quantification and Management ...
aimsnist
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCF
Ian Foster
 
2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model
aimsnist
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and Chemistry
Ian Foster
 

What's hot (20)

Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
 
Materials design using knowledge from millions of journal articles via natura...
Materials design using knowledge from millions of journal articles via natura...Materials design using knowledge from millions of journal articles via natura...
Materials design using knowledge from millions of journal articles via natura...
 
Capturing and leveraging materials science knowledge from millions of journal...
Capturing and leveraging materials science knowledge from millions of journal...Capturing and leveraging materials science knowledge from millions of journal...
Capturing and leveraging materials science knowledge from millions of journal...
 
Discovering advanced materials for energy applications by mining the scientif...
Discovering advanced materials for energy applications by mining the scientif...Discovering advanced materials for energy applications by mining the scientif...
Discovering advanced materials for energy applications by mining the scientif...
 
TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...
 
Materials Informatics and Python
Materials Informatics and PythonMaterials Informatics and Python
Materials Informatics and Python
 
Accelerating materials design through natural language processing
Accelerating materials design through natural language processingAccelerating materials design through natural language processing
Accelerating materials design through natural language processing
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data sets
 
Materials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learningMaterials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learning
 
DuraMat Data Management and Analytics
DuraMat Data Management and AnalyticsDuraMat Data Management and Analytics
DuraMat Data Management and Analytics
 
Smart Metrics for High Performance Material Design
Smart Metrics for High Performance Material DesignSmart Metrics for High Performance Material Design
Smart Metrics for High Performance Material Design
 
Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...
Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...
Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...
 
Hattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop SlidesHattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop Slides
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 
Graph Centric Analysis of Road Network Patterns for CBD’s of Metropolitan Cit...
Graph Centric Analysis of Road Network Patterns for CBD’s of Metropolitan Cit...Graph Centric Analysis of Road Network Patterns for CBD’s of Metropolitan Cit...
Graph Centric Analysis of Road Network Patterns for CBD’s of Metropolitan Cit...
 
A Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge SystemsA Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge Systems
 
A Framework and Infrastructure for Uncertainty Quantification and Management ...
A Framework and Infrastructure for Uncertainty Quantification and Management ...A Framework and Infrastructure for Uncertainty Quantification and Management ...
A Framework and Infrastructure for Uncertainty Quantification and Management ...
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCF
 
2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and Chemistry
 

Similar to Applications of Natural Language Processing to Materials Design

Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Anubhav Jain
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
aimsnist
 
Classification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern Mining
IOSR Journals
 
An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
 An Investigation of Keywords Extraction from Textual Documents using Word2Ve... An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
IJCSIS Research Publications
 
Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...
Kai Li
 
Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...
Anubhav Jain
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
Ian Foster
 
Automatically Generating Wikipedia Articles: A Structure-Aware Approach
Automatically Generating Wikipedia Articles:  A Structure-Aware ApproachAutomatically Generating Wikipedia Articles:  A Structure-Aware Approach
Automatically Generating Wikipedia Articles: A Structure-Aware Approach
George Ang
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
Anubhav Jain
 
Post 1What is text analytics How does it differ from text mini.docx
Post 1What is text analytics How does it differ from text mini.docxPost 1What is text analytics How does it differ from text mini.docx
Post 1What is text analytics How does it differ from text mini.docx
stilliegeorgiana
 
Post 1What is text analytics How does it differ from text mini
Post 1What is text analytics How does it differ from text miniPost 1What is text analytics How does it differ from text mini
Post 1What is text analytics How does it differ from text mini
anhcrowley
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_Habib
El Habib NFAOUI
 
G04124041046
G04124041046G04124041046
G04124041046
IOSR-JEN
 
Presentationonline
PresentationonlinePresentationonline
Presentationonline
kashif Iqbal Kashif.Iqbal.Shah
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
butest
 
ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...
ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...
ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...
ijaia
 
Using Knowledge Graph for Promoting Cognitive Computing
Using Knowledge Graph for Promoting Cognitive ComputingUsing Knowledge Graph for Promoting Cognitive Computing
Using Knowledge Graph for Promoting Cognitive Computing
Artificial Intelligence Institute at UofSC
 
Topic detecton by clustering and text mining
Topic detecton by clustering and text miningTopic detecton by clustering and text mining
Topic detecton by clustering and text mining
IRJET Journal
 
Construction and Querying of Dynamic Knowledge Graphs
Construction and Querying of Dynamic Knowledge GraphsConstruction and Querying of Dynamic Knowledge Graphs
Construction and Querying of Dynamic Knowledge Graphs
Sutanay Choudhury
 
Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)
Enayat Rajabi
 

Similar to Applications of Natural Language Processing to Materials Design (20)

Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
 
Classification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern Mining
 
An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
 An Investigation of Keywords Extraction from Textual Documents using Word2Ve... An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
 
Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...
 
Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Automatically Generating Wikipedia Articles: A Structure-Aware Approach
Automatically Generating Wikipedia Articles:  A Structure-Aware ApproachAutomatically Generating Wikipedia Articles:  A Structure-Aware Approach
Automatically Generating Wikipedia Articles: A Structure-Aware Approach
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
Post 1What is text analytics How does it differ from text mini.docx
Post 1What is text analytics How does it differ from text mini.docxPost 1What is text analytics How does it differ from text mini.docx
Post 1What is text analytics How does it differ from text mini.docx
 
Post 1What is text analytics How does it differ from text mini
Post 1What is text analytics How does it differ from text miniPost 1What is text analytics How does it differ from text mini
Post 1What is text analytics How does it differ from text mini
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_Habib
 
G04124041046
G04124041046G04124041046
G04124041046
 
Presentationonline
PresentationonlinePresentationonline
Presentationonline
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
 
ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...
ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...
ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...
 
Using Knowledge Graph for Promoting Cognitive Computing
Using Knowledge Graph for Promoting Cognitive ComputingUsing Knowledge Graph for Promoting Cognitive Computing
Using Knowledge Graph for Promoting Cognitive Computing
 
Topic detecton by clustering and text mining
Topic detecton by clustering and text miningTopic detecton by clustering and text mining
Topic detecton by clustering and text mining
 
Construction and Querying of Dynamic Knowledge Graphs
Construction and Querying of Dynamic Knowledge GraphsConstruction and Querying of Dynamic Knowledge Graphs
Construction and Querying of Dynamic Knowledge Graphs
 
Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)
 

More from Anubhav Jain

Discovering advanced materials for energy applications: theory, high-throughp...
Discovering advanced materials for energy applications: theory, high-throughp...Discovering advanced materials for energy applications: theory, high-throughp...
Discovering advanced materials for energy applications: theory, high-throughp...
Anubhav Jain
 
Applications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and Design
Anubhav Jain
 
An AI-driven closed-loop facility for materials synthesis
An AI-driven closed-loop facility for materials synthesisAn AI-driven closed-loop facility for materials synthesis
An AI-driven closed-loop facility for materials synthesis
Anubhav Jain
 
Best practices for DuraMat software dissemination
Best practices for DuraMat software disseminationBest practices for DuraMat software dissemination
Best practices for DuraMat software dissemination
Anubhav Jain
 
Best practices for DuraMat software dissemination
Best practices for DuraMat software disseminationBest practices for DuraMat software dissemination
Best practices for DuraMat software dissemination
Anubhav Jain
 
Available methods for predicting materials synthesizability using computation...
Available methods for predicting materials synthesizability using computation...Available methods for predicting materials synthesizability using computation...
Available methods for predicting materials synthesizability using computation...
Anubhav Jain
 
Efficient methods for accurately calculating thermoelectric properties – elec...
Efficient methods for accurately calculating thermoelectric properties – elec...Efficient methods for accurately calculating thermoelectric properties – elec...
Efficient methods for accurately calculating thermoelectric properties – elec...
Anubhav Jain
 
Natural Language Processing for Data Extraction and Synthesizability Predicti...
Natural Language Processing for Data Extraction and Synthesizability Predicti...Natural Language Processing for Data Extraction and Synthesizability Predicti...
Natural Language Processing for Data Extraction and Synthesizability Predicti...
Anubhav Jain
 
Machine Learning for Catalyst Design
Machine Learning for Catalyst DesignMachine Learning for Catalyst Design
Machine Learning for Catalyst Design
Anubhav Jain
 
Natural language processing for extracting synthesis recipes and applications...
Natural language processing for extracting synthesis recipes and applications...Natural language processing for extracting synthesis recipes and applications...
Natural language processing for extracting synthesis recipes and applications...
Anubhav Jain
 
Accelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine LearningAccelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine Learning
Anubhav Jain
 
DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …
Anubhav Jain
 
The Materials Project
The Materials ProjectThe Materials Project
The Materials Project
Anubhav Jain
 
Evaluating Chemical Composition and Crystal Structure Representations using t...
Evaluating Chemical Composition and Crystal Structure Representations using t...Evaluating Chemical Composition and Crystal Structure Representations using t...
Evaluating Chemical Composition and Crystal Structure Representations using t...
Anubhav Jain
 
Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...
Anubhav Jain
 
Discovering and Exploring New Materials through the Materials Project
Discovering and Exploring New Materials through the Materials ProjectDiscovering and Exploring New Materials through the Materials Project
Discovering and Exploring New Materials through the Materials Project
Anubhav Jain
 
The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...
Anubhav Jain
 
The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...
Anubhav Jain
 
Machine Learning Platform for Catalyst Design
Machine Learning Platform for Catalyst DesignMachine Learning Platform for Catalyst Design
Machine Learning Platform for Catalyst Design
Anubhav Jain
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomate
Anubhav Jain
 

More from Anubhav Jain (20)

Discovering advanced materials for energy applications: theory, high-throughp...
Discovering advanced materials for energy applications: theory, high-throughp...Discovering advanced materials for energy applications: theory, high-throughp...
Discovering advanced materials for energy applications: theory, high-throughp...
 
Applications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and Design
 
An AI-driven closed-loop facility for materials synthesis
An AI-driven closed-loop facility for materials synthesisAn AI-driven closed-loop facility for materials synthesis
An AI-driven closed-loop facility for materials synthesis
 
Best practices for DuraMat software dissemination
Best practices for DuraMat software disseminationBest practices for DuraMat software dissemination
Best practices for DuraMat software dissemination
 
Best practices for DuraMat software dissemination
Best practices for DuraMat software disseminationBest practices for DuraMat software dissemination
Best practices for DuraMat software dissemination
 
Available methods for predicting materials synthesizability using computation...
Available methods for predicting materials synthesizability using computation...Available methods for predicting materials synthesizability using computation...
Available methods for predicting materials synthesizability using computation...
 
Efficient methods for accurately calculating thermoelectric properties – elec...
Efficient methods for accurately calculating thermoelectric properties – elec...Efficient methods for accurately calculating thermoelectric properties – elec...
Efficient methods for accurately calculating thermoelectric properties – elec...
 
Natural Language Processing for Data Extraction and Synthesizability Predicti...
Natural Language Processing for Data Extraction and Synthesizability Predicti...Natural Language Processing for Data Extraction and Synthesizability Predicti...
Natural Language Processing for Data Extraction and Synthesizability Predicti...
 
Machine Learning for Catalyst Design
Machine Learning for Catalyst DesignMachine Learning for Catalyst Design
Machine Learning for Catalyst Design
 
Natural language processing for extracting synthesis recipes and applications...
Natural language processing for extracting synthesis recipes and applications...Natural language processing for extracting synthesis recipes and applications...
Natural language processing for extracting synthesis recipes and applications...
 
Accelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine LearningAccelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine Learning
 
DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …
 
The Materials Project
The Materials ProjectThe Materials Project
The Materials Project
 
Evaluating Chemical Composition and Crystal Structure Representations using t...
Evaluating Chemical Composition and Crystal Structure Representations using t...Evaluating Chemical Composition and Crystal Structure Representations using t...
Evaluating Chemical Composition and Crystal Structure Representations using t...
 
Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...
 
Discovering and Exploring New Materials through the Materials Project
Discovering and Exploring New Materials through the Materials ProjectDiscovering and Exploring New Materials through the Materials Project
Discovering and Exploring New Materials through the Materials Project
 
The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...
 
The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...
 
Machine Learning Platform for Catalyst Design
Machine Learning Platform for Catalyst DesignMachine Learning Platform for Catalyst Design
Machine Learning Platform for Catalyst Design
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomate
 

Recently uploaded

The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
Sérgio Sacani
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
Sérgio Sacani
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
HongcNguyn6
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
Sérgio Sacani
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
kejapriya1
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills MN
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
muralinath2
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
MAGOTI ERNEST
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
University of Maribor
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
yqqaatn0
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
TinyAnderson
 
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
RASHMI M G
 
Nucleophilic Addition of carbonyl compounds.pptx
Nucleophilic Addition of carbonyl  compounds.pptxNucleophilic Addition of carbonyl  compounds.pptx
Nucleophilic Addition of carbonyl compounds.pptx
SSR02
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
Anagha Prasad
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
PRIYANKA PATEL
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
by6843629
 
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptxBREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
RASHMI M G
 
Medical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptxMedical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptx
terusbelajar5
 

Recently uploaded (20)

The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
 
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
 
Nucleophilic Addition of carbonyl compounds.pptx
Nucleophilic Addition of carbonyl  compounds.pptxNucleophilic Addition of carbonyl  compounds.pptx
Nucleophilic Addition of carbonyl compounds.pptx
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
 
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptxBREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
 
Medical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptxMedical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptx
 

Applications of Natural Language Processing to Materials Design

  • 1. Applications of Natural Language Processing to Materials Design Anubhav Jain Energy Technologies Area Lawrence Berkeley National Laboratory Berkeley, CA UCB MSE Seminar, March 31 2022 Slides (already) posted to hackingmaterials.lbl.gov
  • 2. 2 Can ML help us work through our backlog of information we need to assimilate from text sources? Flood of information Important things get missed Useful data, but unstructured NLP algorithms
  • 3. • Small things – search is not chemistry-aware – a search for “TiNiSn” will give different results than “NiTiSn” – a search for “SnBi4Te7” won’t match text that reads “we studied SnBi4X7 (X=S, Se, Te)”. • Medium things – it is difficult to ask questions or compile summaries, e.g.: – What is the band gap of “Si”? – What are all the known dopants into GaAs? – What are all materials studied as thermoelectrics? • Big things – one can’t make predictive use of information in text – Based on all that is known, what materials should be studied as thermoelectrics? – Given a synthesis target of a novel compound (composition + structure), what kind of synthesis protocol should be followed to realize the compound? 3 Some ways in which existing tools for searching the literature fall short
  • 4. The types of features we want to enable 4 Zinc oxide ZnO OZn Chemistry aware search (same input, same results) Summary data • Physical properties • Synthesis information • Known applications ferroelectrics All known compositions (PbTiO3, BaTiO3, etc.) Links to computational databases Zn0.5O0.5 Composition A Composition B Composition A synthesis Composition B synthesis Known; summary of all previous syntheses Unknown; suggested synthesis protocol new thermoelectrics
  • 5. What is matscholar? • Matscholar is an attempt to organize the world’s information on materials science • It is an effort to use state-of-the-art natural language processing to make collective use of the information in millions of articles
  • 6. Today, this is usually done manually or (recently) semi-automatically with custom rules 6 Data extracted manually Data extracted semi-automatically Largely rule-based, not example-based (ML)
  • 7. With Matscholar, we are engaged in two primary efforts 1. Collect raw information from the research literature to serve as a source for text mining 2. Develop machine learning models that can be applied to text sources (like the research literature) to extract useful information 7
  • 8. One of our main machine learning projects concerns named entity recognition, or automatically labeling text 8 This allows for search and is crucial to downstream tasks
  • 9. 9 How does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  • 10. 10 Data collection is a multi-step process Currently, ~4 million entries (article abstracts) have been parsed. Separately, a full-text database of comparable size for is compiled via publisher negotiation (Berkeley - Ceder group)
  • 11. 11 How does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  • 12. • First split the text into sentences – Seems simple, but remember edge cases like ”et al.” or “etc.” does not necessarily signify end of sentence despite the period • Then split the sentences into words – Tricky things are detecting and normalizing chemical formulas, selective lowercasing (“Battery” vs “battery” or “BaS” vs “BAs”), homogenizing numbers, etc. • Historically done with ChemDataExtractor* with some custom improvements – We are moving towards a fully custom tokenizer 12 Step 2 - tokenization *http://chemdataextractor.org
  • 13. 13 How does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  • 14. • Part A is marking abstracts as relevant / non-relevant to inorganic materials science • Part B is tediously labeling ~600 abstracts – Largely done by one person – Spot-check of 25 abstracts by a second person gave 87.4% agreement 14 Step 3 – hand label abstracts
  • 15. 15 How does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  • 16. • We use the word2vec algorithm (Google) to turn each unique word in our corpus into a 200- dimensional vector • These vectors encode the meaning of each word meaning based on trying to predict context words around the target 16 Step 4a: the word2vec algorithm is used to “featurize” words Barazza, L. How does Word2Vec’s Skip-Gram work? Becominghuman.ai. 2017
  • 17. • We use the word2vec algorithm (Google) to turn each unique word in our corpus into a 200- dimensional vector • These vectors encode the meaning of each word meaning based on trying to predict context words around the target 17 Step 4a: the word2vec algorithm is used to “featurize” words Barazza, L. How does Word2Vec’s Skip-Gram work? Becominghuman.ai. 2017 “You shall know a word by the company it keeps” - John Rupert Firth (1957)
  • 18. • The classic example is: – “king” - “man” + “woman” = ? → “queen” 18 Word embeddings trained on ”normal” text learns relationships between words
  • 19. 19 For scientific text, it learns scientific concepts as well crystal structures of the elements Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019). When we train word2vec on inorganic materials science abstracts, we get representations in-line with chemical knowledge
  • 20. 20 There seems to be materials knowledge encoded in the word vectors Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  • 21. 21 Word embeddings also have the periodic table encoded in it with no prior knowledge “word embedding” periodic table Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  • 22. 22 Side note: the learned element embeddings from text mining are now used in various state-of-the-art ML models Uses mat2vec embeddings Uses 1-hot encoded embeddings Uses mat2vec embeddings Uses 1-hot encoded embeddings Currently, the two best-performing ML models for predicting various materials properties from a chemical composition make use of mat2vec embeddings! ”Crabnet” https://www.nature.com/articles/s41524-021-00545-1 https://www.nature.com/articles/s41467-020-19964-7 ”RooST”
  • 23. 23 Ok so how does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  • 24. • If you read this sentence: “The band gap of ___ is 4.5 eV” It is clear that the blank should be filled in with a material word (not a synthesis method, characterization method, etc.) How do we get a neural network to take into account context (as well as properties of the word itself)? 24 Step 4b: How do we train a model to recognize context?
  • 25. 25 Step 4b.An LSTM neural net classifies words by reading word sequences Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  • 26. 26 Ok so how does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  • 27. 27 Step 5. Let the model label things for you! Named Entity Recognition X • Custom machine learning models to extract the most valuable materials-related information. • Utilizes a long short-term memory (LSTM) network trained on ~1000 hand-annotated abstracts. • f1 scores of ~0.9. f1 score for inorganic materials extraction is >0.9. Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  • 28. 28 Now we can search! Live on www.matscholar.com
  • 29. 29 We are also integrating matscholar tools with the Materials Project database www.materialsproject.org is a free database of computed materials properties and over >200K registered users
  • 30. 30 Adding generic search capabilities to MP database Currently, you need to type a very strict search format into MP search bar – either a list of elements or specific chemical formulas Can’t search “ferroelectric” for example, just “BaTiO3”
  • 31. 31 Prototype integration with Materials Project is already underway * Working out some kinks that lead to LiCoO2, LiFePO4, etc not being sorted correctly
  • 32. The types of features we want to enable 32 Zinc oxide ZnO OZn Chemistry aware search (same input, same results) Summary data • Physical properties • Synthesis information • Known applications ferroelectrics All known compositions (PbTiO3, BaTiO3, etc.) Links to computational databases Zn0.5O0.5 new thermoelectrics Composition A Composition B Composition A synthesis Composition B synthesis Known; summary of all previous syntheses Unknown; suggested synthesis protocol
  • 33. • The publication data set is not complete • Currently analyzing abstracts only • The algorithms are not perfect • The search interface could be improved further • We would like to hear from you if you try this! 33 Limitations (it is not perfect)
  • 34. 34 Could these techniques also be used to predict which materials we might want to screen for an application? papers to read “someday” NLP algorithms
  • 35. • Dot product of a composition word with the word “thermoelectric” essentially predicts how likely that word is to appear in an abstract with the word thermoelectric • Compositions with high dot products are typically known thermoelectrics • Sometimes, compositions have a high dot product with “thermoelectric” but have never been studied as a thermoelectric • These compositions usually have high computed power factors! (DFT+BoltzTraP) 35 Making predictions: dot products measure likelihood for words to co-occur Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  • 36. 36 Try ”going back in time” and ranking materials, and follow what happens in later years Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  • 37. – For every year since 2001, see which compounds we would have predicted using only literature data until that point in time – Make predictions of what materials are the most promising thermoelectrics for data until that year – See if those materials were actually studied as thermoelectrics in subsequent years 37 A more comprehensive “back in time” test Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  • 38. 38 We also published a list of potential new thermoelectrics Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019). It is one thing to retroactively test, but perhaps another to see how things go after publication
  • 39. 39 Overall: ~33% of predictions were studied as thermoelectrics within 3 years Investigated as thermoelectrics (independently of our study) • About 1/3 of predicted compounds have been studied within 3 years – better than we expect • However, almost all studies were computational explorations of thermoelectricity / first principles calculations and not experiments • 3 compounds had zT measured experimentally: • Li3Sb reached a peak zT ~ 0.3 • Cu7Te5 reached a peak zT ~ 0.14 • CsGeI3 (after further doping) reached a peak zT ~ 0.12 • Overall – the forward prediction of materials that are likely to be studied as thermoelectrics seems to mostly work • However, they are not particularly good thermoelectrics. Investigated by our own collaborators (as a result of our study)
  • 40. 40 How is this working? “Context words” link together information from different sources
  • 41. The types of features we want to enable 41 Zinc oxide ZnO OZn Chemistry aware search (same input, same results) Summary data • Physical properties • Synthesis information • Known applications ferroelectrics All known compositions (PbTiO3, BaTiO3, etc.) Links to computational databases Zn0.5O0.5 new thermoelectrics Composition A Composition B Composition A synthesis Composition B synthesis Known; summary of all previous syntheses Unknown; suggested synthesis protocol
  • 43. 43 Improving the accuracy of the model: training a BERT-based model The BERT model is more advanced than word2vec and better takes into account context. Performance on all tasks is improved; we are currently investigating other models that may have even easier annotation and better performance. Walker, Nicholas, et al. "The Impact of Domain-Specific Pre-Training on Named Entity Recognition Tasks in Materials Science." Available at SSRN 3950755 (2021).
  • 44. • For some tasks/domains, extracting entities is sufficient • For others, we need to relate them! NER does not tell us enough. 44 Improving the capabilities of extraction: relating entities to one another for complex information Dopants Transition metals Sm Sn Base materials ZnO ZnS Dopant quantities 5 at. % ? ? ? What was doped with Sn?? Doping of transition metals into ZnS and ZnO nanoparticles . . . The ZnO:Sm system was formed at 5 at.% . . . The ZnS sample was also doped with Sn . . .
  • 45. • Our goal is to extract structured graphs of entities rather than just the entities themselves • Structured acyclic entity graphs give complete information for extraction and analysis 45 Doping of transition metals into ZnS and ZnO nanoparticles . . . The ZnO:Sm system was formed at 5 at.% . . . The ZnS sample was also doped with Sn . . . ZnS ZnO Transition metals Sm Sn 5 at. % “was doped with” “to the amount of” By relating entities, we get much more powerful and useful information extraction
  • 46. • Earlier, dependency extraction was done using grammar rules (e.g. dependency trees) but it was not particularly successful • We have been experimenting with large seq2seq transformer models • These can take in an unstructured text sequence and output a structured text sequence (e.g., OpenAI Codex that solves programming tasks) • Can be trained with few (<50) examples due to few-shot capability 46 Utilizing large seq2seq models for ERM Transition metal doping is an effective tool for controlling optical absorption in ZnS and hence the number of photons absorbed by photovoltaic devices. By using first principle density functional calculations, we compute the change in number of photons absorbed upon doping with a selected transition metal and found that Ni offers the best chance to improve the performance. This is attributed to the formation of defect states in the band gap of the host ZnS which give rise to additional dipole-allowed optical transition pathways between the conduction and valence band. Analysis of the defect level in the band gap shows that TM dopants do not pin Fermi levels in ZnS and hence the host can be made n- or p- type with other suitable dopants. The measured optical spectra from the doped solution processed ZnS nanocrystal supports our theoretical finding that Ni doping enhances optical absorption the most compared to Co and Mn doping. Raw scientific text Seq2seq Model Trained on intermediate reps. Entity Relationships Output sequence Input seq Output seq Deterministic decoding
  • 47. • Previous NER experiments can be extended with ERM to include much more information 47 Applying ERM to Dopant/Host extraction CaCu3Ti4-xCoxO12 is a doped result with descriptor ceramic and phase cubic from base material CaCu3Ti4O12 (AKA calcium copper titanate) and dopant Co + 2 (AKA cobalt). { “basemats”: { 0: { “aliases”: [“CaCuTi4O12”, “calcium copper titanate”], “descriptor”: null, ...}} “dopants”: { 0: { “aliases”: [“Co+2”, “cobalt”], ...}}, ”results”: { 0: { “aliases”: “CaCu3Ti$_{bf 4-emph{x}}$Co$_{bfemph{x}}$O12" “linked_basemats”: [0], “linked_dopants”: [0], “descriptors”: [“ceramics”], ...}} Seq2seq model unstructured to structured Manual parser
  • 48. For example, we hope to parse a literature-derived database of dopants and dopability 48 With this capability, we plan to release structured materials properties databases based on NLP parsing of literature Sentence Base Material Dopant Doping Concentr. …the influence of yttrium doping (0-10mol%) on BSCF… BSCF Yttrium 0-10 mol% undoped, anion-doped(Sb,Bi) and cation-doped(Ca,Zn) solid sln. of Mg10Si2Sn3… Mg10Si2Sn3 Sb, Bi, Ca, Zn The zT of As2Cd3 with electron doping is found to be ~ with n=10^20cm-3 As2Cd3 electron n=10^20cm-3 This leads to zT=0.5 obtained at 500K (p=10^20cm-3) in p-type As2Cd3T As2Cd3 p-type p=10^20cm-3 The undoped and 0.25wt% La doped CdO films show 111… …however, …. for doping concentrations greater than 0.50wt%. CdO La 0.25wt%, >0.5% Which elements are commonly doped into the same materials (i.e., co-occur as dopants)?
  • 49. • Experimentalists identified relevant factors for gold nanorod dimensions – Experimental temperatures – Solution ages/timing – Precursor amounts 49 We can also tackle complex syntheses if we can do entity relationship modeling Seq2Seq model outputs JSON (form of entity graph) "seed": { "prec": { "HAuCl4": { "vol": "5 mL", "concn": "0.25 mM" }, "CTAB": { "vol": "HAuCl4", "concn": "0.1 M" }, "NaBH4": { "vol": "0.3 mL", "concn": "10 mM" } }, "seed": { "size": "3 nm" }, "temp": "25 degC", "age": "5 min" }, Types of factors important in synthesis Values as extracted from raw text
  • 50. 50 Tests on Au Nanorod Synthesis indicate it is working Seed Solution (age, stir rate, temperature, precursor properties, seed properties) Growth Solution (age, stir rate, temperature, precursor properties) AuNR (aspect ratios, lengths, widths, TSPRs, and LSPRs) Entity detected (F1 score) 0.94 0.92 0.76 Exact match to entity (accuracy) 0.73 0.77 0.52 Support 159 244 96 Aggregated scores by AuNR recipe component Evaluated on 40 test paragraphs Trained on 40 (manual annotation) and 200 (assisted) paragraphs Entity detected = We correctly detected the types of synthesis information present Exact match = The extracted synthesis information is an exact string match
  • 51. The types of features we want to enable 51 Zinc oxide ZnO OZn Chemistry aware search (same input, same results) Summary data • Physical properties • Synthesis information • Known applications ferroelectrics All known compositions (PbTiO3, BaTiO3, etc.) Links to computational databases Zn0.5O0.5 new thermoelectrics Composition A Composition B Composition A synthesis Composition B synthesis Known; summary of all previous syntheses Unknown; suggested synthesis protocol ???
  • 52. 52 Note – we are creating open-source libraries to help with NLP tasks https://github.com/lbnlp
  • 53. • There exists a lot of data and knowledge in the historical corpus of scientific journal articles, but getting the knowledge has been difficult to do on a large scale • Machine learning presents a new frontier for being able to make use of this information 53 Conclusion
  • 54. 54 The Matscholar team Funding from: Slides (already) posted to hackingmaterials.lbl.gov John Dagdelen Alex Dunn Viktoriia Baibakova John Dagdelen Viktoriia Baibakova Nick Walker Kristin Persson Anubhav Jain Gerbrand Ceder Leigh Weston Vahe Tshitoyan Amalie Trewartha alumni Sanghoon Lee