SlideShare a Scribd company logo
Natural Language Processing for Data Extraction
and Synthesizability Prediction from the Energy
Materials Literature
Anubhav Jain
Lawrence Berkeley National Laboratory
MRS Fall meeting, Nov 2022
Slides (already) posted to hackingmaterials.lbl.gov
Literature data can be a key source of materials learning
2
Plan
Synthesize
Characterize
Analyze
local db +
ML
Automated Lab A
Plan
Synthesize
Characterize
Analyze
Conventional Lab B
Plan
Synthesize
Characterize
Analyze
local db +
ML
Automated Lab C
Literature data
+ broad coverage
– difficult to parse
– lack negative examples
– reproducibility
Other A-lab data
+ structured data formats
+ negative examples
– not much out there …
Theory data
+ readily available
– difficult to establish
relevance to synthesis
– computation time
Several research groups are now attempting to
collect data sets from the research literature
3
Weston, L. et al Named Entity Recognition
and Normalization Applied to Large-Scale
Information Extraction from the Materials
Science Literature. J. Chem. Inf. Model.
(2019)
Recently, we also tried BERT variants
Trewartha, A.; Walker, N.; Huo, H.; Lee, S.;
Cruse, K.; Dagdelen, J.; Dunn, A.; Persson,
K. A.; Ceder, G.; Jain, A. Quantifying the
Advantage of Domain-Specific Pre-Training
on Named Entity Recognition Tasks in
Materials Science. Patterns 2022, 3 (4),
100488.
Models were good for labeling entities, but
didn’t understand relationships
4
Named Entity Recognition
• Custom machine learning models to
extract the most valuable materials-related
information.
• Utilizes a long short-term memory (LSTM)
network trained on ~1000 hand-annotated
abstracts.
Trewartha, A.; Walker, N.; Huo, H.; Lee, S.;
Cruse, K.; Dagdelen, J.; Dunn, A.; Persson,
K. A.; Ceder, G.; Jain, A. Quantifying the
Advantage of Domain-Specific Pre-Training
on Named Entity Recognition Tasks in
Materials Science. Patterns 2022, 3 (4),
100488.
Relationships have usually been extracted
via either manual or semi-automated
regular expression construction along
with grammar tree analysis, e.g.
ChemDataExtractor – can be tedious!
Outline
• Using sequence-to-sequence models for combined entity
detection and relationship extraction
• Analyzing synthesis of Au nanorods using literature data
• Analyzing synthesis of phase-pure BiFeO3 using literature data
5
A Sequence-to-Sequence Approach
• Language model takes a sequence of tokens
as input and outputs a sequence of tokens
• Maximizes the likelihood of the output
conditioned on the input
• Additionally includes task conditioning, which can
learn the desired format for outputs
• We’ve done many explorations now with
OpenAI’s GPT-3 which has 175 billion
parameters
• interact with the model through their (paid) API,
although costs are relatively modest
• Capacity for “understanding” language as well
as “world knowledge”
How a sequence-to-sequence approach works
7
Seq2Seq model
(GPT3)
Text in (“prompt”) Text out (“completion”)
Another example
8
Seq2Seq model
(GPT3)
Text in (“prompt”) Text out (“completion”)
Structured data
9
Seq2Seq model
(GPT3)
Text in (“prompt”) Text out (“completion”)
But it’s not perfect for technical data
10
Seq2Seq model
(GPT3)
Text in (“prompt”) Text out (“completion”)
A workflow for fine-tuning GPT-3
1. Initial training set of templates
filled mostly manually, as zero-
shot GPT is often poor for
technical tasks
2. Fine-tune model to fill
templates, use the model to
assist in annotation
3. Repeat as necessary until
desired inference accuracy is
achieved
This procedure can extract complex,
hierarchical relationships between entities
12
Outline
• Using sequence-to-sequence models for combined entity
detection and relationship extraction
• Analyzing synthesis of Au nanorods using literature data
• Analyzing synthesis of phase-pure BiFeO3 using literature data
13
Templated extraction of synthesis recipes
• Annotate paragraphs to output
structured recipe templates
• JSON-format
• Designed using domain knowledge
from experimentalists
• Template is relation graph to be
filled in by model
Example Extraction for Au nanorod synthesis
Note: we are still formally evaluating performance various
issues in getting an accurate evaluation, e.g., predictions
that are functionally correct but written differently
Analyzing AuNR synthesis data set
16
Note that this data set was collected manually via hand-tuned
regular expressions, not NLP or GPT-3 as it was done in parallel
to that work.
We are currently looking at pros/cons of manual approach vs
GPT_3 approach.
Representing recipes as precursor vectors for machine learning
Training a decision tree to predict AuNR
shape shows similar conclusions as literature
17
Rod
Cube
Rod
Cube Bipyramid Star Bipyramid
None
None
None
None
None
None None
• Decision tree shows seed capping
agent type as first decision
boundary for shape determination
• “Citrate-capped gold seeds form
penta-twinned structure, while
CTAB-capped seeds are single
crystalline, hence former leads to
bipyramids and latter leads to
rods”1,2
1 Liu and Guyot-Sionnest, J.
Phys. Chem. B, 2005 109 (47),
22192-22200
2
Grzelczak et al., Chem. Soc.
Rev., 2008,37, 1783-1791
We also see some effect of AgNO3
concentration on AuNR size, but data is noisy
18
N. D. Burrows et al., Langmuir 2017 33 (8), 1891-1907
growth: HAuCl4, CTAB, AA, AgNO3
growth: HAuCl4, CTAB, AA, AgNO3 w/ HAuCl4/CTAB<0.01 filter
growth: HAuCl4, CTAB, AA, AgNO3 + HCl
Overall thoughts on AuNR data set
• The seq2seq method is showing good capabilities in terms of
extracting complex nanorod synthesis data
• We are going to start integrating this into our own pipeline to replace
manual regex for relationship extraction
• Performing machine learning to form hypothesis generation on
AuNR shape and size is messy
• Data sets are messy, and not particularly large
• Nevertheless, it is encouraging that conclusions from the
literature can be automatically found by machine learning
19
Outline
• Using sequence-to-sequence models for combined entity
detection and relationship extraction
• Analyzing synthesis of Au nanorods using literature data
• Analyzing synthesis of phase-pure BiFeO3 using literature
data
20
Seq2Seq approach for solid state synthesis
Initial tests of the seq2seq method on solid state synthesis has encouraging results, but needs further testing
For now, we use manual data extraction to
tackle the problem of BiFeO3 synthesis
22
340 total synthesis recipes (from 178 articles); 57 features per recipe
Machine learning (decision tree) predictions
are in-line with common knowledge
23
Machine learning (decision tree) predictions
are in-line with common knowledge
24
Missing synthesis information – can it be
recovered / reproduced easily?
24
Could not reproduce
Partially reproducible
Reproducible
Exploring unexplored portions of synthesis
space
25
These
decision trees
are
interpretable,
but are they
physical?
Conclusions
• As large language models grow larger and more capable, they are able to parse
increasingly complex scientific text into structured formats
• Applying NLP + ML on synthesis data sets shows that scientific heuristics can be
automatically uncovered, which is promising
• Nevertheless, issues remain in applying NLP to predictive synthesis
• Reproducibility / missing information / conflicting information
• General lack of negative examples
• Unknown data quality
• Thus, results from such techniques will likely need to be treated as initial
hypotheses to be complemented by further experiments
26
Acknowledgements
NLP (seq2seq)
• Alex Dunn
• John Dagdelen
• Nick Walker
• Sanghoon Lee
• Amalie Trewartha
27
Funding provided by:
• U.S. Department of Energy, Basic Energy Science, “D2S2” program
• Toyota Research Institutes, Accelerated Materials Design program
Slides (already) posted to hackingmaterials.lbl.gov
AuNR analysis
• Sanghoon Lee
• Sam Gleason
• Kevin Cruse
BiFeO3 analysis
• Kevin Cruse
• Viktoriia Baibakova
• Maged Abdelsamie
• Kootak Hong
• Carolin Sutter-Fella
• Gerbrand Ceder
Sol-gel synthesis of BiFeO3
28

More Related Content

What's hot

Graphs, Environments, and Machine Learning for Materials Science
Graphs, Environments, and Machine Learning for Materials ScienceGraphs, Environments, and Machine Learning for Materials Science
Graphs, Environments, and Machine Learning for Materials Science
aimsnist
 
Machine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methodsMachine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methods
Anubhav Jain
 
The Materials Project: An Electronic Structure Database for Community-Based M...
The Materials Project: An Electronic Structure Database for Community-Based M...The Materials Project: An Electronic Structure Database for Community-Based M...
The Materials Project: An Electronic Structure Database for Community-Based M...
Anubhav Jain
 
“Materials Informatics and Big Data: Realization of 4th Paradigm of Science i...
“Materials Informatics and Big Data: Realization of 4th Paradigm of Science i...“Materials Informatics and Big Data: Realization of 4th Paradigm of Science i...
“Materials Informatics and Big Data: Realization of 4th Paradigm of Science i...
aimsnist
 
【DL輪読会】Which Shortcut Cues Will DNNs Choose? A Study from the Parameter-Space...
【DL輪読会】Which Shortcut Cues Will DNNs Choose? A Study from the Parameter-Space...【DL輪読会】Which Shortcut Cues Will DNNs Choose? A Study from the Parameter-Space...
【DL輪読会】Which Shortcut Cues Will DNNs Choose? A Study from the Parameter-Space...
Deep Learning JP
 
Machine Learning for Molecules: Lessons and Challenges of Data-Centric Chemistry
Machine Learning for Molecules: Lessons and Challenges of Data-Centric ChemistryMachine Learning for Molecules: Lessons and Challenges of Data-Centric Chemistry
Machine Learning for Molecules: Lessons and Challenges of Data-Centric Chemistry
Ichigaku Takigawa
 
The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...
Anubhav Jain
 
Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...
Anubhav Jain
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
jeykottalam
 
[DL輪読会]Invariance Principle Meets Information Bottleneck for Out-of-Distribut...
[DL輪読会]Invariance Principle Meets Information Bottleneck for Out-of-Distribut...[DL輪読会]Invariance Principle Meets Information Bottleneck for Out-of-Distribut...
[DL輪読会]Invariance Principle Meets Information Bottleneck for Out-of-Distribut...
Deep Learning JP
 
PFN Summer Internship 2021 / Kohei Shinohara: Charge Transfer Modeling in Neu...
PFN Summer Internship 2021 / Kohei Shinohara: Charge Transfer Modeling in Neu...PFN Summer Internship 2021 / Kohei Shinohara: Charge Transfer Modeling in Neu...
PFN Summer Internship 2021 / Kohei Shinohara: Charge Transfer Modeling in Neu...
Preferred Networks
 
Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...
Anubhav Jain
 
230309_LoRa
230309_LoRa230309_LoRa
230309_LoRa
YongSang Yoo
 
IPAB2017 深層学習を使った新薬の探索から創造へ
IPAB2017 深層学習を使った新薬の探索から創造へIPAB2017 深層学習を使った新薬の探索から創造へ
IPAB2017 深層学習を使った新薬の探索から創造へ
Preferred Networks
 
Materials Design in the Age of Deep Learning and Quantum Computation
Materials Design in the Age of Deep Learning and Quantum ComputationMaterials Design in the Age of Deep Learning and Quantum Computation
Materials Design in the Age of Deep Learning and Quantum Computation
KAMAL CHOUDHARY
 
SIGNATE オフロードコンペ 精度認識部門 3rd Place Solution
SIGNATE オフロードコンペ 精度認識部門 3rd Place SolutionSIGNATE オフロードコンペ 精度認識部門 3rd Place Solution
SIGNATE オフロードコンペ 精度認識部門 3rd Place Solution
Yusuke Uchida
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomate
Anubhav Jain
 
Design pattern cheat sheet
Design pattern cheat sheetDesign pattern cheat sheet
Design pattern cheat sheet
Rachanee Saengkrajai
 
Simple Programme Gantt Chart with RAG Status
Simple Programme Gantt Chart with RAG StatusSimple Programme Gantt Chart with RAG Status
Simple Programme Gantt Chart with RAG Status
Mark Ritchie
 
[DL輪読会]Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Ima...
[DL輪読会]Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Ima...[DL輪読会]Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Ima...
[DL輪読会]Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Ima...
Deep Learning JP
 

What's hot (20)

Graphs, Environments, and Machine Learning for Materials Science
Graphs, Environments, and Machine Learning for Materials ScienceGraphs, Environments, and Machine Learning for Materials Science
Graphs, Environments, and Machine Learning for Materials Science
 
Machine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methodsMachine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methods
 
The Materials Project: An Electronic Structure Database for Community-Based M...
The Materials Project: An Electronic Structure Database for Community-Based M...The Materials Project: An Electronic Structure Database for Community-Based M...
The Materials Project: An Electronic Structure Database for Community-Based M...
 
“Materials Informatics and Big Data: Realization of 4th Paradigm of Science i...
“Materials Informatics and Big Data: Realization of 4th Paradigm of Science i...“Materials Informatics and Big Data: Realization of 4th Paradigm of Science i...
“Materials Informatics and Big Data: Realization of 4th Paradigm of Science i...
 
【DL輪読会】Which Shortcut Cues Will DNNs Choose? A Study from the Parameter-Space...
【DL輪読会】Which Shortcut Cues Will DNNs Choose? A Study from the Parameter-Space...【DL輪読会】Which Shortcut Cues Will DNNs Choose? A Study from the Parameter-Space...
【DL輪読会】Which Shortcut Cues Will DNNs Choose? A Study from the Parameter-Space...
 
Machine Learning for Molecules: Lessons and Challenges of Data-Centric Chemistry
Machine Learning for Molecules: Lessons and Challenges of Data-Centric ChemistryMachine Learning for Molecules: Lessons and Challenges of Data-Centric Chemistry
Machine Learning for Molecules: Lessons and Challenges of Data-Centric Chemistry
 
The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...
 
Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
 
[DL輪読会]Invariance Principle Meets Information Bottleneck for Out-of-Distribut...
[DL輪読会]Invariance Principle Meets Information Bottleneck for Out-of-Distribut...[DL輪読会]Invariance Principle Meets Information Bottleneck for Out-of-Distribut...
[DL輪読会]Invariance Principle Meets Information Bottleneck for Out-of-Distribut...
 
PFN Summer Internship 2021 / Kohei Shinohara: Charge Transfer Modeling in Neu...
PFN Summer Internship 2021 / Kohei Shinohara: Charge Transfer Modeling in Neu...PFN Summer Internship 2021 / Kohei Shinohara: Charge Transfer Modeling in Neu...
PFN Summer Internship 2021 / Kohei Shinohara: Charge Transfer Modeling in Neu...
 
Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...
 
230309_LoRa
230309_LoRa230309_LoRa
230309_LoRa
 
IPAB2017 深層学習を使った新薬の探索から創造へ
IPAB2017 深層学習を使った新薬の探索から創造へIPAB2017 深層学習を使った新薬の探索から創造へ
IPAB2017 深層学習を使った新薬の探索から創造へ
 
Materials Design in the Age of Deep Learning and Quantum Computation
Materials Design in the Age of Deep Learning and Quantum ComputationMaterials Design in the Age of Deep Learning and Quantum Computation
Materials Design in the Age of Deep Learning and Quantum Computation
 
SIGNATE オフロードコンペ 精度認識部門 3rd Place Solution
SIGNATE オフロードコンペ 精度認識部門 3rd Place SolutionSIGNATE オフロードコンペ 精度認識部門 3rd Place Solution
SIGNATE オフロードコンペ 精度認識部門 3rd Place Solution
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomate
 
Design pattern cheat sheet
Design pattern cheat sheetDesign pattern cheat sheet
Design pattern cheat sheet
 
Simple Programme Gantt Chart with RAG Status
Simple Programme Gantt Chart with RAG StatusSimple Programme Gantt Chart with RAG Status
Simple Programme Gantt Chart with RAG Status
 
[DL輪読会]Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Ima...
[DL輪読会]Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Ima...[DL輪読会]Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Ima...
[DL輪読会]Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Ima...
 

Similar to Natural Language Processing for Data Extraction and Synthesizability Prediction from the Energy Materials Literature

Natural language processing for extracting synthesis recipes and applications...
Natural language processing for extracting synthesis recipes and applications...Natural language processing for extracting synthesis recipes and applications...
Natural language processing for extracting synthesis recipes and applications...
Anubhav Jain
 
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Anubhav Jain
 
Thesis def
Thesis defThesis def
Thesis def
Jay Vyas
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
c.titus.brown
 
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Databricks
 
Knowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learningKnowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learningjaumebp
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research
Yannick Wurm
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
Anubhav Jain
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
Gaignard Alban
 
OpenDiscovery
OpenDiscoveryOpenDiscovery
OpenDiscoverygwprice
 
Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a...
Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a...Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a...
Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a...
Masahito Ohue
 
Applications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials DesignApplications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials Design
Anubhav Jain
 
DNA Query Language DNAQL: A Novel Approach
DNA Query Language DNAQL: A Novel ApproachDNA Query Language DNAQL: A Novel Approach
DNA Query Language DNAQL: A Novel ApproachEditor IJCATR
 
An interactive approach to multiobjective clustering of gene expression patterns
An interactive approach to multiobjective clustering of gene expression patternsAn interactive approach to multiobjective clustering of gene expression patterns
An interactive approach to multiobjective clustering of gene expression patterns
Ravi Kumar
 
32_Nov07_MachineLear..
32_Nov07_MachineLear..32_Nov07_MachineLear..
32_Nov07_MachineLear..butest
 
Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network Analysis
David Gleich
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
c.titus.brown
 
Cornell Pbsb 20090126 Nets
Cornell Pbsb 20090126 NetsCornell Pbsb 20090126 Nets
Cornell Pbsb 20090126 Nets
Mark Gerstein
 
The Smart Way To Invest in AI and ML_SFStartupDay
The Smart Way To Invest in AI and ML_SFStartupDayThe Smart Way To Invest in AI and ML_SFStartupDay
The Smart Way To Invest in AI and ML_SFStartupDay
Amazon Web Services
 

Similar to Natural Language Processing for Data Extraction and Synthesizability Prediction from the Energy Materials Literature (20)

Natural language processing for extracting synthesis recipes and applications...
Natural language processing for extracting synthesis recipes and applications...Natural language processing for extracting synthesis recipes and applications...
Natural language processing for extracting synthesis recipes and applications...
 
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
 
Thesis def
Thesis defThesis def
Thesis def
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
 
Knowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learningKnowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learning
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
OpenDiscovery
OpenDiscoveryOpenDiscovery
OpenDiscovery
 
Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a...
Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a...Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a...
Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a...
 
ADPosterFinal
ADPosterFinalADPosterFinal
ADPosterFinal
 
Applications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials DesignApplications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials Design
 
DNA Query Language DNAQL: A Novel Approach
DNA Query Language DNAQL: A Novel ApproachDNA Query Language DNAQL: A Novel Approach
DNA Query Language DNAQL: A Novel Approach
 
An interactive approach to multiobjective clustering of gene expression patterns
An interactive approach to multiobjective clustering of gene expression patternsAn interactive approach to multiobjective clustering of gene expression patterns
An interactive approach to multiobjective clustering of gene expression patterns
 
32_Nov07_MachineLear..
32_Nov07_MachineLear..32_Nov07_MachineLear..
32_Nov07_MachineLear..
 
Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network Analysis
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
Cornell Pbsb 20090126 Nets
Cornell Pbsb 20090126 NetsCornell Pbsb 20090126 Nets
Cornell Pbsb 20090126 Nets
 
The Smart Way To Invest in AI and ML_SFStartupDay
The Smart Way To Invest in AI and ML_SFStartupDayThe Smart Way To Invest in AI and ML_SFStartupDay
The Smart Way To Invest in AI and ML_SFStartupDay
 

More from Anubhav Jain

Discovering advanced materials for energy applications: theory, high-throughp...
Discovering advanced materials for energy applications: theory, high-throughp...Discovering advanced materials for energy applications: theory, high-throughp...
Discovering advanced materials for energy applications: theory, high-throughp...
Anubhav Jain
 
Applications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and Design
Anubhav Jain
 
An AI-driven closed-loop facility for materials synthesis
An AI-driven closed-loop facility for materials synthesisAn AI-driven closed-loop facility for materials synthesis
An AI-driven closed-loop facility for materials synthesis
Anubhav Jain
 
Best practices for DuraMat software dissemination
Best practices for DuraMat software disseminationBest practices for DuraMat software dissemination
Best practices for DuraMat software dissemination
Anubhav Jain
 
Best practices for DuraMat software dissemination
Best practices for DuraMat software disseminationBest practices for DuraMat software dissemination
Best practices for DuraMat software dissemination
Anubhav Jain
 
Machine Learning for Catalyst Design
Machine Learning for Catalyst DesignMachine Learning for Catalyst Design
Machine Learning for Catalyst Design
Anubhav Jain
 
Accelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine LearningAccelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine Learning
Anubhav Jain
 
DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …
Anubhav Jain
 
The Materials Project
The Materials ProjectThe Materials Project
The Materials Project
Anubhav Jain
 
Evaluating Chemical Composition and Crystal Structure Representations using t...
Evaluating Chemical Composition and Crystal Structure Representations using t...Evaluating Chemical Composition and Crystal Structure Representations using t...
Evaluating Chemical Composition and Crystal Structure Representations using t...
Anubhav Jain
 
Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...
Anubhav Jain
 
Discovering and Exploring New Materials through the Materials Project
Discovering and Exploring New Materials through the Materials ProjectDiscovering and Exploring New Materials through the Materials Project
Discovering and Exploring New Materials through the Materials Project
Anubhav Jain
 
The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...
Anubhav Jain
 
Machine Learning Platform for Catalyst Design
Machine Learning Platform for Catalyst DesignMachine Learning Platform for Catalyst Design
Machine Learning Platform for Catalyst Design
Anubhav Jain
 
Assessing Factors Underpinning PV Degradation through Data Analysis
Assessing Factors Underpinning PV Degradation through Data AnalysisAssessing Factors Underpinning PV Degradation through Data Analysis
Assessing Factors Underpinning PV Degradation through Data Analysis
Anubhav Jain
 
The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...
Anubhav Jain
 
Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...
Anubhav Jain
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Anubhav Jain
 
Conducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials ProjectConducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials Project
Anubhav Jain
 
Natural Language Processing for Materials Design - What Can We Extract From t...
Natural Language Processing for Materials Design - What Can We Extract From t...Natural Language Processing for Materials Design - What Can We Extract From t...
Natural Language Processing for Materials Design - What Can We Extract From t...
Anubhav Jain
 

More from Anubhav Jain (20)

Discovering advanced materials for energy applications: theory, high-throughp...
Discovering advanced materials for energy applications: theory, high-throughp...Discovering advanced materials for energy applications: theory, high-throughp...
Discovering advanced materials for energy applications: theory, high-throughp...
 
Applications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and Design
 
An AI-driven closed-loop facility for materials synthesis
An AI-driven closed-loop facility for materials synthesisAn AI-driven closed-loop facility for materials synthesis
An AI-driven closed-loop facility for materials synthesis
 
Best practices for DuraMat software dissemination
Best practices for DuraMat software disseminationBest practices for DuraMat software dissemination
Best practices for DuraMat software dissemination
 
Best practices for DuraMat software dissemination
Best practices for DuraMat software disseminationBest practices for DuraMat software dissemination
Best practices for DuraMat software dissemination
 
Machine Learning for Catalyst Design
Machine Learning for Catalyst DesignMachine Learning for Catalyst Design
Machine Learning for Catalyst Design
 
Accelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine LearningAccelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine Learning
 
DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …
 
The Materials Project
The Materials ProjectThe Materials Project
The Materials Project
 
Evaluating Chemical Composition and Crystal Structure Representations using t...
Evaluating Chemical Composition and Crystal Structure Representations using t...Evaluating Chemical Composition and Crystal Structure Representations using t...
Evaluating Chemical Composition and Crystal Structure Representations using t...
 
Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...
 
Discovering and Exploring New Materials through the Materials Project
Discovering and Exploring New Materials through the Materials ProjectDiscovering and Exploring New Materials through the Materials Project
Discovering and Exploring New Materials through the Materials Project
 
The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...
 
Machine Learning Platform for Catalyst Design
Machine Learning Platform for Catalyst DesignMachine Learning Platform for Catalyst Design
Machine Learning Platform for Catalyst Design
 
Assessing Factors Underpinning PV Degradation through Data Analysis
Assessing Factors Underpinning PV Degradation through Data AnalysisAssessing Factors Underpinning PV Degradation through Data Analysis
Assessing Factors Underpinning PV Degradation through Data Analysis
 
The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...
 
Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
 
Conducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials ProjectConducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials Project
 
Natural Language Processing for Materials Design - What Can We Extract From t...
Natural Language Processing for Materials Design - What Can We Extract From t...Natural Language Processing for Materials Design - What Can We Extract From t...
Natural Language Processing for Materials Design - What Can We Extract From t...
 

Recently uploaded

Citrus Greening Disease and its Management
Citrus Greening Disease and its ManagementCitrus Greening Disease and its Management
Citrus Greening Disease and its Management
subedisuryaofficial
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
aishnasrivastava
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCINGRNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
AADYARAJPANDEY1
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
Scintica Instrumentation
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
DiyaBiswas10
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
Columbia Weather Systems
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
muralinath2
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
silvermistyshot
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
anitaento25
 
platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
muralinath2
 
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
ssuserbfdca9
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
YOGESH DOGRA
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
NathanBaughman3
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
muralinath2
 

Recently uploaded (20)

Citrus Greening Disease and its Management
Citrus Greening Disease and its ManagementCitrus Greening Disease and its Management
Citrus Greening Disease and its Management
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCINGRNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
 
platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
 
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
 

Natural Language Processing for Data Extraction and Synthesizability Prediction from the Energy Materials Literature

  • 1. Natural Language Processing for Data Extraction and Synthesizability Prediction from the Energy Materials Literature Anubhav Jain Lawrence Berkeley National Laboratory MRS Fall meeting, Nov 2022 Slides (already) posted to hackingmaterials.lbl.gov
  • 2. Literature data can be a key source of materials learning 2 Plan Synthesize Characterize Analyze local db + ML Automated Lab A Plan Synthesize Characterize Analyze Conventional Lab B Plan Synthesize Characterize Analyze local db + ML Automated Lab C Literature data + broad coverage – difficult to parse – lack negative examples – reproducibility Other A-lab data + structured data formats + negative examples – not much out there … Theory data + readily available – difficult to establish relevance to synthesis – computation time
  • 3. Several research groups are now attempting to collect data sets from the research literature 3 Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019) Recently, we also tried BERT variants Trewartha, A.; Walker, N.; Huo, H.; Lee, S.; Cruse, K.; Dagdelen, J.; Dunn, A.; Persson, K. A.; Ceder, G.; Jain, A. Quantifying the Advantage of Domain-Specific Pre-Training on Named Entity Recognition Tasks in Materials Science. Patterns 2022, 3 (4), 100488.
  • 4. Models were good for labeling entities, but didn’t understand relationships 4 Named Entity Recognition • Custom machine learning models to extract the most valuable materials-related information. • Utilizes a long short-term memory (LSTM) network trained on ~1000 hand-annotated abstracts. Trewartha, A.; Walker, N.; Huo, H.; Lee, S.; Cruse, K.; Dagdelen, J.; Dunn, A.; Persson, K. A.; Ceder, G.; Jain, A. Quantifying the Advantage of Domain-Specific Pre-Training on Named Entity Recognition Tasks in Materials Science. Patterns 2022, 3 (4), 100488. Relationships have usually been extracted via either manual or semi-automated regular expression construction along with grammar tree analysis, e.g. ChemDataExtractor – can be tedious!
  • 5. Outline • Using sequence-to-sequence models for combined entity detection and relationship extraction • Analyzing synthesis of Au nanorods using literature data • Analyzing synthesis of phase-pure BiFeO3 using literature data 5
  • 6. A Sequence-to-Sequence Approach • Language model takes a sequence of tokens as input and outputs a sequence of tokens • Maximizes the likelihood of the output conditioned on the input • Additionally includes task conditioning, which can learn the desired format for outputs • We’ve done many explorations now with OpenAI’s GPT-3 which has 175 billion parameters • interact with the model through their (paid) API, although costs are relatively modest • Capacity for “understanding” language as well as “world knowledge”
  • 7. How a sequence-to-sequence approach works 7 Seq2Seq model (GPT3) Text in (“prompt”) Text out (“completion”)
  • 8. Another example 8 Seq2Seq model (GPT3) Text in (“prompt”) Text out (“completion”)
  • 9. Structured data 9 Seq2Seq model (GPT3) Text in (“prompt”) Text out (“completion”)
  • 10. But it’s not perfect for technical data 10 Seq2Seq model (GPT3) Text in (“prompt”) Text out (“completion”)
  • 11. A workflow for fine-tuning GPT-3 1. Initial training set of templates filled mostly manually, as zero- shot GPT is often poor for technical tasks 2. Fine-tune model to fill templates, use the model to assist in annotation 3. Repeat as necessary until desired inference accuracy is achieved
  • 12. This procedure can extract complex, hierarchical relationships between entities 12
  • 13. Outline • Using sequence-to-sequence models for combined entity detection and relationship extraction • Analyzing synthesis of Au nanorods using literature data • Analyzing synthesis of phase-pure BiFeO3 using literature data 13
  • 14. Templated extraction of synthesis recipes • Annotate paragraphs to output structured recipe templates • JSON-format • Designed using domain knowledge from experimentalists • Template is relation graph to be filled in by model
  • 15. Example Extraction for Au nanorod synthesis Note: we are still formally evaluating performance various issues in getting an accurate evaluation, e.g., predictions that are functionally correct but written differently
  • 16. Analyzing AuNR synthesis data set 16 Note that this data set was collected manually via hand-tuned regular expressions, not NLP or GPT-3 as it was done in parallel to that work. We are currently looking at pros/cons of manual approach vs GPT_3 approach. Representing recipes as precursor vectors for machine learning
  • 17. Training a decision tree to predict AuNR shape shows similar conclusions as literature 17 Rod Cube Rod Cube Bipyramid Star Bipyramid None None None None None None None • Decision tree shows seed capping agent type as first decision boundary for shape determination • “Citrate-capped gold seeds form penta-twinned structure, while CTAB-capped seeds are single crystalline, hence former leads to bipyramids and latter leads to rods”1,2 1 Liu and Guyot-Sionnest, J. Phys. Chem. B, 2005 109 (47), 22192-22200 2 Grzelczak et al., Chem. Soc. Rev., 2008,37, 1783-1791
  • 18. We also see some effect of AgNO3 concentration on AuNR size, but data is noisy 18 N. D. Burrows et al., Langmuir 2017 33 (8), 1891-1907 growth: HAuCl4, CTAB, AA, AgNO3 growth: HAuCl4, CTAB, AA, AgNO3 w/ HAuCl4/CTAB<0.01 filter growth: HAuCl4, CTAB, AA, AgNO3 + HCl
  • 19. Overall thoughts on AuNR data set • The seq2seq method is showing good capabilities in terms of extracting complex nanorod synthesis data • We are going to start integrating this into our own pipeline to replace manual regex for relationship extraction • Performing machine learning to form hypothesis generation on AuNR shape and size is messy • Data sets are messy, and not particularly large • Nevertheless, it is encouraging that conclusions from the literature can be automatically found by machine learning 19
  • 20. Outline • Using sequence-to-sequence models for combined entity detection and relationship extraction • Analyzing synthesis of Au nanorods using literature data • Analyzing synthesis of phase-pure BiFeO3 using literature data 20
  • 21. Seq2Seq approach for solid state synthesis Initial tests of the seq2seq method on solid state synthesis has encouraging results, but needs further testing
  • 22. For now, we use manual data extraction to tackle the problem of BiFeO3 synthesis 22 340 total synthesis recipes (from 178 articles); 57 features per recipe
  • 23. Machine learning (decision tree) predictions are in-line with common knowledge 23 Machine learning (decision tree) predictions are in-line with common knowledge 24
  • 24. Missing synthesis information – can it be recovered / reproduced easily? 24 Could not reproduce Partially reproducible Reproducible
  • 25. Exploring unexplored portions of synthesis space 25 These decision trees are interpretable, but are they physical?
  • 26. Conclusions • As large language models grow larger and more capable, they are able to parse increasingly complex scientific text into structured formats • Applying NLP + ML on synthesis data sets shows that scientific heuristics can be automatically uncovered, which is promising • Nevertheless, issues remain in applying NLP to predictive synthesis • Reproducibility / missing information / conflicting information • General lack of negative examples • Unknown data quality • Thus, results from such techniques will likely need to be treated as initial hypotheses to be complemented by further experiments 26
  • 27. Acknowledgements NLP (seq2seq) • Alex Dunn • John Dagdelen • Nick Walker • Sanghoon Lee • Amalie Trewartha 27 Funding provided by: • U.S. Department of Energy, Basic Energy Science, “D2S2” program • Toyota Research Institutes, Accelerated Materials Design program Slides (already) posted to hackingmaterials.lbl.gov AuNR analysis • Sanghoon Lee • Sam Gleason • Kevin Cruse BiFeO3 analysis • Kevin Cruse • Viktoriia Baibakova • Maged Abdelsamie • Kootak Hong • Carolin Sutter-Fella • Gerbrand Ceder
  • 28. Sol-gel synthesis of BiFeO3 28