This document discusses using natural language processing (NLP) techniques to extract useful information from unstructured text sources in materials science literature. It describes how NLP models can be trained on large datasets of materials science publications to perform tasks like chemistry-aware search, summarizing material properties, and suggesting synthesis methods. The models are developed using techniques like word embeddings, LSTM networks, and named entity recognition. The goal is to organize materials science knowledge from text into a database called Matscholar to enable new applications of the information.
Assessing Factors Underpinning PV Degradation through Data AnalysisAnubhav Jain
The document discusses using PVPRO methods and large-scale data analysis to distinguish system and module degradation in PV systems. It involves 3 main tasks: 1) Developing an algorithm to detect off-maximum power point operation and compare it to existing tools. 2) Applying PVPRO to additional datasets to refine methods and perform degradation analysis on 25 large PV systems. 3) Connecting bill-of-materials data to degradation results from accelerated stress tests through data-driven analysis and publishing findings while anonymizing data.
Extracting and Making Use of Materials Data from Millions of Journal Articles...Anubhav Jain
- The document discusses using natural language processing techniques to extract materials data from millions of journal articles.
- It aims to organize the world's information on materials science by using NLP models to extract useful data from unstructured text sources like research literature in an automated manner.
- The process involves collecting raw text data, developing machine learning models to extract entities and relationships, and building search interfaces to make the extracted data accessible.
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Anubhav Jain
1. The document discusses using natural language processing (NLP) algorithms to extract useful information from unstructured text sources in materials science literature to help organize the world's materials science information and enable new search and analysis capabilities.
2. It describes a project called Matscholar that applies NLP techniques like named entity recognition and relation extraction to millions of article abstracts to build a searchable database with summarized materials property and application data.
3. The approach involves collecting text sources, developing machine learning models trained on annotated examples to extract entities and relations, and integrating the extracted structured data with materials property databases to enable new search and analysis functions.
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Anubhav Jain
1) The document discusses evaluating machine learning algorithms for materials science using the Matbench protocol.
2) Matbench provides standardized datasets, testing procedures, and an online leaderboard to benchmark and compare machine learning performance.
3) This allows different groups to evaluate algorithms independently and identify best practices for materials science predictions.
Open Source Tools for Materials InformaticsAnubhav Jain
This document discusses open source tools for materials informatics, including Matminer and Matscholar. Matminer is a library of descriptors for materials science data that can generate features for machine learning models. It includes over 60 featurizer classes and supports scikit-learn. Matscholar applies natural language processing to over 2 million materials science abstracts to extract keywords and enable improved literature searching. The document argues that open datasets like Matbench and automated tools like Automatminer could help lower barriers for developing machine learning models in materials science by making it easier to obtain training data and evaluate model performance.
The document provides an overview of materials informatics and the Materials Genome Initiative. It discusses how materials informatics uses data-driven approaches and techniques from fields like signal processing, machine learning and statistics to generate structure-property-processing linkages from materials science data and improve understanding of materials behavior. This includes extracting features from materials microstructure, using statistical analysis and data mining to discover relationships and create predictive models, and evaluating how knowledge has improved.
Natural Language Processing for Materials Design - What Can We Extract From t...Anubhav Jain
This document discusses using natural language processing (NLP) techniques to extract and organize information from the materials science literature. It describes how NLP models can recognize named entities like materials, properties, and methods in text. Word embedding algorithms represent words as vectors to encode semantic relationships. Long short-term memory networks then classify words by context. The resulting models can automatically label millions of papers, enabling new search and predictive applications. Predictions based on word co-occurrence have inspired further experimental study of promising materials. The Matscholar team is developing comprehensive NLP tools to advance materials science research.
The Status of ML Algorithms for Structure-property Relationships Using Matb...Anubhav Jain
The document discusses the development of Matbench, a standardized benchmark for evaluating machine learning algorithms for materials property prediction. Matbench includes 13 standardized datasets covering a variety of materials prediction tasks. It employs a nested cross-validation procedure to evaluate algorithms and ranks submissions on an online leaderboard. This allows for reproducible evaluation and comparison of different algorithms. Matbench has provided insights into which algorithm types work best for certain prediction problems and has helped measure overall progress in the field. Future work aims to expand Matbench with more diverse datasets and evaluation procedures to better represent real-world materials design challenges.
Assessing Factors Underpinning PV Degradation through Data AnalysisAnubhav Jain
The document discusses using PVPRO methods and large-scale data analysis to distinguish system and module degradation in PV systems. It involves 3 main tasks: 1) Developing an algorithm to detect off-maximum power point operation and compare it to existing tools. 2) Applying PVPRO to additional datasets to refine methods and perform degradation analysis on 25 large PV systems. 3) Connecting bill-of-materials data to degradation results from accelerated stress tests through data-driven analysis and publishing findings while anonymizing data.
Extracting and Making Use of Materials Data from Millions of Journal Articles...Anubhav Jain
- The document discusses using natural language processing techniques to extract materials data from millions of journal articles.
- It aims to organize the world's information on materials science by using NLP models to extract useful data from unstructured text sources like research literature in an automated manner.
- The process involves collecting raw text data, developing machine learning models to extract entities and relationships, and building search interfaces to make the extracted data accessible.
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Anubhav Jain
1. The document discusses using natural language processing (NLP) algorithms to extract useful information from unstructured text sources in materials science literature to help organize the world's materials science information and enable new search and analysis capabilities.
2. It describes a project called Matscholar that applies NLP techniques like named entity recognition and relation extraction to millions of article abstracts to build a searchable database with summarized materials property and application data.
3. The approach involves collecting text sources, developing machine learning models trained on annotated examples to extract entities and relations, and integrating the extracted structured data with materials property databases to enable new search and analysis functions.
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Anubhav Jain
1) The document discusses evaluating machine learning algorithms for materials science using the Matbench protocol.
2) Matbench provides standardized datasets, testing procedures, and an online leaderboard to benchmark and compare machine learning performance.
3) This allows different groups to evaluate algorithms independently and identify best practices for materials science predictions.
Open Source Tools for Materials InformaticsAnubhav Jain
This document discusses open source tools for materials informatics, including Matminer and Matscholar. Matminer is a library of descriptors for materials science data that can generate features for machine learning models. It includes over 60 featurizer classes and supports scikit-learn. Matscholar applies natural language processing to over 2 million materials science abstracts to extract keywords and enable improved literature searching. The document argues that open datasets like Matbench and automated tools like Automatminer could help lower barriers for developing machine learning models in materials science by making it easier to obtain training data and evaluate model performance.
The document provides an overview of materials informatics and the Materials Genome Initiative. It discusses how materials informatics uses data-driven approaches and techniques from fields like signal processing, machine learning and statistics to generate structure-property-processing linkages from materials science data and improve understanding of materials behavior. This includes extracting features from materials microstructure, using statistical analysis and data mining to discover relationships and create predictive models, and evaluating how knowledge has improved.
Natural Language Processing for Materials Design - What Can We Extract From t...Anubhav Jain
This document discusses using natural language processing (NLP) techniques to extract and organize information from the materials science literature. It describes how NLP models can recognize named entities like materials, properties, and methods in text. Word embedding algorithms represent words as vectors to encode semantic relationships. Long short-term memory networks then classify words by context. The resulting models can automatically label millions of papers, enabling new search and predictive applications. Predictions based on word co-occurrence have inspired further experimental study of promising materials. The Matscholar team is developing comprehensive NLP tools to advance materials science research.
The Status of ML Algorithms for Structure-property Relationships Using Matb...Anubhav Jain
The document discusses the development of Matbench, a standardized benchmark for evaluating machine learning algorithms for materials property prediction. Matbench includes 13 standardized datasets covering a variety of materials prediction tasks. It employs a nested cross-validation procedure to evaluate algorithms and ranks submissions on an online leaderboard. This allows for reproducible evaluation and comparison of different algorithms. Matbench has provided insights into which algorithm types work best for certain prediction problems and has helped measure overall progress in the field. Future work aims to expand Matbench with more diverse datasets and evaluation procedures to better represent real-world materials design challenges.
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Anubhav Jain
- The document describes a computational materials design pipeline that uses theory, optimization, and natural language processing (NLP) to accelerate materials discovery.
- Key components of the pipeline include optimization algorithms like Rocketsled to find best materials solutions with fewer calculations, and NLP tools to extract and analyze knowledge from literature to predict promising new materials and benchmarks.
- The pipeline has shown speedups of 15-30x over random searches and has successfully predicted new thermoelectric materials discoveries 1-2 years before their reporting in literature.
Materials design using knowledge from millions of journal articles via natura...Anubhav Jain
This document discusses natural language processing (NLP) techniques for materials design using information from millions of journal articles. It begins with an overview of how materials are typically discovered and optimized over decades before discussing how NLP could help address this challenge. The document then provides a high-level view of how NLP is used to extract and analyze information from millions of materials science abstracts, including data collection, tokenization, training machine learning models on labeled text, and using the models to automatically extract entities. Examples are given of how word embeddings can encode scientific concepts and relationships in ways that allow predicting promising new materials for applications like thermoelectrics. The talk concludes by discussing future directions for the NLP work.
Discovering advanced materials for energy applications by mining the scientif...Anubhav Jain
This document discusses natural language processing (NLP) techniques for extracting materials-related information from scientific literature. It describes how Matscholar uses NLP to analyze over 4 million paper abstracts, identifying entities like materials, properties, and methods. Key steps include tokenizing text, training word embeddings, and using an LSTM neural network to recognize entities in context. Applications include searching materials by property and predicting promising new materials for applications based on word vector relationships. Future work aims to improve predictions for new compositions and automatically generate databases of materials properties from literature.
TMS workshop on machine learning in materials science: Intro to deep learning...BrianDeCost
This presentation is intended as a high-level introduction for to deep learning and its applications in materials science. The intended audience is materials scientists and engineers
Disclaimers: the second half of this presentation is intended as a broad overview of deep learning applications in materials science; due to time limitations it is not intended to be comprehensive. As a review of the field, this necessarily includes work that is not my own. If my own name is not included explicitly in the reference at the bottom of a slide, I was not involved in that work.
Any mention of commercial products in this presentation is for information only; it does not imply recommendation or endorsement by NIST.
1. Materials Informatics uses Python tools like RDKit for analyzing molecular structures and properties.
2. ORGAN and MolGAN are two generative models that use GANs to generate novel molecular structures based on SMILES strings, with ORGAN incorporating reinforcement learning to optimize for desired properties.
3. Tools like RDKit enable analyzing molecular fingerprints and descriptors that can be used for machine learning applications in materials informatics.
Accelerating materials design through natural language processingAnubhav Jain
This document discusses using natural language processing (NLP) to accelerate materials design. It describes how NLP techniques are being used to analyze over 4 million materials science papers to extract entities like materials, characterization methods, and properties. Word embedding algorithms represent words as vectors to capture relationships between words. NLP models are then trained on labeled text to recognize these entities. This allows automated searching of literature and predicting promising new materials for applications like thermoelectrics based on co-occurrence patterns in text. Future work includes developing structured materials databases from literature and learning embeddings to describe arbitrary materials.
Open-source tools for generating and analyzing large materials data setsAnubhav Jain
This document discusses open-source software tools for generating and analyzing large materials data sets developed by Anubhav Jain and collaborators. It summarizes several software packages including pymatgen for materials analysis, FireWorks for scientific workflows, custodian for error recovery in calculations, and matminer for data mining. Applications of the tools include generating the Materials Project database containing properties of over 65,000 materials compounds calculated using high-performance computing resources. The document emphasizes the importance of open-source collaborative software development and automation to accelerate materials discovery.
Materials discovery through theory, computation, and machine learningAnubhav Jain
The document discusses using theory, computation, and machine learning to discover new materials. It summarizes that density functional theory (DFT) can model material properties from first principles, and how DFT calculations have been automated and run on supercomputers to enable high-throughput screening of materials. Examples are given of computations predicting new materials that were later experimentally confirmed, like sidorenkite cathodes for sodium ion batteries. Related projects are outlined like the open-source Materials Project database of DFT data on over 85,000 materials and software libraries to support high-throughput computation and materials science. Text mining of scientific literature is also discussed to help predict new materials in advance.
This document summarizes work on developing clear sky detection methods and photovoltaic data analytics tools. It describes collaborating with NREL and kWh Analytics to build a robust clear sky detection method for the RdTools software. The goal is to automatically learn the best parameters for the PVLib clear sky model by comparing its labels to known clear sky labels from satellite data. It also discusses developing open-source software to analyze string-level I-V curves collected by Sandia National Labs to detect mismatching and extract IV parameters. The work aims to help researchers by providing data management, analytics and predictive modeling through a DuraMat Data Hub.
Smart Metrics for High Performance Material Designaimsnist
This document discusses smart metrics for high-performance material design using density functional theory (DFT), classical force fields (FF), and machine learning (ML). It provides an overview of the JARVIS database and tools containing over 35,000 materials and classical properties calculated using DFT, FF, and ML methods. Metrics discussed include formation energy, exfoliation energy, elastic constants, surface energy, vacancy energy, grain boundary energy, bandgaps, and other electronic and optical properties important for applications like solar cells. ML models are developed to predict these properties with mean absolute errors within chemical accuracy compared to DFT benchmarks.
Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...aimsnist
This document discusses how high-throughput experimentation (HTE) and machine learning (ML) can accelerate materials discovery for functional metallic glasses (MGs). It describes a round robin experiment between NIST and NREL to synthesize and characterize composition spread samples to test data sharing standards. General trends predicted by ML models often correlate within a given synthesis method but systematic differences can occur between methods. While ML is not a replacement for physics, the combination of HTE and ML can identify promising new materials faster than traditional experimentation alone. Autonomous research platforms may enable an even greater acceleration of the materials discovery process.
The document discusses the Materials Genome Initiative (MGI) and the High-Throughput Experimental Materials Collaboratory (HTE-MC). It describes NIST's role in supporting MGI through developing a materials innovation infrastructure. It outlines the vision for HTE-MC, which would integrate high-throughput synthesis and characterization tools across multiple institutions through a shared network and data management platform. This would provide broader access to experimental facilities and materials data to support accelerated materials discovery. A workshop was held in 2018 to discuss establishing the HTE-MC concept and defining its technical, operational and business models.
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
Artificial intelligence is emerging as a new paradigm in materials science. This talk describes how physical intuition and (insightful) machine learning can solve the complicated task of structure recognition in materials at the nanoscale.
Graph Centric Analysis of Road Network Patterns for CBD’s of Metropolitan Cit...Punit Sharnagat
OSMnx is a Python package to retrieve, model, analyze, and visualize street networks from OpenStreetMap.
OpenStreetMap (OSM) is a collaborative mapping project that provides a free and publicly editable map of the world.
OpenStreetMap provides a valuable crowd-sourced database of raw geospatial data for constructing models of urban street networks for scientific analysis
A Machine Learning Framework for Materials Knowledge Systemsaimsnist
- The document describes a machine learning framework for developing artificial intelligence-based materials knowledge systems (MKS) to support accelerated materials discovery and development.
- The MKS would have main functions of diagnosing materials problems, predicting materials behaviors, and recommending materials selections or process adjustments.
- It would utilize a Bayesian statistical approach to curate process-structure-property linkages for all materials classes and length scales, accounting for uncertainty in the knowledge, and allow continuous updates from new information sources.
A Framework and Infrastructure for Uncertainty Quantification and Management ...aimsnist
QuesTek Innovations presented a framework to incorporate materials genome initiatives (MGI) and artificial intelligence (AI) into their integrated computational materials engineering (ICME) practice. They discussed three key aspects: (1) MaGICMaT, a materials genome and ICME toolkit to manage data and property-structure-performance linkages, (2) an uncertainty quantification framework for CALPHAD modeling, and (3) a cloud-based platform to enable rapid development and deployment of ICME models with an HPC backend. The presentation provided details on their approaches for each aspect and highlighted opportunities to further enhance ICME with MGI and AI.
Going Smart and Deep on Materials at ALCFIan Foster
As we acquire large quantities of science data from experiment and simulation, it becomes possible to apply machine learning (ML) to those data to build predictive models and to guide future simulations and experiments. Leadership Computing Facilities need to make it easy to assemble such data collections and to develop, deploy, and run associated ML models.
We describe and demonstrate here how we are realizing such capabilities at the Argonne Leadership Computing Facility. In our demonstration, we use large quantities of time-dependent density functional theory (TDDFT) data on proton stopping power in various materials maintained in the Materials Data Facility (MDF) to build machine learning models, ranging from simple linear models to complex artificial neural networks, that are then employed to manage computations, improving their accuracy and reducing their cost. We highlight the use of new services being prototyped at Argonne to organize and assemble large data collections (MDF in this case), associate ML models with data collections, discover available data and models, work with these data and models in an interactive Jupyter environment, and launch new computations on ALCF resources.
2D/3D Materials screening and genetic algorithm with ML modelaimsnist
JARVIS-ML provides concise summaries of materials properties using machine learning models trained on the extensive data in the JARVIS repositories. It has developed regression and classification models that can predict formation energies, bandgaps, and other material properties in seconds, much faster than traditional DFT calculations. The models use gradient boosting decision trees and feature importance analysis to provide explanations. JARVIS-ML is available as a public web app and API for rapid screening and discovery of new materials.
The document discusses using artificial intelligence (AI) to accelerate materials innovation for clean energy applications. It outlines six elements needed for a Materials Acceleration Platform: 1) automated experimentation, 2) AI for materials discovery, 3) modular robotics for synthesis and characterization, 4) computational methods for inverse design, 5) bridging simulation length and time scales, and 6) data infrastructure. Examples of opportunities include using AI to bridge simulation scales, assist complex measurements, and enable automated materials design. The document argues that a cohesive infrastructure is needed to make effective use of AI, data, computation, and experiments for materials science.
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsAnubhav Jain
This document summarizes several projects from Anubhav Jain at Lawrence Berkeley National Laboratory related to using artificial intelligence and data mining for materials science. It discusses (1) developing interpretable descriptors of crystal structure based on local environments, (2) the matminer toolkit for connecting materials data to machine learning algorithms, and (3) the atomate/Rocketsled software for running high-throughput density functional theory calculations on supercomputers. It also briefly outlines a project to develop a text mining database for materials science literature.
Data Mining to Discovery for Inorganic Solids: Software Tools and Applicationsaimsnist
This document summarizes four projects from Lawrence Berkeley National Laboratory related to using artificial intelligence and data mining for materials science:
1) Interpretable descriptors of crystal structure that describe local environments as fingerprints to distinguish structures.
2) The matminer toolkit which connects materials data to machine learning algorithms and data visualization.
3) The atomate and Rocketsled software for running high-throughput density functional theory calculations and building a computational optimizer.
4) A text mining approach to label the content of materials science abstracts to build a revised materials search engine and identify related materials.
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Anubhav Jain
- The document describes a computational materials design pipeline that uses theory, optimization, and natural language processing (NLP) to accelerate materials discovery.
- Key components of the pipeline include optimization algorithms like Rocketsled to find best materials solutions with fewer calculations, and NLP tools to extract and analyze knowledge from literature to predict promising new materials and benchmarks.
- The pipeline has shown speedups of 15-30x over random searches and has successfully predicted new thermoelectric materials discoveries 1-2 years before their reporting in literature.
Materials design using knowledge from millions of journal articles via natura...Anubhav Jain
This document discusses natural language processing (NLP) techniques for materials design using information from millions of journal articles. It begins with an overview of how materials are typically discovered and optimized over decades before discussing how NLP could help address this challenge. The document then provides a high-level view of how NLP is used to extract and analyze information from millions of materials science abstracts, including data collection, tokenization, training machine learning models on labeled text, and using the models to automatically extract entities. Examples are given of how word embeddings can encode scientific concepts and relationships in ways that allow predicting promising new materials for applications like thermoelectrics. The talk concludes by discussing future directions for the NLP work.
Discovering advanced materials for energy applications by mining the scientif...Anubhav Jain
This document discusses natural language processing (NLP) techniques for extracting materials-related information from scientific literature. It describes how Matscholar uses NLP to analyze over 4 million paper abstracts, identifying entities like materials, properties, and methods. Key steps include tokenizing text, training word embeddings, and using an LSTM neural network to recognize entities in context. Applications include searching materials by property and predicting promising new materials for applications based on word vector relationships. Future work aims to improve predictions for new compositions and automatically generate databases of materials properties from literature.
TMS workshop on machine learning in materials science: Intro to deep learning...BrianDeCost
This presentation is intended as a high-level introduction for to deep learning and its applications in materials science. The intended audience is materials scientists and engineers
Disclaimers: the second half of this presentation is intended as a broad overview of deep learning applications in materials science; due to time limitations it is not intended to be comprehensive. As a review of the field, this necessarily includes work that is not my own. If my own name is not included explicitly in the reference at the bottom of a slide, I was not involved in that work.
Any mention of commercial products in this presentation is for information only; it does not imply recommendation or endorsement by NIST.
1. Materials Informatics uses Python tools like RDKit for analyzing molecular structures and properties.
2. ORGAN and MolGAN are two generative models that use GANs to generate novel molecular structures based on SMILES strings, with ORGAN incorporating reinforcement learning to optimize for desired properties.
3. Tools like RDKit enable analyzing molecular fingerprints and descriptors that can be used for machine learning applications in materials informatics.
Accelerating materials design through natural language processingAnubhav Jain
This document discusses using natural language processing (NLP) to accelerate materials design. It describes how NLP techniques are being used to analyze over 4 million materials science papers to extract entities like materials, characterization methods, and properties. Word embedding algorithms represent words as vectors to capture relationships between words. NLP models are then trained on labeled text to recognize these entities. This allows automated searching of literature and predicting promising new materials for applications like thermoelectrics based on co-occurrence patterns in text. Future work includes developing structured materials databases from literature and learning embeddings to describe arbitrary materials.
Open-source tools for generating and analyzing large materials data setsAnubhav Jain
This document discusses open-source software tools for generating and analyzing large materials data sets developed by Anubhav Jain and collaborators. It summarizes several software packages including pymatgen for materials analysis, FireWorks for scientific workflows, custodian for error recovery in calculations, and matminer for data mining. Applications of the tools include generating the Materials Project database containing properties of over 65,000 materials compounds calculated using high-performance computing resources. The document emphasizes the importance of open-source collaborative software development and automation to accelerate materials discovery.
Materials discovery through theory, computation, and machine learningAnubhav Jain
The document discusses using theory, computation, and machine learning to discover new materials. It summarizes that density functional theory (DFT) can model material properties from first principles, and how DFT calculations have been automated and run on supercomputers to enable high-throughput screening of materials. Examples are given of computations predicting new materials that were later experimentally confirmed, like sidorenkite cathodes for sodium ion batteries. Related projects are outlined like the open-source Materials Project database of DFT data on over 85,000 materials and software libraries to support high-throughput computation and materials science. Text mining of scientific literature is also discussed to help predict new materials in advance.
This document summarizes work on developing clear sky detection methods and photovoltaic data analytics tools. It describes collaborating with NREL and kWh Analytics to build a robust clear sky detection method for the RdTools software. The goal is to automatically learn the best parameters for the PVLib clear sky model by comparing its labels to known clear sky labels from satellite data. It also discusses developing open-source software to analyze string-level I-V curves collected by Sandia National Labs to detect mismatching and extract IV parameters. The work aims to help researchers by providing data management, analytics and predictive modeling through a DuraMat Data Hub.
Smart Metrics for High Performance Material Designaimsnist
This document discusses smart metrics for high-performance material design using density functional theory (DFT), classical force fields (FF), and machine learning (ML). It provides an overview of the JARVIS database and tools containing over 35,000 materials and classical properties calculated using DFT, FF, and ML methods. Metrics discussed include formation energy, exfoliation energy, elastic constants, surface energy, vacancy energy, grain boundary energy, bandgaps, and other electronic and optical properties important for applications like solar cells. ML models are developed to predict these properties with mean absolute errors within chemical accuracy compared to DFT benchmarks.
Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...aimsnist
This document discusses how high-throughput experimentation (HTE) and machine learning (ML) can accelerate materials discovery for functional metallic glasses (MGs). It describes a round robin experiment between NIST and NREL to synthesize and characterize composition spread samples to test data sharing standards. General trends predicted by ML models often correlate within a given synthesis method but systematic differences can occur between methods. While ML is not a replacement for physics, the combination of HTE and ML can identify promising new materials faster than traditional experimentation alone. Autonomous research platforms may enable an even greater acceleration of the materials discovery process.
The document discusses the Materials Genome Initiative (MGI) and the High-Throughput Experimental Materials Collaboratory (HTE-MC). It describes NIST's role in supporting MGI through developing a materials innovation infrastructure. It outlines the vision for HTE-MC, which would integrate high-throughput synthesis and characterization tools across multiple institutions through a shared network and data management platform. This would provide broader access to experimental facilities and materials data to support accelerated materials discovery. A workshop was held in 2018 to discuss establishing the HTE-MC concept and defining its technical, operational and business models.
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
Artificial intelligence is emerging as a new paradigm in materials science. This talk describes how physical intuition and (insightful) machine learning can solve the complicated task of structure recognition in materials at the nanoscale.
Graph Centric Analysis of Road Network Patterns for CBD’s of Metropolitan Cit...Punit Sharnagat
OSMnx is a Python package to retrieve, model, analyze, and visualize street networks from OpenStreetMap.
OpenStreetMap (OSM) is a collaborative mapping project that provides a free and publicly editable map of the world.
OpenStreetMap provides a valuable crowd-sourced database of raw geospatial data for constructing models of urban street networks for scientific analysis
A Machine Learning Framework for Materials Knowledge Systemsaimsnist
- The document describes a machine learning framework for developing artificial intelligence-based materials knowledge systems (MKS) to support accelerated materials discovery and development.
- The MKS would have main functions of diagnosing materials problems, predicting materials behaviors, and recommending materials selections or process adjustments.
- It would utilize a Bayesian statistical approach to curate process-structure-property linkages for all materials classes and length scales, accounting for uncertainty in the knowledge, and allow continuous updates from new information sources.
A Framework and Infrastructure for Uncertainty Quantification and Management ...aimsnist
QuesTek Innovations presented a framework to incorporate materials genome initiatives (MGI) and artificial intelligence (AI) into their integrated computational materials engineering (ICME) practice. They discussed three key aspects: (1) MaGICMaT, a materials genome and ICME toolkit to manage data and property-structure-performance linkages, (2) an uncertainty quantification framework for CALPHAD modeling, and (3) a cloud-based platform to enable rapid development and deployment of ICME models with an HPC backend. The presentation provided details on their approaches for each aspect and highlighted opportunities to further enhance ICME with MGI and AI.
Going Smart and Deep on Materials at ALCFIan Foster
As we acquire large quantities of science data from experiment and simulation, it becomes possible to apply machine learning (ML) to those data to build predictive models and to guide future simulations and experiments. Leadership Computing Facilities need to make it easy to assemble such data collections and to develop, deploy, and run associated ML models.
We describe and demonstrate here how we are realizing such capabilities at the Argonne Leadership Computing Facility. In our demonstration, we use large quantities of time-dependent density functional theory (TDDFT) data on proton stopping power in various materials maintained in the Materials Data Facility (MDF) to build machine learning models, ranging from simple linear models to complex artificial neural networks, that are then employed to manage computations, improving their accuracy and reducing their cost. We highlight the use of new services being prototyped at Argonne to organize and assemble large data collections (MDF in this case), associate ML models with data collections, discover available data and models, work with these data and models in an interactive Jupyter environment, and launch new computations on ALCF resources.
2D/3D Materials screening and genetic algorithm with ML modelaimsnist
JARVIS-ML provides concise summaries of materials properties using machine learning models trained on the extensive data in the JARVIS repositories. It has developed regression and classification models that can predict formation energies, bandgaps, and other material properties in seconds, much faster than traditional DFT calculations. The models use gradient boosting decision trees and feature importance analysis to provide explanations. JARVIS-ML is available as a public web app and API for rapid screening and discovery of new materials.
The document discusses using artificial intelligence (AI) to accelerate materials innovation for clean energy applications. It outlines six elements needed for a Materials Acceleration Platform: 1) automated experimentation, 2) AI for materials discovery, 3) modular robotics for synthesis and characterization, 4) computational methods for inverse design, 5) bridging simulation length and time scales, and 6) data infrastructure. Examples of opportunities include using AI to bridge simulation scales, assist complex measurements, and enable automated materials design. The document argues that a cohesive infrastructure is needed to make effective use of AI, data, computation, and experiments for materials science.
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsAnubhav Jain
This document summarizes several projects from Anubhav Jain at Lawrence Berkeley National Laboratory related to using artificial intelligence and data mining for materials science. It discusses (1) developing interpretable descriptors of crystal structure based on local environments, (2) the matminer toolkit for connecting materials data to machine learning algorithms, and (3) the atomate/Rocketsled software for running high-throughput density functional theory calculations on supercomputers. It also briefly outlines a project to develop a text mining database for materials science literature.
Data Mining to Discovery for Inorganic Solids: Software Tools and Applicationsaimsnist
This document summarizes four projects from Lawrence Berkeley National Laboratory related to using artificial intelligence and data mining for materials science:
1) Interpretable descriptors of crystal structure that describe local environments as fingerprints to distinguish structures.
2) The matminer toolkit which connects materials data to machine learning algorithms and data visualization.
3) The atomate and Rocketsled software for running high-throughput density functional theory calculations and building a computational optimizer.
4) A text mining approach to label the content of materials science abstracts to build a revised materials search engine and identify related materials.
Classification of News and Research Articles Using Text Pattern MiningIOSR Journals
This document summarizes a research paper that proposes a method for classifying news and research articles using text pattern mining. The method involves preprocessing text to remove stop words and perform stemming. Frequent and closed patterns are then discovered from the preprocessed text. These patterns are structured into a taxonomy and deployed to classify new documents. The method also involves evolving patterns by reshuffling term supports within patterns to reduce the effects of noise from negative documents. Over 80% of documents were successfully classified using this pattern-based approach.
This document proposes using Word2Vec and decision trees to extract keywords from textual documents and classify the documents. It reviews related work on keyword extraction and text classification techniques. The proposed approach involves preprocessing text, representing words as vectors with Word2Vec, calculating frequently occurring keywords for each category, and using decision trees to classify documents based on keyword similarity. Experiments using different preprocessing and Word2Vec settings achieved an F-score of up to 82% for document classification.
Using a keyword extraction pipeline to understand concepts in future work sec...Kai Li
This document describes a study that uses natural language processing and text mining techniques to identify future work statements in scientific papers and extract keywords from those statements. The researchers developed a multi-step pipeline to first identify the future work section, then select future work sentences within that section. They used rules and algorithms to identify sentences discussing future work. Keywords were then extracted from the selected sentences using the RAKE algorithm. An analysis found that 31.4% of papers contained future work statements, with medical science papers having the highest overlap between future work and title-abstract keywords. The researchers hope this work is a first step toward predicting future research topics.
Discovering new functional materials for clean energy and beyond using high-t...Anubhav Jain
- The research group develops computational methods and machine learning models to design new functional materials using high-throughput computing. This includes developing databases of materials properties, benchmarking machine learning algorithms, and applying natural language processing to materials design. Recent work also involves automating materials synthesis and characterization. The group maintains several open-source software packages that power their research.
Deep learning is finding applications in science such as predicting material properties. DLHub is being developed to facilitate sharing of deep learning models, data, and code for science. It will collect, publish, serve, and enable retraining of models on new data. This will help address challenges of applying deep learning to science like accessing relevant resources and integrating models into workflows. The goal is to deliver deep learning capabilities to thousands of scientists through software for managing data, models and workflows.
Automatically Generating Wikipedia Articles: A Structure-Aware ApproachGeorge Ang
The document describes an approach for automatically generating Wikipedia-style articles by using the structure of existing human-authored articles as templates. It involves inducing templates by analyzing section headings across documents, retrieving relevant excerpts from the internet for each template topic, and jointly training extractors to select excerpts that optimize both local relevance and global coherence across the entire article. The results confirm the benefits of incorporating structural information into the content selection process.
Software tools for high-throughput materials data generation and data miningAnubhav Jain
Atomate and matminer are open-source Python libraries for high-throughput materials data generation and data mining. Atomate makes it easy to automatically generate large datasets by running standardized computational workflows with different simulation packages. Matminer contains tools for featurizing materials data and integrating it with machine learning algorithms and data visualization methods. Both aim to accelerate materials discovery by automating and standardizing computational workflows and data analysis tasks.
Post 1What is text analytics How does it differ from text mini.docxstilliegeorgiana
Post 1:
What is text analytics? How does it differ from text mining?
Text Analytics is applying of statistical and machine learning techniques to be able to predict /prescribe or infer any information from the text-mined data. Text mining is a tool that helps in getting the data cleaned up.Text analytics and text mining approaches have essentially equivalent performance. Text analytics requires an expert linguist to produce complex rule sets, whereas text mining requires the analyst to hand-label cases with outcomes or classes to create training data.
Differences between Text Mining and Text Analytics:
• Text Mining and Text Analytics solve the same problems, but use different techniques and are complementary ways to automatically extract meaning from text.
• Text Analytics is developed within the field of computational linguistics. It has the ability to encode human understanding into a series of linguistic rules which are generated by humans are high in precision, but they do not automatically adapt and are usually fragile when tried in new situations.
• Text mining is a newer discipline arising out of the fields of statistics, data mining, and machine learning. Its strength is the ability to inductively create models from collections of historical data. Because statistical models are learned from training data they are adaptive and can identify “unknown unknowns”, leading to the better recall. Still, they can be prone to missing something that would seem obvious to a human.
• Text analytics and text mining approaches have essentially equivalent performance. Text analytics requires an expert linguist to produce complex rule sets, whereas text mining requires the analyst to hand-label cases with outcomes or classes to create training data.
• Due to their different perspectives and strengths, combining text analytics with text mining often leads to better performance than either approach alone.
2. What technologies were used in building Watson (both hardware and software)?
Watson is an extraordinary computer system (a novel combination of advanced hardware an software) designed at answering questions posed in natural human language.Watson is an artificially intelligent computer system capable of answering questions posed in natural language, developed in IBM's DeepQA project by a research team led by principal investigator David Ferrucci. Watson was named after IBM's first CEO and industrialist Thomas J. Watson. The computer system was specifically developed to answer questions on the quiz show Jeopardy! In 2011, Watson competed on Jeopardy! against former winners Brad Rutter and Ken Jennings.
Watson received the first prize of $1 million.The goal was to advance computer science by exploring new ways for computer technology to affect science, business, and society.IBM undertook a challenge to build a computer system that could compete at the human champion level in real time on the American TV quiz show Jeopardy!The extent of the challenge in ...
Post 1What is text analytics How does it differ from text minianhcrowley
Post 1:
What is text analytics? How does it differ from text mining?
Text Analytics is applying of statistical and machine learning techniques to be able to predict /prescribe or infer any information from the text-mined data. Text mining is a tool that helps in getting the data cleaned up.Text analytics and text mining approaches have essentially equivalent performance. Text analytics requires an expert linguist to produce complex rule sets, whereas text mining requires the analyst to hand-label cases with outcomes or classes to create training data.
Differences between Text Mining and Text Analytics:
• Text Mining and Text Analytics solve the same problems, but use different techniques and are complementary ways to automatically extract meaning from text.
• Text Analytics is developed within the field of computational linguistics. It has the ability to encode human understanding into a series of linguistic rules which are generated by humans are high in precision, but they do not automatically adapt and are usually fragile when tried in new situations.
• Text mining is a newer discipline arising out of the fields of statistics, data mining, and machine learning. Its strength is the ability to inductively create models from collections of historical data. Because statistical models are learned from training data they are adaptive and can identify “unknown unknowns”, leading to the better recall. Still, they can be prone to missing something that would seem obvious to a human.
• Text analytics and text mining approaches have essentially equivalent performance. Text analytics requires an expert linguist to produce complex rule sets, whereas text mining requires the analyst to hand-label cases with outcomes or classes to create training data.
• Due to their different perspectives and strengths, combining text analytics with text mining often leads to better performance than either approach alone.
2. What technologies were used in building Watson (both hardware and software)?
Watson is an extraordinary computer system (a novel combination of advanced hardware an software) designed at answering questions posed in natural human language.Watson is an artificially intelligent computer system capable of answering questions posed in natural language, developed in IBM's DeepQA project by a research team led by principal investigator David Ferrucci. Watson was named after IBM's first CEO and industrialist Thomas J. Watson. The computer system was specifically developed to answer questions on the quiz show Jeopardy! In 2011, Watson competed on Jeopardy! against former winners Brad Rutter and Ken Jennings.
Watson received the first prize of $1 million.The goal was to advance computer science by exploring new ways for computer technology to affect science, business, and society.IBM undertook a challenge to build a computer system that could compete at the human champion level in real time on the American TV quiz show Jeopardy!The extent of the challenge in ...
This document provides an overview of text mining and web mining. It defines data mining and describes the common data mining tasks of classification, clustering, association rule mining and sequential pattern mining. It then discusses text mining, defining it as the process of analyzing unstructured text data to extract meaningful information and structure. The document outlines the seven practice areas of text mining as search/information retrieval, document clustering, document classification, web mining, information extraction, natural language processing, and concept extraction. It provides brief descriptions of the problems addressed within each practice area.
This document describes a proposed concept-based mining model that aims to improve document clustering and information retrieval by extracting concepts and semantic relationships rather than just keywords. The model uses natural language processing techniques like part-of-speech tagging and parsing to extract concepts from text. It represents concepts and their relationships in a semantic network and clusters documents based on conceptual similarity rather than term frequency. The model is evaluated using singular value decomposition to increase the precision of key term and phrase extraction.
This document discusses developing an ontology-based semantic web application for the biological domain. It introduces the need for semantic technologies to help machines better understand and combine biological information from different sources. The document outlines the methodology, which involves defining concepts, properties, and relations in the biological domain to create an ontology. It also discusses implementing a semantic web application using the Jena framework to retrieve and manipulate biological data modeled with ontologies and RDF. The goal is to build a semantic search framework to improve information retrieval for biologists.
The document discusses two NSF-funded research projects on intelligence and security informatics:
1. A project to filter and monitor message streams to detect "new events" and changes in topics or activity levels. It describes the technical challenges and components of automatic message processing.
2. A project called HITIQA to develop high-quality interactive question answering. It describes the team members and key research issues like question semantics, human-computer dialogue, and information quality metrics.
ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...ijaia
Regression models and their statistical analyses is one of the most important tool used by scientists and practitioners. The aim of a regression model is to fit parametric functions to data. It is known that the true regression is unknown and specific methods are created and used strictly pertaining to the roblem. For the pioneering work to develop procedures for fitting functions, we refer to the work on the methods of least
absolute deviations, least squares deviations and minimax absolute deviations. Today’s widely celebrated
procedure of the method of least squares for function fitting is credited to the published works of Legendre and Gauss. However, the least squares based models in practice may fail to provide optimal results in nonGaussian situations especially when the errors follow distributions with the fat tails. In this paper an unorthodox method of estimating linear regression coefficients by minimising GMSE(geometric mean of squared errors) is explored. Though GMSE(geometric mean of squared errors) is used to compare models it is rarely used to obtain the coefficients. Such a method is tedious to handle due to the large number of roots obtained by minimisation of the loss function. This paper offers a way to tackle that problem.
Application is illustrated with the ‘Advertising’ dataset from ISLR and the obtained results are compared
with the results of the method of least squares for single index linear regression model.
This document discusses using knowledge graphs to promote cognitive computing. It begins with an introduction to the presenter and their background and research interests. It then outlines the advantages of using knowledge graphs for question answering, machine learning, natural language processing, and information retrieval. Key applications of knowledge graphs at Google, IBM Watson, and on smart phones are also mentioned. The document dives deeper into two of the presenter's research projects on rewriting natural language queries on knowledge graphs and extracting triples from news headlines on Twitter.
Topic detecton by clustering and text miningIRJET Journal
This document discusses topic detection from text documents using text mining and clustering techniques. It proposes extracting keywords from documents, representing topics as groups of keywords, and using k-means clustering on the keywords to group them into topics. The keywords are extracted based on frequency counts and preprocessed by removing stop words and stemming. The k-means clustering algorithm is used to assign keywords to topics represented by cluster centroids, and the centroids are iteratively updated until cluster assignments converge.
Construction and Querying of Dynamic Knowledge GraphsSutanay Choudhury
The ability to construct domain specific knowledge graphs (KG) and perform question-answering or hypothesis generation is a transformative capability. Despite their value, automated construction of knowledge graphs remains an expensive technical challenge that is beyond the reach for most enterprises and academic institutions. We propose an end-to-end framework for developing custom knowledge graph driven analytics for arbitrary application domains. The uniqueness of our system lies A) in its combination of curated KGs along with knowledge extracted from unstructured text, B) support for advanced trending and explanatory questions on a dynamic KG, and C) the ability to answer queries where the answer is embedded across multiple data sources.
Interlinking educational data to Web of Data (Thesis presentation)Enayat Rajabi
This is a thesis presentation about interlinking educational data to Web of Data. I explain how I used the Linked Data approach to expose and interlink educational data to the Linked Open Data cloud
Similar to Applications of Natural Language Processing to Materials Design (20)
Discovering advanced materials for energy applications: theory, high-throughp...Anubhav Jain
Anubhav Jain presented on using density functional theory and high-throughput calculations to design advanced materials for energy applications. Key points included:
1) Density functional theory can be used to model materials physics and properties by approximating many-body quantum mechanics.
2) Thermoelectric materials were discussed as an example application, where the goal is to optimize the figure of merit which depends on conductivity, Seebeck coefficient, and thermal conductivity.
3) High-throughput calculations were performed on over 50,000 materials to efficiently screen for promising thermoelectric candidates like TmAgTe2, though experimental validation is still needed due to approximations.
Applications of Large Language Models in Materials Discovery and DesignAnubhav Jain
The document discusses applications of large language models (LLMs) in materials discovery and design. It describes how LLMs have improved natural language processing tasks related to materials science literature by requiring less custom model training and fine-tuning. As an example, the document discusses how LLMs were used to extract doping information from scientific papers and create a database of over 200,000 doped material compositions. The document suggests LLMs will continue enhancing materials databases and interfaces by integrating search and question-answering capabilities.
An AI-driven closed-loop facility for materials synthesisAnubhav Jain
The document summarizes an AI-driven closed-loop facility for materials synthesis using robotics, machine learning, and optimization algorithms. The facility aims to close the loop on rapid synthesis of new materials by using automated systems to synthesize compounds predicted by algorithms, characterize the results, and feed the data back to improve predictions. In less than 3 weeks, the facility synthesized 41 new chemical compositions out of 58 computationally predicted stable compounds. The facility is now collaborating with other groups to synthesize more complex materials, with the goal of accelerating the discovery of new materials through fully automated closed-loop synthesis and characterization.
Best practices for DuraMat software disseminationAnubhav Jain
The document provides best practices for disseminating software produced by DuraMat-funded projects. It discusses establishing standards and guidance for software produced by DuraMat to save time and effort in development and dissemination, and to provide consistency. The document outlines three levels of dissemination depending on the software's purpose and maturity. Level 1 is for one-off scripts, level 2 is for software used over a project's lifetime, and level 3 is for ongoing, community-maintained projects. Recommendations include documentation, licensing, and use of services like GitHub, Zenodo, and continuous integration tools.
Best practices for DuraMat software disseminationAnubhav Jain
The document provides best practices for disseminating software produced by DuraMat-funded projects. It discusses three levels of dissemination depending on the software's purpose and maturity. For all levels, it recommends documenting code, adding licenses, and hosting on GitHub. For more mature software, it suggests continuous integration, documentation, releases on Zenodo, and submitting to journals. The goal is to effectively share software, establish consistency, and give proper credit for products.
Available methods for predicting materials synthesizability using computation...Anubhav Jain
This document summarizes a talk about computational and machine learning approaches for predicting materials synthesizability. It discusses how machine learning algorithms are generating millions of potential stable compound predictions, far more than can be experimentally tested. It also examines ways to better prioritize candidate materials for synthesis, such as by assessing their likelihood of dynamical stability and calculating their finite-temperature Gibbs free energies more efficiently using machine-learned interatomic force constants. Finally, it describes efforts to integrate literature knowledge using natural language processing to further guide experimental exploration and reduce the number of experiments needed to synthesize predicted materials.
Efficient methods for accurately calculating thermoelectric properties – elec...Anubhav Jain
1) AMSET is a new method for efficiently calculating electronic transport properties from first principles that provides accurate results comparable to more computationally expensive methods.
2) HiPhive uses a data fitting approach to extract interatomic force constants from a small number of non-systematic displacement calculations, avoiding the need for many systematic calculations required by traditional methods to obtain phonon and thermal properties.
3) These new efficient methods enable high-throughput screening of thermoelectric materials by providing accurate transport properties while being computationally feasible for large numbers of materials.
Natural Language Processing for Data Extraction and Synthesizability Predicti...Anubhav Jain
This document discusses using natural language processing and machine learning techniques to extract and analyze synthesis recipes from materials science literature. It presents work using sequence-to-sequence models to extract entities and relationships for the synthesis of gold nanorods and bismuth ferrite from research papers. Decision trees trained on the extracted data are able to reproduce conclusions about the effects of synthesis parameters from literature. However, applying these techniques to predictive synthesis still faces challenges regarding reproducibility, missing information, and lack of negative examples in literature datasets.
This document summarizes a presentation on developing an electrochemical system for selenium removal from water. The project aims to apply machine learning and automated synthesis techniques to accelerate materials development timelines. Initial calculations have reproduced experimental trends for nitrate reduction and screened candidate materials from databases. Procedures have also been established for electrode preparation, testing, and using robots to synthesize predicted candidates. While still early, progress has been made on computational screening, mitigating competing reactions, and testing baseline cathode materials for selenium removal performance and energy efficiency. The remainder of the first project year will focus on refining methods before demonstrating a commercially viable selenium removal system in years two and three.
Accelerating New Materials Design with Supercomputing and Machine LearningAnubhav Jain
Anubhav Jain gave a presentation summarizing his career in materials science research from high school internships through his current role leading the Materials Project. During his PhD and Alvarez fellowship, he developed high-throughput workflows and open source software like FireWorks to automate materials calculations. This allowed him to launch the Materials Project database and scale it up over time with a growing team. The Materials Project has now screened over 180,000 materials and led to successful experimental validations of computational predictions.
DuraMat CO1 Central Data Resource: How it started, how it’s going …Anubhav Jain
The document summarizes several projects developed as part of the DuraMat CO1 Central Data Resource initiative to analyze photovoltaic performance and degradation data. A secure data portal was developed that currently hosts data from 239 users and 271 datasets. Software tools were also created, such as pvAnalytics for data cleaning and filtering, pvOps for operational and maintenance data analysis, and pv-vision for electroluminescence image analysis. These open source tools are publicly available and have helped advance the analysis of PV degradation through access to larger datasets. Overall, the projects have established a foundation for ongoing collaborative research on PV performance and lifetime under DuraMat 2.0.
The Materials Project is a multidisciplinary project with over 250,000 registered users that accelerates materials design. A small team generates data on specific materials using advanced computations and provides organization and dissemination of the data. Over 260,000 registered users can access the data for research and contribute their own experimental or theoretical data. The project continues to deliver new calculated data and works on improving accuracy, modeling magnetic orderings, vibrational properties, and non-ordered compounds. The Materials Project allows users to contribute their own data sets and integrate them with the core data through a new MPContribs capability.
Evaluating Chemical Composition and Crystal Structure Representations using t...Anubhav Jain
This document discusses the Matbench testing protocol for evaluating machine learning models for materials property prediction. Matbench contains 13 standardized tasks to compare different models. Several existing models have been tested, including those using composition features and graph neural networks using structural representations. While some tasks have seen significant improvement, others have seen little progress. The document suggests ways to improve Matbench, such as adding new materials classes, properties, and evaluation metrics to further benchmark progress and encourage development of better models.
Perspectives on chemical composition and crystal structure representations fr...Anubhav Jain
The document discusses the Matbench testing protocol for evaluating machine learning models for materials property prediction. It summarizes the 13 different machine learning tasks in Matbench and the various models that have been tested, including Magpie, Automatminer, MODNet, CGCNN, ALIGNN, and CRABNet. The document outlines ways Matbench could be further improved, such as including a greater diversity of tasks, changing the data splitting methodology, and incorporating active learning into the scoring. The overall goal of Matbench is to provide a standard way to evaluate new machine learning algorithms for materials property prediction and measure progress in the field.
The Materials Project: Applications to energy storage and functional materia...Anubhav Jain
The Materials Project is a free online database containing calculated properties of over 150,000 materials designed to help researchers discover new functional materials. It has been used extensively in academia and industry to identify novel battery electrode materials and solid electrolytes through high-throughput computational screening. Researchers are now using the Materials Project dataset to train machine learning models to predict battery properties and screen for new materials. Related efforts aim to bridge the gap between computational design and physical synthesis by developing an automated synthesis lab to experimentally validate candidate materials identified from the database.
The Materials Project: A Community Data Resource for Accelerating New Materia...Anubhav Jain
The Materials Project is a free online database containing calculated properties of over 150,000 materials designed to accelerate materials design. It contains electronic, thermal, mechanical, magnetic, and other properties powered by hundreds of millions of CPU hours. Users can access core data, tools for analysis, and open-source simulation code. The Materials Project has been used to computationally design new materials that were then experimentally confirmed, such as transparent conductors and thermoelectrics. The project seeks to engage the community through contributions of experimental data, benchmarking of machine learning methods, and disseminating discoveries.
Machine Learning Platform for Catalyst DesignAnubhav Jain
This project aims to develop new electrocatalyst materials for nitrate removal from water using machine learning and computational screening. The team performed calculations on over 1,000 potential compositions to identify promising catalysts with low costs. Experimental synthesis of candidates such as ZnNi and Zn3Co was attempted but it is unclear if the desired alloys were produced. The screening approach is now being applied to identify materials for selenium removal. If successful, low-cost catalysts could be developed to reduce the costs of electrocatalytic water treatment.
Automating materials science workflows with pymatgen, FireWorks, and atomateAnubhav Jain
FireWorks is a workflow management system that allows researchers to define and execute complex computational materials science workflows on local or remote computing resources in an automated manner. It provides features such as error detection and recovery, job scheduling, provenance tracking, and remote file access. The atomate library builds on FireWorks to provide a high-level interface for common materials simulation procedures like structure optimization, band structure calculation, and property prediction using popular codes like VASP. Together, these tools aim to make high-throughput computational materials discovery and design more accessible to researchers.
The binding of cosmological structures by massless topological defectsSérgio Sacani
Assuming spherical symmetry and weak field, it is shown that if one solves the Poisson equation or the Einstein field
equations sourced by a topological defect, i.e. a singularity of a very specific form, the result is a localized gravitational
field capable of driving flat rotation (i.e. Keplerian circular orbits at a constant speed for all radii) of test masses on a thin
spherical shell without any underlying mass. Moreover, a large-scale structure which exploits this solution by assembling
concentrically a number of such topological defects can establish a flat stellar or galactic rotation curve, and can also deflect
light in the same manner as an equipotential (isothermal) sphere. Thus, the need for dark matter or modified gravity theory is
mitigated, at least in part.
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...Sérgio Sacani
Context. With a mass exceeding several 104 M⊙ and a rich and dense population of massive stars, supermassive young star clusters
represent the most massive star-forming environment that is dominated by the feedback from massive stars and gravitational interactions
among stars.
Aims. In this paper we present the Extended Westerlund 1 and 2 Open Clusters Survey (EWOCS) project, which aims to investigate
the influence of the starburst environment on the formation of stars and planets, and on the evolution of both low and high mass stars.
The primary targets of this project are Westerlund 1 and 2, the closest supermassive star clusters to the Sun.
Methods. The project is based primarily on recent observations conducted with the Chandra and JWST observatories. Specifically,
the Chandra survey of Westerlund 1 consists of 36 new ACIS-I observations, nearly co-pointed, for a total exposure time of 1 Msec.
Additionally, we included 8 archival Chandra/ACIS-S observations. This paper presents the resulting catalog of X-ray sources within
and around Westerlund 1. Sources were detected by combining various existing methods, and photon extraction and source validation
were carried out using the ACIS-Extract software.
Results. The EWOCS X-ray catalog comprises 5963 validated sources out of the 9420 initially provided to ACIS-Extract, reaching a
photon flux threshold of approximately 2 × 10−8 photons cm−2
s
−1
. The X-ray sources exhibit a highly concentrated spatial distribution,
with 1075 sources located within the central 1 arcmin. We have successfully detected X-ray emissions from 126 out of the 166 known
massive stars of the cluster, and we have collected over 71 000 photons from the magnetar CXO J164710.20-455217.
The debris of the ‘last major merger’ is dynamically youngSérgio Sacani
The Milky Way’s (MW) inner stellar halo contains an [Fe/H]-rich component with highly eccentric orbits, often referred to as the
‘last major merger.’ Hypotheses for the origin of this component include Gaia-Sausage/Enceladus (GSE), where the progenitor
collided with the MW proto-disc 8–11 Gyr ago, and the Virgo Radial Merger (VRM), where the progenitor collided with the
MW disc within the last 3 Gyr. These two scenarios make different predictions about observable structure in local phase space,
because the morphology of debris depends on how long it has had to phase mix. The recently identified phase-space folds in Gaia
DR3 have positive caustic velocities, making them fundamentally different than the phase-mixed chevrons found in simulations
at late times. Roughly 20 per cent of the stars in the prograde local stellar halo are associated with the observed caustics. Based
on a simple phase-mixing model, the observed number of caustics are consistent with a merger that occurred 1–2 Gyr ago.
We also compare the observed phase-space distribution to FIRE-2 Latte simulations of GSE-like mergers, using a quantitative
measurement of phase mixing (2D causticality). The observed local phase-space distribution best matches the simulated data
1–2 Gyr after collision, and certainly not later than 3 Gyr. This is further evidence that the progenitor of the ‘last major merger’
did not collide with the MW proto-disc at early times, as is thought for the GSE, but instead collided with the MW disc within
the last few Gyr, consistent with the body of work surrounding the VRM.
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills MN
Travis Hills of Minnesota developed a method to convert waste into high-value dry fertilizer, significantly enriching soil quality. By providing farmers with a valuable resource derived from waste, Travis Hills helps enhance farm profitability while promoting environmental stewardship. Travis Hills' sustainable practices lead to cost savings and increased revenue for farmers by improving resource efficiency and reducing waste.
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxMAGOTI ERNEST
Although Artemia has been known to man for centuries, its use as a food for the culture of larval organisms apparently began only in the 1930s, when several investigators found that it made an excellent food for newly hatched fish larvae (Litvinenko et al., 2023). As aquaculture developed in the 1960s and ‘70s, the use of Artemia also became more widespread, due both to its convenience and to its nutritional value for larval organisms (Arenas-Pardo et al., 2024). The fact that Artemia dormant cysts can be stored for long periods in cans, and then used as an off-the-shelf food requiring only 24 h of incubation makes them the most convenient, least labor-intensive, live food available for aquaculture (Sorgeloos & Roubach, 2021). The nutritional value of Artemia, especially for marine organisms, is not constant, but varies both geographically and temporally. During the last decade, however, both the causes of Artemia nutritional variability and methods to improve poorquality Artemia have been identified (Loufi et al., 2024).
Brine shrimp (Artemia spp.) are used in marine aquaculture worldwide. Annually, more than 2,000 metric tons of dry cysts are used for cultivation of fish, crustacean, and shellfish larva. Brine shrimp are important to aquaculture because newly hatched brine shrimp nauplii (larvae) provide a food source for many fish fry (Mozanzadeh et al., 2021). Culture and harvesting of brine shrimp eggs represents another aspect of the aquaculture industry. Nauplii and metanauplii of Artemia, commonly known as brine shrimp, play a crucial role in aquaculture due to their nutritional value and suitability as live feed for many aquatic species, particularly in larval stages (Sorgeloos & Roubach, 2021).
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...University of Maribor
Slides from talk:
Aleš Zamuda: Remote Sensing and Computational, Evolutionary, Supercomputing, and Intelligent Systems.
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Inter-Society Networking Panel GRSS/MTT-S/CIS Panel Session: Promoting Connection and Cooperation
https://www.etran.rs/2024/en/home-english/
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxRASHMI M G
Abnormal or anomalous secondary growth in plants. It defines secondary growth as an increase in plant girth due to vascular cambium or cork cambium. Anomalous secondary growth does not follow the normal pattern of a single vascular cambium producing xylem internally and phloem externally.
Nucleophilic Addition of carbonyl compounds.pptxSSR02
Nucleophilic addition is the most important reaction of carbonyls. Not just aldehydes and ketones, but also carboxylic acid derivatives in general.
Carbonyls undergo addition reactions with a large range of nucleophiles.
Comparing the relative basicity of the nucleophile and the product is extremely helpful in determining how reversible the addition reaction is. Reactions with Grignards and hydrides are irreversible. Reactions with weak bases like halides and carboxylates generally don’t happen.
Electronic effects (inductive effects, electron donation) have a large impact on reactivity.
Large groups adjacent to the carbonyl will slow the rate of reaction.
Neutral nucleophiles can also add to carbonyls, although their additions are generally slower and more reversible. Acid catalysis is sometimes employed to increase the rate of addition.
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
ESR spectroscopy in liquid food and beverages.pptxPRIYANKA PATEL
With increasing population, people need to rely on packaged food stuffs. Packaging of food materials requires the preservation of food. There are various methods for the treatment of food to preserve them and irradiation treatment of food is one of them. It is the most common and the most harmless method for the food preservation as it does not alter the necessary micronutrients of food materials. Although irradiated food doesn’t cause any harm to the human health but still the quality assessment of food is required to provide consumers with necessary information about the food. ESR spectroscopy is the most sophisticated way to investigate the quality of the food and the free radicals induced during the processing of the food. ESR spin trapping technique is useful for the detection of highly unstable radicals in the food. The antioxidant capability of liquid food and beverages in mainly performed by spin trapping technique.
BREEDING METHODS FOR DISEASE RESISTANCE.pptxRASHMI M G
Plant breeding for disease resistance is a strategy to reduce crop losses caused by disease. Plants have an innate immune system that allows them to recognize pathogens and provide resistance. However, breeding for long-lasting resistance often involves combining multiple resistance genes
Applications of Natural Language Processing to Materials Design
1. Applications of Natural Language Processing to
Materials Design
Anubhav Jain
Energy Technologies Area
Lawrence Berkeley National Laboratory
Berkeley, CA
UCB MSE Seminar, March 31 2022
Slides (already) posted to hackingmaterials.lbl.gov
2. 2
Can ML help us work through our backlog of information we
need to assimilate from text sources?
Flood of information
Important things get missed
Useful data, but unstructured
NLP algorithms
3. • Small things – search is not chemistry-aware
– a search for “TiNiSn” will give different results than “NiTiSn”
– a search for “SnBi4Te7” won’t match text that reads “we studied SnBi4X7
(X=S, Se, Te)”.
• Medium things – it is difficult to ask questions or compile
summaries, e.g.:
– What is the band gap of “Si”?
– What are all the known dopants into GaAs?
– What are all materials studied as thermoelectrics?
• Big things – one can’t make predictive use of information in text
– Based on all that is known, what materials should be studied as
thermoelectrics?
– Given a synthesis target of a novel compound (composition + structure),
what kind of synthesis protocol should be followed to realize the compound?
3
Some ways in which existing tools for
searching the literature fall short
4. The types of features we want to enable
4
Zinc oxide
ZnO
OZn
Chemistry aware search
(same input, same results)
Summary data
• Physical properties
• Synthesis information
• Known applications
ferroelectrics All known compositions
(PbTiO3, BaTiO3, etc.)
Links to computational databases
Zn0.5O0.5
Composition A
Composition B
Composition A synthesis
Composition B synthesis
Known;
summary of all
previous
syntheses
Unknown;
suggested
synthesis
protocol
new thermoelectrics
5. What is matscholar?
• Matscholar is an attempt to organize the world’s
information on materials science
• It is an effort to use state-of-the-art natural
language processing to make collective use of
the information in millions of articles
6. Today, this is usually done manually or
(recently) semi-automatically with custom rules
6
Data extracted manually Data extracted
semi-automatically
Largely rule-based, not example-based (ML)
7. With Matscholar, we are engaged in two primary efforts
1. Collect raw information from the research
literature to serve as a source for text mining
2. Develop machine learning models that can be
applied to text sources (like the research
literature) to extract useful information
7
8. One of our main machine learning projects concerns
named entity recognition, or automatically labeling text
8
This allows for search
and is crucial to
downstream tasks
9. 9
How does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
10. 10
Data collection is a multi-step process
Currently, ~4 million
entries (article abstracts)
have been parsed.
Separately, a full-text
database of comparable
size for is compiled via
publisher negotiation
(Berkeley - Ceder group)
11. 11
How does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
12. • First split the text into sentences
– Seems simple, but remember edge cases like ”et al.” or
“etc.” does not necessarily signify end of sentence despite
the period
• Then split the sentences into words
– Tricky things are detecting and normalizing chemical
formulas, selective lowercasing (“Battery” vs “battery” or
“BaS” vs “BAs”), homogenizing numbers, etc.
• Historically done with ChemDataExtractor* with
some custom improvements
– We are moving towards a fully custom tokenizer
12
Step 2 - tokenization
*http://chemdataextractor.org
13. 13
How does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
14. • Part A is marking abstracts
as relevant / non-relevant
to inorganic materials
science
• Part B is tediously labeling
~600 abstracts
– Largely done by one person
– Spot-check of 25 abstracts
by a second person gave
87.4% agreement
14
Step 3 – hand label abstracts
15. 15
How does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
16. • We use the word2vec
algorithm (Google) to turn
each unique word in our
corpus into a 200-
dimensional vector
• These vectors encode the
meaning of each word
meaning based on trying to
predict context words
around the target
16
Step 4a: the word2vec algorithm is used to “featurize” words
Barazza, L. How does Word2Vec’s Skip-Gram work? Becominghuman.ai. 2017
17. • We use the word2vec
algorithm (Google) to turn
each unique word in our
corpus into a 200-
dimensional vector
• These vectors encode the
meaning of each word
meaning based on trying to
predict context words
around the target
17
Step 4a: the word2vec algorithm is used to “featurize” words
Barazza, L. How does Word2Vec’s Skip-Gram work? Becominghuman.ai. 2017
“You shall know a word by
the company it keeps”
- John Rupert Firth (1957)
18. • The classic example is:
– “king” - “man” + “woman” = ? → “queen”
18
Word embeddings trained on ”normal” text learns
relationships between words
19. 19
For scientific text, it learns scientific concepts as well
crystal structures of the elements
Tshitoyan, V. et al. Unsupervised word embeddings capture latent
knowledge from materials science literature. Nature 571, 95–98 (2019).
When we train
word2vec on inorganic
materials science
abstracts, we get
representations in-line
with chemical
knowledge
20. 20
There seems to be materials knowledge encoded in the
word vectors
Tshitoyan, V. et al. Unsupervised word embeddings capture latent
knowledge from materials science literature. Nature 571, 95–98 (2019).
21. 21
Word embeddings also have the periodic table encoded in it
with no prior knowledge
“word embedding”
periodic table
Tshitoyan, V. et al. Unsupervised word embeddings capture latent
knowledge from materials science literature. Nature 571, 95–98 (2019).
22. 22
Side note: the learned element embeddings from text
mining are now used in various state-of-the-art ML models
Uses mat2vec
embeddings
Uses 1-hot encoded
embeddings
Uses mat2vec
embeddings
Uses 1-hot encoded
embeddings
Currently, the two best-performing ML
models for predicting various materials
properties from a chemical
composition make use of mat2vec
embeddings!
”Crabnet”
https://www.nature.com/articles/s41524-021-00545-1
https://www.nature.com/articles/s41467-020-19964-7
”RooST”
23. 23
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
24. • If you read this sentence:
“The band gap of ___ is 4.5 eV”
It is clear that the blank should be filled in with a
material word (not a synthesis method, characterization
method, etc.)
How do we get a neural network to take into account
context (as well as properties of the word itself)?
24
Step 4b: How do we train a model to recognize context?
25. 25
Step 4b.An LSTM neural net classifies words by reading
word sequences
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
26. 26
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
27. 27
Step 5. Let the model label things for you!
Named Entity Recognition
X
• Custom machine learning models to
extract the most valuable materials-related
information.
• Utilizes a long short-term memory (LSTM)
network trained on ~1000 hand-annotated
abstracts.
• f1 scores of ~0.9. f1 score for inorganic
materials extraction is >0.9.
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
29. 29
We are also integrating matscholar tools with the
Materials Project database
www.materialsproject.org is a free database of computed
materials properties and over >200K registered users
30. 30
Adding generic search capabilities to MP database
Currently, you need to type a very
strict search format into MP search
bar – either a list of elements or
specific chemical formulas
Can’t search “ferroelectric” for
example, just “BaTiO3”
31. 31
Prototype integration with Materials Project
is already underway
* Working out some kinks that lead to LiCoO2, LiFePO4, etc not being sorted correctly
32. The types of features we want to enable
32
Zinc oxide
ZnO
OZn
Chemistry aware search
(same input, same results)
Summary data
• Physical properties
• Synthesis information
• Known applications
ferroelectrics All known compositions
(PbTiO3, BaTiO3, etc.)
Links to computational databases
Zn0.5O0.5
new thermoelectrics
Composition A
Composition B
Composition A synthesis
Composition B synthesis
Known;
summary of all
previous
syntheses
Unknown;
suggested
synthesis
protocol
33. • The publication data set is not complete
• Currently analyzing abstracts only
• The algorithms are not perfect
• The search interface could be improved further
• We would like to hear from you if you try this!
33
Limitations (it is not perfect)
34. 34
Could these techniques also be used to predict which
materials we might want to screen for an application?
papers to read “someday”
NLP algorithms
35. • Dot product of a composition word with
the word “thermoelectric” essentially
predicts how likely that word is to appear
in an abstract with the word
thermoelectric
• Compositions with high dot products are
typically known thermoelectrics
• Sometimes, compositions have a high dot
product with “thermoelectric” but have
never been studied as a thermoelectric
• These compositions usually have high
computed power factors!
(DFT+BoltzTraP)
35
Making predictions: dot products measure likelihood for
words to co-occur
Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from
materials science literature. Nature 571, 95–98 (2019).
36. 36
Try ”going back in time” and ranking materials, and follow
what happens in later years
Tshitoyan, V. et al.
Unsupervised word
embeddings capture latent
knowledge from materials
science literature. Nature
571, 95–98 (2019).
37. – For every year since
2001, see which
compounds we would
have predicted using
only literature data until
that point in time
– Make predictions of
what materials are the
most promising
thermoelectrics for
data until that year
– See if those materials
were actually studied as
thermoelectrics in
subsequent years 37
A more comprehensive “back in time” test
Tshitoyan, V. et al. Unsupervised word embeddings capture
latent knowledge from materials science literature. Nature
571, 95–98 (2019).
38. 38
We also published a list of potential new thermoelectrics
Tshitoyan, V. et al. Unsupervised word embeddings capture
latent knowledge from materials science literature. Nature
571, 95–98 (2019).
It is one thing to
retroactively test, but
perhaps another to see
how things go after
publication
39. 39
Overall: ~33% of predictions were studied as
thermoelectrics within 3 years
Investigated as thermoelectrics
(independently of our study)
• About 1/3 of predicted compounds have been
studied within 3 years – better than we expect
• However, almost all studies were computational
explorations of thermoelectricity / first principles
calculations and not experiments
• 3 compounds had zT measured experimentally:
• Li3Sb reached a peak zT ~ 0.3
• Cu7Te5 reached a peak zT ~ 0.14
• CsGeI3 (after further doping) reached a peak
zT ~ 0.12
• Overall – the forward prediction of materials that are
likely to be studied as thermoelectrics seems to
mostly work
• However, they are not particularly good
thermoelectrics.
Investigated by our own collaborators
(as a result of our study)
40. 40
How is this working?
“Context
words” link
together
information
from different
sources
41. The types of features we want to enable
41
Zinc oxide
ZnO
OZn
Chemistry aware search
(same input, same results)
Summary data
• Physical properties
• Synthesis information
• Known applications
ferroelectrics All known compositions
(PbTiO3, BaTiO3, etc.)
Links to computational databases
Zn0.5O0.5
new thermoelectrics
Composition A
Composition B
Composition A synthesis
Composition B synthesis
Known;
summary of all
previous
syntheses
Unknown;
suggested
synthesis
protocol
43. 43
Improving the accuracy of the model:
training a BERT-based model
The BERT model is more advanced than word2vec and better takes into account context.
Performance on all tasks is improved; we are currently investigating other models that may
have even easier annotation and better performance.
Walker, Nicholas, et al. "The Impact of Domain-Specific Pre-Training on Named Entity Recognition Tasks in Materials Science." Available at SSRN 3950755 (2021).
44. • For some tasks/domains, extracting entities is
sufficient
• For others, we need to relate them! NER does not
tell us enough.
44
Improving the capabilities of extraction:
relating entities to one another for complex information
Dopants
Transition metals
Sm
Sn
Base materials
ZnO
ZnS
Dopant quantities
5 at. %
?
?
?
What was doped with Sn??
Doping of transition metals into ZnS and ZnO nanoparticles . . .
The ZnO:Sm system was formed at 5 at.% . . .
The ZnS sample was also doped with Sn . . .
45. • Our goal is to extract structured graphs of entities
rather than just the entities themselves
• Structured acyclic entity graphs give complete
information for extraction and analysis
45
Doping of transition metals into ZnS and ZnO nanoparticles . . .
The ZnO:Sm system was formed at 5 at.% . . .
The ZnS sample was also doped with Sn . . .
ZnS
ZnO
Transition metals
Sm
Sn
5 at. %
“was doped with”
“to the amount of”
By relating entities, we get much more
powerful and useful information extraction
46. • Earlier, dependency extraction was done using grammar rules (e.g.
dependency trees) but it was not particularly successful
• We have been experimenting with large seq2seq transformer models
• These can take in an unstructured text sequence and output a structured
text sequence (e.g., OpenAI Codex that solves programming tasks)
• Can be trained with few (<50) examples due to few-shot capability
46
Utilizing large seq2seq models for ERM
Transition metal doping is an effective tool for controlling optical
absorption in ZnS and hence the number of photons absorbed by
photovoltaic devices. By using first principle density functional
calculations, we compute the change in number of photons absorbed
upon doping with a selected transition metal and found that Ni
offers the best chance to improve the performance. This is
attributed to the formation of defect states in the band gap of the
host ZnS which give rise to additional dipole-allowed optical
transition pathways between the conduction and valence band.
Analysis of the defect level in the band gap shows that TM dopants
do not pin Fermi levels in ZnS and hence the host can be made n- or
p- type with other suitable dopants. The measured optical spectra
from the doped solution processed ZnS nanocrystal supports our
theoretical finding that Ni doping enhances optical absorption the
most compared to Co and Mn doping.
Raw scientific text
Seq2seq
Model
Trained on
intermediate reps.
Entity Relationships
Output
sequence
Input seq Output seq
Deterministic decoding
47. • Previous NER experiments can be extended with
ERM to include much more information
47
Applying ERM to Dopant/Host extraction
CaCu3Ti4-xCoxO12 is a doped result with
descriptor ceramic and phase cubic from base
material CaCu3Ti4O12 (AKA calcium copper
titanate) and dopant Co + 2 (AKA cobalt).
{
“basemats”: {
0: {
“aliases”: [“CaCuTi4O12”, “calcium copper titanate”],
“descriptor”: null,
...}}
“dopants”: {
0: {
“aliases”: [“Co+2”, “cobalt”],
...}},
”results”: {
0: {
“aliases”: “CaCu3Ti$_{bf 4-emph{x}}$Co$_{bfemph{x}}$O12"
“linked_basemats”: [0],
“linked_dopants”: [0],
“descriptors”: [“ceramics”],
...}}
Seq2seq model
unstructured to structured
Manual parser
48. For example, we hope to parse a literature-derived
database of dopants and dopability
48
With this capability, we plan to release structured materials
properties databases based on NLP parsing of literature
Sentence Base Material Dopant Doping Concentr.
…the influence of yttrium doping (0-10mol%) on BSCF… BSCF Yttrium 0-10 mol%
undoped, anion-doped(Sb,Bi) and cation-doped(Ca,Zn) solid sln. of Mg10Si2Sn3… Mg10Si2Sn3 Sb, Bi, Ca, Zn
The zT of As2Cd3 with electron doping is found to be ~ with n=10^20cm-3 As2Cd3 electron n=10^20cm-3
This leads to zT=0.5 obtained at 500K (p=10^20cm-3) in p-type As2Cd3T As2Cd3 p-type p=10^20cm-3
The undoped and 0.25wt% La doped CdO films show 111…
…however, …. for doping concentrations greater than 0.50wt%.
CdO La 0.25wt%,
>0.5%
Which elements are commonly doped
into the same materials (i.e., co-occur
as dopants)?
49. • Experimentalists identified relevant factors for gold
nanorod dimensions
– Experimental temperatures
– Solution ages/timing
– Precursor amounts
49
We can also tackle complex syntheses if we can do
entity relationship modeling
Seq2Seq model outputs JSON
(form of entity graph)
"seed": {
"prec": {
"HAuCl4": {
"vol": "5 mL",
"concn": "0.25 mM"
},
"CTAB": {
"vol": "HAuCl4",
"concn": "0.1 M"
},
"NaBH4": {
"vol": "0.3 mL",
"concn": "10 mM"
}
},
"seed": {
"size": "3 nm"
},
"temp": "25 degC",
"age": "5 min"
},
Types of factors important in synthesis
Values as extracted from raw text
50. 50
Tests on Au Nanorod Synthesis indicate it is working
Seed Solution
(age, stir rate,
temperature,
precursor properties,
seed properties)
Growth
Solution
(age, stir rate,
temperature,
precursor properties)
AuNR
(aspect ratios,
lengths, widths,
TSPRs, and LSPRs)
Entity detected
(F1 score)
0.94 0.92 0.76
Exact match to entity
(accuracy)
0.73 0.77 0.52
Support 159 244 96
Aggregated scores by AuNR recipe component
Evaluated on 40 test paragraphs
Trained on 40 (manual annotation) and 200 (assisted) paragraphs
Entity detected = We correctly detected the types of synthesis information present
Exact match = The extracted synthesis information is an exact string match
51. The types of features we want to enable
51
Zinc oxide
ZnO
OZn
Chemistry aware search
(same input, same results)
Summary data
• Physical properties
• Synthesis information
• Known applications
ferroelectrics All known compositions
(PbTiO3, BaTiO3, etc.)
Links to computational databases
Zn0.5O0.5
new thermoelectrics
Composition A
Composition B
Composition A synthesis
Composition B synthesis
Known;
summary of all
previous
syntheses
Unknown;
suggested
synthesis
protocol
???
52. 52
Note –
we are creating open-source libraries to help with NLP tasks
https://github.com/lbnlp
53. • There exists a lot of data and knowledge in the
historical corpus of scientific journal articles, but
getting the knowledge has been difficult to do on
a large scale
• Machine learning presents a new frontier for
being able to make use of this information
53
Conclusion
54. 54
The Matscholar team
Funding from:
Slides (already) posted to
hackingmaterials.lbl.gov
John
Dagdelen
Alex
Dunn
Viktoriia
Baibakova
John
Dagdelen
Viktoriia
Baibakova
Nick
Walker
Kristin Persson
Anubhav Jain
Gerbrand Ceder
Leigh
Weston
Vahe
Tshitoyan
Amalie
Trewartha
alumni
Sanghoon
Lee