This document discusses natural language processing (NLP) techniques for materials design using information from millions of journal articles. It begins with an overview of how materials are typically discovered and optimized over decades before discussing how NLP could help address this challenge. The document then provides a high-level view of how NLP is used to extract and analyze information from millions of materials science abstracts, including data collection, tokenization, training machine learning models on labeled text, and using the models to automatically extract entities. Examples are given of how word embeddings can encode scientific concepts and relationships in ways that allow predicting promising new materials for applications like thermoelectrics. The talk concludes by discussing future directions for the NLP work.
TMS workshop on machine learning in materials science: Intro to deep learning...BrianDeCost
This presentation is intended as a high-level introduction for to deep learning and its applications in materials science. The intended audience is materials scientists and engineers
Disclaimers: the second half of this presentation is intended as a broad overview of deep learning applications in materials science; due to time limitations it is not intended to be comprehensive. As a review of the field, this necessarily includes work that is not my own. If my own name is not included explicitly in the reference at the bottom of a slide, I was not involved in that work.
Any mention of commercial products in this presentation is for information only; it does not imply recommendation or endorsement by NIST.
TMS workshop on machine learning in materials science: Intro to deep learning...BrianDeCost
This presentation is intended as a high-level introduction for to deep learning and its applications in materials science. The intended audience is materials scientists and engineers
Disclaimers: the second half of this presentation is intended as a broad overview of deep learning applications in materials science; due to time limitations it is not intended to be comprehensive. As a review of the field, this necessarily includes work that is not my own. If my own name is not included explicitly in the reference at the bottom of a slide, I was not involved in that work.
Any mention of commercial products in this presentation is for information only; it does not imply recommendation or endorsement by NIST.
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasAngelo Salatino
Ontologies of research areas are important tools for characterising, exploring, and analysing the research landscape. Some fields of research are comprehensively described by large-scale taxonomies, e.g., MeSH in Biology and PhySH in Physics. Conversely, current Computer Science taxonomies are coarse-grained and tend to evolve slowly. For instance, the ACM classification scheme contains only about 2K research topics and the last version dates back to 2012. In this paper, we introduce the Computer Science Ontology (CSO), a large-scale, automatically generated ontology of research areas, which includes about 15K topics and 70K semantic relationships. It was created by applying the Klink-2 algorithm on a very large dataset of 16M scientific articles. CSO presents two main advantages over the alternatives: i) it includes a very large number of topics that do not appear in other classifications, and ii) it can be updated automatically by running Klink-2 on recent corpora of publications. CSO powers several tools adopted by the editorial team at Springer Nature and has been used to enable a variety of solutions, such as classifying research publications, detecting research communities, and predicting research trends. To facilitate the uptake of CSO we have developed the CSO Portal, a web application that enables users to download, explore, and provide granular feedback on CSO at different levels. Users can use the portal to rate topics and relationships, suggest missing relationships, and visualise sections of the ontology. The portal will support the publication of and access to regular new releases of CSO, with the aim of providing a comprehensive resource to the various communities engaged with scholarly data.
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasAngelo Salatino
Ontologies of research areas are important tools for characterising, exploring, and analysing the research landscape. Some fields of research are comprehensively described by large-scale taxonomies, e.g., MeSH in Biology and PhySH in Physics. Conversely, current Computer Science taxonomies are coarse-grained and tend to evolve slowly. For instance, the ACM classification scheme contains only about 2K research topics and the last version dates back to 2012. In this paper, we introduce the Computer Science Ontology (CSO), a large-scale, automatically generated ontology of research areas, which includes about 15K topics and 70K semantic relationships. It was created by applying the Klink-2 algorithm on a very large dataset of 16M scientific articles. CSO presents two main advantages over the alternatives: i) it includes a very large number of topics that do not appear in other classifications, and ii) it can be updated automatically by running Klink-2 on recent corpora of publications. CSO powers several tools adopted by the editorial team at Springer Nature and has been used to enable a variety of solutions, such as classifying research publications, detecting research communities, and predicting research trends. To facilitate the uptake of CSO we have developed the CSO Portal, a web application that enables users to download, explore, and provide granular feedback on CSO at different levels. Users can use the portal to rate topics and relationships, suggest missing relationships, and visualise sections of the ontology. The portal will support the publication of and access to regular new releases of CSO, with the aim of providing a comprehensive resource to the various communities engaged with scholarly data.
Applying machine learning techniques to big data in the scholarly domainAngelo Salatino
Slides of the Lecture at the 5th International School on Applied Probability Theory,Communications Technologies & Data Science (APTCT-2020)
12 Nov 2020
The Royal Society of Chemistry has an archive of published journals and books stretching back to 1841. In the past decade we have digitized this archive and semantically enriched our frontfile data with chemical structures linked to our free online chemical compound database, ChemSpider. In this talk we will survey our recent efforts to extract all kinds of data – chemical structures, experimental and bibliographic data – from both our backfile and frontfile. We will also discuss our future work to extract chemical reactions to host in our ChemSpider Reactions database and will discuss the potential applications of optical structure recognition technologies for converting structure images to structures as well as using similar techniques to convert experimental spectral data into interactive data formats. A key aspect of this project is the delivery of a crowdsourcing platform for the interactive annotation and validation of the extracted data.
In this talk we will summarise some of the detectable trends on AI beyond deep learning. We will focus on the current transition from deep learning to deep semantics, describing the enabling infrastructures, challenges and opportunities in the construction of the next generation AI systems. The talk will focus on Natural Language Processing (NLP) as an AI sub-domain and will link to the research at the AI Systems Lab at the University of Manchester.
The volume of the data is directly proportional to the model's accuracy in data analytics for any particular domain. Once a developing field or discipline becomes apparent, the scarcity of the data volume becomes a challenging proponent for the correctness of a model and prediction. In the proposed state-of-the-art, a transitive empirical method has been used within the same contextual domain to extract features from a low-resource part via a heterogeneous field with factual data. Even though an example of text processing has been used for brevity, it is not limited. The success rate of the proposed model is 78.37%, considering model performance. But when considering human subject matter experts, the accuracy is 81.2%.
New learning technologies seem likely to transform much of science, as they are already doing for many areas of industry and society. We can expect these technologies to be used, for example, to obtain new insights from massive scientific data and to automate research processes. However, success in such endeavors will require new learning systems: scientific computing platforms, methods, and software that enable the large-scale application of learning technologies. These systems will need to enable learning from extremely large quantities of data; the management of large and complex data, models, and workflows; and the delivery of learning capabilities to many thousands of scientists. In this talk, I review these challenges and opportunities and describe systems that my colleagues and I are developing to enable the application of learning throughout the research process, from data acquisition to analysis.
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
A brief information about the SCOP protein database used in bioinformatics.
The Structural Classification of Proteins (SCOP) database is a comprehensive and authoritative resource for the structural and evolutionary relationships of proteins. It provides a detailed and curated classification of protein structures, grouping them into families, superfamilies, and folds based on their structural and sequence similarities.
This pdf is about the Schizophrenia.
For more details visit on YouTube; @SELF-EXPLANATORY;
https://www.youtube.com/channel/UCAiarMZDNhe1A3Rnpr_WkzA/videos
Thanks...!
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Sérgio Sacani
We characterize the earliest galaxy population in the JADES Origins Field (JOF), the deepest
imaging field observed with JWST. We make use of the ancillary Hubble optical images (5 filters
spanning 0.4−0.9µm) and novel JWST images with 14 filters spanning 0.8−5µm, including 7 mediumband filters, and reaching total exposure times of up to 46 hours per filter. We combine all our data
at > 2.3µm to construct an ultradeep image, reaching as deep as ≈ 31.4 AB mag in the stack and
30.3-31.0 AB mag (5σ, r = 0.1” circular aperture) in individual filters. We measure photometric
redshifts and use robust selection criteria to identify a sample of eight galaxy candidates at redshifts
z = 11.5 − 15. These objects show compact half-light radii of R1/2 ∼ 50 − 200pc, stellar masses of
M⋆ ∼ 107−108M⊙, and star-formation rates of SFR ∼ 0.1−1 M⊙ yr−1
. Our search finds no candidates
at 15 < z < 20, placing upper limits at these redshifts. We develop a forward modeling approach to
infer the properties of the evolving luminosity function without binning in redshift or luminosity that
marginalizes over the photometric redshift uncertainty of our candidate galaxies and incorporates the
impact of non-detections. We find a z = 12 luminosity function in good agreement with prior results,
and that the luminosity function normalization and UV luminosity density decline by a factor of ∼ 2.5
from z = 12 to z = 14. We discuss the possible implications of our results in the context of theoretical
models for evolution of the dark matter halo mass function.
Cancer cell metabolism: special Reference to Lactate PathwayAADYARAJPANDEY1
Normal Cell Metabolism:
Cellular respiration describes the series of steps that cells use to break down sugar and other chemicals to get the energy we need to function.
Energy is stored in the bonds of glucose and when glucose is broken down, much of that energy is released.
Cell utilize energy in the form of ATP.
The first step of respiration is called glycolysis. In a series of steps, glycolysis breaks glucose into two smaller molecules - a chemical called pyruvate. A small amount of ATP is formed during this process.
Most healthy cells continue the breakdown in a second process, called the Kreb's cycle. The Kreb's cycle allows cells to “burn” the pyruvates made in glycolysis to get more ATP.
The last step in the breakdown of glucose is called oxidative phosphorylation (Ox-Phos).
It takes place in specialized cell structures called mitochondria. This process produces a large amount of ATP. Importantly, cells need oxygen to complete oxidative phosphorylation.
If a cell completes only glycolysis, only 2 molecules of ATP are made per glucose. However, if the cell completes the entire respiration process (glycolysis - Kreb's - oxidative phosphorylation), about 36 molecules of ATP are created, giving it much more energy to use.
IN CANCER CELL:
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
introduction to WARBERG PHENOMENA:
WARBURG EFFECT Usually, cancer cells are highly glycolytic (glucose addiction) and take up more glucose than do normal cells from outside.
Otto Heinrich Warburg (; 8 October 1883 – 1 August 1970) In 1931 was awarded the Nobel Prize in Physiology for his "discovery of the nature and mode of action of the respiratory enzyme.
WARNBURG EFFECT : cancer cells under aerobic (well-oxygenated) conditions to metabolize glucose to lactate (aerobic glycolysis) is known as the Warburg effect. Warburg made the observation that tumor slices consume glucose and secrete lactate at a higher rate than normal tissues.
Cancer cell metabolism: special Reference to Lactate Pathway
Materials design using knowledge from millions of journal articles via natural language processing techniques
1. Materials design using knowledge from millions of
journal articles via natural language processing
techniques
Anubhav Jain
Energy Technologies Area
Lawrence Berkeley National Laboratory
Berkeley, CA
IMX Virtual Seminar, April 11 2021
Slides (already) posted to hackingmaterials.lbl.gov
2. • Often, materials are known for several decades
before their functional applications are known
– MgB2 sitting on lab shelves for 50 years before its
identification as a superconductor in 2001
– LiFePO4 known since 1938, only identified as a Li-ion
battery cathode in 1997
• Even after discovery, optimization and
commercialization still take decades
• How is this typically done?
2
Typically, both new materials discovery and optimization
take decades
3. What constrains traditional approaches to materials design?
3
“[The Chevrel] discovery resulted from a lot of
unsuccessful experiments of Mg ions insertion
into well-known hosts for Li+ ions insertion, as
well as from the thorough literature analysis
concerning the possibility of divalent ions
intercalation into inorganic materials.”
-Aurbach group, on discovery of Chevrel cathode
for multivalent (e.g., Mg2+) batteries
Levi, Levi, Chasid, Aurbach
J. Electroceramics (2009)
4. 4
Researchers are starting to fundamentally re-think how we
invent the materials that make up our devices
Next-
generation
materials
design
Computer-
aided
materials
design
Natural
language
processing
“Self-driving
laboratories”
6. 6
Can ML help us work through our backlog of information we
need to assimilate from text sources?
papers to read “someday”
NLP algorithms
7. • It is difficult to look up all information any given material
due to the many different ways chemical compositions
are written
– a search for “TiNiSn” will give different results than “NiTiSn”
– a search for “GaSb” won’t match text that reads “Ga0.5Sb0.5”
– a search for “SnBi4Te7” won’t match text that reads “we studied
SnBi4X7 (X=S, Se, Te)”.
– a search for “AgCrSe2”, if it doesn’t have any hits, won’t suggest
“CuCrSe2” as a similar result
• It is difficult to ask questions or compile summaries, e.g.:
– What is the band gap of “Si”?
– What are all the known dopants into GaAs?
– What are all materials studied as thermoelectrics?
7
Traditional search doesn’t answer the questions we want
8. What is matscholar?
• Matscholar is an attempt to organize the world’s
information on materials science, connecting
together topics of study, synthesis and
characterization methods, and specific materials
compositions
• It is also an effort to use state-of-the-art natural
language processing to make collective use of
the information in millions of articles
9. One of our main projects concerns named entity
recognition, or automatically labeling text
9
This allows for search
and is crucial to
downstream tasks
10. 1
0
> 4 million
Papers Collected
31 million
Properties
19 million
Materials Mentions
8.8 million
Characterization Methods
7.5 million
Applications
5 million
Synthesis Methods
•Data Collection: Over 4 million papers
collected from more than 2100 journals.
Note – entities are currently extracted only from the abstracts of the papers
11. 11
Now we can search!
Live on www.matscholar.com
Live demo
12. • The publication data set is not complete
• Currently analyzing abstracts only
• The algorithms are not perfect
• The search interface could be improved further
• We would like to hear from you if you try this!
12
Limitations (it is not perfect)
13. 13
How does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
14. Extracted 4 million abstracts
of relevant scientific articles
using various APIs from
journal publishers
Some are more difficult than
others to obtain.
Data cleaning is often
needed (e.g., stray HTML
tags, copyright statements)
Abstract collection
continues …
14
Step 1 – data collection
15. 15
How does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
16. • First split the text into sentences
– Seems simple, but remember edge cases like ”et al.” or
“etc.” does not necessarily signify end of sentence despite
the period
• Then split the sentences into words
– Tricky things are detecting and normalizing chemical
formulas, selective lowercasing (“Battery” vs “battery” or
“BaS” vs “BAs”), homogenizing numbers, etc.
• Historically done with ChemDataExtractor* with
some custom improvements
– We are moving towards a fully custom tokenizer
16
Step 2 - tokenization
*http://chemdataextractor.org
17. 17
How does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
18. • Part A is marking abstracts
as relevant / non-relevant
to inorganic materials
science
• Part B is tediously labeling
~600 abstracts
– Largely done by one person
– Spot-check of 25 abstracts
by a second person gave
87.4% agreement
18
Step 3 – hand label abstracts
19. 19
How does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
20. • We use the word2vec
algorithm (Google) to turn
each unique word in our
corpus into a 200-
dimensional vector
• These vectors encode the
meaning of each word
meaning based on trying to
predict context words
around the target
20
Step 4a: the word2vec algorithm is used to “featurize” words
Barazza, L. How does Word2Vec’s Skip-Gram work? Becominghuman.ai. 2017
21. • We use the word2vec
algorithm (Google) to turn
each unique word in our
corpus into a 200-
dimensional vector
• These vectors encode the
meaning of each word
meaning based on trying to
predict context words
around the target
21
Step 4a: the word2vec algorithm is used to “featurize” words
Barazza, L. How does Word2Vec’s Skip-Gram work? Becominghuman.ai. 2017
“You shall know a word by
the company it keeps”
- John Rupert Firth (1957)
22. • The classic example is:
– “king” - “man” + “woman” = ? → “queen”
22
Word embeddings trained on ”normal” text learns
relationships between words
23. 23
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
24. • If you read this sentence:
“The band gap of ___ is 4.5 eV”
It is clear that the blank should be filled in with a
material word (not a synthesis method, characterization
method, etc.)
How do we get a neural network to take into account
context (as well as properties of the word itself)?
24
Step 4b: How do we train a model to recognize context?
25. 25
Step 4b.An LSTM neural net classifies words by reading
word sequences
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
26. 26
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
27. 27
Step 5. Let the model label things for you!
Named Entity Recognition
X
• Custom machine learning models to
extract the most valuable materials-related
information.
• Utilizes a long short-term memory (LSTM)
network trained on ~1000 hand-annotated
abstracts.
• f1 scores of ~0.9. f1 score for inorganic
materials extraction is >0.9.
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
28. 28
Could these techniques also be used to predict which
materials we might want to screen for an application?
papers to read “someday”
NLP algorithms
29. • The classic example is:
– “king” - “man” + “woman” = ? → “queen”
29
Remember that word embeddings seem to learn
relationships in text
30. 30
For scientific text, it learns scientific concepts as well
crystal structures of the elements
Tshitoyan, V. et al. Unsupervised word embeddings capture latent
knowledge from materials science literature. Nature 571, 95–98 (2019).
31. 31
There seems to be materials knowledge encoded in the
word vectors
Tshitoyan, V. et al. Unsupervised word embeddings capture latent
knowledge from materials science literature. Nature 571, 95–98 (2019).
32. 32
Note that more data is not always better!
We want relevance
Tshitoyan, V. et al. Unsupervised word embeddings capture latent
knowledge from materials science literature. Nature 571, 95–98 (2019).
33. 33
Word embeddings also have the periodic table encoded in it
with no prior knowledge
“word embedding”
periodic table
Tshitoyan, V. et al. Unsupervised word embeddings capture latent
knowledge from materials science literature. Nature 571, 95–98 (2019).
34. • Dot product of a composition word with
the word “thermoelectric” essentially
predicts how likely that word is to appear
in an abstract with the word
thermoelectric
• Compositions with high dot products are
typically known thermoelectrics
• Sometimes, compositions have a high dot
product with “thermoelectric” but have
never been studied as a thermoelectric
• These compositions usually have high
computed power factors!
(DFT+BoltzTraP)
34
Making predictions: dot products measure likelihood for
words to co-occur
Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from
materials science literature. Nature 571, 95–98 (2019).
35. 35
Try ”going back in time” and ranking materials, and follow
what happens in later years
Tshitoyan, V. et al.
Unsupervised word
embeddings capture latent
knowledge from materials
science literature. Nature
571, 95–98 (2019).
36. – For every year since
2001, see which
compounds we would
have predicted using
only literature data until
that point in time
– Make predictions of
what materials are the
most promising
thermoelectrics for
data until that year
– See if those materials
were actually studied as
thermoelectrics in
subsequent years 36
A more comprehensive “back in time” test
Tshitoyan, V. et al. Unsupervised word embeddings capture
latent knowledge from materials science literature. Nature
571, 95–98 (2019).
37. 37
We also published a list of potential new thermoelectrics
Tshitoyan, V. et al. Unsupervised word embeddings capture
latent knowledge from materials science literature. Nature
571, 95–98 (2019).
It is one thing to
retroactively test, but
perhaps another to see
how things go after
publication
38. 38
Two were studied between submission and publication of
manuscript
Tshitoyan, V. et al. Unsupervised word embeddings capture
latent knowledge from materials science literature. Nature
571, 95–98 (2019).
39. 39
More were studied since then (mainly computationally)
Tshitoyan, V. et al. Unsupervised word embeddings capture
latent knowledge from materials science literature. Nature
571, 95–98 (2019).
40. 40
More were studied since then (mainly computationally)
Tshitoyan, V. et al. Unsupervised word embeddings capture
latent knowledge from materials science literature. Nature
571, 95–98 (2019).
41. 41
More were studied since then (mainly computationally)
Tshitoyan, V. et al. Unsupervised word embeddings capture
latent knowledge from materials science literature. Nature
571, 95–98 (2019).
https://arxiv.org/abs/2010.08461
42. 42
Our collaborators also synthesized a prediction, finding a
moderate zT of 0.14
Tshitoyan, V. et al. Unsupervised word embeddings capture
latent knowledge from materials science literature. Nature
571, 95–98 (2019).
43. 43
How is this working?
“Context
words” link
together
information
from different
sources
45. 45
1.Automatic creation of structured materials databases from
the literature, e.g. doping database
Sentence Base
Material
Dopant Doping
Concentr.
…the influence of yttrium doping (0-10mol%) on BSCF… BSCF Yttrium 0-10 mol%
undoped, anion-doped(Sb,Bi) and cation-doped(Ca,Zn) solid sln.
of Mg10Si2Sn3…
Mg10Si2Sn3 Sb, Bi, Ca, Zn
The zT of As2Cd3 with electron doping is found to be ~ with
n=10^20cm-3
As2Cd3 electron n=10^20cm-3
This leads to zT=0.5 obtained at 500K (p=10^20cm-3) in p-type
As2Cd3T
As2Cd3 p-type p=10^20cm-3
The undoped and 0.25wt% La doped CdO films show 111…
…however, …. for doping concentrations greater than 0.50wt%.
CdO La 0.25wt%,
>0.5%
Will allow you to answer questions like “what
are all the materials known to be doped with
Eu3+” ?
47. 47
2. Learning representations of materials
● Mat2vec suggested that embeddings contain chemical information
● Can we make embeddings for arbitrary materials as material descriptors?
● i.e., word embeddings for materials not in the literature
● Descriptors could be used for direct classification for application (link
prediction), or quantitative property prediction (regression features)
49. 49
Initial results – predicting experimental band gap from
composition (~3000 data points)
50. 50
3. Creating a comprehensive software library for materials
science NLP research (multiple LBNL research groups)
https://github.com/lbnlp
51. 51
4. Getting data from figures
Original figure Data snippet
extracted fully
automatically
Replotted data
Can we automatically extract structured information from figures?
52. 52
4a. Detecting the various regions of the plot
(a) & (b): Human
labeling of axes and
legend regions (141
training figures)
(c) – (f): Model
predictions based on
“faster_cnn_inception”
model
53. 53
4b.Automatically reading the axis scales
1. Detect numbers using
a custom configuration
of the EasyOCR package
2. Develop algorithm to
detect exponents based
on size and position
3. Set tick
marks at
center of text
height
Number
of
entries
POWER
BASE
a. Digit detection
(previous step)
b. Compile
coordinates of
detected numbers
c. Separate into groups of
low height and high height
54. 54
4c. Getting the data curves (color-based)
Starting image
1. Automatically detect distinct
colors using iterative k-means
clustering
2. Decide which color
channels contain
relevant data
56. • There is a lot of data and knowledge in the
historical corpus of scientific journal articles, but
getting the knowledge has been difficult to do on
a large scale
• Machine learning presents a new frontier for
being able to make use of this information
56
Conclusion
57. 57
The Matscholar team
Kristin Persson
Anubhav Jain
Gerbrand Ceder
John
Dagdelen
Leigh
Weston
Vahe
Tshitoyan
Amalie
Trewartha
Alex
Dunn
Viktoriia
Baibakova
Funding from
(now at Google) (now at Medium)
Slides (already) posted to
hackingmaterials.lbl.gov
+ DOE ARPA-e (figure extraction)