Presented at the 258th American Chemical Society (ACS) National Meeting in San Diego, CA (August 26, 2019).
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public chemical data repository that provides information on various chemical entities, including small molecules, siRNA, miRNA, peptides, lipids, carbohydrates, chemically modified biologics, etc. One of the most commonly requested tasks in PubChem is to search for a compound by chemical name (also commonly called “chemical synonym”). PubChem performs this task by looking up chemical synonym-structure associations provided by individual depositors to PubChem. These name-structure associations are used to create links between chemicals and Medical Subject Headings (MeSH) terms, which in turn are used to generate associations between chemicals and PubMed articles. The accuracy of these depositor-provided synonym-structure associations is dependent upon two important quality control methods used in PubChem: (1) chemical structure standardization and (2) synonym filtering based on crowd voting. In this presentation, we will discuss the two quality control methods and their effects on the chemical synonym-structure associations.
Automated Extraction of Reactions from the Patent Literaturedan2097
We have created a pipeline of recently enhanced open source components for extracting chemical reactions from full text chemical literature. OSCAR4 is used to recognise chemical entities and resolve to structures where appropriate. OPSIN is used to resolve systematic chemical names to structures. Chemical Tagger performs part of speech tagging allowing the interpretation of phrases in chemical syntheses. The final output is a semantic representation (chemical components and their roles, reaction conditions, actions including workup, yield and properties of the product). We then attempt to map all atoms in the product(s) to reactants. If successful we also attempt to calculate the stoichiometry of the reaction. The system has been deployed on over 56,000 USPTO patents published since 2008. The level of recall is useful and most extracted reactions make chemical sense. The pipeline is generally applicable to reactions in chemical literature including journals and theses.
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesNextMove Software
We have previously described the extraction of reactions from US and European patents. This talk will discuss the assembly of over six million extracted reaction details consisting of the connection tables, procedure, quantities, solvents, catalysts and yields into a searchable "read-only" Electronic Lab Notebook.
In addition to reactions details, concepts including diseases, drug targets, and assignees are recognised from the patent documents and normalised to appropriate ontologies. Each normalised term is paired with the reaction details found in the document to allow intuitive cross concept querying (e.g. "GlaxoSmithKline C-C Bond Formation greater than 80% yield Myocardial Infarction"). Reactions are classified and assigned to leafs in the RXNO Ontology. The ontologies are used to provide organisation, faceting, and filtering of results. The reaction classification also provides a precise atom mapping that facilitates structural transformation queries and can improve reaction diagram layout.
Through improvements in substructure search technology we will demonstrate several types of chemical synthesis queries that can be efficiently answered. The combination of high performance chemical searching and additional document terms provides a powerful exploratory and trend analysis tool for chemists.
Need and benefits for structure standardization to facilitate integration and...Valery Tkachenko
There are a large number of US government databases housing diverse collections of chemical data including bioassay data (PubChem), toxicity data (CompTox Chemistry Dashboard) and environmental data (a large collection of EPA databases), to name just a few. In many cases integration between the databases, at the chemical structure level, is via alphanumeric text identifiers such as CAS Numbers, or via InChI (International Chemical Identifiers). Structure-based integration is hyper-dependent on the initial inputs providing the chemical structures to the InChI generation algorithm. To ensure optimal integration between various databases, community standards and agreement regarding standardization of chemical structures would be beneficial, not only to integration of US government databases and resources but also to the international scientific community and hosts of online databases. This presentation will discuss our progress to deliver a fully Open Source chemical standardization platform as an exemplar for the community to build on and enhance. The system utilizes the CDK (Chemistry Development Kit), RD Kit and other open source components. The resource expands on our previous work regarding the Chemical Validation and Standardization Platform and has been tested using the open data collection provided by the EPA Comptox Chemistry Dashboard.
Automated Extraction of Reactions from the Patent Literaturedan2097
We have created a pipeline of recently enhanced open source components for extracting chemical reactions from full text chemical literature. OSCAR4 is used to recognise chemical entities and resolve to structures where appropriate. OPSIN is used to resolve systematic chemical names to structures. Chemical Tagger performs part of speech tagging allowing the interpretation of phrases in chemical syntheses. The final output is a semantic representation (chemical components and their roles, reaction conditions, actions including workup, yield and properties of the product). We then attempt to map all atoms in the product(s) to reactants. If successful we also attempt to calculate the stoichiometry of the reaction. The system has been deployed on over 56,000 USPTO patents published since 2008. The level of recall is useful and most extracted reactions make chemical sense. The pipeline is generally applicable to reactions in chemical literature including journals and theses.
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesNextMove Software
We have previously described the extraction of reactions from US and European patents. This talk will discuss the assembly of over six million extracted reaction details consisting of the connection tables, procedure, quantities, solvents, catalysts and yields into a searchable "read-only" Electronic Lab Notebook.
In addition to reactions details, concepts including diseases, drug targets, and assignees are recognised from the patent documents and normalised to appropriate ontologies. Each normalised term is paired with the reaction details found in the document to allow intuitive cross concept querying (e.g. "GlaxoSmithKline C-C Bond Formation greater than 80% yield Myocardial Infarction"). Reactions are classified and assigned to leafs in the RXNO Ontology. The ontologies are used to provide organisation, faceting, and filtering of results. The reaction classification also provides a precise atom mapping that facilitates structural transformation queries and can improve reaction diagram layout.
Through improvements in substructure search technology we will demonstrate several types of chemical synthesis queries that can be efficiently answered. The combination of high performance chemical searching and additional document terms provides a powerful exploratory and trend analysis tool for chemists.
Need and benefits for structure standardization to facilitate integration and...Valery Tkachenko
There are a large number of US government databases housing diverse collections of chemical data including bioassay data (PubChem), toxicity data (CompTox Chemistry Dashboard) and environmental data (a large collection of EPA databases), to name just a few. In many cases integration between the databases, at the chemical structure level, is via alphanumeric text identifiers such as CAS Numbers, or via InChI (International Chemical Identifiers). Structure-based integration is hyper-dependent on the initial inputs providing the chemical structures to the InChI generation algorithm. To ensure optimal integration between various databases, community standards and agreement regarding standardization of chemical structures would be beneficial, not only to integration of US government databases and resources but also to the international scientific community and hosts of online databases. This presentation will discuss our progress to deliver a fully Open Source chemical standardization platform as an exemplar for the community to build on and enhance. The system utilizes the CDK (Chemistry Development Kit), RD Kit and other open source components. The resource expands on our previous work regarding the Chemical Validation and Standardization Platform and has been tested using the open data collection provided by the EPA Comptox Chemistry Dashboard.
An overview of what we do to curate and annotate small molecules and how it's the basis of Chemmantis. A presentation given to the PDB team at Rutgers University
CINF 170: Regioselectivity: An application of expert systems and ontologies t...NextMove Software
Prediction is much harder than analysis. Consider hurricanes and tornadoes; it's much easier to follow the path of destruction by locating devastated neighborhoods, than to forecast the paths of such weather systems in advance. Likewise for many chemical reactions, such as nitration (by refluxing with nitric acid and sulfuric acid) where the appearance of one or more nitro groups indicates a nitration reaction, but predicting where on a non-trivial organic molecule this functional group appears is a much harder challenge. In this sense, reaction analysis is much simpler than (either forward or retrosynthetic) synthesis planning.
NextMove Software's namerxn is an expert system for classifying reactions (from reaction SMILES, MDL connection tables or ChemDraw sketches) typically assigning each reaction instance to a leaf classification in the Royal Society of Chemistry's RXNO ontology. These tools can be helpful in the analysis of regioselectivity preferences of reactions.
This talk consists of two parts. A technical part describing the recent algorithmic and methodological improvements to the namerxn software, including describing some of the more challenging of the 1000+ reactions it currently identifies. And a scientific part that investigates the regioselective preferences of some of these reactions.
Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACSSimBioSys_Inc
Underpinning the computer-aided synthesis design system, ARChem, are algorithms that extract synthetic knowledge from large reaction databases. The generation of reaction rules that facilitate retrosynthetic analysis, as well as the extraction of information about expected yields, regioselectivity, functional group compatibility, and stereo-chemistry are discussed in these slides.
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...NextMove Software
Of the many chemical reactions performed by synthetic chemists in the pharmaceutical industry and academia, some are potentially more hazardous than others. Fortunately, best practices, compliance and education helps ensure that incidents are rare, but as highlighted by the recent explosion and building evacuation at two UK universities in March 2015, constant vigilance is necessary to ensure a safe work environment. The primary problem is not that chemical safety information, for example from MSDS/SDS data sheets, Bretherick's Handbook or the internet, is readily available, but that the volume of such information makes it difficult for an experimentalist to identify relevant risks in a timely manner.
In this talk, we describe our attempts to encode the Environmental Protection Agency's (EPA's) guidance entitled 'A Method for Determining Compatibility of Hazardous Waste', 1980, in an XML file format. Typical current state-of-the-art methods for alerting potential chemical safety hazards, for example in ELNs, simply annotate reactants with codes extracted from their MSDS/SDS data sheets, such as Global Harmonized System of Classification and Labelling of Chemicals (GHS) or EU R-phrases/S-phrases, leaving the chemist to manually assess whether any of the described incompatibilities is relevant. In this work, we use combinations of SMARTS patterns (for chemical classes) and InChIs (for specific molecules), to capture known reagent incompatibilities, that may be safe in isolation. Specific alerts describe documented incompatibilities between compounds (e.g. acetone and H2O2) while more general alerts can capture known or inferred incompatibilities between functional groups (e.g. ketones and peroxides). The encoding is hierarchical allowing only the most relevant warnings to be triggered. Alerts are encoded in a flexible XML format, facilitating extension and exchange. Advanced features of this XML format, include the ability to specify reactant quantities, and the use of predicted properties (such as of products) in rules.
CINF 35: Structure searching for patent information: The need for speedNextMove Software
Chemical databases grow larger every year. Without investing in additional hardware or improved software, the time to search these databases will in turn grow longer annually. With an ever-increasing number of pharmaceutical patents, the amount of chemical data associated with these is growing at a rate with which hardware advances alone cannot keep up.
Using automated mining of U.S. and European patents, we have extracted large collections of structural data in the form of reactions, mixtures, and exemplified compounds. Additional information such as protein targets and diseases are also extracted from each patent and associated with the structural data. We will describe how this data can be queried with natural language phrases and how these phrases are interpreted as structural queries.
Through innovations in substructure and similarity search algorithms, it is possible to search and retrieve hundreds of millions of chemical records in fractions of a second. We will demonstrate how this is achieved on a regular desktop machine using just-in-time and ahead-of-time compilation techniques.
ChemSpider is a free access online structure-based community for chemists to research data and information. The database of over 20 million chemical structures and associated data has been derived from depositions by well over a hundred contributing data sources including chemical vendors, commercial database providers, web-based scraping of data and individual scientists looking to share their information with the community. Text-mining and conversion of chemical names and identifiers to chemical structures has made an enormous contribution to the availability of diverse data on ChemSpider and includes contributions from patents, open access articles and various online resources. This presentation will provide an overview of the present state of development of this important public resource and review the processes and procedures for the harvesting, deposition and curation of large datasets derived via text-mining and conversion.
Communication of chemistry in the internet era, while it has improved, remains challenged in terms of the exchange of data in a lossless fashion. While there are moves afoot within the publishing industry to produce “data journals”, including embracing some of the new approaches for making data available to the community, many challenges remain. Chemistry data sharing, at even the most basic level, remains a challenge for many chemistry journals. The vast majority of chemistry data is provided as PDF files or trapped on webpages and therefore not available for reuse and repurposing without a significant amount of effort to extract the data. Some of the responsibility resides with the scientists who need to be educated and encouraged in the adoption of appropriate exchange formats and utilization of online platforms for data hosting and dissemination. There are certain practices which, if adopted, could increase both the availability and utility of data for the community. This includes recognition that data, in itself, has value above and beyond inclusion in peer-reviewed publications, the adoption of standard (not necessarily open) formats, clear data licensing, and distribution of the data across multiple platforms. This presentation will provide an overview of ongoing efforts within the National Center for Computational Toxicology to publish chemistry data, both in databases and associated with peer-reviewed publications, in a manner that makes our data and models consumable by the community.
This abstract does not reflect U.S. EPA policy.
Access to scientific information has changed dramatically as a result of the web and its underpinning technologies. The quantities of data, the array of tools available to search and analyze, the devices and the shift in community participation continues to expand while the pace of change does not appear to be slowing. RSC hosts a number of chemistry data resources for the community including ChemSpider, one of the community’s primary online public compound databases. Containing tens of millions of chemical compounds and its associated data ChemSpider serves data tens of thousands of chemists every day. The platform offers the ability for crowdsourcing enabling the community to deposit and curate data. This presentation will provide an overview of the expanding reach of this cheminformatics platform and the nature of the solutions that it helps to enable including structure validation and text mining and semantic markup. ChemSpider is limited in scope as a chemical compound database and we are presently architecting the RSC Data Repository, a platform that will enable us to extend our reach to include chemical reactions, analytical data, and diverse data depositions from chemists across various domains. We will also discuss the possibilities it offers in terms of supporting data modeling and sharing. The future of scientific information and communication will be underpinned by these efforts, influenced by increasing participation from the scientific community.
Chemical Health and Safety Information in PubChemSunghwan Kim
Presented at the 258th American Chemical Society (ACS) National Meeting in San Diego, CA (August 26, 2019).
Risk assessment in laboratories requires ready access to health and safety (H&S) information for many different chemicals used in laboratory work. Because chemical H&S data in the public domain are scattered across many websites, it is essential to create a centralized data repository that collects, organizes, and disseminates these data. An example is PubChem (https://pubchem.ncbi.nlm.nih.gov), developed and maintained by the U.S. National Institutes of Health.
PubChem contains a substantial corpus of H&S information of chemicals collected from authoritative government agencies and international organizations. PubChem’s H&S data include flammability, toxicity, exposure limits, exposure symptoms, first aid, handling, clean-up procedure, GHS symbols, and more. In addition, for 100,000+ compounds, PubChem provides a tailored data view called the Laboratory Chemical Safety Summary (LCSS), which presents pertinent H&S data for a given compound. The complete list of chemicals with an LCSS can be accessed through the PubChem LCSS project webpage (https://pubchemdocs.ncbi.nlm.nih.gov/lcss/) or the PubChem Classification Browser (https://pubchem.ncbi.nlm.nih.gov/classification/#hid=72). If desired, LCSS data can be downloaded from the LCSS page for each compound, or in bulk from the PubChem LCSS project webpage, enabling local annotation of the data to support specific procedures in place at an institution. The LCSS page can be readily accessed from a mobile device using a chemical QR code.
ChemSpider is an online database of over 20 million chemical structures assembled from well over a hundred data sources including chemical and screening library vendors, publicly accessible databases and resources, commercial databases and Open Access literature articles. Such a public resource provides a rich source of ligands for the purpose of virtual screening experiments. These can take many forms. This work will present results from two specific types of studies: 1) Quantitative Structure Activity Relationship (QSAR) based analyses and 2) In-silico docking into protein receptor sites. We will review results from the application of both approaches to a number of specific examples. QSAR analyses utilizing the ChemModLab environment for assessing quantitative structure-activity relationships will and screening using a molecular surface descriptor model.
Presented at the Bioinformatics Seminar at the University of Arkansas, Little Rock on November 5, 2021.
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a popular chemical database at the National Library of Medicine, National Institutes of Health. Arguably, PubChem is one of the largest chemical information resources in the public domain, with 111 million unique chemical structures, 1.39 million biological assays, and 292 million biological activity result outcomes. It also contains significant amounts of scientific research data and the inter-relationships between chemicals, proteins, genes, scientific literature, patents, and more. PubChem is a key resource for big data in chemistry and has been used in many studies for developing bioactivity and toxicity prediction models, discovering polypharmacologic (multi-target) ligands, and identifying new macromolecule targets of compounds (for drug-repurposing or off-target side effect prediction). It has also been used for cheminformatics education as well as chemical health and safety training. This presentation provides a high-level overview of PubChem’s data, tools, and services.
PubChem for chemical information literacy trainingSunghwan Kim
Presented at the American Chemical Society Fall 2021 National Meeting (August 23, 2021; virtual).
==== Abstracts ====
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public chemical information resource that collects chemical information from 780+ data sources. It is visited by millions of users every month and many of them are young students at academic undergraduate or graduate students at academic institutions. While PubChem has a great potential as an online resource for chemical education, it also has important issues that are not familiar to students and educators, including data accuracy, data provenance, structure standardization, terminologies, etc. In this presentation, various aspects of PubChem as a chemical education resource will be discussed, with a special emphasis on how to help students develop chemical information literacy skills.
More Related Content
Similar to Chemical Structure Standardization and Synonym Filtering in PubChem
An overview of what we do to curate and annotate small molecules and how it's the basis of Chemmantis. A presentation given to the PDB team at Rutgers University
CINF 170: Regioselectivity: An application of expert systems and ontologies t...NextMove Software
Prediction is much harder than analysis. Consider hurricanes and tornadoes; it's much easier to follow the path of destruction by locating devastated neighborhoods, than to forecast the paths of such weather systems in advance. Likewise for many chemical reactions, such as nitration (by refluxing with nitric acid and sulfuric acid) where the appearance of one or more nitro groups indicates a nitration reaction, but predicting where on a non-trivial organic molecule this functional group appears is a much harder challenge. In this sense, reaction analysis is much simpler than (either forward or retrosynthetic) synthesis planning.
NextMove Software's namerxn is an expert system for classifying reactions (from reaction SMILES, MDL connection tables or ChemDraw sketches) typically assigning each reaction instance to a leaf classification in the Royal Society of Chemistry's RXNO ontology. These tools can be helpful in the analysis of regioselectivity preferences of reactions.
This talk consists of two parts. A technical part describing the recent algorithmic and methodological improvements to the namerxn software, including describing some of the more challenging of the 1000+ reactions it currently identifies. And a scientific part that investigates the regioselective preferences of some of these reactions.
Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACSSimBioSys_Inc
Underpinning the computer-aided synthesis design system, ARChem, are algorithms that extract synthetic knowledge from large reaction databases. The generation of reaction rules that facilitate retrosynthetic analysis, as well as the extraction of information about expected yields, regioselectivity, functional group compatibility, and stereo-chemistry are discussed in these slides.
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...NextMove Software
Of the many chemical reactions performed by synthetic chemists in the pharmaceutical industry and academia, some are potentially more hazardous than others. Fortunately, best practices, compliance and education helps ensure that incidents are rare, but as highlighted by the recent explosion and building evacuation at two UK universities in March 2015, constant vigilance is necessary to ensure a safe work environment. The primary problem is not that chemical safety information, for example from MSDS/SDS data sheets, Bretherick's Handbook or the internet, is readily available, but that the volume of such information makes it difficult for an experimentalist to identify relevant risks in a timely manner.
In this talk, we describe our attempts to encode the Environmental Protection Agency's (EPA's) guidance entitled 'A Method for Determining Compatibility of Hazardous Waste', 1980, in an XML file format. Typical current state-of-the-art methods for alerting potential chemical safety hazards, for example in ELNs, simply annotate reactants with codes extracted from their MSDS/SDS data sheets, such as Global Harmonized System of Classification and Labelling of Chemicals (GHS) or EU R-phrases/S-phrases, leaving the chemist to manually assess whether any of the described incompatibilities is relevant. In this work, we use combinations of SMARTS patterns (for chemical classes) and InChIs (for specific molecules), to capture known reagent incompatibilities, that may be safe in isolation. Specific alerts describe documented incompatibilities between compounds (e.g. acetone and H2O2) while more general alerts can capture known or inferred incompatibilities between functional groups (e.g. ketones and peroxides). The encoding is hierarchical allowing only the most relevant warnings to be triggered. Alerts are encoded in a flexible XML format, facilitating extension and exchange. Advanced features of this XML format, include the ability to specify reactant quantities, and the use of predicted properties (such as of products) in rules.
CINF 35: Structure searching for patent information: The need for speedNextMove Software
Chemical databases grow larger every year. Without investing in additional hardware or improved software, the time to search these databases will in turn grow longer annually. With an ever-increasing number of pharmaceutical patents, the amount of chemical data associated with these is growing at a rate with which hardware advances alone cannot keep up.
Using automated mining of U.S. and European patents, we have extracted large collections of structural data in the form of reactions, mixtures, and exemplified compounds. Additional information such as protein targets and diseases are also extracted from each patent and associated with the structural data. We will describe how this data can be queried with natural language phrases and how these phrases are interpreted as structural queries.
Through innovations in substructure and similarity search algorithms, it is possible to search and retrieve hundreds of millions of chemical records in fractions of a second. We will demonstrate how this is achieved on a regular desktop machine using just-in-time and ahead-of-time compilation techniques.
ChemSpider is a free access online structure-based community for chemists to research data and information. The database of over 20 million chemical structures and associated data has been derived from depositions by well over a hundred contributing data sources including chemical vendors, commercial database providers, web-based scraping of data and individual scientists looking to share their information with the community. Text-mining and conversion of chemical names and identifiers to chemical structures has made an enormous contribution to the availability of diverse data on ChemSpider and includes contributions from patents, open access articles and various online resources. This presentation will provide an overview of the present state of development of this important public resource and review the processes and procedures for the harvesting, deposition and curation of large datasets derived via text-mining and conversion.
Communication of chemistry in the internet era, while it has improved, remains challenged in terms of the exchange of data in a lossless fashion. While there are moves afoot within the publishing industry to produce “data journals”, including embracing some of the new approaches for making data available to the community, many challenges remain. Chemistry data sharing, at even the most basic level, remains a challenge for many chemistry journals. The vast majority of chemistry data is provided as PDF files or trapped on webpages and therefore not available for reuse and repurposing without a significant amount of effort to extract the data. Some of the responsibility resides with the scientists who need to be educated and encouraged in the adoption of appropriate exchange formats and utilization of online platforms for data hosting and dissemination. There are certain practices which, if adopted, could increase both the availability and utility of data for the community. This includes recognition that data, in itself, has value above and beyond inclusion in peer-reviewed publications, the adoption of standard (not necessarily open) formats, clear data licensing, and distribution of the data across multiple platforms. This presentation will provide an overview of ongoing efforts within the National Center for Computational Toxicology to publish chemistry data, both in databases and associated with peer-reviewed publications, in a manner that makes our data and models consumable by the community.
This abstract does not reflect U.S. EPA policy.
Access to scientific information has changed dramatically as a result of the web and its underpinning technologies. The quantities of data, the array of tools available to search and analyze, the devices and the shift in community participation continues to expand while the pace of change does not appear to be slowing. RSC hosts a number of chemistry data resources for the community including ChemSpider, one of the community’s primary online public compound databases. Containing tens of millions of chemical compounds and its associated data ChemSpider serves data tens of thousands of chemists every day. The platform offers the ability for crowdsourcing enabling the community to deposit and curate data. This presentation will provide an overview of the expanding reach of this cheminformatics platform and the nature of the solutions that it helps to enable including structure validation and text mining and semantic markup. ChemSpider is limited in scope as a chemical compound database and we are presently architecting the RSC Data Repository, a platform that will enable us to extend our reach to include chemical reactions, analytical data, and diverse data depositions from chemists across various domains. We will also discuss the possibilities it offers in terms of supporting data modeling and sharing. The future of scientific information and communication will be underpinned by these efforts, influenced by increasing participation from the scientific community.
Chemical Health and Safety Information in PubChemSunghwan Kim
Presented at the 258th American Chemical Society (ACS) National Meeting in San Diego, CA (August 26, 2019).
Risk assessment in laboratories requires ready access to health and safety (H&S) information for many different chemicals used in laboratory work. Because chemical H&S data in the public domain are scattered across many websites, it is essential to create a centralized data repository that collects, organizes, and disseminates these data. An example is PubChem (https://pubchem.ncbi.nlm.nih.gov), developed and maintained by the U.S. National Institutes of Health.
PubChem contains a substantial corpus of H&S information of chemicals collected from authoritative government agencies and international organizations. PubChem’s H&S data include flammability, toxicity, exposure limits, exposure symptoms, first aid, handling, clean-up procedure, GHS symbols, and more. In addition, for 100,000+ compounds, PubChem provides a tailored data view called the Laboratory Chemical Safety Summary (LCSS), which presents pertinent H&S data for a given compound. The complete list of chemicals with an LCSS can be accessed through the PubChem LCSS project webpage (https://pubchemdocs.ncbi.nlm.nih.gov/lcss/) or the PubChem Classification Browser (https://pubchem.ncbi.nlm.nih.gov/classification/#hid=72). If desired, LCSS data can be downloaded from the LCSS page for each compound, or in bulk from the PubChem LCSS project webpage, enabling local annotation of the data to support specific procedures in place at an institution. The LCSS page can be readily accessed from a mobile device using a chemical QR code.
ChemSpider is an online database of over 20 million chemical structures assembled from well over a hundred data sources including chemical and screening library vendors, publicly accessible databases and resources, commercial databases and Open Access literature articles. Such a public resource provides a rich source of ligands for the purpose of virtual screening experiments. These can take many forms. This work will present results from two specific types of studies: 1) Quantitative Structure Activity Relationship (QSAR) based analyses and 2) In-silico docking into protein receptor sites. We will review results from the application of both approaches to a number of specific examples. QSAR analyses utilizing the ChemModLab environment for assessing quantitative structure-activity relationships will and screening using a molecular surface descriptor model.
Presented at the Bioinformatics Seminar at the University of Arkansas, Little Rock on November 5, 2021.
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a popular chemical database at the National Library of Medicine, National Institutes of Health. Arguably, PubChem is one of the largest chemical information resources in the public domain, with 111 million unique chemical structures, 1.39 million biological assays, and 292 million biological activity result outcomes. It also contains significant amounts of scientific research data and the inter-relationships between chemicals, proteins, genes, scientific literature, patents, and more. PubChem is a key resource for big data in chemistry and has been used in many studies for developing bioactivity and toxicity prediction models, discovering polypharmacologic (multi-target) ligands, and identifying new macromolecule targets of compounds (for drug-repurposing or off-target side effect prediction). It has also been used for cheminformatics education as well as chemical health and safety training. This presentation provides a high-level overview of PubChem’s data, tools, and services.
PubChem for chemical information literacy trainingSunghwan Kim
Presented at the American Chemical Society Fall 2021 National Meeting (August 23, 2021; virtual).
==== Abstracts ====
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public chemical information resource that collects chemical information from 780+ data sources. It is visited by millions of users every month and many of them are young students at academic undergraduate or graduate students at academic institutions. While PubChem has a great potential as an online resource for chemical education, it also has important issues that are not familiar to students and educators, including data accuracy, data provenance, structure standardization, terminologies, etc. In this presentation, various aspects of PubChem as a chemical education resource will be discussed, with a special emphasis on how to help students develop chemical information literacy skills.
PubChem: A Public Chemical Information Resource for Big Data ChemistrySunghwan Kim
A web-seminar jointly organized by KWSE (Korean Woman Scientists & Engineers) and KWiSE (Korean-American Women in Science and Engineering). Presented on July 27, 2021.
PubChem for drug discovery in the age of big data and artificial intelligenceSunghwan Kim
Presented at the American Chemical Society Middle Atlantic Regional Meeting (MARM) 2021 (June 10, 2021).
==== Abstract ====
With the emergence of the age of big data and artificial intelligence, biomedical research communities have a great interest in exploiting the massive amount of chemical and biological data available in the public domain. PubChem (https://pubchem.ncbi.nlm.nih.gov) is one of the largest sources of publicly available chemical information, with +270 million substance descriptions, +110 million unique compounds, +285 million bioactivity outcomes from more than one million biological assay experiments. PubChem provides a wide range of chemical information, including structure, pharmacology, toxicology, drug target, metabolism, chemical vendors, patents, regulations, clinical trials, and many others. These contents can be accessed interactively through web browsers as well as programmatically using computer scripts. They can also be downloaded in bulk through the PubChem File Transfer Protocol (FTP) site. PubChem data has been used in many studies for developing bioactivity and toxicity prediction models, discovering polypharmacologic (multi-target) ligands, and identifying new macromolecule targets of compounds (for drug-repurposing or off-target side effect prediction). This presentation provides an overview of PubChem data, tools, and services useful for drug discovery.
PubChem and its application for cheminformatics educationSunghwan Kim
Presented at the American Chemical Society Middle Atlantic Regional Meeting (MARM) 2021 (June 9, 2021).
==== Abstract ====
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a chemical information resource, developed and maintained by the U.S. National Institutes of Health. It contains a large corpus of publicly chemical data collected from more than 700 data sources. Visited by millions of users every month, it serves a wide range of audiences, from scientific communities to the general public. Considering that many PubChem users are undergraduate and graduate students at academic institutions, it has great potential as a cheminformatics education resource. In this presentation, we will give a brief overview of PubChem’s data content, tools, and services. Important aspects of PubChem as cheminformatics education will be discussed, including data quality and accuracy, data provenance and governance, and structure standardization. Besides, we will discuss PubChem’s education and outreach efforts, including the PubChem Laboratory Chemical Safety Summary (LCSS) and the Cheminformatics On-Line Chemistry Course (OLCC).
Cheminformatics Online Chemistry Course (OLCC): A Community Effort to Introdu...Sunghwan Kim
Presented at the American Chemical Society (ACS) Spring 2021 National Meeting (Virtual, April 16, 2021).
==== Abstract ====
Computer and informatics skills to handle an ever-increasing amount of chemical information are considered important for students pursuing STEM careers in the age of big data. However, many schools do not offer a cheminformatics course or alternative training opportunities. The Cheminformatics Online Chemistry Course (OLCC) is a community effort to introduce cheminformatics content into the undergraduate chemistry curriculum. It is a highly collaborative teaching project involving instructors at multiple schools as well as external cheminformatics experts recruited across sectors, including academia, government, and industry. Three Cheminformatics OLCCs were offered in the Fall 2015, Spring 2017, and Fall 2019 semesters. In each OLCC, the instructors at participating schools would meet face-to-face with the students, while external cheminformatics experts engaged through online discussions across campuses with both the instructors and students. All the material created in the course has been made available at the open education repositories of LibreTexts and CCCE websites for other institutions to adapt to their future needs. This presentation describes the instructional approaches of the Cheminformatics OLCC project and the lessons learned from this community effort. We also discuss future directions for this project as well as cheminformatics education in general, including pedagogy, resources, and course content.
Cheminformatics Education with PubChemSunghwan Kim
Presented on November 13, 2020, as part of the "Integrating Bioinformatics Education Series" (https://ualr.edu/bioinformatics/education-series/), organized by the Arkansas IDeA Network of Biomedical Research Excellence (Arkansas INBRE) (https://inbre.uams.edu/).
Sunghwan Kim
National Library of Medicine, National Institutes of Health, Rockville, Maryland, United States
PubChem as an Emerging Toxicological Information ResourceSunghwan Kim
Presented on October 20, 2020 at the 9th American Society for Cellular and Computational Toxicology (ASCCT) National Meeting.
==== Abstract ====
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public chemical information resource at the U.S. National Institutes of Health. It collects chemical information from 750+ data sources and disseminates it to the public free of charge. Arguably, PubChem contains the largest amount of chemical information available in the public domain, with more than 265 million depositor-provided substance descriptions, 100 million unique chemical structures, and 270 million bioactivity outcomes from one million assays covering around twenty thousand unique protein target sequences.
Included in the many types of content in PubChem is toxicological information about chemicals, e.g., human and animal toxicity, ecotoxicity, exposure limits, exposure symptoms, and antidote & emergency treatment. Notably, a substantial amount of toxicological information from resources formerly offered by the TOXicology data NETwork (TOXNET) is now integrated into PubChem, e.g., the Hazardous Substances Data Bank (HSDB), Genetic Toxicology Data Bank (Gene-Tox), Chemical Carcinogenesis Research Information System (CCRIS), LactMed, and LiverTox. In addition, PubChem contains a large amount of bioactivity and toxicity screening data that can be used to build toxicity prediction models based on statistical and machine-learning approaches. This presentation provides an overview of PubChem’s toxicological information and describes how open data in PubChem can be used to develop prediction models for chemical toxicity.
PubChem: a public chemical information resource for big data chemistrySunghwan Kim
Presented at the Joint Statistical Meetings (JSM) 2020 (virtual) on August 3, 2020.
==== Abstract ====
The idea of “big data” has recently been drawing much attention of the scientific community as well as the general public. An example of big data in Chemistry is the data contained in PubChem, which is a public database of chemical substance descriptions and their biological activities at the National Institutes of Health. PubChem is a sizeable system with 235 million depositor-provided substance descriptions, 96 million unique chemical structures, 1.1 million biological assays, and 268 million biological activity result outcomes. It also contains significant amounts of scientific research data and the inter-relationships between chemicals, proteins, genes, scientific literature, patents and more. PubChem resources have been used in many studies for developing bioactivity and toxicity prediction models, discovering multi-target ligands, and identifying new macromolecule targets of compounds (for drug-repurposing or off-target side effect prediction). This presentation provides an overview of how PubChem’s data, tools, and services can be used for bioassay data analysis and virtual screening (VS) and discusses important aspects of exploiting PubChem for drug discovery.
PubChem as a resource for chemical information educationSunghwan Kim
Presented at the Fall 2020 American Chemical Society (ACS) National Meeting (Virtual) on August 20, 2020.
Sunghwan Kim & Evan Bolton
National Library of Medicine, National Institutes of Health, Rockville, Maryland, United States
==== Abstract ====
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public chemical information resource that contains one of the largest corpus of publicly available chemical information. It is one of the top five most visited chemistry web sites in the world, with more than four million unique users per month (as of April 2020). Considering that many of PubChem users are undergraduate students in academic institutions, PubChem has a great potential as an online resource for chemical education. However, it also has some important issues with data accuracy, data provenance, structure standardization, terminologies and so on, because PubChem is essentially a data aggregator that collects heterogeneous data from 700+ data sources in various domains. This presentation will discuss various aspects of PubChem as a chemical information education resource. Especially, a focus will be given on how to help students develop the ability to critically assess chemical information available in PubChem and other public databases.
Presented at the Fall 2020 American Chemical Society (ACS) National Meeting (Virtual) on August 20, 2020.
Sunghwan Kim, Jian Zhang, Paul Thiessen, Asta Gindulyte, Pertti J. Hakkinen & Evan Bolton
National Library of Medicine, National Institutes of Health, Rockville, Maryland, United States
==== Abstract ====
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public chemical information resource at the U.S. National Institutes of Health. It collects chemical information from 700+ data sources and disseminates the collected data to the public free of charge. Arguably, PubChem contains the largest amount of chemical information available in the public domain, with more than 250 million depositor-provided substance descriptions, 100 million unique chemical structures, and 265 million bioactivity outcomes from one million assays covering around twenty thousand unique protein target sequences.
Included in the many types of content in PubChem is toxicological information about chemicals, e.g., human and animal toxicity, ecotoxicity, exposure limits, exposure symptoms, and antidote & emergency treatment. Notably, a substantial amount of toxicological information from resources formerly offered by the TOXicology data NETwork (TOXNET) is now integrated into PubChem, e.g., the Hazardous Substances Data Bank (HSDB), LactMed, and LiverTox. In addition, PubChem contains a large amount of bioactivity and toxicity screening data that can be used to build toxicity prediction models based on statistical and machine-learning approaches. This presentation provides an overview of PubChem’s toxicological information as well as tools and services that help users exploit this information. It also describes how open data in PubChem can be used to develop prediction models for chemical toxicity.
Presented online at KSEA - Virginia Washington Metro Regional Conference 2020 (VWMRC 2020) (May 9, 2020)
==== Abstract ====
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a popular chemical information resource, visited by millions of unique users per month. It contains chemical data from more than 700 data sources and disseminates these data to the public free of charge. Arguably, it is the largest source of publicly available chemical information, containing more than 250 million depositor-provided substance descriptions, 100 million unique chemical structures, and 260 million bioactivity outcomes from one million assays covering around ten thousand unique protein target sequences. This presentation provides an overview of PubChem’s data, tools, and services useful for drug discovery.
The immense quantity of bioactivity data in PubChem can be used to develop computational models to predict bioactivities of small molecules. While these data are primarily generated from high-throughput screening (HTS), they also include a substantial amount of bioactivity information extracted from peer-reviewed journal articles. In addition, through data integration with other databases, PubChem has a wide range of annotations useful for drug discovery, including pharmacology, toxicology, drug target, metabolism, chemical vendors, scientific articles, patents, and many others.
PubChem supports various types of chemical structure searches, including identity, 2-D and 3-D similarity, substructure, superstructure, and molecular formula. It also provides multiple programmatic access routes, including E-Utilities, Power User Gateway (PUG), PUG-SOAP, PUG-REST, and PUG-View, allowing one to build an automated workflow that takes advantage of information contained in PubChem. In addition, through PubChemRDF, users can integrate PubChem data with their own.
PubChem and Its Applications for Drug DiscoverySunghwan Kim
Presentation delivered at Lehigh University (Bethlehem, PA) on Friday, April 26, 2019.
This presentation provides a brief introduction to PubChem and discusses how to use PubChem for drug discovery. More detailed information on this topic can found in the following paper:
Getting the most out of PubChem for virtual screening.
Expert Opin Drug Discov. 2016 Aug 5; 11(9):843-55.
https://doi.org/10.1080/17460441.2016.1216967
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5045798/
Presentation delivered at Lehigh University (Bethlehem, PA) on Friday, April 26, 2019.
This presentation begins with discussing the history of the cheminformatics field. In addition, it also discusses a question "what makes cheminformatics different from bioinformatics?" (by comparing the ways in which molecules are described and compared in the two fields).
Searching for chemical information using PubChemSunghwan Kim
Presented at the 257th American Chemical Society (ACS) National Meeting in Orlando, FL (April 1, 2019). [CHED 303]
==== Abstract ====
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public chemical database, which provides information on a broad range of chemical entities, including small molecules, lipids, carbohydrates, and (chemically-modified) amino acid and nucleic acid sequences (including siRNA and miRNA). With three million unique users per month at peak, PubChem is ranked as one of the most visited chemistry websites in the world. A substantial number of PubChem users are between ages 18 and 24, who are likely to be undergraduate or graduate students at academic institutions. Therefore, PubChem has a great potential as an online resource for chemical education. In this talk, we will present “PubChem Search”, a new web interface that allows users to quickly find desired chemical information. This interface supports chemical name search as well as various types of chemical structure search, including identity/similarity search, superstructure/substructure search, and molecular search. Using PubChem Search, it is also possible to search for journal articles or patent documents that mention a given chemical. The hits returned from a search can be downloaded to local machines or further refined or analyzed in conjunction with other PubChem tools and services. In this presentation, we will demonstrate how the PubChem Search interface can be used to search beyond google for chemical information of interest.
PubChem as a resource for chemical information trainingSunghwan Kim
Presented at the 257th American Chemical Society (ACS) National Meeting in Orlando, FL (March 31, 2019). [CINF 13]
==== Abstract ====
Libraries at many large academic institutions provide chemical information training programs for students. However, these programs are based on commercial chemical information resources, which come with non-trivial subscription fees. These fees are often too expensive for small organizations, including many primarily undergraduate institutions (PUIs) and community colleges (CCs). It leads to disparity in access to chemical information as well as learning opportunities among students. This issue may be addressed at least in part by developing free online training programs based on public chemical databases, such as PubChem (https://pubchem.ncbi.nlm.nih.gov). PubChem has a great potential as an online resource for chemical education, but it also has important issues that students and teachers should keep in mind, such as data accuracy, data provenance, structure standardization, terminologies and so on. In this presentation, we will discuss various aspects of PubChem as a resource for chemical information training.
Development of machine learning-based prediction models for chemical modulato...Sunghwan Kim
Presented at the 2018 Research Festival at the National Institutes of Health (NIH) in Bethesda, MD (September 13, 2018).
==== Abstract ====
The retinoid X receptor (RXR) is a nuclear hormone receptor that functions as a transcription factor with roles in development, cell differentiation, metabolism, and cell death. Chemicals that interfere the RXR signaling pathway may cause adverse effects on human health. In this study, public-domain bioactivity data available in PubChem (https://pubchem.ncbi.nlm.nih.gov) were used to develop machine learning-based prediction models for chemical modulators of RXR-alpha, which is a subtype of RXR that plays a role in metabolic signaling pathways, dermal cysts, cardiac development, insulin sensitization, etc. The models were constructed from quantitative high-throughput screening (qHTS) data from the Tox21 project, using popular supervised machine learning methods (including support vector machine, random forest, neural network, k-nearest neighbors, decision tree, and naïve Bayes). The general applicability of the developed models was evaluated with external data sets from ChEMBL and the NCATS Chemical Genomics Center (NCGC). This study showcases how open data in the public domain can be used to develop prediction models for bioactivity of small molecules.
Using open bioactivity data for developing machine-learning prediction models...Sunghwan Kim
Presented at the 256th American Chemical Society (ACS) National Meeting in Boston, MA (August 22, 2018).
==== Abstract ====
The retinoid X receptor (RXR) is a nuclear hormone receptor that functions as a transcription factor with roles in development, cell differentiation, metabolism, and cell death. Chemicals that interfere the RXR signaling pathway may cause adverse effects on human health. In this study, open bioactivity data available at PubChem (https://pubchem.ncbi.nlm.nih.gov) were used to develop prediction models for chemical modulators of RXR-alpha, which is a subtype of RXR that plays a role in metabolic signaling pathways, dermal cysts, cardiac development, insulin sensitization, etc. The models were constructed from quantitative high-throughput screening (qHTS) data from the Tox21 project, using various supervised machine learning methods (including support vector machine, random forest, neural network, k-nearest neighbors, decision tree, and naïve Bayes). The performance of the models was evaluated with an external data set containing bioactivity data submitted by ChEMBL and the NCATS Chemical Genomics Center (NCGC). This study showcases how open data in the public domain can be used to develop prediction models for chemical toxicity.
Searching for patent information in PubChem Sunghwan Kim
Presented at the 256th American Chemical Society (ACS) National Meeting in Boston, MA (August 19, 2018).
==== Abstract ====
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public chemical information resource, containing more than 242 million chemical substance descriptions, 94 million unique compounds, and 234 million bioactivities determined from 1.25 million assay experiments. Importantly, data contribution from multiple sources, including IBM, SureChEMBL, ScripDB, NextMove, and BindingDB, allows PubChem to provide links to patent documents that mention chemicals. Currently, PubChem offers links between about 6.7 million patent documents and more than 20 million unique chemical structures, with over 137 million compound-patent links, covering primarily U.S. patents with some from European, and World Intellectual Property Organization, and Japanese patent documents. This presentation will provide an overview of the patent information in PubChem as well as the best practice for using it.
Exploiting PubChem for drug discovery based on natural productsSunghwan Kim
Presented at the 256th American Chemical Society (ACS) National Meeting in Boston, MA (August 19, 2018).
==== Abstract ====
PubChem is one of the largest sources of publicly available chemical information, with more than 242.3 million depositor-provided substance descriptions, 94.7 million unique chemical structures, and 234.8 million bioactivity outcomes from 1.25 million assays covering around ten thousand unique protein target sequences. This presentation provides an overview of PubChem’s data, tools, and services useful for drug discovery based on natural products.
PubChem contains a large amount of bioactivity data, most of which are generated from high-throughput screening (HTS). However, these data also include a substantial amount of bioactivity information extracted from scientific articles published in journals in the chemical biology, medicinal chemistry, and natural product domains, thanks to data contribution by other databases like ChEMBL, Guide to Pharmacology, BindingDB, and PDBbind. In addition, through data integration with other databases such as DrugBank, HSDB, and HMDB, PubChem contains a wide range of annotations useful for drug discovery, including pharmacology, toxicology, drug target, metabolism, chemical vendors, scientific articles, patents, and many others.
PubChem supports various types of chemical structure searches, including identify search, 2-D and 3-D similarity searches, substructure and superstructure searches, molecular formula search. It also provides multiple programmatic access routes, including E-Utilities, Power User Gateway (PUG), PUG-SOAP, PUG-REST, and PUG-View, allowing one to build an automated workflow that takes advantage of information contained in PubChem. In addition, through PubChemRDF, users can integrate PubChem’s data into their own in-house data on a local computing machine.
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...University of Maribor
Slides from talk:
Aleš Zamuda: Remote Sensing and Computational, Evolutionary, Supercomputing, and Intelligent Systems.
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Inter-Society Networking Panel GRSS/MTT-S/CIS Panel Session: Promoting Connection and Cooperation
https://www.etran.rs/2024/en/home-english/
Phenomics assisted breeding in crop improvementIshaGoswami9
As the population is increasing and will reach about 9 billion upto 2050. Also due to climate change, it is difficult to meet the food requirement of such a large population. Facing the challenges presented by resource shortages, climate
change, and increasing global population, crop yield and quality need to be improved in a sustainable way over the coming decades. Genetic improvement by breeding is the best way to increase crop productivity. With the rapid progression of functional
genomics, an increasing number of crop genomes have been sequenced and dozens of genes influencing key agronomic traits have been identified. However, current genome sequence information has not been adequately exploited for understanding
the complex characteristics of multiple gene, owing to a lack of crop phenotypic data. Efficient, automatic, and accurate technologies and platforms that can capture phenotypic data that can
be linked to genomics information for crop improvement at all growth stages have become as important as genotyping. Thus,
high-throughput phenotyping has become the major bottleneck restricting crop breeding. Plant phenomics has been defined as the high-throughput, accurate acquisition and analysis of multi-dimensional phenotypes
during crop growing stages at the organism level, including the cell, tissue, organ, individual plant, plot, and field levels. With the rapid development of novel sensors, imaging technology,
and analysis methods, numerous infrastructure platforms have been developed for phenotyping.
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Studia Poinsotiana
I Introduction
II Subalternation and Theology
III Theology and Dogmatic Declarations
IV The Mixed Principles of Theology
V Virtual Revelation: The Unity of Theology
VI Theology as a Natural Science
VII Theology’s Certitude
VIII Conclusion
Notes
Bibliography
All the contents are fully attributable to the author, Doctor Victor Salas. Should you wish to get this text republished, get in touch with the author or the editorial committee of the Studia Poinsotiana. Insofar as possible, we will be happy to broker your contact.
Toxic effects of heavy metals : Lead and Arsenicsanjana502982
Heavy metals are naturally occuring metallic chemical elements that have relatively high density, and are toxic at even low concentrations. All toxic metals are termed as heavy metals irrespective of their atomic mass and density, eg. arsenic, lead, mercury, cadmium, thallium, chromium, etc.
hematic appreciation test is a psychological assessment tool used to measure an individual's appreciation and understanding of specific themes or topics. This test helps to evaluate an individual's ability to connect different ideas and concepts within a given theme, as well as their overall comprehension and interpretation skills. The results of the test can provide valuable insights into an individual's cognitive abilities, creativity, and critical thinking skills
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Sérgio Sacani
We characterize the earliest galaxy population in the JADES Origins Field (JOF), the deepest
imaging field observed with JWST. We make use of the ancillary Hubble optical images (5 filters
spanning 0.4−0.9µm) and novel JWST images with 14 filters spanning 0.8−5µm, including 7 mediumband filters, and reaching total exposure times of up to 46 hours per filter. We combine all our data
at > 2.3µm to construct an ultradeep image, reaching as deep as ≈ 31.4 AB mag in the stack and
30.3-31.0 AB mag (5σ, r = 0.1” circular aperture) in individual filters. We measure photometric
redshifts and use robust selection criteria to identify a sample of eight galaxy candidates at redshifts
z = 11.5 − 15. These objects show compact half-light radii of R1/2 ∼ 50 − 200pc, stellar masses of
M⋆ ∼ 107−108M⊙, and star-formation rates of SFR ∼ 0.1−1 M⊙ yr−1
. Our search finds no candidates
at 15 < z < 20, placing upper limits at these redshifts. We develop a forward modeling approach to
infer the properties of the evolving luminosity function without binning in redshift or luminosity that
marginalizes over the photometric redshift uncertainty of our candidate galaxies and incorporates the
impact of non-detections. We find a z = 12 luminosity function in good agreement with prior results,
and that the luminosity function normalization and UV luminosity density decline by a factor of ∼ 2.5
from z = 12 to z = 14. We discuss the possible implications of our results in the context of theoretical
models for evolution of the dark matter halo mass function.
ISI 2024: Application Form (Extended), Exam Date (Out), EligibilitySciAstra
The Indian Statistical Institute (ISI) has extended its application deadline for 2024 admissions to April 2. Known for its excellence in statistics and related fields, ISI offers a range of programs from Bachelor's to Junior Research Fellowships. The admission test is scheduled for May 12, 2024. Eligibility varies by program, generally requiring a background in Mathematics and English for undergraduate courses and specific degrees for postgraduate and research positions. Application fees are ₹1500 for male general category applicants and ₹1000 for females. Applications are open to Indian and OCI candidates.
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills MN
Travis Hills of Minnesota developed a method to convert waste into high-value dry fertilizer, significantly enriching soil quality. By providing farmers with a valuable resource derived from waste, Travis Hills helps enhance farm profitability while promoting environmental stewardship. Travis Hills' sustainable practices lead to cost savings and increased revenue for farmers by improving resource efficiency and reducing waste.
Chemical Structure Standardization and Synonym Filtering in PubChem
1. Chemical Structure Standardization and
Synonym Filtering in PubChem
Sunghwan Kim, Ph.D., M.Sc.
ACS National Meeting in San Diego, CA
(August 26, 2019)
3. 3
PubChem
Public chemical information resource
Collects data from more than 690+ sources
Disseminates data back to the public free of charge
Contains the largest amount of publicly available chemical
information
Faces unique challenges to
deal with many big data issues
on a daily basis.
• Chemical structure
standardization
• Name-structure association
clean up
4. Depositor-provided
Bioactivity test results
Unique chemical
structure extraction
through Standardization
Depositor-provided
substance descriptions
Unique chemical structures
Activity of tested
“substances”
Activity of “compounds” derived
from associated “substances”
690+ Data Contributors
Substance
deposition
Assay
deposition
Data Organization in PubChem
Substance ID (SID) Assay ID (AID)
Compound ID (CID)
4
5. Unique chemical
structure extraction
through Standardization
Depositor-provided
substance descriptions
Unique chemical structures
690+ Data Contributors
Substance
deposition
Data Organization in PubChem
Substance ID (SID)
Depositor-provided
Bioactivity test results
Activity of tested
“substances”
Activity of “compounds” derived
from associated “substances”
Assay
deposition
Assay ID (AID)
Compound ID (CID)
5
6. Unique chemical
structure extraction
through Standardization
Depositor-provided
substance descriptions
Unique chemical structures
690+ Data Contributors
Substance
deposition
Data Organization in PubChem
Substance ID (SID)
Compound ID (CID)
6
Individual data depositors
provide PubChem with:
• Chemical structures
• Chemical names (synonyms)
They need to be
organized/cleaned up through:
• Structure standardization
• Synonym filtering
15. 15
• ~90% of the substances
are subject to
standardization.
• Mostly organic
compounds.
• Standardization success rate:
99.64%
• Modification rate:
44.43%
J. Cheminform. (2018) 10:36
Standardization
Statistics
16. Most stable
in vacuum
Most stable
in water
It is not necessarily what one may expect
Standardized Structures
Standardized
by PubChem
17. In most cases, tautomeric forms of a molecule are
standardized into a single form.
There are a few exceptions.
CID 18630CID 31261
Standardized Structures
tautomerization
18. Standardization and Structure Identity Search
You can search PubChem using a structure as a query.
The input structure may be provided:
• using a line notation (e.g., SMILES, InChI)
• through using the PubChem Sketcher.
The input structure for identity search will be standardized
first before the search is performed.
Therefore, hits from identity search may have different
structures from the original input structure.
25. 25
Synonym # SIDs # CIDs
N/A 6,869 6,368
SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT
SUPPORTED IN CURRENT VERSION 4,903 4,902
NULL 4,610 4,599
ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT
VERSION 2,554 2,554
NOT AVAILABLE 1,867 1,816
LECITHIN 1,157 1,142
DIACYLGLYCEROL 847 842
DIGLYCERIDE 841 841
MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT
VERSION! 797 794
VITASMLAB 461 461
MIXTURE NAME 419 413
CLA 770 394
CHLOROPHYLL A 749 393
NA 7,081 371
Unfiltered Depositor-provided synonyms (page 1/3)
26. 26
Synonym # SIDs # CIDs
N/A 6,869 6,368
SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT
SUPPORTED IN CURRENT VERSION 4,903 4,902
NULL 4,610 4,599
ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT
VERSION 2,554 2,554
NOT AVAILABLE 1,867 1,816
LECITHIN 1,157 1,142
DIACYLGLYCEROL 847 842
DIGLYCERIDE 841 841
MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT
VERSION! 797 794
VITASMLAB 461 461
MIXTURE NAME 419 413
CLA 770 394
CHLOROPHYLL A 749 393
NA 7,081 371
Various forms of
“Not Available”
Unfiltered Depositor-provided synonyms (page 1/3)
27. 27
Synonym # SIDs # CIDs
N/A 6,869 6,368
SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT
SUPPORTED IN CURRENT VERSION 4,903 4,902
NULL 4,610 4,599
ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT
VERSION 2,554 2,554
NOT AVAILABLE 1,867 1,816
LECITHIN 1,157 1,142
DIACYLGLYCEROL 847 842
DIGLYCERIDE 841 841
MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT
VERSION! 797 794
VITASMLAB 461 461
MIXTURE NAME 419 413
CLA 770 394
CHLOROPHYLL A 749 393
NA 7,081 371
Various forms of
“Not Available”
Unfiltered Depositor-provided synonyms (page 1/3)
28. 28
Synonym # SIDs # CIDs
N/A 6,869 6,368
SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT
SUPPORTED IN CURRENT VERSION 4,903 4,902
NULL 4,610 4,599
ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT
VERSION 2,554 2,554
NOT AVAILABLE 1,867 1,816
LECITHIN 1,157 1,142
DIACYLGLYCEROL 847 842
DIGLYCERIDE 841 841
MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT
VERSION! 797 794
VITASMLAB 461 461
MIXTURE NAME 419 413
CLA 770 394
CHLOROPHYLL A 749 393
NA 7,081 371
Various forms of
“Not Available”
Great reduction in the structure count
after structure standardization
SIDs are standardized to Na (sodium)
Unfiltered Depositor-provided synonyms (page 1/3)
29. 29
Synonym # SIDs # CIDs
N/A 6,869 6,368
SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT
SUPPORTED IN CURRENT VERSION 4,903 4,902
NULL 4,610 4,599
ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT
VERSION 2,554 2,554
NOT AVAILABLE 1,867 1,816
LECITHIN 1,157 1,142
DIACYLGLYCEROL 847 842
DIGLYCERIDE 841 841
MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT
VERSION! 797 794
VITASMLAB 461 461
MIXTURE NAME 419 413
CLA 770 394
CHLOROPHYLL A 749 393
NA 7,081 371
Error messages from
name generation software
Unfiltered Depositor-provided synonyms (page 1/3)
30. 30
Synonym # SIDs # CIDs
N/A 6,869 6,368
SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT
SUPPORTED IN CURRENT VERSION 4,903 4,902
NULL 4,610 4,599
ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT
VERSION 2,554 2,554
NOT AVAILABLE 1,867 1,816
LECITHIN 1,157 1,142
DIACYLGLYCEROL 847 842
DIGLYCERIDE 841 841
MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT
VERSION! 797 794
VITASMLAB 461 461
MIXTURE NAME 419 413
CLA 770 394
CHLOROPHYLL A 749 393
NA 7,081 371
Names of
chemical classes
Unfiltered Depositor-provided synonyms (page 1/3)
31. 31
Synonym # SIDs # CIDs
(1-(5-CARBOXYPENTYL)-3,3-DIMETHYL-3H-INDOL-1-IUM-2-
YL)METHANIDE HYDROBROMIDE 405 345
ETHANONE,1- - 328 328
CANNOT MAKE CHOICE: LIGANDS ARE COMPARED UP TO 10 SPHERES 304 304
COMPLEX BRIDGED FUSED SYSTEMS ARE NOT SUPPORTED IN CURRENT
VERSION! 302 302
TRIACYLGLYCEROL 286 285
TRIGLYCERIDE 286 285
QUINOLONE DER. 280 279
UNABLE TO GENERATE VALUE 274 264
UNL 656 255
UNKNOWN LIGAND 615 235
HEPT DERIV. 213 211
MULTIPARENT NAMES FOR FUSED SYSTEMS ARE NOT SUPPORTED IN
CURRENT VERSION! 208 208
ACHIRAL CENTER(S) 187 187
Unfiltered Depositor-provided synonyms (page 2/3)
32. 32
Synonym # SIDs # CIDs
(1-(5-CARBOXYPENTYL)-3,3-DIMETHYL-3H-INDOL-1-IUM-2-
YL)METHANIDE HYDROBROMIDE 405 345
ETHANONE,1- - 328 328
CANNOT MAKE CHOICE: LIGANDS ARE COMPARED UP TO 10 SPHERES 304 304
COMPLEX BRIDGED FUSED SYSTEMS ARE NOT SUPPORTED IN CURRENT
VERSION! 302 302
TRIACYLGLYCEROL 286 285
TRIGLYCERIDE 286 285
QUINOLONE DER. 280 279
UNABLE TO GENERATE VALUE 274 264
UNL 656 255
UNKNOWN LIGAND 615 235
HEPT DERIV. 213 211
MULTIPARENT NAMES FOR FUSED SYSTEMS ARE NOT SUPPORTED IN
CURRENT VERSION! 208 208
ACHIRAL CENTER(S) 187 187
“Derivative” of
a chemical
Unfiltered Depositor-provided synonyms (page 2/3)
34. 34
Synonym # SIDs # CIDs
C9H11NO2 179 174
HEM 4,645 165
BCR 290 160
C10H13NO2 161 154
BETA-CAROTENE 298 147
C8H10N2O2 149 144
C10H10N2O2 149 143
-ACETICACID 141 141
C9H8N2O2 143 141
PROTOPORPHYRIN IX CONTAINING FE 3,690 140
C8H9NO2 144 139
NAG 9,599 130
METHANOL 247 128
C8H9NO3 129 127
C10H9NO2 133 126
PYRIDINONE DERIV. 130 126
N. A. 128 125
Molecular formula
Unfiltered Depositor-provided synonyms (page 3/3)
35. 35
Synonym # SIDs # CIDs
C9H11NO2 179 174
HEM 4,645 165
BCR 290 160
C10H13NO2 161 154
BETA-CAROTENE 298 147
C8H10N2O2 149 144
C10H10N2O2 149 143
-ACETICACID 141 141
C9H8N2O2 143 141
PROTOPORPHYRIN IX CONTAINING FE 3,690 140
C8H9NO2 144 139
NAG 9,599 130
METHANOL 247 128
C8H9NO3 129 127
C10H9NO2 133 126
PYRIDINONE DERIV. 130 126
N. A. 128 125
Abbreviation for
chemical names
Unfiltered Depositor-provided synonyms (page 3/3)
36. 36
Synonym # SIDs # CIDs
C9H11NO2 179 174
HEM 4,645 165
BCR 290 160
C10H13NO2 161 154
BETA-CAROTENE 298 147
C8H10N2O2 149 144
C10H10N2O2 149 143
-ACETICACID 141 141
C9H8N2O2 143 141
PROTOPORPHYRIN IX CONTAINING FE 3,690 140
C8H9NO2 144 139
NAG 9,599 130
METHANOL 247 128
C8H9NO3 129 127
C10H9NO2 133 126
PYRIDINONE DERIV. 130 126
N. A. 128 125
Abbreviation for
chemical names
Unfiltered Depositor-provided synonyms (page 3/3)
Description
37. 37
Synonym # SIDs # CIDs
C9H11NO2 179 174
HEM 4,645 165
BCR 290 160
C10H13NO2 161 154
BETA-CAROTENE 298 147
C8H10N2O2 149 144
C10H10N2O2 149 143
-ACETICACID 141 141
C9H8N2O2 143 141
PROTOPORPHYRIN IX CONTAINING FE 3,690 140
C8H9NO2 144 139
NAG 9,599 130
METHANOL 247 128
C8H9NO3 129 127
C10H9NO2 133 126
PYRIDINONE DERIV. 130 126
N. A. 128 125
Abbreviation for
chemical names
Unfiltered Depositor-provided synonyms (page 3/3)
Description
“Not available”
38. 38
Unfiltered Depositor-provided synonyms
Depositor-provided synonyms include:
• Real chemical names
• Abbreviations for chemical names
• “Derivatives” of some chemicals
• Names of chemical classes
• Molecular formula
• N/A, NULL, Not Available, NA, N.A., etc
• Error messages or comments
Not feasible to manually clean up.
PubChem uses crowd-voting-based synonym filtering.
40. 40
PubChem Synonym filtering
Crowd-voting approach
Check for a consensus on the name-structure association
between depositors.
Consensus threshold : >60% of the total votes
When a consensus is reached,
the synonym is added to the “filtered” synonym list of the
corresponding compound (standardized structure).
41. 41
CID 1
Synonym A SID 1Depositor 1
Synonyms that occurs only “once”
No disagreement in the name-structure association
Consider that the Synonym A means CID 1,
(although it may not be correct)
42. 42
CID 1
CID 2
CID 3
Synonym A SID 1Depositor 1
Synonym A
Synonym A
Synonym A
Synonym A
SID 2
SID 4
SID 5
SID 3
Depositor 2
SID 7
Synonym A
Synonym A
SID 8
SID 6
Synonym A
Depositor 3
SID 10
SID 9Synonym A
Synonym A
Depositor 4
Synonyms occurring multiple times
Which one is
the best choice?
43. 43
Synonym filtering using crowd voting
Two potential approaches
• Multiple-votes-per-depositor
• Single-vote-per-depositor
44. 44
CID 1
CID 2
CID 3
Synonym A SID 1Depositor 1
Synonym A
Synonym A
Synonym A
Synonym A
SID 2
SID 4
SID 5
SID 3
Depositor 2
SID 7
Synonym A
Synonym A
SID 8
SID 6
Synonym A
Depositor 3
SID 10
SID 9Synonym A
Synonym A
Depositor 4
# votes
3 (30%)
5 (50%)
2 (20%)
Consensus Threshold = 60%
Multiple-Votes-per-Depositor Strategy
45. 45
CID 1
CID 2
CID 3
Synonym A SID 1Depositor 1
Synonym A
Synonym A
Synonym A
Synonym A
SID 2
SID 4
SID 5
SID 3
Depositor 2
SID 7
Synonym A
Synonym A
SID 8
SID 6
Synonym A
Depositor 3
SID 10
SID 9Synonym A
Synonym A
Depositor 4
# votes
Consensus Threshold = 60%
Single-Vote-per-Depositor Strategy
46. 46
CID 1
CID 2
CID 3
Synonym A SID 1Depositor 1
Synonym A
Synonym A
Synonym A
Synonym A
SID 2
SID 4
SID 5
SID 3
Depositor 2
SID 7
Synonym A
Synonym A
SID 8
SID 6
Synonym A
Depositor 3
SID 10
SID 9Synonym A
Synonym A
Depositor 4
# votes
Consensus Threshold = 60%
Single-Vote-per-Depositor Strategy
47. 47
CID 1
CID 2
CID 3
Synonym A SID 1Depositor 1
Synonym A
Synonym A
Synonym A
Synonym A
SID 2
SID 4
SID 5
SID 3
Depositor 2
SID 7
Synonym A
Synonym A
SID 8
SID 6
Synonym A
Depositor 3
SID 10
SID 9Synonym A
Synonym A
Depositor 4
# votes
Consensus Threshold = 60%
Single-Vote-per-Depositor Strategy
48. 48
CID 1
CID 2
CID 3
Synonym A SID 1Depositor 1
Synonym A
Synonym A
Synonym A
Synonym A
SID 2
SID 4
SID 5
SID 3
Depositor 2
SID 7
Synonym A
Synonym A
SID 8
SID 6
Synonym A
Depositor 3
SID 10
SID 9Synonym A
Synonym A
Depositor 4
# votes
Consensus Threshold = 60%
Single-Vote-per-Depositor Strategy
49. 49
CID 1
CID 2
CID 3
Synonym A SID 1Depositor 1
Synonym A
Synonym A
Synonym A
Synonym A
SID 2
SID 4
SID 5
SID 3
Depositor 2
SID 7
Synonym A
Synonym A
SID 8
SID 6
Synonym A
Depositor 3
SID 10
SID 9Synonym A
Synonym A
Depositor 4
# votes
1 (33%)
2 (67%)
0 (0%)
Consensus Threshold = 60%
Single-Vote-per-Depositor Strategy
Consensus has reached!
Synonym A = CID 2
51. 51
Abbr. CACTVS hash code used Description
CID CID hash code Connectivity + isotopes + stereochemistry
STE CID stereo hash code Connectivity + stereochemistry
CON CID connectivity hash code Connectivity
PCID Parent CID hash code CID of the parent compound
PSTE Parent CID stereo hash code STE of the parent compound
PCON Parent CID connectivity hash code CON of the parent compound
In practice, synonym filtering uses CACTVS hash codes (instead
of CID) to determine whether a consensus is reached or not.
Additional consideration:
Different contexts of chemical sameness
55. 55
1. Synonym filtering focuses on consistency, not correctness.
• It resolves the discrepancies in name-structure associations
within & between depositors.
• It does not mean that filtered synonyms are correct.
Limitations of Synonym Filtering
Fentin acetate (CID 16682804)
Its filtered synonyms include:
• m-Nitrobenzaldehyde 3-thio-4-o-tolylsemicarbazone
• Benzaldehyde, m-nitro-, 3-thio-4-o-tolylsemicarbazone
58. 58
Limitations of Synonym Filtering
1. Synonym filtering focuses on consistency, not correctness.
• Data sources integrate synonym data from another sources that are
regarded to be authoritative (e.g., government resources).
• Erroneous data in one source propagate into another sources.
• This practice helps incorrect name-chemical associations getting more
votes than it should during the synonym filtering process.
59. 59
2. More than 90% of depositor-provided synonyms occur only once.
• Automatically assigned to the structures represented by their
corresponding CIDs.
Limitations of Synonym Filtering
63. 63
PubChem contains a large amount of chemical information provided by
690+ data sources.
Through the chemical structure standardization process, PubChem
standardizes depositor-provided chemical structures and extracts unique
structures.
PubChem uses a crowd-voting-based synonym filtering to clean up
name-structure associations provided by depositors.
Summary
64. 64
Acknowledgements
Evan Bolton
Jie Chen
Tiejun Cheng
Asta Gindulyte
Jia He
Siqian He
Qingliang Li
Benjamin Shoemaker
Thiessen Paul
Bo Yu
Leonid Zaslavsky
Jian Zhang
The PubChem Team
PubChem depositors, users, and collaborators
Funded by the National Library of Medicine