SureChem ACS 2012. Presented by Nico on behalf of all three authors. The data is searchable at https://open.surechem.com/login. Related information included recent posts at http://cdsouthan.blogspot.se/
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...Dr. Haxel Consult
Christopher Southan (The IUPHAR/BPS Guide to PHARMACOLOGY, UK)
As of August 2017, the major automated patent chemistry extractions (in ascending size, NextMove, SCRIPDB, IBM and SureChEMBL) are included submitters for 21.5 million CIDs from the PubChem total of 93.8. The following aspects will be expanded in this presentation, starting with advantages; a) while the relative coverage between open and commercial sources is difficult to determine (PMID 26457120) it is clear that the majority of patent-exemplified structures of medicinal chemistry interest (i.e. from C07 plus A61) are now in PubChem b) this allows most first-filings of lead series and clinical candidates to be tracked d) the PubChem tool box has query, analysis, clustering and linking features difficult to match in commercial sources, e) many structures can be associated with bioactivity data f) connections between manually curated papers and patents can be made via the 0.48 million CID intersects with ChEMBL. However, looking more closely also indicates disadvantages; a) extraction coverage is compromised by dense image tables and poor OCR quality of WO documents, b) SureChEMBL is the only major open pipeline continuously running in situ but has a PubChem updating lag, c) automated extraction generates structural “noise” that degrades chemistry quality d) PubChem patent document metadata indexing is patchy (although better for SureChEMBL in situ) d) nothing in the records indicateas IP status, e) continual re-extraction of common chemistry results in over-mapping (e.g. 126,949 patents for aspirin and 14,294 for atorvastatin), f) authentic compounds are contaminated with spurious mixtures and never-made virtuals, including 1000s of deuterated drugs g) linking between assay data and targets is still a manual exercise. However, all things considered the PubChem patent “big bang” presents users with the best of both worlds (PMID 26194581). Academics or smaller enterprises who cannot afford commercial solutions can now patent mine extensively. For those who have such subscriptions, PubChem has become an essential adjunct/complementary source for the analysis of patent chemistry and associated bio entities such as diseases and drug targets.
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?Dr. Haxel Consult
Fernando Huerta (RISE Bioscience & Materials, SE)
Alexander Minidis (Collaborative Drug Discovery - CDD VAULT, Sweden)
How much information does the scientists need to design new potential drugs?
A thorough overview of public scientific information sources (open access) and methods to collect, process, analyse and visualize this information will be presented. A direct application of such free available information in conjunction with freeware will be described in relation with the efforts of the scientific community to find effective medicines for the ZIKA virus.
Exploiting PubChem for drug discovery based on natural productsSunghwan Kim
Presented at the 256th American Chemical Society (ACS) National Meeting in Boston, MA (August 19, 2018).
==== Abstract ====
PubChem is one of the largest sources of publicly available chemical information, with more than 242.3 million depositor-provided substance descriptions, 94.7 million unique chemical structures, and 234.8 million bioactivity outcomes from 1.25 million assays covering around ten thousand unique protein target sequences. This presentation provides an overview of PubChem’s data, tools, and services useful for drug discovery based on natural products.
PubChem contains a large amount of bioactivity data, most of which are generated from high-throughput screening (HTS). However, these data also include a substantial amount of bioactivity information extracted from scientific articles published in journals in the chemical biology, medicinal chemistry, and natural product domains, thanks to data contribution by other databases like ChEMBL, Guide to Pharmacology, BindingDB, and PDBbind. In addition, through data integration with other databases such as DrugBank, HSDB, and HMDB, PubChem contains a wide range of annotations useful for drug discovery, including pharmacology, toxicology, drug target, metabolism, chemical vendors, scientific articles, patents, and many others.
PubChem supports various types of chemical structure searches, including identify search, 2-D and 3-D similarity searches, substructure and superstructure searches, molecular formula search. It also provides multiple programmatic access routes, including E-Utilities, Power User Gateway (PUG), PUG-SOAP, PUG-REST, and PUG-View, allowing one to build an automated workflow that takes advantage of information contained in PubChem. In addition, through PubChemRDF, users can integrate PubChem’s data into their own in-house data on a local computing machine.
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...Dr. Haxel Consult
Christopher Southan (The IUPHAR/BPS Guide to PHARMACOLOGY, UK)
While the raison d'être of patents is Intellectual Property (IP) there is a growing awareness of the scientific value of their data content. This is particularly so in medicinal chemistry and associated bioactivity domains where disclosed compounds and associated data not only exceeds that published in papers by several-fold and surfaces years earlier, but is also, paradoxically; completely open (i.e. no paywalls). Scientists have traditionally extracted their own relationships or used commercial sources but the last few years have seen a “big bang” in patent extractions submitted to open databases, including nearly 20 million structures now in PubChem.
This tutorial will:
Outline the statistics of patent chemistry in various open sources
Introduce a spectrum of open resources and tools
Enable an understanding of target identification, bioactivity and SAR extraction from patents and connecting these relationships to papers
Cover aspects of medicinal chemistry patent mining
Include hands on exercises using open source antimalarial research as examples
The focus will be on public databases and patent office portals, since these can be transparently demonstrated. However, the essential complementarity with commercial resources will be touched on. Those engaged in Competitive Intelligence will also find the material relevant.
SureChem ACS 2012. Presented by Nico on behalf of all three authors. The data is searchable at https://open.surechem.com/login. Related information included recent posts at http://cdsouthan.blogspot.se/
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...Dr. Haxel Consult
Christopher Southan (The IUPHAR/BPS Guide to PHARMACOLOGY, UK)
As of August 2017, the major automated patent chemistry extractions (in ascending size, NextMove, SCRIPDB, IBM and SureChEMBL) are included submitters for 21.5 million CIDs from the PubChem total of 93.8. The following aspects will be expanded in this presentation, starting with advantages; a) while the relative coverage between open and commercial sources is difficult to determine (PMID 26457120) it is clear that the majority of patent-exemplified structures of medicinal chemistry interest (i.e. from C07 plus A61) are now in PubChem b) this allows most first-filings of lead series and clinical candidates to be tracked d) the PubChem tool box has query, analysis, clustering and linking features difficult to match in commercial sources, e) many structures can be associated with bioactivity data f) connections between manually curated papers and patents can be made via the 0.48 million CID intersects with ChEMBL. However, looking more closely also indicates disadvantages; a) extraction coverage is compromised by dense image tables and poor OCR quality of WO documents, b) SureChEMBL is the only major open pipeline continuously running in situ but has a PubChem updating lag, c) automated extraction generates structural “noise” that degrades chemistry quality d) PubChem patent document metadata indexing is patchy (although better for SureChEMBL in situ) d) nothing in the records indicateas IP status, e) continual re-extraction of common chemistry results in over-mapping (e.g. 126,949 patents for aspirin and 14,294 for atorvastatin), f) authentic compounds are contaminated with spurious mixtures and never-made virtuals, including 1000s of deuterated drugs g) linking between assay data and targets is still a manual exercise. However, all things considered the PubChem patent “big bang” presents users with the best of both worlds (PMID 26194581). Academics or smaller enterprises who cannot afford commercial solutions can now patent mine extensively. For those who have such subscriptions, PubChem has become an essential adjunct/complementary source for the analysis of patent chemistry and associated bio entities such as diseases and drug targets.
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?Dr. Haxel Consult
Fernando Huerta (RISE Bioscience & Materials, SE)
Alexander Minidis (Collaborative Drug Discovery - CDD VAULT, Sweden)
How much information does the scientists need to design new potential drugs?
A thorough overview of public scientific information sources (open access) and methods to collect, process, analyse and visualize this information will be presented. A direct application of such free available information in conjunction with freeware will be described in relation with the efforts of the scientific community to find effective medicines for the ZIKA virus.
Exploiting PubChem for drug discovery based on natural productsSunghwan Kim
Presented at the 256th American Chemical Society (ACS) National Meeting in Boston, MA (August 19, 2018).
==== Abstract ====
PubChem is one of the largest sources of publicly available chemical information, with more than 242.3 million depositor-provided substance descriptions, 94.7 million unique chemical structures, and 234.8 million bioactivity outcomes from 1.25 million assays covering around ten thousand unique protein target sequences. This presentation provides an overview of PubChem’s data, tools, and services useful for drug discovery based on natural products.
PubChem contains a large amount of bioactivity data, most of which are generated from high-throughput screening (HTS). However, these data also include a substantial amount of bioactivity information extracted from scientific articles published in journals in the chemical biology, medicinal chemistry, and natural product domains, thanks to data contribution by other databases like ChEMBL, Guide to Pharmacology, BindingDB, and PDBbind. In addition, through data integration with other databases such as DrugBank, HSDB, and HMDB, PubChem contains a wide range of annotations useful for drug discovery, including pharmacology, toxicology, drug target, metabolism, chemical vendors, scientific articles, patents, and many others.
PubChem supports various types of chemical structure searches, including identify search, 2-D and 3-D similarity searches, substructure and superstructure searches, molecular formula search. It also provides multiple programmatic access routes, including E-Utilities, Power User Gateway (PUG), PUG-SOAP, PUG-REST, and PUG-View, allowing one to build an automated workflow that takes advantage of information contained in PubChem. In addition, through PubChemRDF, users can integrate PubChem’s data into their own in-house data on a local computing machine.
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...Dr. Haxel Consult
Christopher Southan (The IUPHAR/BPS Guide to PHARMACOLOGY, UK)
While the raison d'être of patents is Intellectual Property (IP) there is a growing awareness of the scientific value of their data content. This is particularly so in medicinal chemistry and associated bioactivity domains where disclosed compounds and associated data not only exceeds that published in papers by several-fold and surfaces years earlier, but is also, paradoxically; completely open (i.e. no paywalls). Scientists have traditionally extracted their own relationships or used commercial sources but the last few years have seen a “big bang” in patent extractions submitted to open databases, including nearly 20 million structures now in PubChem.
This tutorial will:
Outline the statistics of patent chemistry in various open sources
Introduce a spectrum of open resources and tools
Enable an understanding of target identification, bioactivity and SAR extraction from patents and connecting these relationships to papers
Cover aspects of medicinal chemistry patent mining
Include hands on exercises using open source antimalarial research as examples
The focus will be on public databases and patent office portals, since these can be transparently demonstrated. However, the essential complementarity with commercial resources will be touched on. Those engaged in Competitive Intelligence will also find the material relevant.
Overview of the SureChEMBL system and web interface.
https://www.surechembl.org/search/
SureChEMBL is a freely available web resource for chemistry patent searching. It is based on a fully automatic and dynamic text and image mining pipeline.
ChEMBL and KNIME provide an ideal match of open data with open tools. This is a quick overview of how to access ChEMBL data resources and web services (ChEMBL, UniChem, Beaker, myChEMBL, SureChEMBL) via the KNIME platform.
CINF 55: SureChEMBL: An open patent chemistry resourceGeorge Papadatos
SureChEMBL (https://www.surechembl.org) is a new resource provided by the European Bioinformatics Institute (EMBL-EBI) that annotates, extracts and indexes chemistry from full text patent documents by means of continuous, automated text and image mining. SureChEMBL is perhaps the only open, freely available, live patent chemistry resource available, in a field that has been traditionally commercial.
Since its launch last September, the SureChEMBL interface provides sophisticated keyword and chemistry-based querying and exporting functionality against a corpus of more than 16 million compounds extracted from 13 million patent documents. Both the interface and the underlying data pipeline leverage a number of technologies for name to structure conversion, as well as compound standardisation, registration and searching.
In addition to providing an overview of the system, recent developments and improvements will be described. These include the introduction of various data interexchange and exporting options, such as flat files and a data feed client. Furthermore, our future plans for the SureChEMBL system will be outlined. To date, such plans include complementing the chemical annotations with biological ones, covering genes, proteins, diseases and indications. Furthermore, we are planning to further enrich the chemical annotations with a relevance score, indicating their importance in the patent document.
Searching for patent information in PubChem Sunghwan Kim
Presented at the 256th American Chemical Society (ACS) National Meeting in Boston, MA (August 19, 2018).
==== Abstract ====
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public chemical information resource, containing more than 242 million chemical substance descriptions, 94 million unique compounds, and 234 million bioactivities determined from 1.25 million assay experiments. Importantly, data contribution from multiple sources, including IBM, SureChEMBL, ScripDB, NextMove, and BindingDB, allows PubChem to provide links to patent documents that mention chemicals. Currently, PubChem offers links between about 6.7 million patent documents and more than 20 million unique chemical structures, with over 137 million compound-patent links, covering primarily U.S. patents with some from European, and World Intellectual Property Organization, and Japanese patent documents. This presentation will provide an overview of the patent information in PubChem as well as the best practice for using it.
With the unprecedented growth of chemical databases incorporating up to several hundred billions of synthetically feasible chemicals, modelers are not in shortage of chemicals to process. Importantly, such "Big Chemical Data" offers humongous opportunities for discovering novel bioactive molecules. However, the current generation of cheminformatics software tools is not capable of handling, characterizing, and processing such extremely large chemical libraries. In this presentation, we will discuss the rationale and the main challenges (theoretical and technical) for screening very large repositories of compounds in the current context of drug discovery. We will present several proof-of-concept studies regarding the screening of extremely large libraries (1+ billion compounds) using our novel GPU-accelerated cheminformatics platform to identify molecules with defined bioactivity. Overall, we will show that GPU computing represents an effective and inexpensive architecture to develop, employ, and validate a new generation of cheminformatics methods and tools ready to process billions of compounds.
Background of the project and simple use cases of using the Open PHACTS API and KNIME to extract compound, target and indication entities from millions of patent documents and infer meaningful links among them. Open PHACTS Linked Data meeting in Vienna.
The open patent chemistry “big bang”: Implications, opportunities and caveatsDr. Haxel Consult
In 2012, after the first IBM deposition, few would have predicted that PubChem compounds that included patent-extracted structures would exceed 20 million within three years (i.e. 30% of the total). The current major open patent chemistry submitters (in size order) are NextMove, SCRIPDB, Thomson Pharma, IBM and SureChEMBL. This “big bang” has a range of utilities and implications. Firstly, pharmaceutical companies must now integrate their exploitation of both public and commercial patent chemistry because capture is divergent. Secondly, the academic community and small companies can now patent-mine extensively without commercial sources. Thirdly, first-filings of most lead series and clinical candidates can now be tracked. Fourthly, drug targets in ChEMBL can be intersected with Structure Activity Relationship (SAR) data sets from patents, some of which are now target-mapped in other databases (doi:10.1016/j.ddtec.2014.12.001). However, while this patent chemistry “big bang” is generally welcomed by database users, there are significant caveats. In particular, both automated and manual extraction bring in a variety of artefacts that add confounding structural “noise”. These include a) permutations of mixtures and chiral exemplifications, b) virtual structures (including isotopic analogues of approved drugs), c) an emerging trend of vendor “patent picking” for non-stocked compounds, d) 85% of public patent chemistry has no biological data links and c) extractions from documents do not directly indicate IP status. These problems and some partial solutions will be discussed.
11 years old presentation submitted as Project work: Golden Mantra to Perform Worldwide Patent Searches
Patent provides the right to exclude others from making, using, selling, offering for sale, or importing the patented invention for the term of the patent, usually 20 years from the filing date. A patent is, in effect, a limited property right that the government offers to inventors in exchange for their agreement to share the details of their inventions with the public. Like any other property right, it may be sold, licensed, mortgaged, assigned or transferred, given away, or simply abandoned.
In order to obtain a patent, an applicant must provide a written description of his or her invention in sufficient detail for a person skilled in the art (i.e., the relevant area of technology) to make and use the invention.
PubChem for chemical information literacy trainingSunghwan Kim
Presented at the American Chemical Society Fall 2021 National Meeting (August 23, 2021; virtual).
==== Abstracts ====
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public chemical information resource that collects chemical information from 780+ data sources. It is visited by millions of users every month and many of them are young students at academic undergraduate or graduate students at academic institutions. While PubChem has a great potential as an online resource for chemical education, it also has important issues that are not familiar to students and educators, including data accuracy, data provenance, structure standardization, terminologies, etc. In this presentation, various aspects of PubChem as a chemical education resource will be discussed, with a special emphasis on how to help students develop chemical information literacy skills.
EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...ChemAxon
SureChEMBL is a new resource provided by the European Bioinformatics Institute (EMBL-EBI) that annotates, extracts and indexes chemistry from full text patent documents by means of continuous, automated text and image mining. SureChEMBL is perhaps the only open, freely available, live patent chemistry resource available, in a field that has been traditionally commercial.
Since its launch last September, the SureChEMBL interface provides sophisticated keyword and chemistry-based querying and exporting functionality against a corpus of more than 16 million compounds extracted from 13 million patent documents. Both the interface and the underlying data pipeline leverage extensively ChemAxon technologies for name to structure conversion, as well as compound standardisation, registration and searching.
In addition to providing an overview of the system, recent developments and improvements will be described. These include the introduction of various data interexchange and exporting options, such as flat files and a data feed client. Furthermore, our future plans for the SureChEMBL system will be outlined. To date, such plans include complementing the chemical annotations with biological ones, covering genes, proteins, diseases and indications. Furthermore, we are planning to further enrich the chemical annotations with a relevance score, indicating their importance in the patent document.
Overview of the SureChEMBL system and web interface.
https://www.surechembl.org/search/
SureChEMBL is a freely available web resource for chemistry patent searching. It is based on a fully automatic and dynamic text and image mining pipeline.
ChEMBL and KNIME provide an ideal match of open data with open tools. This is a quick overview of how to access ChEMBL data resources and web services (ChEMBL, UniChem, Beaker, myChEMBL, SureChEMBL) via the KNIME platform.
CINF 55: SureChEMBL: An open patent chemistry resourceGeorge Papadatos
SureChEMBL (https://www.surechembl.org) is a new resource provided by the European Bioinformatics Institute (EMBL-EBI) that annotates, extracts and indexes chemistry from full text patent documents by means of continuous, automated text and image mining. SureChEMBL is perhaps the only open, freely available, live patent chemistry resource available, in a field that has been traditionally commercial.
Since its launch last September, the SureChEMBL interface provides sophisticated keyword and chemistry-based querying and exporting functionality against a corpus of more than 16 million compounds extracted from 13 million patent documents. Both the interface and the underlying data pipeline leverage a number of technologies for name to structure conversion, as well as compound standardisation, registration and searching.
In addition to providing an overview of the system, recent developments and improvements will be described. These include the introduction of various data interexchange and exporting options, such as flat files and a data feed client. Furthermore, our future plans for the SureChEMBL system will be outlined. To date, such plans include complementing the chemical annotations with biological ones, covering genes, proteins, diseases and indications. Furthermore, we are planning to further enrich the chemical annotations with a relevance score, indicating their importance in the patent document.
Searching for patent information in PubChem Sunghwan Kim
Presented at the 256th American Chemical Society (ACS) National Meeting in Boston, MA (August 19, 2018).
==== Abstract ====
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public chemical information resource, containing more than 242 million chemical substance descriptions, 94 million unique compounds, and 234 million bioactivities determined from 1.25 million assay experiments. Importantly, data contribution from multiple sources, including IBM, SureChEMBL, ScripDB, NextMove, and BindingDB, allows PubChem to provide links to patent documents that mention chemicals. Currently, PubChem offers links between about 6.7 million patent documents and more than 20 million unique chemical structures, with over 137 million compound-patent links, covering primarily U.S. patents with some from European, and World Intellectual Property Organization, and Japanese patent documents. This presentation will provide an overview of the patent information in PubChem as well as the best practice for using it.
With the unprecedented growth of chemical databases incorporating up to several hundred billions of synthetically feasible chemicals, modelers are not in shortage of chemicals to process. Importantly, such "Big Chemical Data" offers humongous opportunities for discovering novel bioactive molecules. However, the current generation of cheminformatics software tools is not capable of handling, characterizing, and processing such extremely large chemical libraries. In this presentation, we will discuss the rationale and the main challenges (theoretical and technical) for screening very large repositories of compounds in the current context of drug discovery. We will present several proof-of-concept studies regarding the screening of extremely large libraries (1+ billion compounds) using our novel GPU-accelerated cheminformatics platform to identify molecules with defined bioactivity. Overall, we will show that GPU computing represents an effective and inexpensive architecture to develop, employ, and validate a new generation of cheminformatics methods and tools ready to process billions of compounds.
Background of the project and simple use cases of using the Open PHACTS API and KNIME to extract compound, target and indication entities from millions of patent documents and infer meaningful links among them. Open PHACTS Linked Data meeting in Vienna.
The open patent chemistry “big bang”: Implications, opportunities and caveatsDr. Haxel Consult
In 2012, after the first IBM deposition, few would have predicted that PubChem compounds that included patent-extracted structures would exceed 20 million within three years (i.e. 30% of the total). The current major open patent chemistry submitters (in size order) are NextMove, SCRIPDB, Thomson Pharma, IBM and SureChEMBL. This “big bang” has a range of utilities and implications. Firstly, pharmaceutical companies must now integrate their exploitation of both public and commercial patent chemistry because capture is divergent. Secondly, the academic community and small companies can now patent-mine extensively without commercial sources. Thirdly, first-filings of most lead series and clinical candidates can now be tracked. Fourthly, drug targets in ChEMBL can be intersected with Structure Activity Relationship (SAR) data sets from patents, some of which are now target-mapped in other databases (doi:10.1016/j.ddtec.2014.12.001). However, while this patent chemistry “big bang” is generally welcomed by database users, there are significant caveats. In particular, both automated and manual extraction bring in a variety of artefacts that add confounding structural “noise”. These include a) permutations of mixtures and chiral exemplifications, b) virtual structures (including isotopic analogues of approved drugs), c) an emerging trend of vendor “patent picking” for non-stocked compounds, d) 85% of public patent chemistry has no biological data links and c) extractions from documents do not directly indicate IP status. These problems and some partial solutions will be discussed.
11 years old presentation submitted as Project work: Golden Mantra to Perform Worldwide Patent Searches
Patent provides the right to exclude others from making, using, selling, offering for sale, or importing the patented invention for the term of the patent, usually 20 years from the filing date. A patent is, in effect, a limited property right that the government offers to inventors in exchange for their agreement to share the details of their inventions with the public. Like any other property right, it may be sold, licensed, mortgaged, assigned or transferred, given away, or simply abandoned.
In order to obtain a patent, an applicant must provide a written description of his or her invention in sufficient detail for a person skilled in the art (i.e., the relevant area of technology) to make and use the invention.
PubChem for chemical information literacy trainingSunghwan Kim
Presented at the American Chemical Society Fall 2021 National Meeting (August 23, 2021; virtual).
==== Abstracts ====
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public chemical information resource that collects chemical information from 780+ data sources. It is visited by millions of users every month and many of them are young students at academic undergraduate or graduate students at academic institutions. While PubChem has a great potential as an online resource for chemical education, it also has important issues that are not familiar to students and educators, including data accuracy, data provenance, structure standardization, terminologies, etc. In this presentation, various aspects of PubChem as a chemical education resource will be discussed, with a special emphasis on how to help students develop chemical information literacy skills.
EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...ChemAxon
SureChEMBL is a new resource provided by the European Bioinformatics Institute (EMBL-EBI) that annotates, extracts and indexes chemistry from full text patent documents by means of continuous, automated text and image mining. SureChEMBL is perhaps the only open, freely available, live patent chemistry resource available, in a field that has been traditionally commercial.
Since its launch last September, the SureChEMBL interface provides sophisticated keyword and chemistry-based querying and exporting functionality against a corpus of more than 16 million compounds extracted from 13 million patent documents. Both the interface and the underlying data pipeline leverage extensively ChemAxon technologies for name to structure conversion, as well as compound standardisation, registration and searching.
In addition to providing an overview of the system, recent developments and improvements will be described. These include the introduction of various data interexchange and exporting options, such as flat files and a data feed client. Furthermore, our future plans for the SureChEMBL system will be outlined. To date, such plans include complementing the chemical annotations with biological ones, covering genes, proteins, diseases and indications. Furthermore, we are planning to further enrich the chemical annotations with a relevance score, indicating their importance in the patent document.
PubChem: a public chemical information resource for big data chemistrySunghwan Kim
Presented at the Joint Statistical Meetings (JSM) 2020 (virtual) on August 3, 2020.
==== Abstract ====
The idea of “big data” has recently been drawing much attention of the scientific community as well as the general public. An example of big data in Chemistry is the data contained in PubChem, which is a public database of chemical substance descriptions and their biological activities at the National Institutes of Health. PubChem is a sizeable system with 235 million depositor-provided substance descriptions, 96 million unique chemical structures, 1.1 million biological assays, and 268 million biological activity result outcomes. It also contains significant amounts of scientific research data and the inter-relationships between chemicals, proteins, genes, scientific literature, patents and more. PubChem resources have been used in many studies for developing bioactivity and toxicity prediction models, discovering multi-target ligands, and identifying new macromolecule targets of compounds (for drug-repurposing or off-target side effect prediction). This presentation provides an overview of how PubChem’s data, tools, and services can be used for bioassay data analysis and virtual screening (VS) and discusses important aspects of exploiting PubChem for drug discovery.
Presented at the Bioinformatics Seminar at the University of Arkansas, Little Rock on November 5, 2021.
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a popular chemical database at the National Library of Medicine, National Institutes of Health. Arguably, PubChem is one of the largest chemical information resources in the public domain, with 111 million unique chemical structures, 1.39 million biological assays, and 292 million biological activity result outcomes. It also contains significant amounts of scientific research data and the inter-relationships between chemicals, proteins, genes, scientific literature, patents, and more. PubChem is a key resource for big data in chemistry and has been used in many studies for developing bioactivity and toxicity prediction models, discovering polypharmacologic (multi-target) ligands, and identifying new macromolecule targets of compounds (for drug-repurposing or off-target side effect prediction). It has also been used for cheminformatics education as well as chemical health and safety training. This presentation provides a high-level overview of PubChem’s data, tools, and services.
The internet has changed the way we access chemistry data as well as providing access to data that can quickly proliferate and becomes referenceable. Web access to chemical structures and their integration with biological data has become massively enabling with numbers for UniChem, PubChem and ChemSpider reaching 157, 97 and 71 million respectively (at the time of writing). A range of specialist databases small enough to be curated have stand-alone utility and synergies when integrated into the larger collections. These include DrugBank, BindingDB, ChEBI, and many others. Databases of any size have inherent quality challenges but at large scale various forms of “noise” accumulate to problematic levels. The unfortunate consequence is that “bigger gets worse”. This is particularly associated with large uncurated submissions from vendors and automated document extractions (even though these are high-value). Virtual enumerations and circularity between overlapping sources add to the problem. As a result of some of the noise in the larger databases the value becomes highly dependent on the specific applications. An example includes using the databases to support non-targeted analysis. This presentation covers examples of these noise and quality issues and suggests at least some options to ameliorate the problem. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
Quality and noise in big chemistry databasesChris Southan
Presented at Aug 2019 ACS by Antony Williams. Abstract: The internet has changed the way we access chemistry data as well as providing access to data that can quickly proliferate and becomes referenceable. Web access to chemical structures and their integration with biological data has become massively enabling with numbers for UniChem, PubChem and ChemSpider reaching 157, 97 and 71 million respectively (at the time of writing). A range of specialist databases small enough to be curated have stand-alone utility and synergies when integrated into the larger collections. These include DrugBank, BindingDB, ChEBI, and many others. Databases of any size have inherent quality challenges but at large scale various forms of “noise” accumulate to problematic levels. The unfortunate consequence is that “bigger gets worse”. This is particularly associated with large uncurated submissions from vendors and automated document extractions (even though these are high-value). Virtual enumerations and circularity between overlapping sources add to the problem. As a result of some of the noise in the larger databases the value becomes highly dependent on the specific applications. An example includes using the databases to support non-targeted analysis. This presentation covers examples of these noise and quality issues and suggests at least some options to ameliorate the problem
Presented online at KSEA - Virginia Washington Metro Regional Conference 2020 (VWMRC 2020) (May 9, 2020)
==== Abstract ====
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a popular chemical information resource, visited by millions of unique users per month. It contains chemical data from more than 700 data sources and disseminates these data to the public free of charge. Arguably, it is the largest source of publicly available chemical information, containing more than 250 million depositor-provided substance descriptions, 100 million unique chemical structures, and 260 million bioactivity outcomes from one million assays covering around ten thousand unique protein target sequences. This presentation provides an overview of PubChem’s data, tools, and services useful for drug discovery.
The immense quantity of bioactivity data in PubChem can be used to develop computational models to predict bioactivities of small molecules. While these data are primarily generated from high-throughput screening (HTS), they also include a substantial amount of bioactivity information extracted from peer-reviewed journal articles. In addition, through data integration with other databases, PubChem has a wide range of annotations useful for drug discovery, including pharmacology, toxicology, drug target, metabolism, chemical vendors, scientific articles, patents, and many others.
PubChem supports various types of chemical structure searches, including identity, 2-D and 3-D similarity, substructure, superstructure, and molecular formula. It also provides multiple programmatic access routes, including E-Utilities, Power User Gateway (PUG), PUG-SOAP, PUG-REST, and PUG-View, allowing one to build an automated workflow that takes advantage of information contained in PubChem. In addition, through PubChemRDF, users can integrate PubChem data with their own.
Progress in drug discovery and chemical biology is hugely enabled by curated document-assay-result-compound-target relationships (D-A-R-C-P) in open databases from resources such as the Guide to Pharmacology and ChEMBL. These are synergistically integrated into PubChem which pre-computes chemical similarity and connectivity between over 95 million structures and 5.6 million BioAssay results. It also links chemistry to documents via various additional routes including MeSH and large scale submissions from publishers. However, these efforts are patchy and very few journals facilitate such connectivity. There thus remains a massive shortfall in public D-A-R-C-P capture from decades of papers and patents. This presentation will cover these aspects and discuss their partial amelioration by options such as author-driven depositions and open lab-book approaches as used by Open Source Malaria
PubChem as a resource for chemical information trainingSunghwan Kim
Presented at the 257th American Chemical Society (ACS) National Meeting in Orlando, FL (March 31, 2019). [CINF 13]
==== Abstract ====
Libraries at many large academic institutions provide chemical information training programs for students. However, these programs are based on commercial chemical information resources, which come with non-trivial subscription fees. These fees are often too expensive for small organizations, including many primarily undergraduate institutions (PUIs) and community colleges (CCs). It leads to disparity in access to chemical information as well as learning opportunities among students. This issue may be addressed at least in part by developing free online training programs based on public chemical databases, such as PubChem (https://pubchem.ncbi.nlm.nih.gov). PubChem has a great potential as an online resource for chemical education, but it also has important issues that students and teachers should keep in mind, such as data accuracy, data provenance, structure standardization, terminologies and so on. In this presentation, we will discuss various aspects of PubChem as a resource for chemical information training.
PubChem for drug discovery in the age of big data and artificial intelligenceSunghwan Kim
Presented at the American Chemical Society Middle Atlantic Regional Meeting (MARM) 2021 (June 10, 2021).
==== Abstract ====
With the emergence of the age of big data and artificial intelligence, biomedical research communities have a great interest in exploiting the massive amount of chemical and biological data available in the public domain. PubChem (https://pubchem.ncbi.nlm.nih.gov) is one of the largest sources of publicly available chemical information, with +270 million substance descriptions, +110 million unique compounds, +285 million bioactivity outcomes from more than one million biological assay experiments. PubChem provides a wide range of chemical information, including structure, pharmacology, toxicology, drug target, metabolism, chemical vendors, patents, regulations, clinical trials, and many others. These contents can be accessed interactively through web browsers as well as programmatically using computer scripts. They can also be downloaded in bulk through the PubChem File Transfer Protocol (FTP) site. PubChem data has been used in many studies for developing bioactivity and toxicity prediction models, discovering polypharmacologic (multi-target) ligands, and identifying new macromolecule targets of compounds (for drug-repurposing or off-target side effect prediction). This presentation provides an overview of PubChem data, tools, and services useful for drug discovery.
BOUNCER: A Privacy-aware Query Processing Over Federations of RDF DatasetsKemele M. Endris
Data provides the basis for emerging scientific and interdisciplinary data-centric applications with the potential of improving the quality of life for the citizens. However, effective data-centric applications demand data management techniques able to process a large volume of data which may include sensitive data, e.g., financial transactions, medical procedures, or personal data. Managing sensitive data requires the enforcement of privacy and access control regulations, particularly, during the execution of queries against datasets that include sensitive and non-sensitive data. In this paper, we tackle the problem of enforcing privacy regulations during query processing, and propose BOUNCER, a privacy-aware query engine over federations of RDF datasets. BOUNCER allows for the description of RDF datasets in terms of RDF molecule templates, i.e., abstract descriptions of the properties of the entities in an RDF dataset and their privacy regulations. Furthermore, BOUNCER implements query decomposition and optimization techniques able to identify query plans over RDF datasets that not only contain the relevant entities to answer a query, but that are also regulated by policies that allow for accessing these relevant entities. We empirically evaluate the effectiveness of the BOUNCER privacy-aware techniques over state-of-the-art benchmarks of RDF datasets. The observed results suggest that BOUNCER can effectively enforce access control regulations at different granularity without impacting the performance of query processing.
Presented to David Gloriam's Group, Copenhagen, Feb 2020
**********************************
The theme will be presented from the perspective of both past involvement in peptide curation in the Guide to Pharmacology (GtoPdb) and in current searching for bioactive peptides in the wider ecosystem that includes ChEMBL and PubChem. The core problem is that peptides hang in limbo land between bioinformatics (BLAST) and cheminformatics (Tanimoto) neither of which provide optimal searching. Curating peptides in GtoPdb presents many challenges, including mapping endogenous peptides to Swiss-Prot cleavage annotations. For synthetic peptides, equivocal specification of modifications and exact positions of radiolabels are also problematic However, target-mapped citation-supported quantitative binding parameters are curated where possible. For those peptides falling below the PubChem CID SMILES limit of approximately 70 residues, GtoPdb has been using Sugar and Splice from NextMove Software to convert into CIDs. Specific problems associated with finding bioactive peptides in databases will be outlined.
Vicissitudes of target validation for BACE1 and BACE2 Chris Southan
Introduction/Background & Aims
The beta-amyloid (APP) cleaving enzyme (BACE1) was implicated as a drug target for Alzheimer's Disease (AD) back in 1999. In 2011, the paralogue, BACE2, became a new proposed target for type II diabetes (T2DM) having been reported to be the TMEM27 secretase regulating pancreatic beta-cell function [1]. By 2019 the accumulated evidence, including a swathe of failed clinical trials for BACE1 inhibitors, has produced a de facto de-validation of both targets in both diseases. As a learning exercise, the series of events leading up to this is reviewed here.
Method/Summary of work
Basic information about these two targets and the lead compounds against them were sourced via the IUPHAR/BPS Guide to Pharmacology (GtoPdb) as Target ids: 2330 and 2331, for BACE1 and 2, respectively. This was consolidated by a literature and patent review as well as following them in other databases. The most recent information on clinical trials was sourced from press releases.
Results/Discussion
GtoPdb annotates 24 lead compounds against BACE1 and 12 against BACE2. The corresponding counts mapped to these targets in ChEMBL are 8741 and 1377 making BACE1 one of the most actively pursued enzyme targets ever. Notwithstanding the massive global effort during 2018 Merck’s verubecestat and J&J’s atabecestat BACE1 inhibitors not only failed their Phase III endpoints but even appeared to worsen cognition in prodromal patients. In 2019 Amgen/Novartis stopped Phase II/III trials of umibecestat that also showed more cognitive decline in the treatment group compared to controls. BACE2 presented an anomalous situation in several ways. By 2016 both Novartis and Amgen declared their inability to reproduce the TMEM27 secretase turnover reported in 2011. Notwithstanding, Novartis and other companies have published patents on BACE2-specific inhibitors over several years and paradoxically verubecestat is more potent against BACE2 rather than 1 but was never tested for glucose-lowering. Equally puzzling is that one academic group is still publishing BACE2 inhibitors for T2D even post de-validation. One thing both targets have in common is the complete absence of genetic support from genome-wide disease association studies but this warning sign went unheeded.
Conclusions
The massive waste of resources on the pursuit of BACE1 as an AD target over the last two decades is catastrophic. This tale of de-validation is compounded for this paralogous pair of enzymes by the fact that the original evidence for BACE2 as a T2D target was eventually refuted. The story of these targets highlights a range of crucial pharmacological pitfalls that must be avoided in the future.
Reference(s)
[1] Southan C, Hancock J.M. (2013) A tale of two drug targets: the evolutionary history of BACE1 and BACE2. Front Genet. 4:293.
In silico 360 Analysis for Drug DevelopmentChris Southan
Introduction:
Consequent to a memorandum of understanding between the Karolinska Institutet and the International Union of Basic and Clinical Pharmacology (IUPHAR) in 2018 a report on academic drug development, including guidelines (ADEV) has been drafted [1]. As part of this exercise, we conceived a triage for comprehensive informatics profiling around the compound, target, disease axis. We have termed this “in slico 360” (INS360) the aim of which was to support ADEV teams since they may lack either internal expertise or external support to do this on their own. Indeed, some past SciLifeLab Drug Discovery and Development Platform projects had been halted because of overlooked competitive impingements or insufficient target validation evidence.
Methods
We assessed the current database landscape, mostly public but including commercial, for potential utility for INS360. We were guided primarily by content coverage, usability, and reputation. We also explored some open property prediction resources for assay interference and toxicological inferences.
Results:
As a first-stop-shop, we selected the IUPHAR/BPS Guide to PHARMACOLOGY with ~900 ligand-target relationships captured via expert curation of journal papers Moving up in scale we evaluated ChEMBL at 1.8 million compounds with 1.1 million assay descriptions and 7,000 targets. With yet another jump we could search the patent corpus with 18 million extracted compounds in SureChEMBL. We explored PubChem that integrates these three with over 500 other sources linked to 96 million compounds, BioAssay results and connectivity into the NCBI Entrez system. The final jump in scale for document-to-chemistry navigation was represented by SciFinder with 155 million structures. On the target side, 360-exploration has the need to encompass literature, structure, genetic variation, splicing, interactions, and disease pathways. From their UniProt links, both GtoPdb and ChEMBL provide these entry points. Navigating genetic association data in support of target validation was enabled by the OpenTargets portal and the GWAS Catalog. We also fount servers that could produce prediction scores from chemical structures for a range of features important for de-risking development.
Conclusion:
This work scoped out initial resource choices for the INS360. We propose that not only ADEV operations but essentially any pharmacology research team has much to gain from this approach and many potential pitfalls can consequently be avoided when approaching key checkpoints, such as preparing a publication. However, support may be needed for both institutions and teams to get the best out of these complex and feature-rich databases.
[1] Southan C, (2019) Towards Academic Drug Development Guidelines, ChemRxiv pre-print no. 8869574
Will the correct BACE ORFs please stand up?Chris Southan
BACE1 and BACE2 are protease targets for Alzheimer's and diabetes, respectively but their validation is now questioned
Phylogenetic analysis can added functional insights
This came up against two key problems
A surprising prevalence of incorrect protein sequences predicted from genomes
Many BACE1 and BACE2 orthologues had truncation and/or indel errors.
Key phylogenetic representative genomes are languishing in an unfinished state
Some options for amelioration of these problems will be described
An update on the evolution of these enzymes will be shown
Look for new and potentially useful human 5HT2A-directed small molecule chemistry surfaced since the last meeting., check for compounds against as 5HT2A primary target but also combined inhibitors, poll round the key databases, literature and patents, earching challenges arise from synonym soup, complex cross-reactivities (see PMID 29679900) in vitro data gaps and in vivo polypharmacology
Looking at chemistry - protein - papers connectivity in ELIXIRChris Southan
This is a poster for the UK ELXIR meetin in Birmingham UK, Nov 2018. It is the summary of a blog-post https://cdsouthan.blogspot.com/2018/08/an-initial-look-at-elixir-chemistry.html that asses chemistry <> protein <> papers connectivity (C-P-P) for five ELIXIR resources
Poster for World Congres of Pharmacology 2018, Kyoto
Introduction: The pharmacological literature and patents connect compound structures to their bioactivity. However, entombing these relationships for millions of compounds among millions of PDFs is acknowledged as massively problematic. The situation is ameliorated by resources that extract the entity and data relationships the authors and inventors put “in” to their PDFs back “out” into structured database records. The IUPHAR/BPS Guide to PHARMACOLOGY (GtoPdb) has been doing this by stringent curation of ligands and their quantitative activity against protein targets [1]. Our citations are submitted to PubChem (PC), who then link to PubMed (PM) [2]. This study presents an overview of this connectivity.
Methods: For GtoPdb entries in PC Substance we used the PC interface to count our submitted PM links. This gives the PC > PM mapping counts from which we analysed the PM links. We then performed reciprocal analyses (i.e. PM > PC) by selecting PM sets. We then compared two journals by counting structure links by year and source.
Results: From 8988 GtoPdb-submitted ligand substances in PC (release 2017.5), 7309 are linked to 8980 PM entries. Of the 7309 there are 5632 links to chemical structures in PC the rest being antibodies and larger peptides. From the 8980 PMIDs, the Journal of Medicinal Chemistry (JMC) accounted for 1003 as our most frequently cited primary source of structure-to-activity mappings. For the British Journal of Pharmacology (BJP) most of the 345 cross-references were development compounds. Further analysis showed that from 2014 to 2017 the BJP to PC links of ~ 30 structures per year are mostly from GtoPdb and the Comparative Toxicology Database. However, going back to 2010-12, this increased to 500-800 connections, mainly derived from the IBM automated chemical extraction from abstracts. A similar pattern was observed for JMC.
Conclusion: Navigation between documents and databases is an essential competence for pharmacologists and drug discovery but the NCBI Entrez system is daunting. GtoPdb is a major contributor of high-quality links and provides a first-stop to guide users into the PC/PM systems. However, our results indicated potentially serious specificity issues with automated chemistry-to-journal linking from non-GtoPdb sources.
References: [1] Harding et al. (2018). Nucl. Acids Res. 45 (Database Issue), doi: 10.1093/nar/gkx1121.
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
Multi-source connectivity as the driver of solar wind variability in the heli...Sérgio Sacani
The ambient solar wind that flls the heliosphere originates from multiple
sources in the solar corona and is highly structured. It is often described
as high-speed, relatively homogeneous, plasma streams from coronal
holes and slow-speed, highly variable, streams whose source regions are
under debate. A key goal of ESA/NASA’s Solar Orbiter mission is to identify
solar wind sources and understand what drives the complexity seen in the
heliosphere. By combining magnetic feld modelling and spectroscopic
techniques with high-resolution observations and measurements, we show
that the solar wind variability detected in situ by Solar Orbiter in March
2022 is driven by spatio-temporal changes in the magnetic connectivity to
multiple sources in the solar atmosphere. The magnetic feld footpoints
connected to the spacecraft moved from the boundaries of a coronal hole
to one active region (12961) and then across to another region (12957). This
is refected in the in situ measurements, which show the transition from fast
to highly Alfvénic then to slow solar wind that is disrupted by the arrival of
a coronal mass ejection. Our results describe solar wind variability at 0.5 au
but are applicable to near-Earth observatories.
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Sérgio Sacani
We characterize the earliest galaxy population in the JADES Origins Field (JOF), the deepest
imaging field observed with JWST. We make use of the ancillary Hubble optical images (5 filters
spanning 0.4−0.9µm) and novel JWST images with 14 filters spanning 0.8−5µm, including 7 mediumband filters, and reaching total exposure times of up to 46 hours per filter. We combine all our data
at > 2.3µm to construct an ultradeep image, reaching as deep as ≈ 31.4 AB mag in the stack and
30.3-31.0 AB mag (5σ, r = 0.1” circular aperture) in individual filters. We measure photometric
redshifts and use robust selection criteria to identify a sample of eight galaxy candidates at redshifts
z = 11.5 − 15. These objects show compact half-light radii of R1/2 ∼ 50 − 200pc, stellar masses of
M⋆ ∼ 107−108M⊙, and star-formation rates of SFR ∼ 0.1−1 M⊙ yr−1
. Our search finds no candidates
at 15 < z < 20, placing upper limits at these redshifts. We develop a forward modeling approach to
infer the properties of the evolving luminosity function without binning in redshift or luminosity that
marginalizes over the photometric redshift uncertainty of our candidate galaxies and incorporates the
impact of non-detections. We find a z = 12 luminosity function in good agreement with prior results,
and that the luminosity function normalization and UV luminosity density decline by a factor of ∼ 2.5
from z = 12 to z = 14. We discuss the possible implications of our results in the context of theoretical
models for evolution of the dark matter halo mass function.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.Sérgio Sacani
The return of a sample of near-surface atmosphere from Mars would facilitate answers to several first-order science questions surrounding the formation and evolution of the planet. One of the important aspects of terrestrial planet formation in general is the role that primary atmospheres played in influencing the chemistry and structure of the planets and their antecedents. Studies of the martian atmosphere can be used to investigate the role of a primary atmosphere in its history. Atmosphere samples would also inform our understanding of the near-surface chemistry of the planet, and ultimately the prospects for life. High-precision isotopic analyses of constituent gases are needed to address these questions, requiring that the analyses are made on returned samples rather than in situ.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
Slide 1: Title Slide
Extrachromosomal Inheritance
Slide 2: Introduction to Extrachromosomal Inheritance
Definition: Extrachromosomal inheritance refers to the transmission of genetic material that is not found within the nucleus.
Key Components: Involves genes located in mitochondria, chloroplasts, and plasmids.
Slide 3: Mitochondrial Inheritance
Mitochondria: Organelles responsible for energy production.
Mitochondrial DNA (mtDNA): Circular DNA molecule found in mitochondria.
Inheritance Pattern: Maternally inherited, meaning it is passed from mothers to all their offspring.
Diseases: Examples include Leber’s hereditary optic neuropathy (LHON) and mitochondrial myopathy.
Slide 4: Chloroplast Inheritance
Chloroplasts: Organelles responsible for photosynthesis in plants.
Chloroplast DNA (cpDNA): Circular DNA molecule found in chloroplasts.
Inheritance Pattern: Often maternally inherited in most plants, but can vary in some species.
Examples: Variegation in plants, where leaf color patterns are determined by chloroplast DNA.
Slide 5: Plasmid Inheritance
Plasmids: Small, circular DNA molecules found in bacteria and some eukaryotes.
Features: Can carry antibiotic resistance genes and can be transferred between cells through processes like conjugation.
Significance: Important in biotechnology for gene cloning and genetic engineering.
Slide 6: Mechanisms of Extrachromosomal Inheritance
Non-Mendelian Patterns: Do not follow Mendel’s laws of inheritance.
Cytoplasmic Segregation: During cell division, organelles like mitochondria and chloroplasts are randomly distributed to daughter cells.
Heteroplasmy: Presence of more than one type of organellar genome within a cell, leading to variation in expression.
Slide 7: Examples of Extrachromosomal Inheritance
Four O’clock Plant (Mirabilis jalapa): Shows variegated leaves due to different cpDNA in leaf cells.
Petite Mutants in Yeast: Result from mutations in mitochondrial DNA affecting respiration.
Slide 8: Importance of Extrachromosomal Inheritance
Evolution: Provides insight into the evolution of eukaryotic cells.
Medicine: Understanding mitochondrial inheritance helps in diagnosing and treating mitochondrial diseases.
Agriculture: Chloroplast inheritance can be used in plant breeding and genetic modification.
Slide 9: Recent Research and Advances
Gene Editing: Techniques like CRISPR-Cas9 are being used to edit mitochondrial and chloroplast DNA.
Therapies: Development of mitochondrial replacement therapy (MRT) for preventing mitochondrial diseases.
Slide 10: Conclusion
Summary: Extrachromosomal inheritance involves the transmission of genetic material outside the nucleus and plays a crucial role in genetics, medicine, and biotechnology.
Future Directions: Continued research and technological advancements hold promise for new treatments and applications.
Slide 11: Questions and Discussion
Invite Audience: Open the floor for any questions or further discussion on the topic.
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
This pdf is about the Schizophrenia.
For more details visit on YouTube; @SELF-EXPLANATORY;
https://www.youtube.com/channel/UCAiarMZDNhe1A3Rnpr_WkzA/videos
Thanks...!
1. Pros and cons of 23 million
patent-extracted structures in PubChem
Christopher Southan, Senior Cheminformatician, IUPHAR/BPS Guide to
Pharmacology, Discovery Brain Sciences, University of Edinburgh, UK.
1
ACS Boston, Sunday Aug 19th 2018, Chemical Structure Searching for Patent Information
Session , 2:15 PM - 2:45 PM Harbor Ballroom III - Westin Boston Waterfront
https://www.slideshare.net/cdsouthan
2. Abstract (will not be shown)
As of March 2018, the major automated patent chemistry extractions (in ascending size, NextMove,
SCRIPDB, IBM and SureChEMBL) cover 22.17 million CIDs from the PubChem total of 94.7.These have
become hugely enabling, with advantages including a) majority of patent-exemplified structures of
medicinal chemistry interest are now in PubChem b) first-filings of lead series and clinical candidates
can be tracked d) the PubChem tool box has features difficult to match in commercial sources, e) many
structures can be associated with bioactivity data f) connections between papers and patents can be
made via ChEMBL entries g) BindingDB has accumulated a valuable collection of manual SAR
extraction from US patents that can be intersected with the automatically extracted structures and h)
coverage for some documents approaches that of SciFinder. However, there are a range of
disadvantages and caveats associated with automated extraction.These include; a) coverage
compromised by dense image tables, Markush nesting and poor OCR quality ofWO documents, b) as
the major pipeline in situ SureChEMBL can have a PubChem updating lag of some months c)
automated extraction generates structural “noise” that degrades chemistry quality, mainly from the
conversion of split IUPAC strings d) PubChem patent document indexing is patchy d) nothing in the
records actually indicates IP status, e) continual re-extraction of common chemistry results in irrelevant
structure-to-document associations (e.g. 126,949 patents for aspirin ), f) authentic compounds are
contaminated with spurious mixtures of various types as well as never-made virtuals. Surprisingly these
include 44K of deuterated drug analogues g) outside the BindingDB set, linking between SAR data and
targets from recent filings is still a manual exercise but examples will be shown how this can be done. In
terms of searching using SureChEMBL as an entry portal and moving from intra-document chemistry
exemplifications out to search PubChem, including the advantages of structure clustering, will be
demonstrated. Balancing the pros and cons indicates that the PubChem patent extraction “big bang”
over the last five years presents users with the best of both worlds. Academics can now patent mine
extensively and PubChem has become an essential adjunct to commercial sources of patent chemistry
and associated bio entities such as diseases and drug targets.
2
3. Introduction and outline
• I assume general awareness of patent chemisty value, database chemistry
searching and SAR mapping to targets (background refs in final slide)
• Since PubChem is free there are no serious ”cons” so these slides are better
classified as caveats and gotchas
• Note these related presentations: ”Structure searching for patent information:
The need for speed” (May) 13:45 CINF 35 : ”Automating chemical structure and
inhibition data extraction from patents” 2:45 CINF 37, Hinton,. “Searching for
patent information in PubChem” Kim et al, 3:30 CINF 38. “Beyond journal
articles – extracting bioactivity data from patents” Gaulton et al., 9:00 CINF 116
• Thes slides willl cover: source numbers, source intersects, fragmentation,
vituals, clustering, relative coverage of drug sources, BindingDB, common
chemistry loos ends of pros and cons, summary and further info
3
4. SnapshotAug 2018: PubChem 96.5 mill
• Major sources are Chemical Named Entity Recognician (CNER) pipelines
• Thomson Pharma (2006-2016 R.I.P.) manual extraction of 4.3 million CIDs from
patents and papers, would probably add ~ 1.0 mill patent structures
• 24% of PubChem CIDs include at least one patent extraction SID
• There are 49% single-sourceCIDs in PubChem
• 26% (12.6 mill) of these come from patent sources
• ~ 1.2 SID:CID ratio
• Note NextMove SIDs have had synthesis data extracted (PMID: 27028220)
4
6. Patent CIDs by year (cumulative)
• SureChEMBL is the only major source regularly updating
• But gotchas in exact load times (e.g. as of 04 Aug):
– In situ; WO chemistry downloadable ~ 1 week post-publication
– In UniChem, 27 July 2018 update = 19,648678
– In PubChem, load date 23 June 2018 = 18,415971 CIDs
• Will there be post-2017 IBM refresh ? 6
7. Pro: divergence, cons: this has ceased
and remains largely unexplained
7
IBM = 10.7 mill
SCRIPDB = 4.0 mill a one-off from
SureChEMBL = 17.6 mill
2.9
2.4
4.7 10.1
0.6 0.4
0.50
Union = 21.7
3-way = 2.4
3 + 2-way = 8.1
Unique= 13.5
8. Con: CNER fragmentation and mixtures
8
ChEMBL + Thomson Pharma
manual extraction
Patent CNER Sources
• Low shoulder includes split IUPACs, Markush bits, synthetic schema from single
images and mixture splits
• High shoulder peptide drop-off?
10. Pro: PubChem “slice ‘n dice” features
10
• Some PubChem functionality may be difficult to mimic in commercial databases
• Powerful similarity ”walking” between patents, papers, BioAssays, structures, vendors,etc
11. Pro: manual SAR extraction > BindingDB > PubChem
• 151,314 structures from 2098 USPTO patents, 2013 - 2018 (via CWUs)
• 146,751 patent-only
• Subsumed by ChEMBL at release time (e.g. 24 has 74,050 of thes)
11
12. Common-chemistry-to-many-documents
(futile indexing)
• PubChem aspirin (CID 2244) linked to 134,286 patent documents
• SureChEMBL aspirin structure search gives 401,341 document matches
• SureChEMBL 78,351 document links for aspirin name search
• SureChEMBL aspirin structure search, restricted toWO-only, claims
section and 2018 - gives 152 documents
• SciFinder 8,985 patent references for aspirin by name or structure
• Below; corpus count (x-axis) vs compounds (y-axis) for US9181236
12
13. BISTS (BIg Strange ThingS) from patents:
the infamous “Chessbordanes”
• Mainly a SCRIPDB legacy from CWU’s
• Still there but more amusing that a serious Con
13
14. Con: virtuals
14
• ”Deuterogate” example of
1000,s of enumerations
without reduction to practice
(i.e. no data)
• Unforseen consequences of
flow patents < PubChem
• US20080045558, 506
deuterated codeines (CID
5284371), 206 deuterated
oxycodone (CID 5284603)
3,251 SIDs, SureChEMBL,
SCRIBDB, IBM (all CWUs)
• SciFinder extracted 1014
isotopic substances under
”bological study”
Preparation and utility of opioid analgesics, Auspex
15. Comparative coverage (1) single patent
Pro: overlaps, Con: divergence
15
• US9181236B1, 2015, “2-spiro-
substituted iminothiazines
and their mono-and dioxides
as BACE inhibitors”
• 173 BindingDB CIDs curated
from PubChem
• 405 substances SDF from
SciFinder OpenBabel > 391 IK
> 362 CIDs
• 1657 rows > 834 SureChEMBL
IDs > 664 CIDs
• https://pubchem.ncbi.nlm.nih.
gov/patent/US918123 gives
742 CIDs
16. Comparative coverage (II): patents vs papers
• Intersect of ~0.5 mill CIDs is a Pro, but there are caveats
• ChEMBL extraction from papers is 1,3 mill with the rest confirmed BioAssays
from mostly MLSCN compounds
• Patents include extractions from PubMed abstracts by IBM
• ChEMBL includes the patent extractions of BindingDB (but only 73K)
16
17. Comparative coverage (III): drug source matches
• Chart is ”look back” cumulative CID coverage of INN and Guide to Pharmaclogy
• From 9479 INNs, 87% have a patent match (n.b. 82% have a ChEMBL match)
• From 7159 in GtoPdb 79% have a patent match
• From 9767 in DrugBank 72% have a patent match
• Caveat: some matches may be from secondary patents (i.e. not first-filings)
17
18. Pro and con loose ends
• CNER is confounded by dense image tables and poor OCR (e.g.WO PDFs)
• CNER is brainless compared to manual extraction (e.g. CID 2791850)
• CNER pipelines are divergent
• No Markush handling
• Peptide capture is patchy
• Can only filter ”in claims” via IBM SID tags
• In bioactivity and SAR terms there are probably no more than ~ 50K A61/C07
quality documents with useful data from last decade
• These cover only ~ 3.5 million bioactives (but ~2x the literature)
• So we could have an overhead ~ 20 million non-bioactives
18
19. The security “con”
• Drug discovery organisations that file may prohibit the open searching of
proprietary structures via the PubChem interface outside the firewall
• Notwithstanding, there is no patent case-law precedent for composition-of-
matter claims being challenged on the basis of structures intercepted from an
open server
• Ipso facto prohibition of open searching constitutes a major nailing-of-feet-to-
the-floor
• You can do initial scoping searches from home or your phone anyway
• You can do an InChIKey inner layer search, including against UniChem at 156
mill and Google (~200 mill?) but this is skeleton exact match
19
20. Conclusions
• PubChem open patent chemistry has more Pros that Cons
• Extensive synergy with SureChEMBL as the largest maintained source
• This may be a better first-stop shop for metadata slicing
• Users need to understand CNER quirks, pitfalls
• Difficult to get hard comparative coverage stats but indication is that PubChem
has the majority of exemplified structures from patents
• The non-redundant corpus of quality Med. Chem. patents is not only surprisingly
small but also fully open for text mining
• Those without commercial sources are well enabled for open patent mining
• However, they should be circumspect about relying on it for comprehensive prior-
art and due-diligence checking
• Those with commercial sources now have to perform open searching in // anyway
20
21. Further reading and COI
21
https://www.ncbi.nlm.nih.gov/pubmed/29451740
https://www.researchgate.net/publication/313264567_Examples_of_SAR-
Centric_Patent_Mining_Using_Open_Resources
https://sites.google.com/view/tw2informatics/home
Conflict of interest (minor) Has done patent analysis
consulting