1. George Papadatos presented on mining compounds, targets, and indications from the patent corpus using SureChEMBL and Open PHACTS.
2. He outlined how SureChEMBL annotates chemicals in patents and links them to biological data for Open PHACTS, and how relevance scoring helps prioritize important entities.
3. Examples were given of using the Open PHACTS API and patent data in SureChEMBL for use cases like identifying key compounds, analyzing chemical spaces, and integrating external data.
Overview of the SureChEMBL system and web interface.
https://www.surechembl.org/search/
SureChEMBL is a freely available web resource for chemistry patent searching. It is based on a fully automatic and dynamic text and image mining pipeline.
CINF 55: SureChEMBL: An open patent chemistry resourceGeorge Papadatos
SureChEMBL (https://www.surechembl.org) is a new resource provided by the European Bioinformatics Institute (EMBL-EBI) that annotates, extracts and indexes chemistry from full text patent documents by means of continuous, automated text and image mining. SureChEMBL is perhaps the only open, freely available, live patent chemistry resource available, in a field that has been traditionally commercial.
Since its launch last September, the SureChEMBL interface provides sophisticated keyword and chemistry-based querying and exporting functionality against a corpus of more than 16 million compounds extracted from 13 million patent documents. Both the interface and the underlying data pipeline leverage a number of technologies for name to structure conversion, as well as compound standardisation, registration and searching.
In addition to providing an overview of the system, recent developments and improvements will be described. These include the introduction of various data interexchange and exporting options, such as flat files and a data feed client. Furthermore, our future plans for the SureChEMBL system will be outlined. To date, such plans include complementing the chemical annotations with biological ones, covering genes, proteins, diseases and indications. Furthermore, we are planning to further enrich the chemical annotations with a relevance score, indicating their importance in the patent document.
Background of the project and simple use cases of using the Open PHACTS API and KNIME to extract compound, target and indication entities from millions of patent documents and infer meaningful links among them. Open PHACTS Linked Data meeting in Vienna.
ChEMBL and KNIME provide an ideal match of open data with open tools. This is a quick overview of how to access ChEMBL data resources and web services (ChEMBL, UniChem, Beaker, myChEMBL, SureChEMBL) via the KNIME platform.
EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...ChemAxon
SureChEMBL is a new resource provided by the European Bioinformatics Institute (EMBL-EBI) that annotates, extracts and indexes chemistry from full text patent documents by means of continuous, automated text and image mining. SureChEMBL is perhaps the only open, freely available, live patent chemistry resource available, in a field that has been traditionally commercial.
Since its launch last September, the SureChEMBL interface provides sophisticated keyword and chemistry-based querying and exporting functionality against a corpus of more than 16 million compounds extracted from 13 million patent documents. Both the interface and the underlying data pipeline leverage extensively ChemAxon technologies for name to structure conversion, as well as compound standardisation, registration and searching.
In addition to providing an overview of the system, recent developments and improvements will be described. These include the introduction of various data interexchange and exporting options, such as flat files and a data feed client. Furthermore, our future plans for the SureChEMBL system will be outlined. To date, such plans include complementing the chemical annotations with biological ones, covering genes, proteins, diseases and indications. Furthermore, we are planning to further enrich the chemical annotations with a relevance score, indicating their importance in the patent document.
Development of machine learning-based prediction models for chemical modulato...Sunghwan Kim
Presented at the 2018 Research Festival at the National Institutes of Health (NIH) in Bethesda, MD (September 13, 2018).
==== Abstract ====
The retinoid X receptor (RXR) is a nuclear hormone receptor that functions as a transcription factor with roles in development, cell differentiation, metabolism, and cell death. Chemicals that interfere the RXR signaling pathway may cause adverse effects on human health. In this study, public-domain bioactivity data available in PubChem (https://pubchem.ncbi.nlm.nih.gov) were used to develop machine learning-based prediction models for chemical modulators of RXR-alpha, which is a subtype of RXR that plays a role in metabolic signaling pathways, dermal cysts, cardiac development, insulin sensitization, etc. The models were constructed from quantitative high-throughput screening (qHTS) data from the Tox21 project, using popular supervised machine learning methods (including support vector machine, random forest, neural network, k-nearest neighbors, decision tree, and naïve Bayes). The general applicability of the developed models was evaluated with external data sets from ChEMBL and the NCATS Chemical Genomics Center (NCGC). This study showcases how open data in the public domain can be used to develop prediction models for bioactivity of small molecules.
Enabling HTS Hit follow up via Chemo informatics, File Enrichment, and Outsou...Graham Smith
Enabling HTS Hit follow up via Chemo informatics, File Enrichment, and Outsourcing The history of parallel chemistry for lead discovery at Pfizer Sandwich from begining to outsourcing
Overview of the SureChEMBL system and web interface.
https://www.surechembl.org/search/
SureChEMBL is a freely available web resource for chemistry patent searching. It is based on a fully automatic and dynamic text and image mining pipeline.
CINF 55: SureChEMBL: An open patent chemistry resourceGeorge Papadatos
SureChEMBL (https://www.surechembl.org) is a new resource provided by the European Bioinformatics Institute (EMBL-EBI) that annotates, extracts and indexes chemistry from full text patent documents by means of continuous, automated text and image mining. SureChEMBL is perhaps the only open, freely available, live patent chemistry resource available, in a field that has been traditionally commercial.
Since its launch last September, the SureChEMBL interface provides sophisticated keyword and chemistry-based querying and exporting functionality against a corpus of more than 16 million compounds extracted from 13 million patent documents. Both the interface and the underlying data pipeline leverage a number of technologies for name to structure conversion, as well as compound standardisation, registration and searching.
In addition to providing an overview of the system, recent developments and improvements will be described. These include the introduction of various data interexchange and exporting options, such as flat files and a data feed client. Furthermore, our future plans for the SureChEMBL system will be outlined. To date, such plans include complementing the chemical annotations with biological ones, covering genes, proteins, diseases and indications. Furthermore, we are planning to further enrich the chemical annotations with a relevance score, indicating their importance in the patent document.
Background of the project and simple use cases of using the Open PHACTS API and KNIME to extract compound, target and indication entities from millions of patent documents and infer meaningful links among them. Open PHACTS Linked Data meeting in Vienna.
ChEMBL and KNIME provide an ideal match of open data with open tools. This is a quick overview of how to access ChEMBL data resources and web services (ChEMBL, UniChem, Beaker, myChEMBL, SureChEMBL) via the KNIME platform.
EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...ChemAxon
SureChEMBL is a new resource provided by the European Bioinformatics Institute (EMBL-EBI) that annotates, extracts and indexes chemistry from full text patent documents by means of continuous, automated text and image mining. SureChEMBL is perhaps the only open, freely available, live patent chemistry resource available, in a field that has been traditionally commercial.
Since its launch last September, the SureChEMBL interface provides sophisticated keyword and chemistry-based querying and exporting functionality against a corpus of more than 16 million compounds extracted from 13 million patent documents. Both the interface and the underlying data pipeline leverage extensively ChemAxon technologies for name to structure conversion, as well as compound standardisation, registration and searching.
In addition to providing an overview of the system, recent developments and improvements will be described. These include the introduction of various data interexchange and exporting options, such as flat files and a data feed client. Furthermore, our future plans for the SureChEMBL system will be outlined. To date, such plans include complementing the chemical annotations with biological ones, covering genes, proteins, diseases and indications. Furthermore, we are planning to further enrich the chemical annotations with a relevance score, indicating their importance in the patent document.
Development of machine learning-based prediction models for chemical modulato...Sunghwan Kim
Presented at the 2018 Research Festival at the National Institutes of Health (NIH) in Bethesda, MD (September 13, 2018).
==== Abstract ====
The retinoid X receptor (RXR) is a nuclear hormone receptor that functions as a transcription factor with roles in development, cell differentiation, metabolism, and cell death. Chemicals that interfere the RXR signaling pathway may cause adverse effects on human health. In this study, public-domain bioactivity data available in PubChem (https://pubchem.ncbi.nlm.nih.gov) were used to develop machine learning-based prediction models for chemical modulators of RXR-alpha, which is a subtype of RXR that plays a role in metabolic signaling pathways, dermal cysts, cardiac development, insulin sensitization, etc. The models were constructed from quantitative high-throughput screening (qHTS) data from the Tox21 project, using popular supervised machine learning methods (including support vector machine, random forest, neural network, k-nearest neighbors, decision tree, and naïve Bayes). The general applicability of the developed models was evaluated with external data sets from ChEMBL and the NCATS Chemical Genomics Center (NCGC). This study showcases how open data in the public domain can be used to develop prediction models for bioactivity of small molecules.
Enabling HTS Hit follow up via Chemo informatics, File Enrichment, and Outsou...Graham Smith
Enabling HTS Hit follow up via Chemo informatics, File Enrichment, and Outsourcing The history of parallel chemistry for lead discovery at Pfizer Sandwich from begining to outsourcing
11 years old presentation submitted as Project work: Golden Mantra to Perform Worldwide Patent Searches
Patent provides the right to exclude others from making, using, selling, offering for sale, or importing the patented invention for the term of the patent, usually 20 years from the filing date. A patent is, in effect, a limited property right that the government offers to inventors in exchange for their agreement to share the details of their inventions with the public. Like any other property right, it may be sold, licensed, mortgaged, assigned or transferred, given away, or simply abandoned.
In order to obtain a patent, an applicant must provide a written description of his or her invention in sufficient detail for a person skilled in the art (i.e., the relevant area of technology) to make and use the invention.
Using open bioactivity data for developing machine-learning prediction models...Sunghwan Kim
Presented at the 256th American Chemical Society (ACS) National Meeting in Boston, MA (August 22, 2018).
==== Abstract ====
The retinoid X receptor (RXR) is a nuclear hormone receptor that functions as a transcription factor with roles in development, cell differentiation, metabolism, and cell death. Chemicals that interfere the RXR signaling pathway may cause adverse effects on human health. In this study, open bioactivity data available at PubChem (https://pubchem.ncbi.nlm.nih.gov) were used to develop prediction models for chemical modulators of RXR-alpha, which is a subtype of RXR that plays a role in metabolic signaling pathways, dermal cysts, cardiac development, insulin sensitization, etc. The models were constructed from quantitative high-throughput screening (qHTS) data from the Tox21 project, using various supervised machine learning methods (including support vector machine, random forest, neural network, k-nearest neighbors, decision tree, and naïve Bayes). The performance of the models was evaluated with an external data set containing bioactivity data submitted by ChEMBL and the NCATS Chemical Genomics Center (NCGC). This study showcases how open data in the public domain can be used to develop prediction models for chemical toxicity.
PubChem for drug discovery in the age of big data and artificial intelligenceSunghwan Kim
Presented at the American Chemical Society Middle Atlantic Regional Meeting (MARM) 2021 (June 10, 2021).
==== Abstract ====
With the emergence of the age of big data and artificial intelligence, biomedical research communities have a great interest in exploiting the massive amount of chemical and biological data available in the public domain. PubChem (https://pubchem.ncbi.nlm.nih.gov) is one of the largest sources of publicly available chemical information, with +270 million substance descriptions, +110 million unique compounds, +285 million bioactivity outcomes from more than one million biological assay experiments. PubChem provides a wide range of chemical information, including structure, pharmacology, toxicology, drug target, metabolism, chemical vendors, patents, regulations, clinical trials, and many others. These contents can be accessed interactively through web browsers as well as programmatically using computer scripts. They can also be downloaded in bulk through the PubChem File Transfer Protocol (FTP) site. PubChem data has been used in many studies for developing bioactivity and toxicity prediction models, discovering polypharmacologic (multi-target) ligands, and identifying new macromolecule targets of compounds (for drug-repurposing or off-target side effect prediction). This presentation provides an overview of PubChem data, tools, and services useful for drug discovery.
The patent literature has historically been complex and inaccessible to searches required for effective IP management and maintenance of a competitive position, particularly when it comes to chemical structure information. The availability of raw patent text feeds in a structured form have allowed the application of text-to-structure and image-to-structure conversion techniques. The problem then became one of applying this solution across massive data sets in an accurate and scalable manner to deliver a turnkey patent informatics system with automatically extracted, and searchable chemical structures. SureChem, an advanced cloud application, uses a tournament of methods to achieve higher coverage and accuracy than any single approach. This product was launched and licensed by a user community with a freemium business model. Latterly, user feedback and market shifts indicated a need to link biological data into patents too (sequences, genes, targets, diseases, etc). This created an opportunity to transition SureChem to EMBL-EBI, a public organisation with the remit of data dissemination and sharing, and deep experience of biodata, including the large ChEMBL database of Structure Activity Relationship Data. In 2014 SureChem became SureChEMBL. The presentation will review the development of SureChem, discuss the marketplace for patent informatics, and look ahead to future development plans for SureChEMBL.
Searching for patent information in PubChem Sunghwan Kim
Presented at the 256th American Chemical Society (ACS) National Meeting in Boston, MA (August 19, 2018).
==== Abstract ====
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public chemical information resource, containing more than 242 million chemical substance descriptions, 94 million unique compounds, and 234 million bioactivities determined from 1.25 million assay experiments. Importantly, data contribution from multiple sources, including IBM, SureChEMBL, ScripDB, NextMove, and BindingDB, allows PubChem to provide links to patent documents that mention chemicals. Currently, PubChem offers links between about 6.7 million patent documents and more than 20 million unique chemical structures, with over 137 million compound-patent links, covering primarily U.S. patents with some from European, and World Intellectual Property Organization, and Japanese patent documents. This presentation will provide an overview of the patent information in PubChem as well as the best practice for using it.
ChemSpider is being built with the intention of being a chemical structure centric community for chemists. With over 16 million chemical structures as of August 2007, and with data deposition and curation mechanisms in place for text, structure and spectra ChemSpider intends to be a meeting place and collaborative environment for chemists to work together.
How can you access PubChem programmatically?Sunghwan Kim
Presented at the 255th American Chemical Society (ACS) National Meeting in New Orleans, LA (March. 19, 2018).
Building automated workflows that exploit the vast amount of data contained in PubChem requires programmatic access to the data through application programming interfaces (APIs). PubChem provides several programmatic access routes to its data, including Entrez Utilities (E-Utilities or E-Utils), PubChem Power User Gateway (PUG), PUG-SOAP, PUG-REST, PUG-View, and a REST-ful interface to PubChemRDF. This presentation provides an overview of these programmatic access tools, including recent updates, limitations, usage policies, and best practices.
*References*
(1) PUG-SOAP and PUG-REST: web services for programmatic access to chemical information in PubChem, Nucleic Acids Research, 2015, 43(W1):W605–W611. https://doi.org/10.1093/nar/gkv396
(2) An update on PUG-REST: RESTful interface for programmatic access to PubChem, Nucleic Acids Research, 2018, 46(W1):gky294. https://doi.org/10.1093/nar/gky294
Tom Dexheimer, Ph.D.
Manager, MSU Assay Development and Drug Repurposing Core
Michigan State University
Drug Discovery Interdisciplinary Event May 14, 2015
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?Dr. Haxel Consult
Fernando Huerta (RISE Bioscience & Materials, SE)
Alexander Minidis (Collaborative Drug Discovery - CDD VAULT, Sweden)
How much information does the scientists need to design new potential drugs?
A thorough overview of public scientific information sources (open access) and methods to collect, process, analyse and visualize this information will be presented. A direct application of such free available information in conjunction with freeware will be described in relation with the efforts of the scientific community to find effective medicines for the ZIKA virus.
11 years old presentation submitted as Project work: Golden Mantra to Perform Worldwide Patent Searches
Patent provides the right to exclude others from making, using, selling, offering for sale, or importing the patented invention for the term of the patent, usually 20 years from the filing date. A patent is, in effect, a limited property right that the government offers to inventors in exchange for their agreement to share the details of their inventions with the public. Like any other property right, it may be sold, licensed, mortgaged, assigned or transferred, given away, or simply abandoned.
In order to obtain a patent, an applicant must provide a written description of his or her invention in sufficient detail for a person skilled in the art (i.e., the relevant area of technology) to make and use the invention.
Using open bioactivity data for developing machine-learning prediction models...Sunghwan Kim
Presented at the 256th American Chemical Society (ACS) National Meeting in Boston, MA (August 22, 2018).
==== Abstract ====
The retinoid X receptor (RXR) is a nuclear hormone receptor that functions as a transcription factor with roles in development, cell differentiation, metabolism, and cell death. Chemicals that interfere the RXR signaling pathway may cause adverse effects on human health. In this study, open bioactivity data available at PubChem (https://pubchem.ncbi.nlm.nih.gov) were used to develop prediction models for chemical modulators of RXR-alpha, which is a subtype of RXR that plays a role in metabolic signaling pathways, dermal cysts, cardiac development, insulin sensitization, etc. The models were constructed from quantitative high-throughput screening (qHTS) data from the Tox21 project, using various supervised machine learning methods (including support vector machine, random forest, neural network, k-nearest neighbors, decision tree, and naïve Bayes). The performance of the models was evaluated with an external data set containing bioactivity data submitted by ChEMBL and the NCATS Chemical Genomics Center (NCGC). This study showcases how open data in the public domain can be used to develop prediction models for chemical toxicity.
PubChem for drug discovery in the age of big data and artificial intelligenceSunghwan Kim
Presented at the American Chemical Society Middle Atlantic Regional Meeting (MARM) 2021 (June 10, 2021).
==== Abstract ====
With the emergence of the age of big data and artificial intelligence, biomedical research communities have a great interest in exploiting the massive amount of chemical and biological data available in the public domain. PubChem (https://pubchem.ncbi.nlm.nih.gov) is one of the largest sources of publicly available chemical information, with +270 million substance descriptions, +110 million unique compounds, +285 million bioactivity outcomes from more than one million biological assay experiments. PubChem provides a wide range of chemical information, including structure, pharmacology, toxicology, drug target, metabolism, chemical vendors, patents, regulations, clinical trials, and many others. These contents can be accessed interactively through web browsers as well as programmatically using computer scripts. They can also be downloaded in bulk through the PubChem File Transfer Protocol (FTP) site. PubChem data has been used in many studies for developing bioactivity and toxicity prediction models, discovering polypharmacologic (multi-target) ligands, and identifying new macromolecule targets of compounds (for drug-repurposing or off-target side effect prediction). This presentation provides an overview of PubChem data, tools, and services useful for drug discovery.
The patent literature has historically been complex and inaccessible to searches required for effective IP management and maintenance of a competitive position, particularly when it comes to chemical structure information. The availability of raw patent text feeds in a structured form have allowed the application of text-to-structure and image-to-structure conversion techniques. The problem then became one of applying this solution across massive data sets in an accurate and scalable manner to deliver a turnkey patent informatics system with automatically extracted, and searchable chemical structures. SureChem, an advanced cloud application, uses a tournament of methods to achieve higher coverage and accuracy than any single approach. This product was launched and licensed by a user community with a freemium business model. Latterly, user feedback and market shifts indicated a need to link biological data into patents too (sequences, genes, targets, diseases, etc). This created an opportunity to transition SureChem to EMBL-EBI, a public organisation with the remit of data dissemination and sharing, and deep experience of biodata, including the large ChEMBL database of Structure Activity Relationship Data. In 2014 SureChem became SureChEMBL. The presentation will review the development of SureChem, discuss the marketplace for patent informatics, and look ahead to future development plans for SureChEMBL.
Searching for patent information in PubChem Sunghwan Kim
Presented at the 256th American Chemical Society (ACS) National Meeting in Boston, MA (August 19, 2018).
==== Abstract ====
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public chemical information resource, containing more than 242 million chemical substance descriptions, 94 million unique compounds, and 234 million bioactivities determined from 1.25 million assay experiments. Importantly, data contribution from multiple sources, including IBM, SureChEMBL, ScripDB, NextMove, and BindingDB, allows PubChem to provide links to patent documents that mention chemicals. Currently, PubChem offers links between about 6.7 million patent documents and more than 20 million unique chemical structures, with over 137 million compound-patent links, covering primarily U.S. patents with some from European, and World Intellectual Property Organization, and Japanese patent documents. This presentation will provide an overview of the patent information in PubChem as well as the best practice for using it.
ChemSpider is being built with the intention of being a chemical structure centric community for chemists. With over 16 million chemical structures as of August 2007, and with data deposition and curation mechanisms in place for text, structure and spectra ChemSpider intends to be a meeting place and collaborative environment for chemists to work together.
How can you access PubChem programmatically?Sunghwan Kim
Presented at the 255th American Chemical Society (ACS) National Meeting in New Orleans, LA (March. 19, 2018).
Building automated workflows that exploit the vast amount of data contained in PubChem requires programmatic access to the data through application programming interfaces (APIs). PubChem provides several programmatic access routes to its data, including Entrez Utilities (E-Utilities or E-Utils), PubChem Power User Gateway (PUG), PUG-SOAP, PUG-REST, PUG-View, and a REST-ful interface to PubChemRDF. This presentation provides an overview of these programmatic access tools, including recent updates, limitations, usage policies, and best practices.
*References*
(1) PUG-SOAP and PUG-REST: web services for programmatic access to chemical information in PubChem, Nucleic Acids Research, 2015, 43(W1):W605–W611. https://doi.org/10.1093/nar/gkv396
(2) An update on PUG-REST: RESTful interface for programmatic access to PubChem, Nucleic Acids Research, 2018, 46(W1):gky294. https://doi.org/10.1093/nar/gky294
Tom Dexheimer, Ph.D.
Manager, MSU Assay Development and Drug Repurposing Core
Michigan State University
Drug Discovery Interdisciplinary Event May 14, 2015
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?Dr. Haxel Consult
Fernando Huerta (RISE Bioscience & Materials, SE)
Alexander Minidis (Collaborative Drug Discovery - CDD VAULT, Sweden)
How much information does the scientists need to design new potential drugs?
A thorough overview of public scientific information sources (open access) and methods to collect, process, analyse and visualize this information will be presented. A direct application of such free available information in conjunction with freeware will be described in relation with the efforts of the scientific community to find effective medicines for the ZIKA virus.
Social Media for Placement Networking:
Before, During and After
This presentation shares tips for making the most of social media for professional networking to prepare for placement, during a placement and directly afterwards.
Sue Beckingham presents tutor tips and Dalian Terry his experience as a student currently on placement.
Creating Learning Environments with Communities of PracticeOlivier Serrat
Communities of practice have become an accepted part of organizational development. One should pay attention to domain, membership, norms and rules, structure and process, flow of energy, results, resources, and values.
If you are serious about your business you need to be on LinkedIn. This session will cover 6 of the most critical elements every LinkedIn Profile needs. Whether you are an executive, manager or new employee, LinkedIn has something to offer. Every attendee will receive crisp, concise and actionable advice for managing and making the most of their LinkedIn profile.
This was delivered live at the Microsoft Worldwide Partner Conference in Orlando on July 13, 2015
Accelerate you customer Acquisition with Sales SprintsMichael Humblet
Bring focus to you customer acquisition by introducing targeted sales sprints to drive new revenue. Change your sales process from a sales centric to a customer centric approach by mapping the buyers journey.
Incubation de demandeurs d'emploi au sein d'un espace de Coworking à LilleEmmanuel Duvette
En 2014, Mutualab, un lieu ouvert d'innovation sociale à Lille a été accompagné pour expérimenter l'inclusion de demandeurs d'emploi dans un espace de Coworking.
Serverless Architectures and Continuous DeliveryRobin Weston
Serverless architectures have been touted as the next evolution of cloud-hosted software. Indeed, the promise of resiliency and scalability without the need for infrastructure management sounds too good to be true!
But how well do serverless architectures play with the patterns and practises of continuous delivery? Do they help or hinder us in our goal of delivering frequent and low risk software changes to production? What are the trade-offs to weigh up when considering using a serverless architecture on your next project?
This presentation will give a brief initial introduction to the world of serverless architectures. We’ll then look at their benefits and drawbacks, with a particular focus on our recent experiences building and operating a production AWS Lambda and API Gateway system. We’ll also look at the ever-evolving tooling and service ecosystems, and make some suggestions regarding how to start safely dipping your toes in the serverless waters.
Congress seeks metrics for the NIST Cybersecurity FrameworkDavid Sweigert
Uploaded as a courtesy by:
Dave Sweigert
Legislation calling on the National Institute of Standards and Technology to develop outcome metrics to demonstrate the effectiveness of the NIST Cybersecurity Framework is scheduled to be considered - and likely amended - at a markup session of the House Science, Space and Technology Committee..
(1) IN GENERAL.—The Institute shall promote the implementation by Federal agencies of the Framework for Improving Critical Infrastructure Cybersecurity (in this section and section 20B referred to as the ‘Framework’) by providing to the Office of Management and Budget, the Office of Science and Technology Policy, and all other Federal agencies, not later than 6 months after the date of enactment of the NIST Cybersecurity Framework, Assessment, and Auditing Act of 2017, guidance that Federal agencies may use to incorporate the Framework into their information security risk management efforts, including practices related to compliance with chapter 35 of title 44, United States Code, and any other applicable Federal law.
El Inbound Marketing te permitirá atraer más visitas, cerrar más clientes y convertirlos en promotores. Hoy en día los consumidores son bombardeados de publicidad por lo que es cada vez más difícil diferenciarse con una oferta atractiva. El Inbound Marketing propone que las marcas dejen de perseguir a las personas y armen una oferta de valor que entregue información útil y de calidad a través de los canales que las personas utilizan regularmente.
Brand strategy, when untethered from direct creative execution, has tremendous potential to set the agenda for companies, leading to greater retention, operational efficiency, and audience love. The trick is taking the brave step to think of the power of brand differently.
IT Workers Share Their Best IT Career AdviceRobert Half
Robert Half Technology recently asked more than 8,000 IT pros to share the best IT career advice they have ever received. Scroll through this SlideShare to see their responses and get the tips you need to move your own career forward.
An Integrated Approach To Drug Discovery Using Parallel SynthesisGraham Smith
An Integrated Approach To Drug Discovery Using Parallel Synthesis. The history of parallel chemistry for lead discovery at Pfizer Sandwich from begining to outsourcing
The information revolution has transformed many business sectors over the last decade and the pharmaceutical industry is no exception. Developments in scientific and information technologies have unleashed an avalanche of content on research scientists who are struggling to access and filter this in an efficient manner. Furthermore, this domain has traditionally suffered from a lack of standards in how entities, processes and experimental results are described, leading to difficulties in determining whether results from two different sources can be reliably compared. The need to transform the way the life-science industry uses information has led to new thinking about how companies should work beyond their firewalls. In this talk we will provide an overview of the traditional approaches major pharmaceutical companies have taken to knowledge management and describe the business reasons why pre-competitive, cross-industry and public-private partnerships have gained much traction in recent years. We will consider the scientific challenges concerning the integration of biomedical knowledge, highlighting the complexities in representing everyday scientific objects in computerised form. This leads us to discuss how the semantic web might lead us to a long-overdue solution. The talk will be illustrated by focusing on the EU-Open PHACTS initiative (openphacts.org), established to provide a unique public-private infrastructure for pharmaceutical discovery. The aims of this work will be described and how technologies such as just-in-time identity resolution, nanopublication and interactive visualisations are helping to build a powerful software platform designed to appeal to directly to scientific users across the public and private sectors.
Presentation for Texas A&M Superfund Research Center virtual learning series, Big Data in Environmental Science and Toxicology. More details at https://superfund.tamu.edu/big-data-session-2-aug-18-2021/
EPA’s National Center for Computational Toxicology is developing automated workflows for curating large databases within the DSSTox project, and providing accurate linkages of data to chemical structures, exposure and hazard information. The data are made available via the EPA’s CompTox Chemistry Dashboard (https://comptox.epa.gov/dashboard), a publicly accessible website providing access to data for ~760,000 chemical substances, the majority of these represented as chemical structures. The web application delivers a wide array of computed and measured physicochemical properties, in vitro high-throughput screening data and in vivo toxicity data, as well as integrated chemical linkages to a growing list of literature, toxicology, and analytical chemistry websites. In addition, several specific search types are in development to directly support the mass spectrometry non-targeted screening community, enabling cohesive workflows to support data generation for the detection and assessment of environmental exposures to chemicals contained within DSSTox. The application provides access to segregated lists of chemicals that are of specific interest to relevant stakeholders, including, for example, scientists interested in Per- & Polyfluoroalkyl Substances (PFAS). Added lists include those sourced from the European Union as well as developed in-house and now containing thousands of chemicals. A procured testing library of hundreds of PFAS chemicals annotated into chemical categories has been integrated into the dashboard with a number of resulting benefits: a searchable database of chemical properties, with hazard and exposure predictions, and links to the open literature. This presentation will provide an overview of the dashboard, the developing library of PFAS chemicals and associated categorization, and new physicochemical property and environmental fate and transport QSAR prediction models developed for these chemicals. The application of the dashboard to support mass spectrometry non-targeted analysis studies for the identification of PFAS chemicals will also be reviewed. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
The EPA CompTox Chemistry Dashboard provides access to data associated with ~760,000 chemical substances. The available data includes experimental and predicted physicochemical properties, environmental fate and transport data, in vivo and in silico toxicity data, in vitro bioassay data, exposure data and a variety of other types of information. The data are under continuous expansion and curation and the experimental data have been used to develop QSAR and QSPR models. A number of these models are available via a web interface so that users can submit a chemical structure and predict properties in real time. The dashboard also provides access to pre-compiled chemical lists and categories, including pesticides, and chemicals detected in the environment via non-targeted mass spectrometry analysis. The data are searchable using chemical identifiers (systematic names, trade names, CAS Registry Numbers), by structure, mass and formula. Batch searches allow for data associated with thousands of chemicals to be obtained in a few seconds, with just a few button clicks, and downloaded to the desktop. This presentation will provide an overview of the Dashboard and its applications to accessing source data associated with agriculturally related chemicals. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
With the unprecedented growth of chemical databases incorporating up to several hundred billions of synthetically feasible chemicals, modelers are not in shortage of chemicals to process. Importantly, such "Big Chemical Data" offers humongous opportunities for discovering novel bioactive molecules. However, the current generation of cheminformatics software tools is not capable of handling, characterizing, and processing such extremely large chemical libraries. In this presentation, we will discuss the rationale and the main challenges (theoretical and technical) for screening very large repositories of compounds in the current context of drug discovery. We will present several proof-of-concept studies regarding the screening of extremely large libraries (1+ billion compounds) using our novel GPU-accelerated cheminformatics platform to identify molecules with defined bioactivity. Overall, we will show that GPU computing represents an effective and inexpensive architecture to develop, employ, and validate a new generation of cheminformatics methods and tools ready to process billions of compounds.
Quantifying the content of biomedical semantic resources as a core for drug d...Syed Muhammad Ali Hasnain
The biomedical research community is providing large-scale data sources to enable knowledge discovery from the data alone, or from novel scientific experiments in combination with the existing knowledge.
Increasingly semantic Web technologies are being developed and used including ontologies, triple stores and combinations thereof.
The amount of data is constantly increasing as well as the complexity of data.
Since the data sources are publicly available, the amount of content can be derived giving an overview on the accessible content but also on the state of the data representation in comparison to the existing content.
For a better understanding of the existing data resources, i.e.\ judgments on the distribution of data triples across concepts, data types and primary providers, we have performed a comprehensive analysis which delivers an overview on the accessible content for semantic Web solutions.
It can be derived that the information related to genes, proteins and chemical entities form the center, whereas the content related to diseases and pathways forms a smaller portion.
Further data relates to dietary content and specific questions such as cancer prevention and toxicological effects of drugs.
Dana Vanderwall, Associate Director of Cheminformatics at Bristol-Myers Squibb, presented at Drexel University for Jean-Claude Bradley's Chemical Information Retrieval class on December 2, 2010. This first part covers "Cheminformatics & The evolving relationship between data in the public domain & pharma" and includes a general discussion of modern drug discovery and the details of a malaria dataset recently released from the pharmaceutical industry to the public.
High resolution mass spectrometry (HRMS) and non-targeted analysis (NTA) are utilized to identify emerging contaminants and chemical signatures of interest detected in various media. At the US Environmental Protection Agency the CompTox Chemicals Dashboard (https://comptox.epa.gov/dashboard) is an open chemistry resource and web-based application containing data for ~900,000 substances and supports non-targeted and suspect screening analyses. Searching functionality includes identifier searches (e.g. systematic names, trade names and CAS Registry Numbers), mass and formula-based searches and prototype developments include combined substructure-mass/formula searches and searching experimental mass spectral data against predicted fragmentation spectra. A specific type of data mapping in the database uses “MS-Ready” structures, a way to process all registered substances to separate multi-component chemicals into their individual components, removal of stereochemical bonds and desalting and neutralization. This MS-Ready processing supports batch-searching using either mass or formulae to identify candidate chemicals and their mapped substances. A number of chemical lists (https://comptox.epa.gov/dashboard/chemical_lists) have also been developed to support the identification of chemicals related to agrochemistry, specifically pesticides (both active and inert constituents), insecticides and their metabolites and environmental breakdown products). This presentation will provide an overview of how the CompTox Chemicals Dashboard supports mass spectrometry based structure identification and non-targeted analysis of chemicals in agrochemistry. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
1. Mining Compounds, Targets and Indications
from the Patent Corpus with SureChEMBL
and Open PHACTS
George Papadatos, PhD
ChEMBL Group, EMBL-EBI
georgep@ebi.ac.uk
#ShefChem16
2. Outline
• Overview of chemical annotations in SureChEMBL
• Biological annotations for Open PHACTS
• Relevance scoring
• Open PHACTS patent API
• Use cases and examples
4. What do people do with ChEMBL?
• Chemical space clustering & visualisation
• (Q)SAR analysis and Virtual Screening
• Data modeling, activity cliffs, Free-Wilson, MMP analysis
• Bioisosteric replacements mining
• De novo drug design
• In silico evolutionary compound optimisation
• Target prediction (target fishing)
• Off-target prediction and ADR analysis
• Polypharmacology networks
• Druggability / Drug-likeness
• Text mining
• Third-party application integration
5. Why looking at patent documents?
• Patent filing and searching
• Legal, financial and commercial incentives & interests
• Prior art, novelty, freedom to operate searches
• Competitive intelligence
• Unprecedented wealth of knowledge
• Most of the knowledge will never be disclosed anywhere else
• Compounds, scaffolds, reactions, formulations
• Biological targets, sequences, diseases, indications
• Average lag of 2-4 years between patent document and journal
publication disclosure for chemistry, 4-5 for biological targets
Rodriguez-Esteban & Bundschus. Text Mining Patents for Biomedical Knowledge. Drug Discovery Today 2016.
6. From SureChem to SureChEMBL
• 2007: SureChem is founded
• 2010: Acquired by Macmillan
• 2012: Revamped commercial product: SureChemDirect
• Modular, cloud-based architecture
• Dec 2013: Donated to EMBL-EBI
• Sept 2014: SureChEMBL UI is launched
• Jan 2015: Data client and map file dumps
• April 2015: Bio-annotations are added
• May 2016: SureChEMBL data available via Open PHACTS API
16. SureChEMBL Data Content and Growth
• > 18M unique structures
• ~13M are in a drug-like phys-chem space
• > 15M chemically annotated patents
• ~5M are life-science relevant (based on patent codes)
• ~80,000 novel compounds extracted from ~50,000 new
patents monthly
• 1–7 days for a published patent to be chemically annotated,
indexed and searchable in SureChEMBL
• SureChEMBL provides search access to all patents (not just
chemically annotated ones) ~120M patents
17. • Connectivity match on single components - UniChem
ChEMBL-SureChEMBL overlap
21.4%
SureChEMBL
ChEMBL
1.5M
17M
22. Common sources of errors
• Small, poor quality images
• OCR errors in names (OCR done by IFI). There is an OCR correction
step, but cannot fix all errors
Becomes: ‘2,6-Difluoro-Λ/-{1 -r(4-iodo-2-methylphenyl)methvn-1 H-
pyrazol-3- vDbenzamide’
• Reliability better for US patents due to inclusion of mol files
24. What about testing?
• Rigorous testing hampered by lack of good test sets
• Akhondi corpus* annotates compounds, genes and diseases
• à But not normalised to identifiers
• à Does not convert to compound structures
• à Identifies all entities, not relevant entities
• BioCreative competition corpus covers only abstracts (21K)
• Comparison with ‘gold standard’/’ground truth’ data sets
• SciFinder for compounds / Other sets for relevance
• GVK BIO and BindingDB for targets
• BUT not necessarily comprehensive, so can’t assess precision
* Akhondi et al (2014)
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0107477
25. Comparison of compound coverage
• Comparison of compounds with SciFinder
• Set of 47 patents from Akhondi* corpus
• SciFinder compounds for these extracted
• 65% of SciFinder compounds were found by SureChEMBL
• ‘Missed’ compounds included some cases that were out of scope
for SureChEMBL pipeline (e.g., Markush, tables)
* Akhondi et al (2014)
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0107477
26. Accessing SureChEMBL data
Web Interface
Data client feed
Local compound-patent db schema
Daily updates
Compound-patent map file
Quarterly updates
UniChem
Weekly updates
Access to 135M structures
http://chembl.blogspot.co.uk/2015/08/accessing-surechembl-data-in-bulk.html
27. Identify key compounds in patent docs
• What are key compounds in med. chem. patents?
• ‘Best’, most promising/important compound
• Drug candidate
• Optimal property profile
• Most active
• Automated identification of key compounds in patents
• Using compound information only
• Main assumption: busy chemical space around key compounds
• Number of NNs, MCS extraction and R-group frequency
Hattori, K., Wakabayashi, H., & Tamaki, K. (2008). JCIM, 48(1), 135–142. doi:10.1021/ci7002686
Tyrchan, C., Boström, J., Giordanetto, F., Winter, J., & Muresan, S. (2012). JCIM, 52(6), 1480–1489.
doi:10.1021/ci3001293
28. Workflow
1. Read a file with chemistry extracted from the Levitra family of patents
• US65663601
2. Filter by different text-mining and chemoinformatics properties to
remove noise and enrich the genuinely novel structures
3. Visualise the chemical space using MDS and dimensionality reduction
• Identify and fix outliers in the chemistry space.
4. Find the Murcko and MCS scaffolds
5. Compare the derived MCS core with the actual Markush structure
6. Identify key compounds using structural information only
• Count number of NNs per compound
• R-Group decomposition and frequency of R-Groups
1Sayle, R. et al. (2012). JCIM, 52(1), 51–62. doi:10.1021/ci200463r
29. Scaffold and chemical space analysis
hp://nbviewer.ipython.org/github/rdkit/UGM_2014/blob/master/Notebooks/Vardenafil.ipynb
31. SureChEMBL bio-annotation
• SciBite’s Termite text-mining engine run on 4M life-science patents
from SureChEMBL corpus
• Genes (identified by HGNC symbols, e.g. FDFT1) and diseases
(identified by MeSH IDs, e.g. D009765) annotated
• Section/frequency information annotated (e.g., in title, abstract,
claims, total frequency)
• Relevance score (0-3) to flag important chemical and biological
entities and remove noise
32. Relevance scoring
• Patents designed to obfuscate – not all entities are equal
Over 50 proteins mentioned!
SRC proto-oncogene, non-receptor tyrosine kinase ; aurora kinase A ; insulin ; microtubule-associated protein tau ; catenin (cadherin-
associated protein), beta 1, 88kDa ; interleukin 1, alpha ; c-src tyrosine kinase ; phosducin-like 2 ; colony stimulating factor 2 (granulocyte-
macrophage) ; tumor necrosis factor ; glycogen synthase kinase 3 beta ; amyloid beta (A4) precursor protein ; asparaginase ; interferon,
beta 1, fibroblast ; jun proto-oncogene ; coagulation factor II (thrombin) ; acetylcholinesterase ; FGR proto-oncogene, Src family tyrosine
kinase ; BLK proto-oncogene, Src family tyrosine kinase ; angiotensin I converting enzyme ; albumin ; axin 1 ; heat shock transcription
factor 1 ; v-myb avian myeloblastosis viral oncogene homolog ; NADH dehydrogenase, subunit 1 (complex I) ; LYN proto-oncogene, Src
family tyrosine kinase ; LCK proto-oncogene, Src family tyrosine kinase ; CCAAT/enhancer binding protein (C/EBP), alpha ; HCK proto-
oncogene, Src family tyrosine kinase ; ATP citrate lyase ; Glycogen synthase kinase 3 ; beta-catenin ; interferon ; tnf superfamily ; lactate
dehydrogenase ; neurotrophic factor ; pyruvate kinase ; fibroblast growth factor ; glycogen synthase ; tumor necrosis factor alpha ; serine
threonine kinase ; tau protein ; interleukin ; albumin ; thrombin ; eif2b ; ion channel ; cytokine ; heat shock factor ; axin ; cAMP
Response element binding protein ; amyloid beta ; Microtubule Interacting Protein ; c-Jun N-terminal kinases ; src homology ; calcium
channel ; atpase ; acetylcholinesterase
Some more interesting than others…
The method of claim 27, wherein the method comprises inhibiting Aurora-2, GSK-3, or Src activity
Slide by Lee Harland
36. Relevance scoring – genes/diseases
• Various features used:
• Term frequency
• Position (title, abstract, figure, caption, table)
• Frequency distribution
• Scores range from 0 – 3
• 3 – most important entities in the patent
• 2 – important entities in the patent
• 1 – mentioned entities in the patent
• 0 – ambiguous entity/likely annotation error
37. Relevance scoring - compounds
• Main assumptions for relevance:
1. Very frequent compounds are usually irrelevant
2. Compounds with busy chemical space around them are interesting
• Use distribution of close analogues (NNs) among compounds found in the same
patent family – FCFP_4 similarity
• Scores range from 0 – 3
• 3 – highest number of NNs: most important entities in the patent
• 2 – important entities in the patent
• 1 – few NNs: mentioned entities in the patent
• 0 – singletons or trivial entities, most likely errors or reagents, solvents,
substituents
Hatori, K., Wakabayashi, H., & Tamaki, K. (2008). JCIM, 48(1), 135–142. doi:10.1021/ci7002686
Tyrchan, C., Boström, J., Giordaneto, F., Winter, J., & Muresan, S. (2012). JCIM, 52(6), 1480–1489.
doi:10.1021/ci3001293
41. Open PHACTS Patent API
Disease
Compound
Target
Patent
extracted
links
https://dev.openphacts.org/docs/2.1
42. Open PHACTS Patent API
Disease
Compound
Target
Patent
inferred links
extracted
links
43. Use case #1: Patent to Entities
1. From a patent get compounds, genes and diseases
2. Filter to remove noise
• Frequency and relevance score
3. Process and visualise
Disease
Compound
Target
Patent
?
?
?
49. Does it make sense?
• Calculate MCS
US-7718693-B2
50. Use case #2: Drug targets & indications for compound
1. Search patents for a compound (approved drug)
2. Filter to remove noise
• Frequency, relevance score and classification code
3. For remaining patents, get disease and target entities
4. Filter to remove noise
• Frequency and relevance score
5. Visualise results
Disease
Compound
Target
Patent
58. From target to scaffolds
PARP1
Niraparib, Phase 3
Olaparib, FDA-approved
Talazoparib, Phase 3
WO-2008020180-A2
US-20100035883-A1
US-20080167345-A1
59. Take home message
• It is now possible to search and interlink the key structures,
scaffolds, targets and diseases from med. chem. patent
corpus automatically
• By high-throughput text-mining only
• Thanks to simple heuristics (relevance scores and frequency)
• Using the Open PHACTS API
• For the first time in a free resource in such scale
Disease
Compound
Target
Patent
60. What next: Other ideas and use cases
• Target validation / Druggability
• For a target, get me all related and relevant diseases
• Compare with DisGeNET / Open Targets, etc.
• Any known patented scaffolds for my target?
• Start a pharmacophore hypothesis for patent busting
• Find a tool compound for a new assay
• Novelty checking / Due diligence
• What do we know about this scaffold / compound?
• Add ChEMBL pharmacology and pathway information
63. BindingDB patent set
• BindingDB manually extracts binding activities from tables
in granted US med. chem. patents
• Coverage 2013-2016, currently ~1,100 patents
• On-going effort
• Includes exact and binned bioactivities
• Ki, IC50, Kd, EC50 activity types
• Compounds mapped to example no in patent tables
• Detailed assay description
• http://www.bindingdb.org/bind/ByPatent.jsp
64. BindingDB patent bioactivity integration
• 1,017 US patents (mapped to SureChEMBL IDs)
• 68.6K compounds
• ~70 unique cmpds per patent
• Only 5% exist in ChEMBL (exact match) à complementary
• 450 single protein targets
• ~10% not in ChEMBL, e.g. GPR4 in US-8748435-B2
• 100,329 bioactivity values
• ~100 activities per patent
• Compounds and targets curated by ChEMBL
• Available in ChEMBL22
http://www.bindingdb.org/bind/ByPatent.jsp
66. Conclusion: current patent data landscape
• ChEMBL
• Literature SAR data, drug annotations, manually curated
• Coming soon: bioactivities from patents
• SureChEMBL
• Chemical annotations automatically mined from patents
• Open PHACTS
• The above plus biological annotations automatically mined
from patents, with relevance scoring
• Public API
67.
68.
69.
70. Acknowledgements
• ChEMBL and SureChEMBL
• Anna Gaulton
• Mark Davies
• Nathan Dedman
• James Siddle
• Anne Hersey
• SciBite
• Lee Harland
• Open PHACTS consortium
• Nick Lynch
• Daniela Digles
• Antonis Loizou
• surechembl-help@ebi.ac.uk
72. Mining Compounds, Targets and Indications
from the Patent Corpus with SureChEMBL
and Open PHACTS
George Papadatos, PhD
ChEMBL Group, EMBL-EBI
georgep@ebi.ac.uk
#ShefChem16