This document discusses mining drug targets, structures, and activity data from open full-text patent sources and web tools. It summarizes that mining patents can provide novel bioactive chemical structures and target/assay data not found in journals. However, patent mining is challenging due to large document sizes and variability in data presentation. The document outlines open sources and tools that can be used for patent mining, including PubChem which contains over 6 million patent-derived structures. With open tools and sources, targeted patent mining can complement proprietary databases and expedite drug discovery.
Biological literature mining - from information retrieval to biological disco...Lars Juhl Jensen
14th International Conference on Intelligent Systems for Molecular Biology, Tutorial, Fortaleza Conference Center, Fortaleza, Brazil, August 6-10, 2006
USUGM 2014 - Gregory Landrum (Novartis): What else can you do with the Marku...ChemAxon
In a collaboration with ChemAxon we have developed a web-based interface for searching, browsing and managing chemical information. The system was designed to accommodate to capture the information that users stored in various documents in local files(like pdf, ppt slides, as images etc.). These bits of information were not centrally available, and when people moved on, this data was lost.
ChemAxon’s JChem Cartridge and its Markush extensions and Document to Database tool enabled us to collect this data. It serves a good basis for future developments too. When developing this new interface, we focused on ease of use, maintainability, and flexibility.
Biological literature mining - from information retrieval to biological disco...Lars Juhl Jensen
14th International Conference on Intelligent Systems for Molecular Biology, Tutorial, Fortaleza Conference Center, Fortaleza, Brazil, August 6-10, 2006
USUGM 2014 - Gregory Landrum (Novartis): What else can you do with the Marku...ChemAxon
In a collaboration with ChemAxon we have developed a web-based interface for searching, browsing and managing chemical information. The system was designed to accommodate to capture the information that users stored in various documents in local files(like pdf, ppt slides, as images etc.). These bits of information were not centrally available, and when people moved on, this data was lost.
ChemAxon’s JChem Cartridge and its Markush extensions and Document to Database tool enabled us to collect this data. It serves a good basis for future developments too. When developing this new interface, we focused on ease of use, maintainability, and flexibility.
Text Mining for Biocuration of Bacterial Infectious DiseasesDan Sullivan, Ph.D.
Specialty gene sets, such as virulence factors and antibiotic resistance genes, are of particular interest to infectious disease researchers. Much of the information about specialty genes’ function is described in literature but unavailable as structured data in bioinformatics databases. The steadily increasing volume of literature makes it difficult to manually find relevant papers and extract assertion sentences about specialty genes. This presentation describes efforts to build and an automatic classifier for such sentences. Experiments were conducted to assess the impact of the imbalance of positive and negative examples in source documents on classification; develop a support vector machine (SVM) classifier using term frequency-inverse document frequency (TF-IDF) representation of text; and assess the marginal benefit of additional training examples on the quality of the classifier. Analysis of learning curves indicates that additional training examples will not likely improve the quality of the classifier. We discuss options for other text representation schemes to investigate in order to improve the quality of the classifier as measured by F-score.
Data analysis & integration challenges in genomicsmikaelhuss
Presentation given at the Genomics Today and Tomorrow event in Uppsala, Sweden, 19 March 2015. (http://connectuppsala.se/events/genomics-today-and-tomorrow/) Topics include APIs, "querying by data set", machine learning.
Short tutorials on how to use the web-based tool DAVID - Database for Annotation, Visualization and Integrated Discovery) - http://david.abcc.ncifcrf.gov/
DAVID provides a comprehensive set of functional annotation tools for investigators to understand biological meaning behind large list of genes.
14th International Conference on Intelligent Systems for Molecular Biology, Software demo, Fortaleza Conference Center, Fortaleza, Brazil, August 6-10, 2006
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...Dr. Haxel Consult
Christopher Southan (The IUPHAR/BPS Guide to PHARMACOLOGY, UK)
While the raison d'être of patents is Intellectual Property (IP) there is a growing awareness of the scientific value of their data content. This is particularly so in medicinal chemistry and associated bioactivity domains where disclosed compounds and associated data not only exceeds that published in papers by several-fold and surfaces years earlier, but is also, paradoxically; completely open (i.e. no paywalls). Scientists have traditionally extracted their own relationships or used commercial sources but the last few years have seen a “big bang” in patent extractions submitted to open databases, including nearly 20 million structures now in PubChem.
This tutorial will:
Outline the statistics of patent chemistry in various open sources
Introduce a spectrum of open resources and tools
Enable an understanding of target identification, bioactivity and SAR extraction from patents and connecting these relationships to papers
Cover aspects of medicinal chemistry patent mining
Include hands on exercises using open source antimalarial research as examples
The focus will be on public databases and patent office portals, since these can be transparently demonstrated. However, the essential complementarity with commercial resources will be touched on. Those engaged in Competitive Intelligence will also find the material relevant.
Text Mining for Biocuration of Bacterial Infectious DiseasesDan Sullivan, Ph.D.
Specialty gene sets, such as virulence factors and antibiotic resistance genes, are of particular interest to infectious disease researchers. Much of the information about specialty genes’ function is described in literature but unavailable as structured data in bioinformatics databases. The steadily increasing volume of literature makes it difficult to manually find relevant papers and extract assertion sentences about specialty genes. This presentation describes efforts to build and an automatic classifier for such sentences. Experiments were conducted to assess the impact of the imbalance of positive and negative examples in source documents on classification; develop a support vector machine (SVM) classifier using term frequency-inverse document frequency (TF-IDF) representation of text; and assess the marginal benefit of additional training examples on the quality of the classifier. Analysis of learning curves indicates that additional training examples will not likely improve the quality of the classifier. We discuss options for other text representation schemes to investigate in order to improve the quality of the classifier as measured by F-score.
Data analysis & integration challenges in genomicsmikaelhuss
Presentation given at the Genomics Today and Tomorrow event in Uppsala, Sweden, 19 March 2015. (http://connectuppsala.se/events/genomics-today-and-tomorrow/) Topics include APIs, "querying by data set", machine learning.
Short tutorials on how to use the web-based tool DAVID - Database for Annotation, Visualization and Integrated Discovery) - http://david.abcc.ncifcrf.gov/
DAVID provides a comprehensive set of functional annotation tools for investigators to understand biological meaning behind large list of genes.
14th International Conference on Intelligent Systems for Molecular Biology, Software demo, Fortaleza Conference Center, Fortaleza, Brazil, August 6-10, 2006
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...Dr. Haxel Consult
Christopher Southan (The IUPHAR/BPS Guide to PHARMACOLOGY, UK)
While the raison d'être of patents is Intellectual Property (IP) there is a growing awareness of the scientific value of their data content. This is particularly so in medicinal chemistry and associated bioactivity domains where disclosed compounds and associated data not only exceeds that published in papers by several-fold and surfaces years earlier, but is also, paradoxically; completely open (i.e. no paywalls). Scientists have traditionally extracted their own relationships or used commercial sources but the last few years have seen a “big bang” in patent extractions submitted to open databases, including nearly 20 million structures now in PubChem.
This tutorial will:
Outline the statistics of patent chemistry in various open sources
Introduce a spectrum of open resources and tools
Enable an understanding of target identification, bioactivity and SAR extraction from patents and connecting these relationships to papers
Cover aspects of medicinal chemistry patent mining
Include hands on exercises using open source antimalarial research as examples
The focus will be on public databases and patent office portals, since these can be transparently demonstrated. However, the essential complementarity with commercial resources will be touched on. Those engaged in Competitive Intelligence will also find the material relevant.
Overview of the SureChEMBL system and web interface.
https://www.surechembl.org/search/
SureChEMBL is a freely available web resource for chemistry patent searching. It is based on a fully automatic and dynamic text and image mining pipeline.
PubChem: a public chemical information resource for big data chemistrySunghwan Kim
Presented at the Joint Statistical Meetings (JSM) 2020 (virtual) on August 3, 2020.
==== Abstract ====
The idea of “big data” has recently been drawing much attention of the scientific community as well as the general public. An example of big data in Chemistry is the data contained in PubChem, which is a public database of chemical substance descriptions and their biological activities at the National Institutes of Health. PubChem is a sizeable system with 235 million depositor-provided substance descriptions, 96 million unique chemical structures, 1.1 million biological assays, and 268 million biological activity result outcomes. It also contains significant amounts of scientific research data and the inter-relationships between chemicals, proteins, genes, scientific literature, patents and more. PubChem resources have been used in many studies for developing bioactivity and toxicity prediction models, discovering multi-target ligands, and identifying new macromolecule targets of compounds (for drug-repurposing or off-target side effect prediction). This presentation provides an overview of how PubChem’s data, tools, and services can be used for bioassay data analysis and virtual screening (VS) and discusses important aspects of exploiting PubChem for drug discovery.
The patent literature has historically been complex and inaccessible to searches required for effective IP management and maintenance of a competitive position, particularly when it comes to chemical structure information. The availability of raw patent text feeds in a structured form have allowed the application of text-to-structure and image-to-structure conversion techniques. The problem then became one of applying this solution across massive data sets in an accurate and scalable manner to deliver a turnkey patent informatics system with automatically extracted, and searchable chemical structures. SureChem, an advanced cloud application, uses a tournament of methods to achieve higher coverage and accuracy than any single approach. This product was launched and licensed by a user community with a freemium business model. Latterly, user feedback and market shifts indicated a need to link biological data into patents too (sequences, genes, targets, diseases, etc). This created an opportunity to transition SureChem to EMBL-EBI, a public organisation with the remit of data dissemination and sharing, and deep experience of biodata, including the large ChEMBL database of Structure Activity Relationship Data. In 2014 SureChem became SureChEMBL. The presentation will review the development of SureChem, discuss the marketplace for patent informatics, and look ahead to future development plans for SureChEMBL.
Presentation for Texas A&M Superfund Research Center virtual learning series, Big Data in Environmental Science and Toxicology. More details at https://superfund.tamu.edu/big-data-session-2-aug-18-2021/
Closing the gap between chemistry and biology: Joining between text tombs and...Chris Southan
Progress in the biomedical sciences is critically dependent on explicit chemical structures and bioactivity results described in text. This applies across drug discovery, pharmacology, chemical biology, and metabolomics. However the entombing of the majority of these structures and associated data within patents, papers, abstracts and web pages has been a major barrier to progress. This presentation introduces the current public information flow from documents and its associated barriers, such as inadequate author specification of structures, journal pay walls precluding text mining and the patchiness of MeSH chemistry annotation for PubMed-to-PubChem connectivity. It then reviews trends that are lowering these barriers. These include the Google merge of over 50 million InChIKey(s) from PubChem, ChemSpider and UniChem, ChEMBL containing SAR for 0.8 million structures from 50K medicinal chemistry papers, over 20 million abstracts in PubMed, and full-text open patent chemistry in SureChemOpen bringing PubChem patent-extracted structures to 15 million. In addition, options such as Open Lab Books and figshare are expanding the choices for surfacing new structures. Methods will be outlined for establishing document-to-document and document-to-database links via chemical structures. These include the PubChem toolbox, protein targets in UniProt, PubChem BioAssay, ChEMBL indexing in UK PMC, SureChemOpen, chemicalize.org for text name-to-structure conversion , OSRA for image-to-structure conversion, Venny for set comparisons and InChIKey searching in Google [1]. Combined use of these approaches to make joins between patents, papers, abstracts chemical database entries, SAR data and drug target protein sequences will be illustrated with recent novel antimalarial lead compounds, patent-only BACE2 inhibitors and company code numbers in the NCATS repurposing list.
Exploiting PubChem for drug discovery based on natural productsSunghwan Kim
Presented at the 256th American Chemical Society (ACS) National Meeting in Boston, MA (August 19, 2018).
==== Abstract ====
PubChem is one of the largest sources of publicly available chemical information, with more than 242.3 million depositor-provided substance descriptions, 94.7 million unique chemical structures, and 234.8 million bioactivity outcomes from 1.25 million assays covering around ten thousand unique protein target sequences. This presentation provides an overview of PubChem’s data, tools, and services useful for drug discovery based on natural products.
PubChem contains a large amount of bioactivity data, most of which are generated from high-throughput screening (HTS). However, these data also include a substantial amount of bioactivity information extracted from scientific articles published in journals in the chemical biology, medicinal chemistry, and natural product domains, thanks to data contribution by other databases like ChEMBL, Guide to Pharmacology, BindingDB, and PDBbind. In addition, through data integration with other databases such as DrugBank, HSDB, and HMDB, PubChem contains a wide range of annotations useful for drug discovery, including pharmacology, toxicology, drug target, metabolism, chemical vendors, scientific articles, patents, and many others.
PubChem supports various types of chemical structure searches, including identify search, 2-D and 3-D similarity searches, substructure and superstructure searches, molecular formula search. It also provides multiple programmatic access routes, including E-Utilities, Power User Gateway (PUG), PUG-SOAP, PUG-REST, and PUG-View, allowing one to build an automated workflow that takes advantage of information contained in PubChem. In addition, through PubChemRDF, users can integrate PubChem’s data into their own in-house data on a local computing machine.
Presented to David Gloriam's Group, Copenhagen, Feb 2020
**********************************
The theme will be presented from the perspective of both past involvement in peptide curation in the Guide to Pharmacology (GtoPdb) and in current searching for bioactive peptides in the wider ecosystem that includes ChEMBL and PubChem. The core problem is that peptides hang in limbo land between bioinformatics (BLAST) and cheminformatics (Tanimoto) neither of which provide optimal searching. Curating peptides in GtoPdb presents many challenges, including mapping endogenous peptides to Swiss-Prot cleavage annotations. For synthetic peptides, equivocal specification of modifications and exact positions of radiolabels are also problematic However, target-mapped citation-supported quantitative binding parameters are curated where possible. For those peptides falling below the PubChem CID SMILES limit of approximately 70 residues, GtoPdb has been using Sugar and Splice from NextMove Software to convert into CIDs. Specific problems associated with finding bioactive peptides in databases will be outlined.
Vicissitudes of target validation for BACE1 and BACE2 Chris Southan
Introduction/Background & Aims
The beta-amyloid (APP) cleaving enzyme (BACE1) was implicated as a drug target for Alzheimer's Disease (AD) back in 1999. In 2011, the paralogue, BACE2, became a new proposed target for type II diabetes (T2DM) having been reported to be the TMEM27 secretase regulating pancreatic beta-cell function [1]. By 2019 the accumulated evidence, including a swathe of failed clinical trials for BACE1 inhibitors, has produced a de facto de-validation of both targets in both diseases. As a learning exercise, the series of events leading up to this is reviewed here.
Method/Summary of work
Basic information about these two targets and the lead compounds against them were sourced via the IUPHAR/BPS Guide to Pharmacology (GtoPdb) as Target ids: 2330 and 2331, for BACE1 and 2, respectively. This was consolidated by a literature and patent review as well as following them in other databases. The most recent information on clinical trials was sourced from press releases.
Results/Discussion
GtoPdb annotates 24 lead compounds against BACE1 and 12 against BACE2. The corresponding counts mapped to these targets in ChEMBL are 8741 and 1377 making BACE1 one of the most actively pursued enzyme targets ever. Notwithstanding the massive global effort during 2018 Merck’s verubecestat and J&J’s atabecestat BACE1 inhibitors not only failed their Phase III endpoints but even appeared to worsen cognition in prodromal patients. In 2019 Amgen/Novartis stopped Phase II/III trials of umibecestat that also showed more cognitive decline in the treatment group compared to controls. BACE2 presented an anomalous situation in several ways. By 2016 both Novartis and Amgen declared their inability to reproduce the TMEM27 secretase turnover reported in 2011. Notwithstanding, Novartis and other companies have published patents on BACE2-specific inhibitors over several years and paradoxically verubecestat is more potent against BACE2 rather than 1 but was never tested for glucose-lowering. Equally puzzling is that one academic group is still publishing BACE2 inhibitors for T2D even post de-validation. One thing both targets have in common is the complete absence of genetic support from genome-wide disease association studies but this warning sign went unheeded.
Conclusions
The massive waste of resources on the pursuit of BACE1 as an AD target over the last two decades is catastrophic. This tale of de-validation is compounded for this paralogous pair of enzymes by the fact that the original evidence for BACE2 as a T2D target was eventually refuted. The story of these targets highlights a range of crucial pharmacological pitfalls that must be avoided in the future.
Reference(s)
[1] Southan C, Hancock J.M. (2013) A tale of two drug targets: the evolutionary history of BACE1 and BACE2. Front Genet. 4:293.
In silico 360 Analysis for Drug DevelopmentChris Southan
Introduction:
Consequent to a memorandum of understanding between the Karolinska Institutet and the International Union of Basic and Clinical Pharmacology (IUPHAR) in 2018 a report on academic drug development, including guidelines (ADEV) has been drafted [1]. As part of this exercise, we conceived a triage for comprehensive informatics profiling around the compound, target, disease axis. We have termed this “in slico 360” (INS360) the aim of which was to support ADEV teams since they may lack either internal expertise or external support to do this on their own. Indeed, some past SciLifeLab Drug Discovery and Development Platform projects had been halted because of overlooked competitive impingements or insufficient target validation evidence.
Methods
We assessed the current database landscape, mostly public but including commercial, for potential utility for INS360. We were guided primarily by content coverage, usability, and reputation. We also explored some open property prediction resources for assay interference and toxicological inferences.
Results:
As a first-stop-shop, we selected the IUPHAR/BPS Guide to PHARMACOLOGY with ~900 ligand-target relationships captured via expert curation of journal papers Moving up in scale we evaluated ChEMBL at 1.8 million compounds with 1.1 million assay descriptions and 7,000 targets. With yet another jump we could search the patent corpus with 18 million extracted compounds in SureChEMBL. We explored PubChem that integrates these three with over 500 other sources linked to 96 million compounds, BioAssay results and connectivity into the NCBI Entrez system. The final jump in scale for document-to-chemistry navigation was represented by SciFinder with 155 million structures. On the target side, 360-exploration has the need to encompass literature, structure, genetic variation, splicing, interactions, and disease pathways. From their UniProt links, both GtoPdb and ChEMBL provide these entry points. Navigating genetic association data in support of target validation was enabled by the OpenTargets portal and the GWAS Catalog. We also fount servers that could produce prediction scores from chemical structures for a range of features important for de-risking development.
Conclusion:
This work scoped out initial resource choices for the INS360. We propose that not only ADEV operations but essentially any pharmacology research team has much to gain from this approach and many potential pitfalls can consequently be avoided when approaching key checkpoints, such as preparing a publication. However, support may be needed for both institutions and teams to get the best out of these complex and feature-rich databases.
[1] Southan C, (2019) Towards Academic Drug Development Guidelines, ChemRxiv pre-print no. 8869574
Will the correct BACE ORFs please stand up?Chris Southan
BACE1 and BACE2 are protease targets for Alzheimer's and diabetes, respectively but their validation is now questioned
Phylogenetic analysis can added functional insights
This came up against two key problems
A surprising prevalence of incorrect protein sequences predicted from genomes
Many BACE1 and BACE2 orthologues had truncation and/or indel errors.
Key phylogenetic representative genomes are languishing in an unfinished state
Some options for amelioration of these problems will be described
An update on the evolution of these enzymes will be shown
Look for new and potentially useful human 5HT2A-directed small molecule chemistry surfaced since the last meeting., check for compounds against as 5HT2A primary target but also combined inhibitors, poll round the key databases, literature and patents, earching challenges arise from synonym soup, complex cross-reactivities (see PMID 29679900) in vitro data gaps and in vivo polypharmacology
Quality and noise in big chemistry databasesChris Southan
Presented at Aug 2019 ACS by Antony Williams. Abstract: The internet has changed the way we access chemistry data as well as providing access to data that can quickly proliferate and becomes referenceable. Web access to chemical structures and their integration with biological data has become massively enabling with numbers for UniChem, PubChem and ChemSpider reaching 157, 97 and 71 million respectively (at the time of writing). A range of specialist databases small enough to be curated have stand-alone utility and synergies when integrated into the larger collections. These include DrugBank, BindingDB, ChEBI, and many others. Databases of any size have inherent quality challenges but at large scale various forms of “noise” accumulate to problematic levels. The unfortunate consequence is that “bigger gets worse”. This is particularly associated with large uncurated submissions from vendors and automated document extractions (even though these are high-value). Virtual enumerations and circularity between overlapping sources add to the problem. As a result of some of the noise in the larger databases the value becomes highly dependent on the specific applications. An example includes using the databases to support non-targeted analysis. This presentation covers examples of these noise and quality issues and suggests at least some options to ameliorate the problem
Progress in drug discovery and chemical biology is hugely enabled by curated document-assay-result-compound-target relationships (D-A-R-C-P) in open databases from resources such as the Guide to Pharmacology and ChEMBL. These are synergistically integrated into PubChem which pre-computes chemical similarity and connectivity between over 95 million structures and 5.6 million BioAssay results. It also links chemistry to documents via various additional routes including MeSH and large scale submissions from publishers. However, these efforts are patchy and very few journals facilitate such connectivity. There thus remains a massive shortfall in public D-A-R-C-P capture from decades of papers and patents. This presentation will cover these aspects and discuss their partial amelioration by options such as author-driven depositions and open lab-book approaches as used by Open Source Malaria
Looking at chemistry - protein - papers connectivity in ELIXIRChris Southan
This is a poster for the UK ELXIR meetin in Birmingham UK, Nov 2018. It is the summary of a blog-post https://cdsouthan.blogspot.com/2018/08/an-initial-look-at-elixir-chemistry.html that asses chemistry <> protein <> papers connectivity (C-P-P) for five ELIXIR resources
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Mining Drug Targets, Structures and Activity Data
1. Mining Drug Targets, Structures and
Activity Data Using Open Full-Text
Patent Sources and Web Tools
Christopher Southan
ChrisDS Consulting, Göteborg, Sweden,
Prepared for BioIT, Boston, April 2012,
Track 11, Open Source Solutions, Wednesday, 13:45
[1]
3. Key Relationships
Extractable from Patents and Papers
MAQALPWLLLWMGAGVLPAHGTQHGIRLPLRSGLGGA
PLGLRLPRETDEEPEEPGRRGSFVEMVDNLRGKSGQGY
YVEMTVGSPPQTLNILVDTGSSNFAVGAAPHPFLHRYYQ
RQLSSTYRDLRKGVYVPYTQGKWEGELGTDLVSIPHGP
NVTVRANIAAITESDKFFINGSNWEGILGLAYAEIARPDD
SLEPFFDSLVKQTHVPNLFSLQLCGAGFPLNQSEVLASV
GGSMIIGGIDHSLYTGSLWYTPIRREWYYEVIIVRVEINGQ
DLKMDCKEYNYDKSIVDSGTTNLRLPKKVFEAAVKSIKA
ASSTEKFPDGFWLGEQLVCWQAGTTPWNIFPVISLYLM
GEVTNQSFRITILPQQYLRPVEDVATSQDDCYKFAISQSS
TGTVMGAVIMEGFYVVFDRARKRIGFAVSACHVHDEFRT
AAVEGPFVTLDMEDCGYNIPQTDESTLMTIAYVMAAICAL
FMLPLCLMVCQWRCLRCLRQQHDDFADDISLLK
Document Assay Result Compound Target
2011 PMID 21569515
2010 doi:10.1007/978-3-642-15120-0_9
Important ”bag of targets” exceptions (eg bacterial/parasite whole cells)
[3]
4. The Good News: Patent Mining Utility
• Novel bioactive chemical structures related to drug discovery exceeding those in
journals by at least five-fold.
• Encompass academic, as well as commercial, global med. chem. output.
• Targets, assays, mechanisms of action, disease descriptions and in-vivo data.
• ~ 70% of data initially patent-only, some never disclosed elswhere.
• Include synthetic descriptions and other useful enabling information.
• Precede journal or meeting reports by ~ 1.5 to 5 years.
• Can be complementary to papers (e.g. larger SAR matrix).
• Intersect with papers at chemistry, target, disease, author and citation levels
• IP exploitable for Neglected Tropical Disease research becoming ”open”.
[4]
5. The Bad News: Patent Mining Can be Tough
• High-specificity retrieval of relevant documents difficult
• Massive chaff-to-wheat ratio in 100s of pages
• Differences in layout, house style and data location
• Markush permutation
• Variability in IUPAC strings and image rendering
• Use of non-standard gene/protein names
• Obfuscation via;
– Qualitative or binned assay results
– Structure-to-data links non-obvious, patchy or absent
– Less than 50% of titles include target names
– The ”hiding the lead and core structures” game
– Blunderbuss disease and use exemplifications
– Tense ambiguity (i.e. ”could be” vs. ”was” done)
• Quality judgments dificult
• Patents cite papers and patents but few papers cite patents
• Document redundancy of Kind codes, patent families and equivalents
• Finding drug candidate first-filings is difficult
• The PDF hamburger problem and OCR noise
[5]
6. Reasons for Rolling-your-own Patent
Chemistry and Data Extraction
• Limited budget
• You are likely to be a tacit super-curator by profession
• Best-of-both-worlds synergy with licensed sources (e.g. digging deeper)
• Combine automated outputs with manual triage
• Develop a technical understanding and comparison of vendor offerings
• Commercial dbs cap the number of manually-extracted examples
• Need SAR analogues for a few targets rather than many (e.g. mechanistic
enzymology or systems chemical biology)
• Only require data sampling across specific disease areas
• Not overly concerned about false-negatives (i.e. don’t need
comprehensive prior-art check or scoping of claims)
• Open tools operate on any text or web source, not just patents
• You may already have commercial text mining capability
• Flexibility of intersecting patent with literature chemistry (e.g. ChEMBL,
journals you subscribe to, PubMed and PMC)
• You can slice-and-dice PubChem patent chemistry in ways
complementary to commercial databases
[6]
7. Open Sources and Tools Overview
• Searching metadata, abstracts and text
– Official public portals: EPO/Espacenet, USPTO, WIPO, EBI CiteExplore
– Open full-text: FreePatentsOnline, Google patents, Google Scholar, et al.
• Metadata, full-text and chemical structure search - SureChemOpen
• Bulk name-to-structure conversion - ChemAxon Chemicalize
• Individulal name-to-structure - OPSIN
• Conversion of images to structures - OSRA
• Sketcher inputs – many options
• Corroborative search in SureChemOpen, PubChem, ChemSpider, Chemicalize
• EPO patent number searching in PubChem
• PDF24.org for cutting pages and OnlineOCR.net for sections or tables
• Utopia bioentity mark-up
(those below not included in this presentation but relevant)
• NCI/CADD Chemical Identifier Resolver and Online SMILES Translator
• Open cheminformatics tools – CDK, ChemViz, Taverna, OpenBabel etc.
• OSCAR/PatentEye, Murray-Rust group, organic-reaction.com Laconde et al,
SCRIPDB, Juristica group
(n.b. Google should give urls for all these source and tool names)
[7]
9. PubChem Patent-derived Content ~6 million
• ~ 2.8 million Discovery Gate/Thomson Pharma intersect mainly Derwent WPI
pharmaceutical patents plus some journal extractions
• ~ 5.1 million (allpat above) is the union of Thomson/Derwent plus IBM
• ~ 3.5 million of these are Lipinski-ROF compliant
• ~ 40% journal-extracted structures in ChEMBL have a match in the 5.1 million
• ~ 70% of these are Lipinski-ROF compliant
• ~ 90% of these have assay data
• ~ 60% of the IBM structures (1.5 million) are novel as defined by unique CIDs
• ~ 2.3 million SureChem pre-2007 structures also in there (but not selectable)
[9]
15. Synonym Recall
• Title only BACE1 = 8
• Title + abstract BACE1 = 97
• Title + abstract BACE2 = 29
• Title + abstract BACE = 392
• Title + abstract ”Beta secretase” = 1056
• Title + abstract memapsin = 87
• Title + abstract BACE1 OR "Beta secretase" OR BACE OR BACE2 OR
Memapsin = 1383
• Title + abstract BACE1 OR "Beta secretase" OR BACE OR BACE2 OR
Memapsin AND inhibitors = 841
• Same query to PubMed (this interface) = 1031
[15]
19. IUPAC-to-structure: OPSIN
Instalable
application
Also chemical
dictionary
conversions
Result; Example 31 structure is 24 nM BACE1 inhibitor
[19]
20. Image-to-strucuture: OSRA
• Patchy results but fixable by editing and similarity iteration in PubChem
• Also an installable application
• Useful to cross-check between images and IUPACs
[20]
22. Structure Search in PubChem
SMILES (via OPSIN, OSRA, SureChem, Chemicalize or sketcher)
Often see stero differences to the Derwent entry in PubChem
[22]
23. PubChem Similarity ”Walking”
• 2D and 3D different results
• Can do multiple steps
• Can ”read” CID history
• Possible to ”walk” between patents
• Look for links to ChEMBL, BioAssay, PubMed, chemical suppliers etc.
[23]
25. SureChemOpen: Patent Retrieval
• Patent searching, chemistry-to-patent and patent-to-chemistry in one portal
• Higher rate of name-to-structure conversion than Chemicalize or OPSIN (but not
bulk export)
[25]
26. SurChemOpen, WIPO, OPSIN and PubChem
Result 1nm (?) BACE2 inhibitor
with assay and synthesis details.
[26]
27. SureChemOpen: Structure > Patent
Direct answers to: ”which patents contain compounds simiar to my query”
and ”show me all the compounds in these patents”
[27]
30. Espacenet EP2391601 > ChemAxon Chemicalize.org
• Description URL from
Espacenet pasted into
Chemcalize.org
• Most of 74 examples
converted
• Example 60 had 4
analgues in PubChem
at 95% Tamimoto (e.g.
CID 46852300) but no
exact match
• Claims section was
Markush description
so no relevant
structures converted
[30]
31. EP2391601 > Chemicalize > PubChem
Chemicalize Similarity listing PubChem Tanimoto sub-cluster
• EP2391601 description text > Chemicalize SDF download > PubChem
Structure Search upload = 311 structures
• Of these 206 have PubChem exact matches
• Of these 176 have Thomson Pharma matches
• The example cluster (Thomson/Derwent extraction) cluster is ~15
• The example cluster from Chemicalize is ~ 90
• Ipso facto Chemicalize extracted at least 70 novel structures
• But only 10 examples were in the highest-potency bin
[31]
33. Tables and Recalcitrant IUPACs
PDF
Find tables
Snip image
Online OCR
Word Pad
Chemicalize
OPSIN
OSRA
• iterative fixing of OCR
errors (e.g. 1 vs l)
• cross-check Mw in the
document
[33]
34. Utopia Mark-up of Patent Introduction
Bioentity mark-up (green) via EMBL Reflect with rich call-out options
[34]
35. Tips for Joining Everything up
• SureChemOpen is continuing to back-fill and add features.
• Check the Chemicalize archive (~ 0.5 million) for unique content.
• Between Chemicalize, OSRA, OPSIN and sketching you can extract most things
(e.g. journal or meeting abstracts, PubMed Central full-text, catalogues, wiki
pages, blog posts and MeSH IUPACs).
• Check PubChem ”same connectivity” for tautomer forms in different CIDs.
• Check PubChem ”similar” compounds for analogues even if you cannot track
back to a patent number.
• Most PDB ligands published by companies have a patent analogue series.
• Espacenet text chemicalizes well but FreePantentsOnline can be better.
• Google Scholar tracks patent citations.
• Full-text is good but don’t forget to eyeball the original PDF
• You can ”walk” between patents by 2D/3D clusters, inventors or citations.
• Less-common author/inventor names may track a journal paper back to a patent.
• CiteExplore includes selectable ChEMBL structure links.
• Check ChEMBL structures for SureChem links via ChemSpider.
• On a good day you can paste OCR table data into Excel.
• You can set SciBitely patent keyword alerts and see posts on Twitter.
[35]
36. Conclusions
• Roll-your-own patent mining can take you a long way.
• Complementary to commerical databases.
• Target-centric recall and specificity is reasonable.
• Published patents are indexed and open text-extracted within weeks.
• You need perspicacity to dig out SAR details.
• Can cherry pick examples by potency or collate whole series
• Establishing intersects between journal articles and patents is valuable.
• Exemplified structures typically cover a broader range of analogue space
and SAR data than papers.
• You can ”walk” between patents via citation and chemistry clustering.
• PubChem already contains over 6 million patent-derived structures with
more depositions and links expected.
• The increased public surfacing of chemical structres and bioactivity data
from patents will expedite medicinal chemistry, tropical disease research
and chemical biology.
[36]