The document describes advanced grammars for named entity recognition using LeadMine text mining software. It discusses how LeadMine uses CaffeineFix technology to specify dictionaries and regular expressions to match entities. It provides examples of entity types recognized by LeadMine such as chemicals, proteins, diseases, and reactions. It also describes how LeadMine generates plural forms and normalizes entities. The document outlines some complex grammars used, including grammars for chemicals, numbers, dates, and more.
Automatic extraction of bioactivity data from patentsNextMove Software
Structure-Activity Relationship (SAR) analysis is important for the development of novel small molecule drugs. Such analyses rely on bioactivity data either from in-house or published data, with data from the latter currently being extracted manually at much expensive.
Here we report on an entirely automated system for extracting bioactivity data that we are developing, initially targeting US patents. The system relies on combining the results of many technologies: chemical entity recognition, chemical name to structure, table processing, chemical compound number resolution, chemical sketch interpretation, and even in some cases reconstitution of molecules from a generic core and R-group definitions. Where possible, the target and the assay description are also identified.
To assess the precision/recall of our system we compare our results with those manually extracted from US patents by BindingDB. We also compare the data we’ve extracted with the data present in ChEMBL from journal articles, to analyse whether there are significant differences between activity data in journal articles and patents e.g. differences in targets of interest.
Chemical sketches are ubiquitous in the published literature. Unlike connection table formats that precisely capture chemistry for database entry, the primary purpose of a sketch format is to produce a high quality image for conveying information to other chemists. Chemical sketches can be presented in a variety of chemistry-specific formats as well as image formats, with the later presenting additional challenges to interpretation. Since 2001 the United States Patent Office has redrawn all chemical sketches in ChemDraw, yielding to date over 25 million freely available CDX files.
Correctly extracting chemistry from these files required tackling of many areas including disambiguation of ambiguous labels (e.g. B, D, P, V, Ac), interpreting labels (e.g. COOH), interpretation of free text overlaid on the structure (e.g. brackets for a repeated group) and assignment of reaction role.
We report our work on extracting chemical structures and reactions from sketches and demonstrate the improvements in quality that tackling the intricacies of sketches provides over more naïve approaches. One notable improvement is the ability to better distinguish between specific compounds, fragments, generic structures and reaction schemes. We compare the chemistry extracted from sketches with the results from text-mining, and show that a large amount of chemistry is only available from one medium or the other. We also explore cases where the combination of the output from sketches and text enables extraction of data that either method in isolation could not e.g. Markush structures, reactions where the product is given as a sketch.
All US patents from 1976 onwards are freely available in computer-readable formats providing a large corpus for chemical text mining. Other patent offices are increasingly also offering their back-catalogues as XML, allowing chemical text mining to be performed in the same way as for recent US patents.
We investigate how much chemistry is found in non-US patents (compared to US patents) and, where the chemistry is present in publications from multiple patent offices, how long were the delays between these publications.
We show that non-US patents can be text mined for a large number of chemical reactions and analyse the overlap with reactions from US patents. Finally we use all the extracted chemical reactions to explore whether models for predicting reaction yield may be built from features such as the reaction type and its reaction conditions (as text mined from the patent text).
CINF 18: Wikipedia and Wiktionary as resources for chemical text miningNextMove Software
The resources provided by the Wikimedia Foundation provide an unprecedented resource for chemists, information professionals and natural language processing researchers in the annotation of pharmaceutically-relevant information in documents. A widely publicized example of the use of Wikipedia in artificial intelligence research is IBM's Watson's participation in the Jeopardy! quiz show. In this presentation, we present several chemical research applications of Wikipedia-derived data sets, including named-entity dictionaries and synonym lists for linking ontologies. The global community of volunteer contributors to these projects deserves continual recognition for the invaluable resource they enable.
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...NextMove Software
The Cahn-Ingold-Prelog (CIP) priority rules have been the corner stone in written communication of stereo-chemical configuration for more than half a century. The rules rank ligands around a stereocentre allowing an atom order and layout invariant stereo-descriptor to be assigned, for example R (right) or S (left) for tetrahedral atoms. Despite their widespread daily use, many chemists may be surprised to find that beyond trivial cases, different software may assign different labels to the same structure diagram.
There have been several attempts to either replace or amend the CIP rules. This talk will highlight the more challenging aspects of the ranking and present a comparison of software that provide CIP labels and where they disagree. Providing an IUPAC verified free and open source CIP implementation would allow software maintainers and vendors to validate and improve their implementations. Ultimately this would improve the accuracy in exchange of written chemical information for all.
Automatic extraction of bioactivity data from patentsNextMove Software
Structure-Activity Relationship (SAR) analysis is important for the development of novel small molecule drugs. Such analyses rely on bioactivity data either from in-house or published data, with data from the latter currently being extracted manually at much expensive.
Here we report on an entirely automated system for extracting bioactivity data that we are developing, initially targeting US patents. The system relies on combining the results of many technologies: chemical entity recognition, chemical name to structure, table processing, chemical compound number resolution, chemical sketch interpretation, and even in some cases reconstitution of molecules from a generic core and R-group definitions. Where possible, the target and the assay description are also identified.
To assess the precision/recall of our system we compare our results with those manually extracted from US patents by BindingDB. We also compare the data we’ve extracted with the data present in ChEMBL from journal articles, to analyse whether there are significant differences between activity data in journal articles and patents e.g. differences in targets of interest.
Chemical sketches are ubiquitous in the published literature. Unlike connection table formats that precisely capture chemistry for database entry, the primary purpose of a sketch format is to produce a high quality image for conveying information to other chemists. Chemical sketches can be presented in a variety of chemistry-specific formats as well as image formats, with the later presenting additional challenges to interpretation. Since 2001 the United States Patent Office has redrawn all chemical sketches in ChemDraw, yielding to date over 25 million freely available CDX files.
Correctly extracting chemistry from these files required tackling of many areas including disambiguation of ambiguous labels (e.g. B, D, P, V, Ac), interpreting labels (e.g. COOH), interpretation of free text overlaid on the structure (e.g. brackets for a repeated group) and assignment of reaction role.
We report our work on extracting chemical structures and reactions from sketches and demonstrate the improvements in quality that tackling the intricacies of sketches provides over more naïve approaches. One notable improvement is the ability to better distinguish between specific compounds, fragments, generic structures and reaction schemes. We compare the chemistry extracted from sketches with the results from text-mining, and show that a large amount of chemistry is only available from one medium or the other. We also explore cases where the combination of the output from sketches and text enables extraction of data that either method in isolation could not e.g. Markush structures, reactions where the product is given as a sketch.
All US patents from 1976 onwards are freely available in computer-readable formats providing a large corpus for chemical text mining. Other patent offices are increasingly also offering their back-catalogues as XML, allowing chemical text mining to be performed in the same way as for recent US patents.
We investigate how much chemistry is found in non-US patents (compared to US patents) and, where the chemistry is present in publications from multiple patent offices, how long were the delays between these publications.
We show that non-US patents can be text mined for a large number of chemical reactions and analyse the overlap with reactions from US patents. Finally we use all the extracted chemical reactions to explore whether models for predicting reaction yield may be built from features such as the reaction type and its reaction conditions (as text mined from the patent text).
CINF 18: Wikipedia and Wiktionary as resources for chemical text miningNextMove Software
The resources provided by the Wikimedia Foundation provide an unprecedented resource for chemists, information professionals and natural language processing researchers in the annotation of pharmaceutically-relevant information in documents. A widely publicized example of the use of Wikipedia in artificial intelligence research is IBM's Watson's participation in the Jeopardy! quiz show. In this presentation, we present several chemical research applications of Wikipedia-derived data sets, including named-entity dictionaries and synonym lists for linking ontologies. The global community of volunteer contributors to these projects deserves continual recognition for the invaluable resource they enable.
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...NextMove Software
The Cahn-Ingold-Prelog (CIP) priority rules have been the corner stone in written communication of stereo-chemical configuration for more than half a century. The rules rank ligands around a stereocentre allowing an atom order and layout invariant stereo-descriptor to be assigned, for example R (right) or S (left) for tetrahedral atoms. Despite their widespread daily use, many chemists may be surprised to find that beyond trivial cases, different software may assign different labels to the same structure diagram.
There have been several attempts to either replace or amend the CIP rules. This talk will highlight the more challenging aspects of the ranking and present a comparison of software that provide CIP labels and where they disagree. Providing an IUPAC verified free and open source CIP implementation would allow software maintainers and vendors to validate and improve their implementations. Ultimately this would improve the accuracy in exchange of written chemical information for all.
CINF 170: Regioselectivity: An application of expert systems and ontologies t...NextMove Software
Prediction is much harder than analysis. Consider hurricanes and tornadoes; it's much easier to follow the path of destruction by locating devastated neighborhoods, than to forecast the paths of such weather systems in advance. Likewise for many chemical reactions, such as nitration (by refluxing with nitric acid and sulfuric acid) where the appearance of one or more nitro groups indicates a nitration reaction, but predicting where on a non-trivial organic molecule this functional group appears is a much harder challenge. In this sense, reaction analysis is much simpler than (either forward or retrosynthetic) synthesis planning.
NextMove Software's namerxn is an expert system for classifying reactions (from reaction SMILES, MDL connection tables or ChemDraw sketches) typically assigning each reaction instance to a leaf classification in the Royal Society of Chemistry's RXNO ontology. These tools can be helpful in the analysis of regioselectivity preferences of reactions.
This talk consists of two parts. A technical part describing the recent algorithmic and methodological improvements to the namerxn software, including describing some of the more challenging of the 1000+ reactions it currently identifies. And a scientific part that investigates the regioselective preferences of some of these reactions.
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint FederationMuhammad Saleem
Efficient federated query processing is of significant importance to tame the large amount of data available on the Web of Data. Previous works have focused on generating optimized query execution plans for fast result retrieval. However, devising source selection approaches beyond triple pattern-wise source selection has not received much attention. This work presents HiBISCuS, a novel hypergraph-based source selection approach to federated SPARQL querying. Our approach can be directly combined with existing SPARQL query federation engines to achieve the same recall while querying fewer data sources. We extend three well-known SPARQL query federation engines with HiBISCus and compare our extensions with the original approaches on FedBench. Our evaluation shows that HiBISCuS can efficiently reduce the total number of sources selected without losing recall. Moreover, our approach significantly reduces the execution time of the selected engines on most of the benchmark queries.
FedX - Optimization Techniques for Federated Query Processing on Linked Dataaschwarte
The final slides of our talk about FedX at the 10th International Semantic Web Conference in Bonn. For details about FedX see http://www.fluidops.com/fedx/
From: Linked Data: what cataloguers need to know. A CIG event. 25 November 2013, Birmingham. #cigld
http://www.cilip.org.uk/cataloguing-and-indexing-group/events/linked-data-what-cataloguers-need-know-cig-event
Accompanying write-up from Catalogue & Index 174: http://discovery.ucl.ac.uk/1449458/
As the oldest abstracting service, Chemisches Zentralblatt generated detailed abstracts of scientific research from
1840-1969. CAS, in partnership with Iconic Translation Machines (ITM), has made this information accessible in ChemZent TM , the first and only English searchable translation of Chemisches Zentralblatt.
After its introduction in 1840, Chemisches Zentralblatt quickly grew to be an invaluable resource for chemists. While the content is freely accessible via various online platforms, locating specific information in the volumes of Chemisches Zentralblatt can be challenging. To find a topic or author, the user needs to know the year and volume of interest. In addition, the content was previously only available in German.
Leveraging ITM technology and extensive CAS expertise processing scientific literature, the two companies teamed
to develop ChemZent. This historical content further enhances the most comprehensive and authoritative source of references, substances and reactions in chemistry and related sciences accessible in SciFinder ® . Three million English–translated abstracts are now searchable in SciFinder, making this rich source of content accessible to today’s researchers.
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...dan2097
Extracting the structures of small molecules from unstructured text is now a mature field, however there still remain areas that present considerable difficulty or have until this point remained unexplored.
One such area is identification of chemical names with misspellings or errors introduced by optical character recognition. The approach we have taken employs a formal grammar describing the syntax of a systematic name. To provide coverage over the vast majority of organic nomenclature including carbohydrates, amino acids and natural products we have developed a new way of representing the grammar such as to allow an order of magnitude more states than previous efforts1 whilst simultaneously reducing memory consumption. To efficiently perform spelling correction against this grammar we will describe a heuristic spelling correction algorithm.
Another area that remains underexplored is the identification and resolution of chemical line formulae by which we also include domain specific line formulae such as are used to describe oligosaccharides and peptides. We describe the recognition and resolution of these often overlooked chemical entities.
We also show how one can identify entities such as journal and patent references, which can aid in the navigation of semantically enhanced documents.
(1) Sayle, R.; Xie, P. H.; Muresan, S. Improved Chemical Text Mining of Patents with Infinite Dictionaries and Automatic Spelling Correction. J. Chem. Inf. Model. 2011, 52, 51–62.
II-PIC 2017: Why did I miss that Patent? How value added databases of STN he...Dr. Haxel Consult
Makarand Waikar (ACS International, India)
With a rapid increase in the patent and non-patent literature- there is always a chance that we miss some important patent due to the complexity with which they are being reported. This is critical in prior art, Freedom-to-Operate, Infringement searches.
In this session, we will showcase case studies on how value added databases will help identify evasive information such as incompletely defined substances, chemically modified bio-sequences, prophetic substances, Markush structures, numeric properties discussed in Patents and NPL.
Case studies will be presented using STN ®–an online platform with over 100 techno-scientific databases. Over 95% of the world’s patent applications are reviewed by patent offices that use STN.
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...Seth Grimes
Presentation by Nathan Schneider, Assistant Professor of Linguistics and Computer Science at Georgetown University, to the Washington DC Natural Language Processing meetup, October 14, 2019 (https://www.meetup.com/DC-NLP/events/264894589/).
The Ins and Outs of Preposition Semantics: Challenges in Comprehensive Corpu...Seth Grimes
Presentation by Nathan Scheider, Georgetown University, to the Washington DC Natural Language Processing meetup, October 14, 2019, https://www.meetup.com/DC-NLP/events/264894589/.
In grammars we trust: LeadMine, a knowledge driven solutionNextMove Software
We present a system employing large grammars and dictionaries to recognize a broad range of chemical entities. The system utilizes these re-sources to identify chemical entities without an explicit tokenization step. To al-low recognition of terms slightly outside the coverage of these resources we employ spelling correction, entity extension, and merging of adjacent entities. Recall is enhanced by the use of abbreviation detection and precision is en-hanced by the removal of abbreviations of non-entities. With the use of training data to produce further dictionaries of terms to recognize/ignore our system achieved 86.2% precision and 85.0% recall on an unused development set.
CINF 170: Regioselectivity: An application of expert systems and ontologies t...NextMove Software
Prediction is much harder than analysis. Consider hurricanes and tornadoes; it's much easier to follow the path of destruction by locating devastated neighborhoods, than to forecast the paths of such weather systems in advance. Likewise for many chemical reactions, such as nitration (by refluxing with nitric acid and sulfuric acid) where the appearance of one or more nitro groups indicates a nitration reaction, but predicting where on a non-trivial organic molecule this functional group appears is a much harder challenge. In this sense, reaction analysis is much simpler than (either forward or retrosynthetic) synthesis planning.
NextMove Software's namerxn is an expert system for classifying reactions (from reaction SMILES, MDL connection tables or ChemDraw sketches) typically assigning each reaction instance to a leaf classification in the Royal Society of Chemistry's RXNO ontology. These tools can be helpful in the analysis of regioselectivity preferences of reactions.
This talk consists of two parts. A technical part describing the recent algorithmic and methodological improvements to the namerxn software, including describing some of the more challenging of the 1000+ reactions it currently identifies. And a scientific part that investigates the regioselective preferences of some of these reactions.
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint FederationMuhammad Saleem
Efficient federated query processing is of significant importance to tame the large amount of data available on the Web of Data. Previous works have focused on generating optimized query execution plans for fast result retrieval. However, devising source selection approaches beyond triple pattern-wise source selection has not received much attention. This work presents HiBISCuS, a novel hypergraph-based source selection approach to federated SPARQL querying. Our approach can be directly combined with existing SPARQL query federation engines to achieve the same recall while querying fewer data sources. We extend three well-known SPARQL query federation engines with HiBISCus and compare our extensions with the original approaches on FedBench. Our evaluation shows that HiBISCuS can efficiently reduce the total number of sources selected without losing recall. Moreover, our approach significantly reduces the execution time of the selected engines on most of the benchmark queries.
FedX - Optimization Techniques for Federated Query Processing on Linked Dataaschwarte
The final slides of our talk about FedX at the 10th International Semantic Web Conference in Bonn. For details about FedX see http://www.fluidops.com/fedx/
From: Linked Data: what cataloguers need to know. A CIG event. 25 November 2013, Birmingham. #cigld
http://www.cilip.org.uk/cataloguing-and-indexing-group/events/linked-data-what-cataloguers-need-know-cig-event
Accompanying write-up from Catalogue & Index 174: http://discovery.ucl.ac.uk/1449458/
As the oldest abstracting service, Chemisches Zentralblatt generated detailed abstracts of scientific research from
1840-1969. CAS, in partnership with Iconic Translation Machines (ITM), has made this information accessible in ChemZent TM , the first and only English searchable translation of Chemisches Zentralblatt.
After its introduction in 1840, Chemisches Zentralblatt quickly grew to be an invaluable resource for chemists. While the content is freely accessible via various online platforms, locating specific information in the volumes of Chemisches Zentralblatt can be challenging. To find a topic or author, the user needs to know the year and volume of interest. In addition, the content was previously only available in German.
Leveraging ITM technology and extensive CAS expertise processing scientific literature, the two companies teamed
to develop ChemZent. This historical content further enhances the most comprehensive and authoritative source of references, substances and reactions in chemistry and related sciences accessible in SciFinder ® . Three million English–translated abstracts are now searchable in SciFinder, making this rich source of content accessible to today’s researchers.
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...dan2097
Extracting the structures of small molecules from unstructured text is now a mature field, however there still remain areas that present considerable difficulty or have until this point remained unexplored.
One such area is identification of chemical names with misspellings or errors introduced by optical character recognition. The approach we have taken employs a formal grammar describing the syntax of a systematic name. To provide coverage over the vast majority of organic nomenclature including carbohydrates, amino acids and natural products we have developed a new way of representing the grammar such as to allow an order of magnitude more states than previous efforts1 whilst simultaneously reducing memory consumption. To efficiently perform spelling correction against this grammar we will describe a heuristic spelling correction algorithm.
Another area that remains underexplored is the identification and resolution of chemical line formulae by which we also include domain specific line formulae such as are used to describe oligosaccharides and peptides. We describe the recognition and resolution of these often overlooked chemical entities.
We also show how one can identify entities such as journal and patent references, which can aid in the navigation of semantically enhanced documents.
(1) Sayle, R.; Xie, P. H.; Muresan, S. Improved Chemical Text Mining of Patents with Infinite Dictionaries and Automatic Spelling Correction. J. Chem. Inf. Model. 2011, 52, 51–62.
II-PIC 2017: Why did I miss that Patent? How value added databases of STN he...Dr. Haxel Consult
Makarand Waikar (ACS International, India)
With a rapid increase in the patent and non-patent literature- there is always a chance that we miss some important patent due to the complexity with which they are being reported. This is critical in prior art, Freedom-to-Operate, Infringement searches.
In this session, we will showcase case studies on how value added databases will help identify evasive information such as incompletely defined substances, chemically modified bio-sequences, prophetic substances, Markush structures, numeric properties discussed in Patents and NPL.
Case studies will be presented using STN ®–an online platform with over 100 techno-scientific databases. Over 95% of the world’s patent applications are reviewed by patent offices that use STN.
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...Seth Grimes
Presentation by Nathan Schneider, Assistant Professor of Linguistics and Computer Science at Georgetown University, to the Washington DC Natural Language Processing meetup, October 14, 2019 (https://www.meetup.com/DC-NLP/events/264894589/).
The Ins and Outs of Preposition Semantics: Challenges in Comprehensive Corpu...Seth Grimes
Presentation by Nathan Scheider, Georgetown University, to the Washington DC Natural Language Processing meetup, October 14, 2019, https://www.meetup.com/DC-NLP/events/264894589/.
In grammars we trust: LeadMine, a knowledge driven solutionNextMove Software
We present a system employing large grammars and dictionaries to recognize a broad range of chemical entities. The system utilizes these re-sources to identify chemical entities without an explicit tokenization step. To al-low recognition of terms slightly outside the coverage of these resources we employ spelling correction, entity extension, and merging of adjacent entities. Recall is enhanced by the use of abbreviation detection and precision is en-hanced by the removal of abbreviations of non-entities. With the use of training data to produce further dictionaries of terms to recognize/ignore our system achieved 86.2% precision and 85.0% recall on an unused development set.
n this presentation, Manoj K. has talked about “Regular Expression”. Here he has explained how Regular Expressions are used. He has covered all of the codes and what they are used for. The goal is to teach you how to use regular expressions once and for all.
----------------------------------------------------------
Get Socialistic
Our website: http://valuebound.com/
LinkedIn: http://bit.ly/2eKgdux
Facebook: https://www.facebook.com/valuebound/
Twitter: https://twitter.com/valuebound
CINF 35: Structure searching for patent information: The need for speedNextMove Software
Chemical databases grow larger every year. Without investing in additional hardware or improved software, the time to search these databases will in turn grow longer annually. With an ever-increasing number of pharmaceutical patents, the amount of chemical data associated with these is growing at a rate with which hardware advances alone cannot keep up.
Using automated mining of U.S. and European patents, we have extracted large collections of structural data in the form of reactions, mixtures, and exemplified compounds. Additional information such as protein targets and diseases are also extracted from each patent and associated with the structural data. We will describe how this data can be queried with natural language phrases and how these phrases are interpreted as structural queries.
Through innovations in substructure and similarity search algorithms, it is possible to search and retrieve hundreds of millions of chemical records in fractions of a second. We will demonstrate how this is achieved on a regular desktop machine using just-in-time and ahead-of-time compilation techniques.
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionNextMove Software
Presented by Roger Sayle at the 11th International Conference on Chemical Structures (ICCS) 2018.
Exponential database growth will always be a technical challenge but evolutionary strategies offer to hold off the inevitable in the short term and revolutionary strategies promise a longer-term solution.
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesNextMove Software
We have previously described the extraction of reactions from US and European patents. This talk will discuss the assembly of over six million extracted reaction details consisting of the connection tables, procedure, quantities, solvents, catalysts and yields into a searchable "read-only" Electronic Lab Notebook.
In addition to reactions details, concepts including diseases, drug targets, and assignees are recognised from the patent documents and normalised to appropriate ontologies. Each normalised term is paired with the reaction details found in the document to allow intuitive cross concept querying (e.g. "GlaxoSmithKline C-C Bond Formation greater than 80% yield Myocardial Infarction"). Reactions are classified and assigned to leafs in the RXNO Ontology. The ontologies are used to provide organisation, faceting, and filtering of results. The reaction classification also provides a precise atom mapping that facilitates structural transformation queries and can improve reaction diagram layout.
Through improvements in substructure search technology we will demonstrate several types of chemical synthesis queries that can be efficiently answered. The combination of high performance chemical searching and additional document terms provides a powerful exploratory and trend analysis tool for chemists.
Building on Sand: Standard InChIs on non-standard molfilesNextMove Software
The molfile serves as a de facto standard for chemical information exchange. It is perhaps the most widely supported format with its core syntax being easy to understand, parse, and generate. Beyond the core syntax, more advanced features such as sgroups and enhanced stereochemistry are rarely supported, often only being partially implemented and used. Additionally, several vendors, toolkits, and service providers have added extended syntax to their molfiles to solve particular corner cases or representation problems. This talk will provide a brief summary of the less widely supported features of the molfile including sgroups and enhanced stereochemistry. Additionally, a survey of how extensions can/have been implemented and which extensions exist "in the wild". Special attention will be paid to capturing of coordination bonds for extending the InChI to handle organometallics.
Richard's entangled aventures in wonderlandRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.Sérgio Sacani
The return of a sample of near-surface atmosphere from Mars would facilitate answers to several first-order science questions surrounding the formation and evolution of the planet. One of the important aspects of terrestrial planet formation in general is the role that primary atmospheres played in influencing the chemistry and structure of the planets and their antecedents. Studies of the martian atmosphere can be used to investigate the role of a primary atmosphere in its history. Atmosphere samples would also inform our understanding of the near-surface chemistry of the planet, and ultimately the prospects for life. High-precision isotopic analyses of constituent gases are needed to address these questions, requiring that the analyses are made on returned samples rather than in situ.
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
Multi-source connectivity as the driver of solar wind variability in the heli...Sérgio Sacani
The ambient solar wind that flls the heliosphere originates from multiple
sources in the solar corona and is highly structured. It is often described
as high-speed, relatively homogeneous, plasma streams from coronal
holes and slow-speed, highly variable, streams whose source regions are
under debate. A key goal of ESA/NASA’s Solar Orbiter mission is to identify
solar wind sources and understand what drives the complexity seen in the
heliosphere. By combining magnetic feld modelling and spectroscopic
techniques with high-resolution observations and measurements, we show
that the solar wind variability detected in situ by Solar Orbiter in March
2022 is driven by spatio-temporal changes in the magnetic connectivity to
multiple sources in the solar atmosphere. The magnetic feld footpoints
connected to the spacecraft moved from the boundaries of a coronal hole
to one active region (12961) and then across to another region (12957). This
is refected in the in situ measurements, which show the transition from fast
to highly Alfvénic then to slow solar wind that is disrupted by the arrival of
a coronal mass ejection. Our results describe solar wind variability at 0.5 au
but are applicable to near-Earth observatories.
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
Advanced grammars for state-of-the-art named entity recognition (NER)
1. Advanced grammars for
state-of-the-art named
entity recognition (NER)
Roger Sayle and daniel lowe
NextMove Software, Cambridge, UK
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
2. overview
• NextMove Software’s LeadMine text-mining engine
internally uses “CaffeineFix” (.cfx) technology for
specifying and efficiently matching important terms.
• In addition to case-sensitive and case-insensitive
term matching CaffeineFix/LeadMine also support
spelling correction (fuzzy matching).
• The most common usage is to simply compile
dictionaries into binary form for fast matching.
• Advanced users, specify “regular expressions”.
• In this presentation, we go beyond REGEXPs.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
3. leadmine v2 entity types
1. Chemicals
2. Biomolecules
3. Anatomy
4. Cell Lines
5. Diseases
6. Symptoms
7. Mechanisms of Action
8. Species/Organisms
9. Companies
10. Named Reactions
11. Regions
12. Languages/Possessives
1.1 Dictionary Names
1.2 Systematic Names
1.3 Generic Classes
1.4 Polymers
1.5 Formulae
2.1 Proteins
2.2 Genes
2.3 E.C. Numbers
2.4 PDB Codes
3.1 Cell Types
3.2 Cytogenetic Loci
1.1.1 Abbreviations
1.1.2 CAS RN Numbers
1.1.3 Registry Numbers
1.2.1 Functional Groups
1.2.2 Elements
1.2.3 Acids
1.2.4 SMILES
1.2.5 InChIs
2.1.1 Targets
2.1.2 P450s
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
4. named entity normal forms
• Chemicals SMILES and/or InChI
• Proteins UniProt
• Genes Entrez GeneID/HGNC
• Targets ChEMBL
• Species/Organism NCBI Taxonomy ID
• Diseases/Symptoms ICD-10
• Named Reactions RXNO
• Mechanism of Action ATC
• Many of these can also use NLM MeSH Terms.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
5. Example entity dictionary as dag
• Nitrogen containing heterocycles as minimal DFA:
– Pyrrole, Pyrazole, Imidazole, Pyrdine, Pyridazine,
Pyrimidine, Pyrazine
• CaffeineFix supports (very large) user dictionaries.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
6. Obo ontologies as dictionaries
• In addition to regular TSV (tab-separated value) files
for storing dictionaries, LeadMine’s obo2dict also
supports OBO ontologies, a convenient method for
tracking synonyms and foreign language forms.
[Term]
id: RXNO:0000006
name: Diels-Alder reaction
synonym: "Diels-Alder cycloaddition" EXACT []
synonym: "ディールス・アルダー反応" EXACT Japanese []
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
7. Plural form generation
• LeadMine’s pluralize automatically generates English
plural forms from singular dictionary entries.
diels-alder couplings RXNO:0000006
diels-alder cycloadditions RXNO:0000006
diels-alder reactions RXNO:0000006
acridine syntheses RXNO:0000518
acyclic beckmann rearrangements RXNO:0000564
acyloin condensations RXNO:0000085
olefin metatheses RXNO:0000280
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
8. Unusual entities
• ISBN, URL, PubMed SQL statement
• Roman Numerals, Date Solvent Mixture
• ColorState, Zip codes Hearst Patterns
• Katakana Unknown acid
• HELM, InChI, SMILES, v2000 Unknown antibody
• Credit Card Numbers Unknown disease
• Region Unknown INN
• Person Ordinal numbers
• Disease Cardinal numbers
• Journal de, es, fr, it, sv
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
9. Grammars within grammars
• LeadMine grammar’s are specified constructively
effectively producing even more entity types.
• Region = City + Continent + Country + Island + Lake +
Mountain + Ocean + River + Sea + State/Province +
OtherFeature + OtherRegion.
• City = CityAlbania + CityAndorra + CityAustralia + CityAustria +
… + CityUS + …
• CityUS = CityUS_AK + CityUS_AL + CityUS_AR + CityUS_AZ +
CityUS_CA + CityUS_CO + …
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
10. Pharma registry numbers
• CaffineFix v2.0 supports sets of user-defined
regular expressions as dictionaries.
• One application is specifying the format of
registry numbers, such as GSK204454A
• Prefix: “A” | “AZ” | “BMY” | “GSK” | “LY” | …
• Number: d{3-7}
• Suffix: (“.” d) | [“a” .. “z”]
• RegistryNumber: Prefix [“ ” | “-”] Number [Suffix]
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
11. Cardinal numbers
• English
– One, ten, two thousand and forty eight, ten million
• German
– Eins, Zehn, Hundert, Million, Viermillion
– Vierhundertsiebenundzwanzigtausendfünfhundertvierunddreißig
• French
– Trois cents, un mille, mille neuf cent quatre-vingts dix-huit
• Italian
– Uno, due, trenta, ottocentosessantamila settecentoottantanove
• Swedish
– en miljon trehundrasjuttiåtta tusen niohundrasjuttiett
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
12. cas registry number grammar
• Two to seven digits, followed by a hyphen, two digits,
a hyphen and a final check digit
– e.g. 7732-18-5
• Regular Expression: (([1-9]d{2,5})|([5-9]d))-dd-d
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
13. Cas check digit calculation
• More generally CaffeineFix’s finite state machines
can do limited processing...
• The final check digit of a CAS number is calculated by
series term summation modulo 10.
• The last digit time 1, the previous digit times 2, the
previous digit times 3, and computing the sum
modulo 10.
• The CAS number for water is 7732-18-5.
• The checksum 5 is calculated as (1x8 + 2x1 + 3x2 +
4x3 + 5x7 + 6x7) mod 10 = 5.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
14. Fsm for matching cas check digits
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
15. Fsm for matching cas check digits
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
16. cas number correction example
• 7732-18-8? Did you mean...
– 7732-18-5
– 7732-11-8
– 77328-18-8
– 7733-18-8
– 77342-18-8
– 77392-18-8
– 71732-18-8
– 76732-18-8
– 97732-18-8
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
17. Roman numerals
One useful operator is NonEmpty that removes the empty string
from the set of valid matches, and requires at least one or more
characters to match.
I
II
III
IV
V
VI
VII
VIII
IX
X
XX
XXX
XL
L
LX
LXX
LXXX
XC
C
CC
CCC
CD
D
DC
DCC
DCCC
CM
M
MM
MMM
Thousands Hundreds Tens Units
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
18. Unknown acid
• Another operators allows wildcards with exceptions,
effectively a not operator.
• An unknown acid is “[a-z’-]+ acid” where the first
word excludes:
– Stop words: a, the, and, any, is, in, was, etc.
– Common qualifiers: acceptable, preferred, etc.
– Adjectives: battery, free, inorganic, strong, etc.
– Known acids: acetic, nitric, amino, carboxylic, etc.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
19. Unknown inn
• A variation on this theme allows LeadMine to
recognize novel (recently announced) kinase
inhibitors and antibodies based on the structure of
their INN names.
• An unknown kinase inhibitor is “[a-z]+inib” and an
unknown antibody is “[a-z]+mab” where the words
exclude previously known/reported INN names and
“colliding” English words.
april != capropril, KappaB != rozrolimupab, yuletide != exenatide,
triumvir != zanamivir, etc.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
20. Person grammar
• The named person grammar matches:
1. [Salutation] FirstName [Initials] Surname [Suffix]
2. [Salutation] FirstName [Initials] UnknownSurname [Suffix]
3. [Salutation] UnknownFirstName [Initials] Surname [Suffix]
• where
Salutation includes Mr., Mrs., Dr., Sir, His Highness, …
FirstName includes David, John, Sarah, Tom, Angela, …
Surname includes Smith, Jones, Overington, …
UnknownFirstname excludes Big, Lake, The, Outer, etc.
UnknownSurname excludes Avenue, Bridge, Street, etc.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
21. List construction operator
• Another frequently used idiom, are the operators for
constructing comma separated list.
• These turn the grammar matching “X” into the
grammar matching things like “X, X, X and X”.
• More specifically:
(X [ “,” “ ”? X]* (“ and ”| “ or ” | “ and/or ” )? X
• Another variation of this allows “other”, “similar”
and “related” to the final X if the list is non-empty.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
22. Hearst pattern grammars
• An example use of list constructions is in the
recognition of Heart Patterns.
1. X such as Y [“including”, “especially” etc.]
2. Y and other X [“and related”, “or similar” etc.]
3. such X as Y
• Where X is category or classification term;
• And Y is a list of exemplified terms.
• Marti A. Hearst, “Automatic Acquisition of Hyponyms from Large Text Corpora”, Proceedings
of the 14th International Conference on Computational Linguistics, Nantes, France, July 1992.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
23. Complex object builder
• An application of the list construction operator is in
our “complex object builder” construction operator.
ComplexObjectBuilder cob;
cob.insert(“red”, “lorry”, “lorries”);
cob.insert(“yellow”, “lorry”, “lorries”);
• Allows matching not only of
“red lorry”, “red lorries”, “yellow lorry” and “yellow lorries”
• But also of…
“red and yellow lorries”, “yellow and red lorries”, etc.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
24. complex disease examples
• Adenomatous polyps of the colon and rectum.
• Fibroepithelia or epithelial hyperplasias.
• Inherited spinocerebellar ataxia.
• Stage II or stage III colorectal cancer.
• Inherited breast and overian cancers.
• Argentinian, Bolivian and Korean haemorrhagic
fevers.
• Dermatitis due to heat, cold, radiation, cosmetics,
fungi and shellfish.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
25. Grammars for Safety text mining
• “May cause lung damage if swallowed”
– “may” → “can”, “could”, “may”, “might”, “will”, etc.
– “cause” → “lead to”, “result in”, “trigger”, “bring on”, …
– “lung damage” → “explosion”, “cancer”, “injury”, …
– “if” → “when”, “once”…
– “swallowed” → “heated”, “shaken”, “dried”, “ignited”…
• “Highly toxic”
– “highly” → “very”, “extremely”, “unusually”, “intensely”…
– “toxic” → “explosive”, “carcinogenic”, “poisonous”…
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
26. efficient protein variant naming
• CaffeineFix technology can also be applied to naming
peptides and arbitrary protein variants/mutants.
• Consider the a database of the following 11 peptides:
– CFFQNCPRG phenylpressin
– CFVRNCPTG annetocin
– CFWTSCPIG octopressin
– CYFQNCPRG argipressin
– CYFQNCPKG lypressin
– CYFRNCPIG cephalotocin
– CYIQNCPLG oxytocin
– CYIQNCPPG prol-oxytocin
– CYIQNCPRG vasotocin
– CYIQSCPIG seritocin
– CYISNCPIG isotocin
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
27. Dag representation of sequences
These 11 peptides may be efficiently represented and
search as a “directed acyclic graph” [38 vs. 99 states]
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
28. entirety of uniprot/swissprot
• Using this representation, all 540546 protein
sequences in uniprot_sprot, which contains over
192M amino acids, requires 142M states (1.4Gb).
• This data structure allows close analogues to be
identified much faster than using NCBI blastp.
• For example, all 540546 sequences can be queried
against this database (i.e. all-against-all) in ~9m30s
on a single core on a laptop.
• The sequence from PDB 1CRN (crambin 46AA) is
canonically named as [L25I]P01542 in 0.002s.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
29. Application to precision medicine
• A more realistic example is that sequence of the
gene “spastic paraplegia4” with six mutations from
OMIM:604277 can be canonically named as
[I344K,S362C,N386S,D441G,C448Y,R499C]Q9UBP0
• Run-time for this query is 0.2s.
• By comparison, blastp 2.2.29+ takes about 6s.
– With default arguments, NCBI blastp run time is 7s.
– Only 6s with –num_descriptions 1 –num_alignments 1.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
30. summary
• LeadMine’s .cfx files can do far more than efficiently
match very large dictionaries of terms.
• Indeed, many of the grammars used at NextMove
Software potentially match an infinite number of
terms.
• Construction of domain specific grammars can be
done in collaboration with LeadMine customers.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017