This document discusses automatic text simplification in the biomedical domain. It covers detecting difficulties in biomedical texts, acquiring paraphrases for technical terms, and applying these techniques to simplify texts for non-specialized readers. The detection of difficulties involves identifying complex words and sentences using measures of readability, eye tracking analysis, and natural language processing. Acquiring paraphrases involves gathering simpler synonyms and definitions of technical terms from resources like Wikipedia and distributional semantic models. The overall goal is to make health information more understandable for patients and the public.
Classification of prostate cancer pathology reports using natural language pr...Anjani Dhrangadhariya
A pathology report is written using technical medical language by a pathologist to inform the referring doctor about cancer diagnostic and staging information. The aim of the pathology report is to disseminate the results from pathologists' inspection back to the physician or surgeon. It consists of information like patient's name, gross description, microscopic description, diagnosis, tumor size, and margin including the other information divided into three to four sections like medical history, diagnosis, summary, and conclusion. For more information about pathology reports read here. However, the pathology reports aren't structured and lack one or the other of the abovementioned sections except for the diagnostic information. On top of that, these reports are handwritten notes on a printed template, signed and stored as digital pdf documents. Converting the pdf documents to machine-readable text files using optical character recognition tools adds additional noise to these otherwise unstructured reports. We propose an approach to utilize these highly-noisy and unstructured pathology reports to automatically extract information about Gleason grading.
Towards comprehensive syntactic and semantic annotations of the clinical narr...Jinho Choi
Objective To create annotated clinical narratives with layers of syntactic and semantic labels to facilitate advances in clinical natural language processing (NLP). To develop NLP algorithms and open source components. Methods Manual annotation of a clinical narrative corpus of 127 606 tokens following the Treebank schema for syntactic information, PropBank schema for predicate-argument structures, and the Unified Medical Language System (UMLS) schema for semantic information. NLP components were developed. Results The final corpus consists of 13 091 sentences containing 1772 distinct predicate lemmas. Of the 766 newly created PropBank frames, 74 are verbs. There are 28 539 named entity (NE) annotations spread over 15 UMLS semantic groups, one UMLS semantic type, and the Person semantic category. The most frequent annotations belong to the UMLS semantic groups of Procedures (15.71%), Disorders (14.74%), Concepts and Ideas (15.10%), Anatomy (12.80%), Chemicals and Drugs (7.49%), and the UMLS semantic type of Sign or Symptom (12.46%). Inter-annotator agreement results: Treebank (0.926), PropBank (0.891–0.931), NE (0.697–0.750). The part-of-speech tagger, constituency parser, dependency parser, and semantic role labeler are built from the corpus and released open source. A significant limitation uncovered by this project is the need for the NLP community to develop a widely agreed-upon schema for the annotation of clinical concepts and their relations. Conclusions This project takes a foundational step towards bringing the field of clinical NLP up to par with NLP in the general domain. The corpus creation and NLP components provide a resource for research and application development that would have been previously impossible.
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...Databricks
The speaker will review case studies from real-world projects that built AI systems using Natural Language Processing (NLP) in healthcare. These case studies cover projects that deployed automated patient risk prediction, automated diagnosis, clinical guidelines, and revenue cycle optimization.
Interest has increased in the use of prognosis factors as a cursor for breast cancer personalized treatment. For clinicians, early detection of those factors can be helpful for a good management of the disease and for the choice of an efficient treatment. Moreover, it exists a huge amount of meaningful information in pathological reports, biological measurements and clinical information in a patient journey that remain unexploited. In that context, I propose to develop and apply novel machine learning techniques to predict cancer outcome such as recurrence or survival from multi-modal breast cancer patient data (including medical notes in natural languages and the outcome of various lab analyses). For that, I use a deep neural sequence transduction for electronic health records called BEHRT1. This model is inspired from one of the most powerful transformer-based architecture in Natural Language Processing: BERT2.
Classification of prostate cancer pathology reports using natural language pr...Anjani Dhrangadhariya
A pathology report is written using technical medical language by a pathologist to inform the referring doctor about cancer diagnostic and staging information. The aim of the pathology report is to disseminate the results from pathologists' inspection back to the physician or surgeon. It consists of information like patient's name, gross description, microscopic description, diagnosis, tumor size, and margin including the other information divided into three to four sections like medical history, diagnosis, summary, and conclusion. For more information about pathology reports read here. However, the pathology reports aren't structured and lack one or the other of the abovementioned sections except for the diagnostic information. On top of that, these reports are handwritten notes on a printed template, signed and stored as digital pdf documents. Converting the pdf documents to machine-readable text files using optical character recognition tools adds additional noise to these otherwise unstructured reports. We propose an approach to utilize these highly-noisy and unstructured pathology reports to automatically extract information about Gleason grading.
Towards comprehensive syntactic and semantic annotations of the clinical narr...Jinho Choi
Objective To create annotated clinical narratives with layers of syntactic and semantic labels to facilitate advances in clinical natural language processing (NLP). To develop NLP algorithms and open source components. Methods Manual annotation of a clinical narrative corpus of 127 606 tokens following the Treebank schema for syntactic information, PropBank schema for predicate-argument structures, and the Unified Medical Language System (UMLS) schema for semantic information. NLP components were developed. Results The final corpus consists of 13 091 sentences containing 1772 distinct predicate lemmas. Of the 766 newly created PropBank frames, 74 are verbs. There are 28 539 named entity (NE) annotations spread over 15 UMLS semantic groups, one UMLS semantic type, and the Person semantic category. The most frequent annotations belong to the UMLS semantic groups of Procedures (15.71%), Disorders (14.74%), Concepts and Ideas (15.10%), Anatomy (12.80%), Chemicals and Drugs (7.49%), and the UMLS semantic type of Sign or Symptom (12.46%). Inter-annotator agreement results: Treebank (0.926), PropBank (0.891–0.931), NE (0.697–0.750). The part-of-speech tagger, constituency parser, dependency parser, and semantic role labeler are built from the corpus and released open source. A significant limitation uncovered by this project is the need for the NLP community to develop a widely agreed-upon schema for the annotation of clinical concepts and their relations. Conclusions This project takes a foundational step towards bringing the field of clinical NLP up to par with NLP in the general domain. The corpus creation and NLP components provide a resource for research and application development that would have been previously impossible.
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...Databricks
The speaker will review case studies from real-world projects that built AI systems using Natural Language Processing (NLP) in healthcare. These case studies cover projects that deployed automated patient risk prediction, automated diagnosis, clinical guidelines, and revenue cycle optimization.
Interest has increased in the use of prognosis factors as a cursor for breast cancer personalized treatment. For clinicians, early detection of those factors can be helpful for a good management of the disease and for the choice of an efficient treatment. Moreover, it exists a huge amount of meaningful information in pathological reports, biological measurements and clinical information in a patient journey that remain unexploited. In that context, I propose to develop and apply novel machine learning techniques to predict cancer outcome such as recurrence or survival from multi-modal breast cancer patient data (including medical notes in natural languages and the outcome of various lab analyses). For that, I use a deep neural sequence transduction for electronic health records called BEHRT1. This model is inspired from one of the most powerful transformer-based architecture in Natural Language Processing: BERT2.
Broadening the Scope of NanopublicationsTobias Kuhn
(CC Attribution License does not apply to included third-party material on slide 3; see the paper for the references: http://www.tkuhn.ch/pub/kuhn2013eswc.pdf )
The Application of the Human Phenotype Ontology mhaendel
Presented at the II International Summer School for Rare Disease and Orphan Drug Registries, September 15-19, 2014, Organized by the National Centre for Rare Diseases
Istituto Superiore di Sanità (ISS), Rome, Italy.
Note the extensive contribution by many consortium members and partners listed in the acknowledgements slide.
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...David Talby
An April 2023 presentation to the AMIA working group on natural language processing. The talk focuses on three current trends in NLP and how they apply in healthcare: Large language models, No-code, and Responsible AI.
This year's 3rd Annual TCGC: The Clinical Genome Conference, held June 10-12, 2014 in San Francisco, is a three-day event that weaves together the science of sequencing and the business of implementing genomics in the clinic. It uniquely illustrates the mutual influence of those areas and the need to therefore consider the needs, challenges and opportunities of both - from next-generation sequencing and variant interpretation to insurance reimbursement and electronic health records - throughout the entire research process.Learn more at http://www.clinicalgenomeconference.com
Cephalometrics history, evolution, and land marks/orthodontic courses by indi...Indian dental academy
Indian Dental Academy: will be one of the most relevant and exciting training center with best faculty and flexible training programs for dental professionals who wish to advance in their dental practice,Offers certified courses in Dental implants,Orthodontics,Endodontics,Cosmetic Dentistry, Prosthetic Dentistry, Periodontics and General Dentistry.
Indian Dental Academy: will be one of the most relevant and exciting training center with best faculty and flexible training programs for dental professionals who wish to advance in their dental practice,Offers certified courses in Dental implants,Orthodontics,Endodontics,Cosmetic Dentistry, Prosthetic Dentistry, Periodontics and General Dentistry.
Tweeting beyond Facts – The Need for a Linguistic PerspectiveData Science Society
Text is only accepted by its intended audience when included facts are properly anchored in extra-propositional information: source, time, date, sentiment, certainty, veridicity, etc. can all be conveyed through linguistic embedding constructions, among others. Work in the CLaC Lab has a long tradition of modeling this extra-propositional material, from reported speech to speculative language, negation, modality, event temporal anchoring and tense information. I show results from our recent validation of negation and modality as contributing to a downstream task of sentiment analysis of tweets and outline how I consider this validation to extend to our continuing effort to give a modular, shallow, and compositional treatment of embedding predicates in general.
ATTENTION-BASED DEEP LEARNING SYSTEM FOR NEGATION AND ASSERTION DETECTION IN ...ijaia
Natural language processing (NLP) has been recently used to extract clinical information from free text in Electronic Health Record (EHR). In clinical NLP one challenge is that the meaning of clinical entities is heavily affected by assertion modifiers such as negation, uncertain, hypothetical, experiencer and so on. Incorrect assertion assignment could cause inaccurate diagnosis of patients’ condition or negatively influence following study like disease modelling. Thus, high-performance clinical NLP systems which can automatically detect negation and other assertion status of given target medical findings (e.g. disease, symptom) in clinical context are highly demanded. Here in this work, we propose a deep-learning system based on word embedding and Attention-based Bidirectional Long Short-Term Memory networks (AttBiLSTM) for assertion detection in clinical notes. Unlike previous state-of-art methods which require knowledge input, our system is a knowledge poor machine learning system and can be easily extended or transferred to other domains. The evaluation of our system on public benchmarking corpora demonstrates that a knowledge poor deep-learning system can also achieve high performance for detecting negation and assertions comparing to state-of-the-art systems.
This talk gives an introduction to entity linking for biomedical data. It describes the problem to be solved as a three stage task and links to state of the art approaches for these steps.
Talk held at the Hamburg Data Science Meetup, Hamburgs largest data event.
Here are tutorial (Methods and Applications of NLP in Medicine) slides at AIME 2020 (International Conference on Artificial Intelligence in Medicine) provided by Dr. Hua Xu, Dr. Yifan Peng, Dr. Yanshan Wang, Dr. Rui Zhang. Through this half-day tutorial, we introduced our methodological efforts in applying NLP to the clinical domain, and showcase our real-world NLP applications in clinical practice and research across four institutions. We reviewed NLP techniques in solving clinical problems and facilitating clinical research, the state-of-the art clinical NLP tools, and share collaboration experience with clinicians, as well as publicly available EHR data and medical resources, and also concluded the tutorial with vast opportunities and challenges of clinical NLP. The tutorial will provide an overview of clinical backgrounds, and does not presume knowledge in medicine or health care.
Speaker: Vitalii Braslavskyi, Software Engineer at Grammarly
Summary:
Today, the dominant approach to software engineering is an imperative one — the best practices have been proven over time. But the world is always evolving, and in order to evolve with it and remain as productive as possible, we need to continue searching for better tools to solve problems of increasing complexity.
In this talk, we'll discuss the tools and techniques of the .Net ecosystem that can help us to concentrate on the problem itself — not just on the intermediate steps (which have likely already been solved). We'll compare imperative and declarative approaches and assess solutions to problems.
We'll also offer examples of how engineers in Grammarly's Office Add-in team use these tools to improve the efficiency of our engineering and strengthen our solutions to the problems at hand.
Grammarly AI-NLP Club #10 - Information-Theoretic Probing with Minimum Descri...Grammarly
Speaker: Elena Voita, a Ph.D. student at the University of Edinburgh and the University of Amsterdam
Summary: How can you know whether a model (e.g., ELMo, BERT) has learned to encode a linguistic property? The most popular approach to measure how well pretrained representations encode a linguistic property is to use the accuracy of a probing classifier (probe). However, such probes often fail to adequately reflect differences in representations, and they can show different results depending on probe hyperparameters. As an alternative to standard probing, we propose information-theoretic probing which measures minimum description length (MDL) of labels given representations. In addition to probe quality, the description length evaluates “the amount of effort” needed to achieve this quality. We show that (i) MDL can be easily evaluated on top of standard probe-training pipelines, and (ii) compared to standard probes, the results of MDL probing are more informative, stable, and sensible.
More Related Content
Similar to Grammarly AI-NLP Club #5 - Automatic text simplification in the biomedical domain - Natalia Grabar
Broadening the Scope of NanopublicationsTobias Kuhn
(CC Attribution License does not apply to included third-party material on slide 3; see the paper for the references: http://www.tkuhn.ch/pub/kuhn2013eswc.pdf )
The Application of the Human Phenotype Ontology mhaendel
Presented at the II International Summer School for Rare Disease and Orphan Drug Registries, September 15-19, 2014, Organized by the National Centre for Rare Diseases
Istituto Superiore di Sanità (ISS), Rome, Italy.
Note the extensive contribution by many consortium members and partners listed in the acknowledgements slide.
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...David Talby
An April 2023 presentation to the AMIA working group on natural language processing. The talk focuses on three current trends in NLP and how they apply in healthcare: Large language models, No-code, and Responsible AI.
This year's 3rd Annual TCGC: The Clinical Genome Conference, held June 10-12, 2014 in San Francisco, is a three-day event that weaves together the science of sequencing and the business of implementing genomics in the clinic. It uniquely illustrates the mutual influence of those areas and the need to therefore consider the needs, challenges and opportunities of both - from next-generation sequencing and variant interpretation to insurance reimbursement and electronic health records - throughout the entire research process.Learn more at http://www.clinicalgenomeconference.com
Cephalometrics history, evolution, and land marks/orthodontic courses by indi...Indian dental academy
Indian Dental Academy: will be one of the most relevant and exciting training center with best faculty and flexible training programs for dental professionals who wish to advance in their dental practice,Offers certified courses in Dental implants,Orthodontics,Endodontics,Cosmetic Dentistry, Prosthetic Dentistry, Periodontics and General Dentistry.
Indian Dental Academy: will be one of the most relevant and exciting training center with best faculty and flexible training programs for dental professionals who wish to advance in their dental practice,Offers certified courses in Dental implants,Orthodontics,Endodontics,Cosmetic Dentistry, Prosthetic Dentistry, Periodontics and General Dentistry.
Tweeting beyond Facts – The Need for a Linguistic PerspectiveData Science Society
Text is only accepted by its intended audience when included facts are properly anchored in extra-propositional information: source, time, date, sentiment, certainty, veridicity, etc. can all be conveyed through linguistic embedding constructions, among others. Work in the CLaC Lab has a long tradition of modeling this extra-propositional material, from reported speech to speculative language, negation, modality, event temporal anchoring and tense information. I show results from our recent validation of negation and modality as contributing to a downstream task of sentiment analysis of tweets and outline how I consider this validation to extend to our continuing effort to give a modular, shallow, and compositional treatment of embedding predicates in general.
ATTENTION-BASED DEEP LEARNING SYSTEM FOR NEGATION AND ASSERTION DETECTION IN ...ijaia
Natural language processing (NLP) has been recently used to extract clinical information from free text in Electronic Health Record (EHR). In clinical NLP one challenge is that the meaning of clinical entities is heavily affected by assertion modifiers such as negation, uncertain, hypothetical, experiencer and so on. Incorrect assertion assignment could cause inaccurate diagnosis of patients’ condition or negatively influence following study like disease modelling. Thus, high-performance clinical NLP systems which can automatically detect negation and other assertion status of given target medical findings (e.g. disease, symptom) in clinical context are highly demanded. Here in this work, we propose a deep-learning system based on word embedding and Attention-based Bidirectional Long Short-Term Memory networks (AttBiLSTM) for assertion detection in clinical notes. Unlike previous state-of-art methods which require knowledge input, our system is a knowledge poor machine learning system and can be easily extended or transferred to other domains. The evaluation of our system on public benchmarking corpora demonstrates that a knowledge poor deep-learning system can also achieve high performance for detecting negation and assertions comparing to state-of-the-art systems.
This talk gives an introduction to entity linking for biomedical data. It describes the problem to be solved as a three stage task and links to state of the art approaches for these steps.
Talk held at the Hamburg Data Science Meetup, Hamburgs largest data event.
Here are tutorial (Methods and Applications of NLP in Medicine) slides at AIME 2020 (International Conference on Artificial Intelligence in Medicine) provided by Dr. Hua Xu, Dr. Yifan Peng, Dr. Yanshan Wang, Dr. Rui Zhang. Through this half-day tutorial, we introduced our methodological efforts in applying NLP to the clinical domain, and showcase our real-world NLP applications in clinical practice and research across four institutions. We reviewed NLP techniques in solving clinical problems and facilitating clinical research, the state-of-the art clinical NLP tools, and share collaboration experience with clinicians, as well as publicly available EHR data and medical resources, and also concluded the tutorial with vast opportunities and challenges of clinical NLP. The tutorial will provide an overview of clinical backgrounds, and does not presume knowledge in medicine or health care.
Speaker: Vitalii Braslavskyi, Software Engineer at Grammarly
Summary:
Today, the dominant approach to software engineering is an imperative one — the best practices have been proven over time. But the world is always evolving, and in order to evolve with it and remain as productive as possible, we need to continue searching for better tools to solve problems of increasing complexity.
In this talk, we'll discuss the tools and techniques of the .Net ecosystem that can help us to concentrate on the problem itself — not just on the intermediate steps (which have likely already been solved). We'll compare imperative and declarative approaches and assess solutions to problems.
We'll also offer examples of how engineers in Grammarly's Office Add-in team use these tools to improve the efficiency of our engineering and strengthen our solutions to the problems at hand.
Grammarly AI-NLP Club #10 - Information-Theoretic Probing with Minimum Descri...Grammarly
Speaker: Elena Voita, a Ph.D. student at the University of Edinburgh and the University of Amsterdam
Summary: How can you know whether a model (e.g., ELMo, BERT) has learned to encode a linguistic property? The most popular approach to measure how well pretrained representations encode a linguistic property is to use the accuracy of a probing classifier (probe). However, such probes often fail to adequately reflect differences in representations, and they can show different results depending on probe hyperparameters. As an alternative to standard probing, we propose information-theoretic probing which measures minimum description length (MDL) of labels given representations. In addition to probe quality, the description length evaluates “the amount of effort” needed to achieve this quality. We show that (i) MDL can be easily evaluated on top of standard probe-training pipelines, and (ii) compared to standard probes, the results of MDL probing are more informative, stable, and sensible.
Grammarly AI-NLP Club #9 - Dumpster diving for parallel corpora with efficien...Grammarly
Speaker: Kenneth Heafield, Lecturer at the University of Edinburgh
Summary: The ParaCrawl project is mining a petabyte of the web for translations to release freely at https://paracrawl.eu/releases.html. But the web is a messy place, with a lot of data to sift through. To find translations, we translate everything into English or at least use a neural encoder. A related project makes machine translation inference more efficient by using optimizations ranging from assembly instructions to removal of bits of model architecture.
Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...Grammarly
Speaker: Nizar Habash is an Associate Professor of Computer Science at New York University Abu Dhabi (NYUAD). Professor Habash’s research includes extensive work on machine translation, morphological analysis, and computational modeling of Arabic and its dialects. Professor Habash has been a principal investigator or co-investigator on over 20 grants. He has over 200 publications including a book titled “Introduction to Arabic Natural Language Processing.” His website is www.nizarhabash.com. He is the director of the NYUAD Computational Approaches to Modeling Language (CAMeL) Lab (www.camel-lab.com).
Summary: The Arabic language presents a number of challenges to researchers and developers of language technologies. Arabic is both morphologically rich and highly ambiguous; and it has a number of dialects that vary widely amongst themselves and with Standard Arabic. The dialects have no official spelling standards, and spelling and grammar errors are common in unedited Standard Arabic. In this talk, we present some of these challenges in detail and cover some of the ongoing efforts to address them with creative language technologies.
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...Grammarly
Speaker: Artem Chernodub, Chief Scientist at Clikque Technology and Associate Professor at Ukrainian Catholic University
Summary: Sequence Tagging is an important NLP problem that has several applications, including Named Entity Recognition, Part-of-Speech Tagging, and Argument Component Detection. In our talk, we will focus on a BiLSTM+CNN+CRF model — one of the most popular and efficient neural network-based models for tagging. We will discuss task decomposition for this model, explore the internal design of its components, and provide the ablation study for them on the well-known NER 2003 shared task dataset.
Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...Grammarly
Speaker: Isabelle Augenstein, Assistant Professor, University of Copenhagen
Summary: The spread of misinformation and disinformation is growing, and it’s having a big impact on interpersonal communications, politics and even science.
Traditional methods, e.g., manual fact-checking by reporters, cannot keep up with the growth of information. On the other hand, there has been much progress in natural language processing recently, partly due to the resurgence of neural methods.
How can natural language processing methods fill this gap and help to automatically check facts?
This talk will explore different ways to frame fact checking and detail our ongoing work on learning to encode documents for automated fact checking, as well as describe future challenges.
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...Grammarly
Speaker: Marek Rei, Senior Research Associate, University of Cambridge
Summary: The number of people learning English around the world is currently estimated at 1.5 billion and is predicted to exceed 1.9 billion by 2020. The increasing need to communicate beyond borders has created a large unmet demand for qualified language teachers across the globe. Computational models for error detection and essay scoring can alleviate this issue by giving millions of people access to affordable learning resources. Successful systems for automated language teaching will need to analyse language at various levels of granularity and provide useful feedback to individual students.In this talk, we will explore some of the latest approaches to written language assessment, using neural architectures for composing the meaning of a sentence or text, and also discuss potential future directions in the field.
Grammarly Meetup: DevOps at Grammarly: Scaling 100xGrammarly
Speaker: Dmitry Unkovsky, Software Engineer at Grammarly
Summary: We will tell the story of DevOps at Grammarly since 2013. We’ll talk about how we managed infrastructure growth while keeping up with the rapid pace of product development; what worked for us and what did not, and why; and what it’s like to make technical choices as an engineer at our company. We will share our current vision and future plans.
Grammarly Meetup: Memory Networks for Question Answering on Tabular Data - Sv...Grammarly
Tabular data is difficult to analyze and search through. There is a clear need for new tools and interfaces that would allow even non-tech-savvy users to gain insights from open datasets without resorting to specialized data analysis tools or even having to fully understand the dataset structure. We explore the End-To-End Memory Networks architecture (Sukhbaatar et al., 2015) in application to answering natural language questions from tabular data. This architecture was originally designed for the question-answering tasks from short natural language texts (bAbI tasks) (Weston et al., 2015), which include testing elements of inductive and deductive reasoning, co-reference resolution and time manipulation.
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...Grammarly
Speaker: Jordi Carrera Ventura, Artificial Intelligence technologist at Telefónica R&D
Summary: Chatbots (aka conversational agents, spoken dialogue systems) allow users to interface with computers using natural language by simply asking questions or issuing commands.
Given a query, the chatbot builds a semantic representation of the input, transforms it into a logical statement, and performs all the necessary actions to fulfill the user's intent. Sometimes this simply means calculating an exact answer or retrieving a fact from a database, whereas other times it means building a contextual model and running a full-fledged conversation flow while keeping track of anaphoras and cross-references.
Besides the direct applications of chatbots in IoT (Amazon’s Alexa, Apple's Siri) and IT (the historical field of Information Retrieval as a whole can be seen as a sub-problem of spoken dialogue systems), chatbots' main appeal for technologists is their location at the intersection of all major Natural Language Processing technologies and many of the deepest questions in Cognitive Science today: semantic parsing, entity recognition, knowledge representation, and coreference resolution.
In this talk, I will explore those questions in the context of an applied industry setting, and I will introduce a framework suitable for addressing them, together with an overview of the state-of-the-art in chatbot technology and some original techniques.
Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...Grammarly
Speaker: Tim Baldwin, Professor of Computer Science, University of Melbourne
Summary: Two forms of bias that are commonly associated with natural language processing (NLP) tasks are domain bias (implicit bias towards documents from a particular domain, with lower performance over other document types) and social bias (implicit bias towards documents authored by particular types of individuals, with lower performance over documents authored by other types of individuals). In this talk, I will discuss the importance of debiasing NLP models across these dimensions, and strategies that can be employed to achieve this. I will focus the talk on the task of language identification (i.e., identifying the language(s) a written document is authored in).
Speaker: Andriy Gryshchuk, Senior Research Engineer at Grammarly.
Summary: Paraphrase detection is a challenging NLP task since it requires both thorough syntactic and thorough semantic analysis to identify whether two phrases have the same intent. A few months ago, paraphrase identification became an objective of one of the most popular Kaggle competitions, Quora Question Pairs. In this talk, Yuriy Guts and Andriy Gryshchuk, silver medalists of the competition, will share their arsenal of statistical, linguistic, and Deep Learning approaches that helped them succeed in this challenge.
Speaker: Yuriy Guts, Machine Learning Engineer at DataRobot.
Paraphrase detection is a challenging NLP task since it requires both thorough syntactic and thorough semantic analysis to identify whether two phrases have the same intent. A few months ago, paraphrase identification became an objective of one of the most popular Kaggle competitions, Quora Question Pairs. In this talk, Yuriy Guts and Andriy Gryshchuk, silver medalists of the competition, will share their arsenal of statistical, linguistic, and Deep Learning approaches that helped them succeed in this challenge.
Natural Language Processing for biomedical text mining - Thierry HamonGrammarly
Speaker: Thierry Hamon, Associate Professor in Computer Science at Université Paris, Member of the LIMSI-CNRS research lab.
Summary: Among the large amounts of unstructured data generated across the world and available nowadays, textual data represent an important source of information. This fact is particularly true in the biomedical domain, where a constant increasing demand to access the textual content is observed: the situation is relevant for accessing and processing Electronic Health Records, online discussion forums, and scientific literature. Indeed, dealing with biomedical texts requires us to take into account a great variety of texts, languages and Users.
For several years now, a lot of NLP research has focused on mining and retrieving information (i.e., medical entities and domain-specific relations), which are relevant for biologists, physicians, terminologists, epidemiologists, and patients. We will propose an overview of the NLP methods used for tackling several such research problems through text mining applications. First, we will present the resources and rule-based approaches we designed for extracting drug-related information from clinical texts, and for acquiring domain-specific semantic relations from digital libraries. Then we will present the cross-lingual approach we are developing for building multilingual terminologies from a patient-centered Ukrainian corpus.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Grammarly AI-NLP Club #5 - Automatic text simplification in the biomedical domain - Natalia Grabar
1. Context Difficulty Paraphrases Conclusion
Automatic text simplification in biomedical
domain
Natalia Grabar
STL CNRS UMR8163, France
Grammarly, Kyiv, Ukraine: 21/08/2018
1/45 Automatic text simplification in biomedical domain Natalia Grabar
2. Context Difficulty Paraphrases Conclusion
Background
Lviv University
Languages, Linguistics
2/45 Automatic text simplification in biomedical domain Natalia Grabar
3. Context Difficulty Paraphrases Conclusion
Background
Lviv University Master, PhD
INaLCO, Universit´e Paris 6
Languages, Linguistics NLP, Medical area, Terminology
2/45 Automatic text simplification in biomedical domain Natalia Grabar
4. Context Difficulty Paraphrases Conclusion
Background
Lviv University Master, PhD
INaLCO, Universit´e Paris 6
Languages, Linguistics NLP, Medical area, Terminology
PostDoc, AHU
Inserm, Fondation HON Geneva
Information retrieval, Quality of information
Discourse analysis, Typology
Information for non-specialized users
2/45 Automatic text simplification in biomedical domain Natalia Grabar
5. Context Difficulty Paraphrases Conclusion
Background
Lviv University Master, PhD
INaLCO, Universit´e Paris 6
Languages, Linguistics NLP, Medical area, Terminology
Acquisition of lexical resources
PostDoc, AHU Researcher
Inserm, Fondation HON Geneva CNRS
Information retrieval, Quality of information Information for non-specialized users
Discourse analysis, Typology Semantic annotation, Information extraction
Information for non-specialized users
2/45 Automatic text simplification in biomedical domain Natalia Grabar
6. Context Difficulty Paraphrases Conclusion
Automatic text simplification in biomedical domain
work in French
1 Context
2 Detection of difficulties
3 Acquisition of paraphrases
4 Conclusion
3/45 Automatic text simplification in biomedical domain Natalia Grabar
7. Context Difficulty Paraphrases Conclusion
Context
Evolution of the biomedical domain:
specific knowledge and terms
Different kinds of users:
medical staff, pharmacists, students, patients...
various levels of specialization
Patients: quality of information, understanding
technicity and understanding of health information
⇒ Close relation with health and well-being of people
(AMA, 1999; Berland et al., 2001; McCray, 2005; Tran et al.,
2009)
4/45 Automatic text simplification in biomedical domain Natalia Grabar
8. Context Difficulty Paraphrases Conclusion
Readability of health documents
Health information must be: readable, understandable, usable
In different situations:
follow up of treatments
make decisions (chronical disorders)
communicate with medical doctors
make the healthcare process successful
Real difficulty:
understand the steps of the correct intake of drugs (Patel
et al., 2002)
within 2,600 US patients (2 hospitals):
26% to 60% cannot understand instructions on drug intake,
informed consensus, health brochures (Williams et al., 1995)
Documents, health websites designed for patients:
often show high technicity (Berland et al., 2001)
5/45 Automatic text simplification in biomedical domain Natalia Grabar
9. Context Difficulty Paraphrases Conclusion
Objective
Make health documents and medical terms better
understandable by patients:
detect reading difficulties
propose common paraphrases for technical terms
Diagnosis
of text
modelref. ref. model res. rules
Detection of
difficult words
Simplification
/decoration
difficult
Text Simplified text
Interdisciplinary research:
linguistics, psychology, terminology, NLP...
6/45 Automatic text simplification in biomedical domain Natalia Grabar
10. Context Difficulty Paraphrases Conclusion
Detection of difficulties
1 Context
2 Detection of difficulties
3 Acquisition of paraphrases
4 Conclusion
7/45 Automatic text simplification in biomedical domain Natalia Grabar
11. Context Difficulty Paraphrases Conclusion
Detection of difficulties (documents)
Existing work
Text typology
Diagnosis of the text readability
Classical measures: Flesch (Flesch, 1948), Fog (Gunning,
1973)...
Computational measures:
classical measures and medical vocabulary (Kokkinakis &
Toporowska Gronostaj, 2006)
n-grams of characters (Poprat et al., 2006)
manual weighting of words (Zheng et al., 2002)
morphology (Chmielik & Grabar, 2009)
stylistic criteria (Grabar et al., 2007)
discursive criteria (Goeuriot et al., 2007)
various combinations (Wang, 2006; Zeng-Treiler et al., 2007;
Goeuriot et al., 2007; Leroy et al., 2008)
...
8/45 Automatic text simplification in biomedical domain Natalia Grabar
13. Context Difficulty Paraphrases Conclusion
Detection of difficulties (words)
Existing work
Facilitators: hiphen (Bertram et al., 2011), space (Frisson
et al., 2008), morphological closeness (L¨uttmann et al., 2011),
primes (Bozic et al., 2007; Beyersmann et al., 2012), pictures
(Dohmes et al., 2004; Koester & Schiller, 2011), etc.
Morphological head (Jarema et al., 1999; Libben et al., 2003)
NLP: challenges (Specia et al., 2012):
for a short text and a given word, several possible substitutions
which satisfy the context are proposed
→ sort the substitutions according to their simplicity
Descriptors:
Google n-grams, WordNet, length of words, number syllables,
mutual information, frequency...
10/45 Automatic text simplification in biomedical domain Natalia Grabar
14. Context Difficulty Paraphrases Conclusion
Detection of difficulties
Psychology: eye-tracking (Grabar et al., 2018)
Eye-tracking:
recording eye movements when reading
Several indicators:
fixations: periods during which the eyes are stable (visual
information is analyzed)
saccades: rapid movements of eyes to move from one point to
another
regressions: backward movements
11/45 Automatic text simplification in biomedical domain Natalia Grabar
15. Context Difficulty Paraphrases Conclusion
Detection of difficulties: Eye-tracking
text1
EXAMEN : ECHOGRAPHIE DES MAINS ET DES PIEDS
MOTIF : Bilan d’arthralgies
Mains : On ne visualise pas de t´enosynovite, ou d’arthrosynovite.
Avant-pieds : On retrouve des remaniements int´eressant les premi`eres
m´etatarsophalangiennes en rapport avec des ant´ec´edents de chirurgie d’Hallux
valgus.
Absence d’arthrosynovite au niveau des articulations m´etatarsophalangiennes.
EXAMEN : ECHOGRAPHIE DES MAINS ET DES PIEDS
MOTIF : Bilan de douleurs articulaires
Mains : On ne visualise pas d’inflammation des tendons, ni de la membrane
articulaire.
Avant-pieds : On retrouve des remaniements int´eressants sur les premi`eres
articulations des pieds en rapport avec les ant´ec´edents de la chirurgie de la
d´eformation du pied.
Absence d’inflammation de la membrane au niveau des articulations du pied.
12/45 Automatic text simplification in biomedical domain Natalia Grabar
16. Context Difficulty Paraphrases Conclusion
Detection of difficulties: Eye-tracking
text2
Cette patiente avait constitu´e un infarctus du myocarde ant´erieur en novembre
2010, pour lequel avait ´et´e r´ealis´ee une angioplastie de l’IVA moyenne avec
implantation d’un stent non actif Vision de 2.75 mm x 18 mm, un compl´ement
par angioplastie au ballon seul en aval. Une endoproth`ese avait ´egalement ´et´e
implant´ee au niveau de la circonflexe proximale, avec un stent Vision 2.5 x 18
mm. La fraction d’´ejection ´etait ´evalu´ee entre 35 et 40 %.
Nous l’avions revue r´ecemment, en insuffisance cardiaque, avec plusieurs autres
probl`emes :
- une an´emie microcytaire inexpliqu´ee,
- un d´es´equilibre important de son diab`ete pour lequel elle a ´et´e, entre temps,
prise en charge par nos confr`eres diab´etologues.
Cette patiente avait pr´esent´e une crise cardiaque en novembre 2010, pour
laquelle avait ´et´e r´ealis´ee une intervention chirurgicale de l’art`ere cardiaque avec
implantation d’un stent non actif. Un autre stent avait ´egalement ´et´e implant´e
au niveau d’une autre art`ere. La fraction d’´ejection observ´ee ´etait basse.
Nous l’avions revue r´ecemment, en insuffisance cardiaque, avec plusieurs autres
probl`emes :
- une an´emie inexpliqu´ee,
- un d´es´equilibre important de son diab`ete pour lequel elle a ´et´e, entre temps,13/45 Automatic text simplification in biomedical domain Natalia Grabar
17. Context Difficulty Paraphrases Conclusion
Detection of difficulties: Eye-tracking
Results on text1
14/45 Automatic text simplification in biomedical domain Natalia Grabar
18. Context Difficulty Paraphrases Conclusion
Detection of difficulties: Eye-tracking
Results on text2
15/45 Automatic text simplification in biomedical domain Natalia Grabar
20. Context Difficulty Paraphrases Conclusion
Detection of difficulties: NLP
(Grabar et al., 2014)
Medical words from Snomed International (Cˆot´e et al., 1993)
29,641 lemmatized words
Manually annotated:
by 3 independent annotators:
categories:
1 I can understand
2 I am not sure
3 I cannot understand
inter-annotator agreement: Cohen’s Kappa 0.736
NLP task: supervised categorization
automatically reproduce the manual annotations: F=0.90
24 descriptors:
syntactic and morphological information, reference lexica,
frequency, length, initial and final substrings, readability
scores...
17/45 Automatic text simplification in biomedical domain Natalia Grabar
21. Context Difficulty Paraphrases Conclusion
Detection of difficulties: NLP
18/45 Automatic text simplification in biomedical domain Natalia Grabar
22. Context Difficulty Paraphrases Conclusion
Detection of difficulties
Typology
abbreviations (OG, VG, PAPS, j, bat, cp);
proper names (Gougerot, Sj¨ogren, Bentall, Glasgow, Babinski,
Barthel, Cockcroft);
drug names;
neoclassical compounds - disorders, procedures, treatments
(pseudoh´emophilie, scl´erodermie, hydrolase, tympanectomie,
arthrod`ese, synesth´esie);
borrowings from Latin or English;
human anatomy (cloacal, pubovaginal, nasopharyng´e, mitral,
antre, inguinal, strontium, ´eryth`eme, maxillo-facial,
m´esent`ere);
lab test results.
19/45 Automatic text simplification in biomedical domain Natalia Grabar
23. Context Difficulty Paraphrases Conclusion
Acquisition of paraphrases
1 Contexte
2 Detection of difficulties
3 Acquisition of paraphrases
4 Conclusion
20/45 Automatic text simplification in biomedical domain Natalia Grabar
24. Context Difficulty Paraphrases Conclusion
Acquisition of paraphrases
Existing work: general language
Revision of Simple Wikipedia articles (Yatskar et al., 2010):
probabilistic models and filters
between 1,079 and 2,970 pairs:
{stands for, is the same as}, {indigenous, native}
precision: 17% to 86%;
Methods from machine translation (Zhu et al., 2010; Wubben
et al., 2012):
parallel and aligned corpora (Wikipedia/Simple Wikipedia)
Distributional methods (Glavas & Stajner, 2015; Kim et al.,
2016):
monolingual corpora
vectors can contain equivalents easier to understand
filtering
21/45 Automatic text simplification in biomedical domain Natalia Grabar
25. Context Difficulty Paraphrases Conclusion
Acquisition of paraphrases
Existing work: medical language
Automatic translator of medical terms to general language
(McCray et al., 1999):
MEDLINEplus (brochures)
Consumer Health Vocabulary (CHV) (Zeng & Tse, 2006)
collaborative approach
Morpho-syntactic variants (Del´eger & Zweigenbaum, 2008;
Cartoni & Del´eger, 2011):
{consommation r´eguli`ere, consommer de fa¸con r´eguli`ere}
{gˆene `a la lecture, empˆeche de lire}
Social media specificities (Tapi Nzali et al., 2015):
misspellings
{cirrhose, cyrose}, {m´etastase, metastase}
reduced words
{oncologue, onco}, {chimioth´erapie, chimio}
22/45 Automatic text simplification in biomedical domain Natalia Grabar
27. Context Difficulty Paraphrases Conclusion
Definitions
Methods
Definition: structure with two elements:
definiendum (term to define) and definiens (the definition)
Myocarde est le tissu musculaire du coeur
Use of four patterns (P´ery-Woodley & Rebeyrolle, 1998)
d´esigne (means)
est un (is a)
est appel´e (called as)
peut ˆetre d´efini comme (can be defined as)
...with inflectional variants
Trigger: term
24/45 Automatic text simplification in biomedical domain Natalia Grabar
29. Context Difficulty Paraphrases Conclusion
Definitions
Results
L’hypoglyc´emie est un manque de sucre dans l’organisme
Une septic´emie est un empoisonnement du sang du `a un
microbe
Le curetage est un nettoyage en profondeur d’une gencive
inflamm´ee
Pour un ˆetre humain adulte, une hypoglyc´emie est une
glyc´emie inf´erieure `a 0,8 g/L
Les signes classiques annonciateurs de l’hypoglyc´emie sont des
sueurs, pˆaleur, palpitations, fringales en particulier
L’imp´etigo est une infection cutan´ee, qui provoque des
pustules qui d´eg´en`erent en croˆutes jaunˆatres, l’imp´etigo est
due `a...
26/45 Automatic text simplification in biomedical domain Natalia Grabar
30. Context Difficulty Paraphrases Conclusion
Definitions
Results
Readability (p´ericarde):
+ La couche ext´erieure du cœur est appel´ee p´ericarde.
∼ Le p´ericarde est un sac `a double paroi contenant le cœur et les
racines des gros vaisseaux sanguins.
− Le p´ericarde est un organe de glissement, form´e de deux
feuillets limitant une cavit´e virtuelle, la cavit´e p´ericardique, qui
permet les mouvements cardiaques.
27/45 Automatic text simplification in biomedical domain Natalia Grabar
31. Context Difficulty Paraphrases Conclusion
Reformulations
Motivation
Reformulation: say differently (Le Bot et al., 2008)
Occurrence of reformulations:
indicates presence of difficult words/terms
provides triggers for the extraction
Exploit reliable data:
health fora with moderators
Wikipedia
28/45 Automatic text simplification in biomedical domain Natalia Grabar
32. Context Difficulty Paraphrases Conclusion
Reformulations
Methods
concept marker reformulation
v´esiculaire, c’est-`a-dire, venant de la v´esicule biliaire
3 markers :
c’est-`a-dire (I mean)
autrement dit ; Autrement dit (in other words)
encore appel´e(e)(s) (also called)
Pre-processing
POS-tagging and syntactic analysis by Cordial (Laurent et al.,
2009)
Trigger: markers
Extraction of concept and of reformulation:
syntactic information
boundaries: syntagms or propositions
29/45 Automatic text simplification in biomedical domain Natalia Grabar
33. Context Difficulty Paraphrases Conclusion
Reformulations
form lemma POS POSMT GS type GS Prop
Vous vous PPER2P Pp2.pn 1 S 1
ne ne ADV Rpn 3—1 S 1
devez devoir VINDP2P Vmip2p 3 V 1
pas pas ADV Rgn 3 Q 1
employer employer VINF Vmn – 5 D 2
de de PREP Sp 7 D 2
savons savon NCMP Ncmp 7 D 2
ou ou COO Cc 7 F 2
des de le DETDPIG Da-.p-i 10—7 F 2
laits lait NCMP Ncmp 10—7 F 2
sophistiqu´es sophistiqu´e ADJMP Afpmp 10—7 F 2
, , PCTFAIB Ypw - - 2
c’ ce PDS Pd-..- 13 N 2
est est ADV Rgp - p 2
-`a `a PREP Sp 16 F 2
-dire dire VINF Vmn– 16 F 2
contenant contenant NCMS Ncms 17 D 2
plusieurs plusieurs ADJIND Dt-.p- 19 D 2
composants composant NCMP Ncmp 19 D 2
30/45 Automatic text simplification in biomedical domain Natalia Grabar
34. Context Difficulty Paraphrases Conclusion
Reformulations
form lemma POS POSMT GS type GS Prop
Vous vous PPER2P Pp2.pn 1 S 1
ne ne ADV Rpn 3—1 S 1
devez devoir VINDP2P Vmip2p 3 V 1
pas pas ADV Rgn 3 Q 1
employer employer VINF Vmn – 5 D 2
de de PREP Sp 7 D 2
savons savon NCMP Ncmp 7 D 2
ou ou COO Cc 7 F 2
des de le DETDPIG Da-.p-i 10—7 F 2
laits lait NCMP Ncmp 10—7 F 2
sophistiqu´es sophistiqu´e ADJMP Afpmp 10—7 F 2
, , PCTFAIB Ypw - - 2
c’ ce PDS Pd-..- 13 N 2
est est ADV Rgp - p 2
-`a `a PREP Sp 16 F 2
-dire dire VINF Vmn– 16 F 2
contenant contenant NCMS Ncms 17 D 2
plusieurs plusieurs ADJIND Dt-.p- 19 D 2
composants composant NCMP Ncmp 19 D 2
31/45 Automatic text simplification in biomedical domain Natalia Grabar
35. Context Difficulty Paraphrases Conclusion
Reformulations
form lemma POS POSMT GS type GS Prop
Vous vous PPER2P Pp2.pn 1 S 1
ne ne ADV Rpn 3—1 S 1
devez devoir VINDP2P Vmip2p 3 V 1
pas pas ADV Rgn 3 Q 1
employer employer VINF Vmn – 5 D 2
de de PREP Sp 7 D 2
savons savon NCMP Ncmp 7 D 2
ou ou COO Cc 7 F 2
des de le DETDPIG Da-.p-i 10—7 F 2
laits lait NCMP Ncmp 10—7 F 2
sophistiqu´es sophistiqu´e ADJMP Afpmp 10—7 F 2
, , PCTFAIB Ypw - - 2
c’ ce PDS Pd-..- 13 N 2
est est ADV Rgp - p 2
-`a `a PREP Sp 16 F 2
-dire dire VINF Vmn– 16 F 2
contenant contenant NCMS Ncms 17 D 2
plusieurs plusieurs ADJIND Dt-.p- 19 D 2
composants composant NCMP Ncmp 19 D 2
32/45 Automatic text simplification in biomedical domain Natalia Grabar
36. Context Difficulty Paraphrases Conclusion
Reformulations
Evaluation
Dev. Test P R F
nb occ. 96 2 757 exact 0.24 0.24 0.24
nb types 96 2 710 inexact 0.98 0.98 0.98
Difficulties:
detection of boundaries:
en c’est-`a-dire au contact du sang circulant
une toxi-infection, c’est-`a-dire, qu’ elle peut
semantics:
en 10 ans autrement dit sur 64 millions de personnes
un objectif c’est-`a-dire une finalit´e
33/45 Automatic text simplification in biomedical domain Natalia Grabar
37. Context Difficulty Paraphrases Conclusion
Reformulations
Results
des canaux galactophores c’est-`a-dire s´ecr`etent le lait
erratiques c’est-`a-dire qu’ils changent de d’aspect et d’endroit
par une lithiase c’est-`a-dire un caillou
clivage du moi c’est-`a-dire comme une opposition entre le moi
et la r´ealit´e
au gr´e de la d´esint´egration radioactive du 18 F c’est-`a-dire
avec une demi-vie d’environ
un trouble de l’identit´e sexuelle c’est-`a-dire qu’ils s’identifient
`a un genre ne correspondant pas `a leur sexe biologique
une enzyme prot´eolytique c’est-`a-dire dig`ere les prot´eines
comme le fait le suc pancr´eatique
celle de troubles fonctionnels intestinaux encore appel´es
colopathie fonctionnelle
34/45 Automatic text simplification in biomedical domain Natalia Grabar
38. Context Difficulty Paraphrases Conclusion
Morphological composition
Morphological
analysis of components
TranslationPOS−tagging
Medical
terms
Corpus
POS−tagging Syntactic
analysis
Evaluation
Alignment
Processing of terms
myocarde myocarde/Nom
[[[myo N*] [carde N*] NOM] ique ADJ]
myo=muscle, carde=coeur
Processing of corpus
Les causes de tachycardie ventriculaire sont superposables `a celles des
extrasystoles ventriculaires: infarctus du myocarde, insuffisance cardiaque,
hypertrophie du muscle du cœur et prolapsus de la valve mitrale.
35/45 Automatic text simplification in biomedical domain Natalia Grabar
39. Context Difficulty Paraphrases Conclusion
Morphological composition
Morphological
analysis of components
TranslationPOS−tagging
Medical
terms
Corpus
POS−tagging Syntactic
analysis
Evaluation
Alignment
Processing of terms
myocarde myocarde/Nom
[[[myo N*] [carde N*] NOM] ique ADJ]
myo=muscle, carde=coeur
Processing of corpus
Les causes de tachycardie ventriculaire sont superposables `a celles des
extrasystoles ventriculaires: infarctus du myocarde, insuffisance cardiaque,
[hypertrophie du [muscle du cœur]] et prolapsus de la valve mitrale.
36/45 Automatic text simplification in biomedical domain Natalia Grabar
40. Context Difficulty Paraphrases Conclusion
Morphological composition
Results
Alignment syntagm/term (percentage of alignment):
E1: full term and syntagm:
{myo pathie, maladie du muscle}
E2: full term, partial syntagm:
{myo pathie, maladie du muscle cardiaque}
E3: partial term, full syntagm:
{myopathie, la maladie}
E4: partial term and syntagm:
{myopathie, l’ origine de la maladie}
37/45 Automatic text simplification in biomedical domain Natalia Grabar
41. Context Difficulty Paraphrases Conclusion
Morphological composition
Evaluation
Nb of unigrams bigrams trigrams
b l s b l s b l s
correct paraphrases 549 785 644 378 517 461 195 290 257
poss. correct 39 32 67 22 45 75 10 19 41
processing of terms 47 60 44 28 28 46 9 10 26
incorrect paraphrases 33 146 296 64 80 380 25 39 148
Pstrict 82 77 61 77 77 48 82 81 55
Pweak 88 80 68 81 84 40 86 86 63
%incorrect 5 14 28 13 12 39 11 11 31
Evaluation:
strict precision 82 to 55%
weak precision 86 to 40%
error rate 5 to 39%
Resources
without: the best precision
morphology: good precision
synonymy: low precision
38/45 Automatic text simplification in biomedical domain Natalia Grabar
42. Context Difficulty Paraphrases Conclusion
Morphological composition
Morphological analysis
Ambigous analysis
[post [[uro N*] [graphie N*] NOM] NOM]
[[posturo N*] [graphie N*] NOM]
Incorrect analysis
sanglot: lot and sang
exotique: externe and oreille
divin: deux and vin (deux litres de vin)
39/45 Automatic text simplification in biomedical domain Natalia Grabar
43. Context Difficulty Paraphrases Conclusion
Morphological composition
Extraction of paraphrases and their evaluation
Correct paraphrases
raw
{podalgie, douleur du pied}
{mastite, inflammation du sein}
{cystoprostatectomie, ablation de la vessie et de la prostate}
Morphology
{desmorrhexie, rupture des ligaments} (ligament→ligaments)
{bronchite, inflammation des bronches/inflammation
bronchique} (bronche→bronches, bronche→bronchique)
{dentalgie, douleurs dentaires} (dents→dentaires)
Synonymy
{aclasie, absence de fracture} (cassure→fracture)
{enterectomie, r´esection des intestins} (ablation→r´esection)
40/45 Automatic text simplification in biomedical domain Natalia Grabar
44. Context Difficulty Paraphrases Conclusion
Morphological composition
Extraction of paraphrases and their evaluation
Semantic relations between components:
well managed by data from corpora
errors: coordination/subordination
hematospermie: le sang ou le sperme, instead of
→ le sang dans le sperme
Non-compositional terms:
ost´eodermie: peau and os, instead of
→ une structure d’´ecailles, de plaques osseuses ou d’autres
compositions dans les couches dermiques de la peau, comme
chez les l´ezards ou dinosaures
41/45 Automatic text simplification in biomedical domain Natalia Grabar
45. Context Difficulty Paraphrases Conclusion
Comparison with existing work
term type nb. para precision
(Zeng et al., 2006) all CHV
(Elhadad & Sutaria, 2007) all 152 0.58
(Del´eger & Zweigenbaum, 2008) m-synt. 65, 82 0.67, 0.60
(Cartoni & Del´eger, 2011) m-synt. 109 0.66
definitions all 1,028 0.52, 0.68
morphology compounds 1,128 0.76, 0.86
abbreviations abbr. 42, 8,106 0.74/0.94
reformulation all 96, 2,710 0.24/0.98
parentheses all 305, 92,971 0.23/0.68
morpho-syntactic:
{consommation r´eguli`ere, consommer de fa¸con r´eguli`ere}
comparable performance, better coverage
42/45 Automatic text simplification in biomedical domain Natalia Grabar
46. Context Difficulty Paraphrases Conclusion
Comparison with existing work
D´eriF (Namer, 2003):
gloss in formal language for every analyzed word
our method: coverage depends on content of corpora
myocarde:
”(Partie de – Type particulier de) coeur en rapport avec le(s)
muscle”
muscle du coeur
desmorrhexie:
”rupture (du – li´ee au) ligament”
rupture des ligaments
43/45 Automatic text simplification in biomedical domain Natalia Grabar
47. Context Difficulty Paraphrases Conclusion
Conclusion
Detection of difficulties
in reading and understanding
Acquisition of resources
for explaining technical terms
Methods dedicated to different kinds of linguistic phenomena
paraphrases, reformulations...
Exploitation of general language corpora
Complementary methods
Interesting and exploitable results
Work in French
Diagnosis
of text
modelref. ref. model res. rules
Detection of
difficult words
Simplification
/decoration
difficult
Text Simplified text
44/45 Automatic text simplification in biomedical domain Natalia Grabar
48. Context Difficulty Paraphrases Conclusion
Future work
Increase the coverage of paraphrases and reformulations:
more corpora
comparables (Cochrane, patient package inserts, Wiki/Viki)
monolingual
more suppletive resources
other methods for extracting the paraphrases
Alignment with medical terminologies
Distribution of the resource
Other languages
Lexical simplification of medical texts
ANR project CLEAR (Communication, Literacy, Education,
Accessibility, Readability)
Diagnosis
of text
modelref. ref. model res. rules
Detection of
difficult words
Simplification
/decoration
difficult
Text Simplified text
45/45 Automatic text simplification in biomedical domain Natalia Grabar
49. Context Difficulty Paraphrases Conclusion
AMA (1999).
Health literacy: report of the council on scientific affairs. Ad hoc committee on
health literacy for the council on scientific affairs, American Medical Association.
JAMA, 281(6), 552–7.
Antoine, E. & Grabar, N. (2017).
Acquisition of expert/non-expert vocabulary from reformulations.
In MIE, Stud Health Technol Inform. 235, pp. 521–525.
Berland, G., Elliott, M., Morales, L., Algazy, J., Kravitz, R.,
Broder, M., Kanouse, D., Munoz, J., Puyol, J. & et al, M. L. (2001).
Health information on the internet. accessibility, quality, and readability in
english ans spanish.
JAMA, 285(20), 2612–2621.
Bertram, R., Kuperman, V., Baayen, H. R. & Hy¨on¨a, J. (2011).
The hyphen as a segmentation cue in triconstituent compound processing: It’s
getting better all the time.
Scandinavian Journal of Psychology, 52(6), 530–544.
Beyersmann, E., Coltheart, M. & Castles, A. (2012).
Parallel processing of whole words and morphemes in visual word recognition.
The Quarterly Journal of Experimental Psychology, 65(9), 1798–1819.
Bozic, M., Marslen-Wilson, W. D., Stamatakis, E. A., Davis, M. H. &
Tyler, L. K. (2007).
Differentiating morphology, form, and meaning: Neural correlates of
morphological complexity.
45/45 Automatic text simplification in biomedical domain Natalia Grabar
50. Context Difficulty Paraphrases Conclusion
Journal of Cognitive Neuroscience, 19(9), 1464–1475.
Cartoni, B. & Del´eger, L. (2011).
D´ecouverte de patrons paraphrastiques en corpus comparable: une approche
bas´ee sur les n-grammes.
In Traitement Automatique des Langues Naturelles (TALN).
Chmielik, J. & Grabar, N. (2009).
Comparative study between expert and non-expert biomedical writings: their
morphology and semantics.
Stud Health Technol Inform., 150, 359–63.
Chmielik, J. & Grabar, N. (2011).
D´etection de la sp´ecialisation scientifique et technique des documents
biom´edicaux grˆace aux informations morphologiques.
TAL, 51(2), 151–179.
Cˆot´e, R. A., Rothwell, D. J., Palotay, J. L., Beckett, R. S. &
Brochu, L. (1993).
The Systematised Nomenclature of Human and Veterinary Medicine: SNOMED
International.
Northfield: College of American Pathologists.
Del´eger, L. & Zweigenbaum, P. (2008).
Paraphrase acquisition from comparable medical corpora of specialized and lay
texts.
In Ann Symp Am Med Inform Assoc (AMIA), pp. 146–50.
Dohmes, P., Zwitserlood, P. & B¨olte, J. (2004).45/45 Automatic text simplification in biomedical domain Natalia Grabar
51. Context Difficulty Paraphrases Conclusion
The impact of semantic transparency of morphologically complex words on
picture naming.
Brain and Language, 90(1-3), 203–212.
Elhadad, N. & Sutaria, K. (2007).
Mining a lexicon of technical terms and lay equivalents.
In BioNLP, pp. 49–56.
Flesch, R. (1948).
A new readability yardstick.
Journ Appl Psychol, 23, 221–233.
Frisson, S., Niswander-Klement, E. & Pollatsek, A. (2008).
The role of semantic transparency in the processing of english compound words.
Br J Psychol, 99(1), 87–107.
Glavas, G. & Stajner, S. (2015).
Simplifying lexical simplification: Do we need simplified corpora?
In ACL-COLING, pp. 63–68.
Goeuriot, L., Grabar, N. & Daille, B. (2007).
Caract´erisation des discours scientifique et vulgaris´e en fran¸cais, japonais et
russe.
In Traitement Automatique des Langues Naturelles (TALN), pp. 93–102.
Grabar, N., Farce, E. & Sparrow, L. (2018).
´Etude de la lisibilit´e des documents de sant´e avec des m´ethodes d’oculom´etrie.
In Traitement Automatique des Langues Naturelles (TALN), pp. 1–14.
45/45 Automatic text simplification in biomedical domain Natalia Grabar
52. Context Difficulty Paraphrases Conclusion
Grabar, N. & Hamon, T. (2014).
Automatic extraction of layman names for technical medical terms.
In ICHI 2014, Pavia, Italy.
Grabar, N. & Hamon, T. (2016).
Exploitation de la morphologie pour l’extraction automatique de paraphrases
grand public des termes m´edicaux.
TAL, 57(1), 85–109.
Grabar, N., Hamon, T. & Amiot, D. (2014).
Automatic diagnosis of understanding of medical words.
In EACL PITR Workshop, pp. 11–20.
Grabar, N., Krivine, S. & Jaulent, M. (2007).
Classification of health webpages as expert and non expert with a reduced set of
cross-language features.
In Ann Symp Am Med Inform Assoc (AMIA), pp. 284–288.
Gunning, R. (1973).
The art of clear writing.
New York, NY: McGraw Hill.
Jarema, G., Busson, C., Nikolova, R., Tsapkini, K. & Libben, G. (1999).
Processing compounds: A cross-linguistic study.
Brain and Language, 68(1-2), 362–369.
Kim, Y.-S., Hullman, J., Burgess, M. & Adar, E. (2016).
Simplescience: Lexical simplification of scientific terminology.
In EMNLP, pp. 1–6.45/45 Automatic text simplification in biomedical domain Natalia Grabar
53. Context Difficulty Paraphrases Conclusion
Koester, D. & Schiller, N. O. (2011).
The functional neuroanatomy of morphology in language production.
NeuroImage, 55(2), 732–741.
Kokkinakis, D. & Toporowska Gronostaj, M. (2006).
Comparing lay and professional language in cardiovascular disorders corpora.
In A. Pham T., James Cook University, Ed., WSEAS Transactions on
BIOLOGY and BIOMEDICINE, pp. 429–437.
Laurent, D., N`egre, S. & S´egu´ela, P. (2009).
L’analyseur syntaxique Cordial dans Passage.
In Traitement Automatique des Langues Naturelles (TALN).
Le Bot, M.-C., Schuwer, M. & ´Elisabeth Richard (dir.) (2008).
La reformulation : Marqueurs linguistiques – Strat´egies ´enonciatives.
Rennes: Rivages linguistiques.
Leroy, G., Helmreich, S., Cowie, J., Miller, T. & Zheng, W. (2008).
Evaluating online health information: Beyond readability formulas.
In Ann Symp Am Med Inform Assoc (AMIA), pp. 394–8.
Libben, G., Gibson, M., Yoon, Y. B. & Sandra, D. (2003).
Compound fracture: The role of semantic transparency and morphological
headedness.
Brain and Language, 84(1), 50–64.
L¨uttmann, H., Zwitserlood, P. & B¨olte, J. (2011).
45/45 Automatic text simplification in biomedical domain Natalia Grabar
54. Context Difficulty Paraphrases Conclusion
Sharing morphemes without sharing meaning: Production and comprehension of
german verbs in the context of morphological relatives.
Canadian Journal of Experimental Psychology/Revue canadienne de psychologie
exp´erimentale, 65(3), 173–191.
McCray, A. (2005).
Promoting health literacy.
J of Am Med Infor Ass, 12, 152–163.
McCray, A., Loane, R., Browne, A. & Bangalore, A. (1999).
Terminology issues in user access to web-based medical information.
In Ann Symp Am Med Inform Assoc (AMIA), pp. 107–7.
Namer, F. (2003).
Automatiser l’analyse morpho-s´emantique non affixale: le syst`eme D´eriF.
Cahiers de Grammaire, 28, 31–48.
Patel, V., Branch, T. & Arocha, J. (2002).
Errors in interpreting quantities as procedures : The case of pharmaceutical
labels.
Int Journ Med Inform, 65(3), 193–211.
P´ery-Woodley, M. & Rebeyrolle, J. (1998).
Domain and genre in sublanguage text: definitional microtexts in three corpora.
In LREC, pp. 987–992.
Poprat, M., Mark´o, K. & Hahn, U. (2006).
A language classifier that automatically divides medical documents for experts
and health care consumers.
45/45 Automatic text simplification in biomedical domain Natalia Grabar
55. Context Difficulty Paraphrases Conclusion
In Int Congress of the European Federation for Medical Informatics, pp.
503–508, Maastricht.
Quinlan, J. (1993).
C4.5 Programs for Machine Learning.
San Mateo, CA: Morgan Kaufmann.
Specia, L., Jauhar, S. & Mihalcea, R. (2012).
Semeval-2012 task 1: English lexical simplification.
In *SEM 2012, pp. 347–355.
Tapi Nzali, M., Bringay, S., Lavergne, C., Opitz, T., Az´e, J. &
Mollevi, C. (2015).
Construction d’un vocabulaire patient/m´edecin d´edi´e au cancer du sein `a partir
des m´edias sociaux.
In IC 2015.
Tran, T., Chekroud, H., Thiery, P. & Julienne, A. (2009).
Internet et soins : un tiers invisible dans la relation m´edecine/patient ?
Ethica Clinica, 53, 34–43.
Wang, Y. (2006).
Automatic recognition of text difficulty from consumers health information.
In IEEE, Ed., Computer-Based Medical Systems, pp. 131–136.
Williams, M., Parker, R., Baker, D., Parikh, N., Pitkin, K., Coates,
W. & Nurss, J. (1995).
Inadequate functional health literacy among patients at two public hospitals.
JAMA, 274(21), 1677–1682.
45/45 Automatic text simplification in biomedical domain Natalia Grabar
56. Context Difficulty Paraphrases Conclusion
Wubben, S., van den Bosch, A. & Krahmer, E. (2012).
Sentence simplification by monolingual machine translation.
In Annual Meeting of the Association for Computational Linguistics, pp.
1015–1024.
Yatskar, M., Pang, B., Danescu-Niculescu-Mizil, C. & Lee, L. (2010).
For the sake of simplicity: Unsupervised extraction of lexical simplifications from
Wikipedia.
In NAACL, pp. 365–368.
Zeng, Q. & Tse, T. (2006).
Exploring and developing consumer health vocabularies.
JAMIA, 13, 24–29.
Zeng, Q. T., Tse, T., Divita, G., Keselman, A., Crowell, J. & Browne,
A. C. (2006).
Exploring lexical forms: first-generation consumer health vocabularies.
In Ann Symp Am Med Inform Assoc (AMIA), pp. 1155–1155.
Zeng-Treiler, Q., Kim, H., Goryachev, S., Keselman, A., Slaugther, L.
& Smith, C. (2007).
Text characteristics of clinical reports and their implications for the readability of
personal health records.
In MEDINFO, pp. 1117–1121, Brisbane, Australia.
Zheng, W., Milios, E. & Watters, C. (2002).
Filtering for medical news items using a machine learning approach.
In Ann Symp Am Med Inform Assoc (AMIA), pp. 949–53.
45/45 Automatic text simplification in biomedical domain Natalia Grabar
57. Context Difficulty Paraphrases Conclusion
Zhu, Z., Bernhard, D. & Gurevych, I. (2010).
A monolingual tree-based translation model for sentence simplification.
In COLING 2010, pp. 1353–1361.
45/45 Automatic text simplification in biomedical domain Natalia Grabar