SlideShare a Scribd company logo
1 of 34
Download to read offline
10th Workshop on Building and
Using Comparable Corpora
at ACL‘17
Vancouver, Canada
A parallel collection of clinical trials in Portuguese and English
Mariana Neves August 3rd, 2017
Parallel corpora of documents
● Parallel collections of documents are valuable resources for
training, tuning and evaluating machine translation (MT) tools.
● However, these are not available for some domains, e.g.,
biomedicine, and manually creating such collections is an
expensive task.
Parallel corpora for news vs. biomedical domains
(https://ufal.mff.cuni.cz/ufal_medical_corpus)
Sources for biomedical documents
● Clinical discharge summaries
– Not available due to privacy
issues
– Usually monolingual
(http://home.iprimus.com.au/callanders/Image-21.jpg)
Sources for biomedical documents
● Scientific publications
– Many are freely available, but frequently monolingual (English)
– There are some exceptions, e.g., Scielo, EDP
(http://www.scielo.br/scielo.php?script=sci_abstract&pid=S1676-06032017000100201&lng=pt&nrm=iso&tlng=en
http://www.scielo.br/scielo.php?script=sci_abstract&pid=S1676-06032017000100201&lng=pt&nrm=iso&tlng=pt)
Sources for biomedical documents
● Scientific publications
– Scielo corpus: the first parallel (comparable) corpus of scientific
publications
(Neves M, Jimeno-Yepes A and Névéol A. The Scielo Corpus: a Parallel Corpus of Scientific Publications for Biomedicine,
International Conference on Language Resources and Evaluation (LREC), 2016, Portoroz, Slovenia. )
Sources for biomedical documents
● Scientific publications
– Datasets from Scielo and EDP currently being used in the WMT
Biomedical Translation Task
(http://www.statmt.org/wmt17/biomedical-translation-task.htm)
Sources for biomedical documents
● Clinical trials
– Freely (publicily) available but usually monolingual (English)
(http://clinicaltrials.gov/)
Sources for biomedical documents
● Clinical trials
– Even sometimes in countries whose native language isn‘t
English...
(http://reec.aemps.es)
Deutschen Register Klinischer Studien (DRKS)
(http://www.drks.de)
● Clinical trials
– Sometimes parallel documents are available but license doesn‘t
allow its distribution
Brazilian Clinical Trials Registry /
Registro Brasileiro de Ensaios Clínicos (ReBEC)
(http://www.ensaiosclinicos.gov.br/)
Overview of a clinical trial
(http://www.ensaiosclinicos.gov.br/rg/RBR-48pb9h/)
Overview of a clinical trial
(http://www.ensaiosclinicos.gov.br/rg/RBR-48pb9h/)
Overview of a clinical trial
(http://www.ensaiosclinicos.gov.br/rg/RBR-48pb9h/)
Overview of a clinical trial
(http://www.ensaiosclinicos.gov.br/rg/RBR-48pb9h/)
Overview of a clinical trial
(http://www.ensaiosclinicos.gov.br/rg/RBR-48pb9h/)
Corpus construction: Pipeline
● Data download
● OpenXML Trials parsing
● Sentence splitting
● Sentence alignment
● Quality checking
Similar to : Neves M, Jimeno-Yepes A and Névéol A. The Scielo Corpus: a Parallel Corpus of Scientific Publications for Biomedicine,
International Conference on Language Resources and Evaluation (LREC), 2016, Portoroz, Slovenia.
Data download
Data download
(120 links as of January 4th)
OpenXML parsing
OpenXML parsing
OpenXML parsing
OpenXML parsing
Open XML parsing
● Fields that have been considered:
(a) trial identifier
(b) public title
(c) scientific title
(d) interventions to be carried out
(e) inclusion criteria for taking part
(f) exclusion criteria for not participating
(g) primary outcome
(h) secondary outcome
Final documents based on the concatenation of the various fields
following the order in the OpenXML Trials file
Sentence splitting
● „Sentence Detector“ models for EN and PT
(https://opennlp.apache.org/)
Sentence alignment
● Default parameters of GMA
● List of stopwords
– EN: http://www.textfixer.com/tutorials/common-english-words.txt
– PT: http://www.linguateca.pt/chave/stopwords/chave.MF300.txt
and English
(http://nlp.cs.nyu.edu/GMA/)
Quality checking
(https://github.com/cfedermann/Appraise)
(Sample of 50 trials)
Results
● Clinical trials corpus: 1188 documents
– EN: 23,843 sentences, 625,881 tokens
– PT: 23,666 sentences, 665,325 tokens
Results
● Manual validation of 50 trials
– 67% of the sentences are correctly aligned
– 28% of the sentences were not aligned
– 5% of the sentences had some overlap
● In contrast, our results for the Scielo corpus had a >80% correct
alignment
Discussion
● Many of the wrong
alignments were due to
shifted sentences
– Mainly due to fields
being placed in
different order due
to multiple
instances of the
same type
Discussion
● Some few errors were due to wrong sentence splitting
Subject must be at least 18 years of age; males and females with
a documented diagnosis of ulcerative colitis (UC) at least 4
months prior to entry into the study; subjects with moderately to
severely active UC based on Mayo score criteria; subjects must
have failed or be intolerant of at least one of the following
treatments for UC: corticosteroids (oral ou intravenous),
azathioprine or 6 mercaptopurine (6MP), anti TNF alpha therapy
(infliximab ou
adalimumab).
Discussion
● Some errors were due to splitting of (very) long sentences
Diseases which cause damage in the intestinal mucosa, diseases that
significantly increase the gastrointestinal transit as infectious enteritis,
celiac disease, inflammatory bowel disease (Chron), drug-induced
enteritis or radiation, diverticular disease of the colon; History of
surgery: heart (whatever), renal (exercises kidney or renal agenesis),
intestinal (partial or total removal of the esophagus, stomach,
duodenum, jejunum, ileum, ascending colon, transverse colon,
descending colon, sigmoid colon or rectum) , liver or pancreas;
Volunteers smoking more than five cigarettes a day; different eating
habits of the population standard, eg vegetarianism, veganism; History
of alcohol consumption or use of drugs of abuse; Made use of
antibiotics as regular medication (continuous use) within the 4 weeks
preceding the valuation date and / or the start of the breath test; This
examination Colonoscopy one month before the breath test H2
expired.
Conclusions
● A novel comparable/parallel corpus of clinical trials for EN/PT
– Reasonable size, easy to obtain and freely available
● However, further processing is necessary to improve the quality of
the corpus.
● Experiments still pending to evaluate its suitability for MT.
● Available at:
– https://github.com/biomedical-translation-corpora/wmt-task
Thank you!
Looking forward to answering your questions!
Mariana Neves
Current email: Mariana.Lara-Neves@bfr.bund.de
Current affiliation: German Federal Institute for Risk Assessment

More Related Content

Similar to Mariana Neves - 2017 - A parallel collection of clinical trials in Portuguese and English

2023-11-09 HealthRI Biobanking day_Amsterdam_Alain van Gool.pdf
2023-11-09 HealthRI Biobanking day_Amsterdam_Alain van Gool.pdf2023-11-09 HealthRI Biobanking day_Amsterdam_Alain van Gool.pdf
2023-11-09 HealthRI Biobanking day_Amsterdam_Alain van Gool.pdfAlain van Gool
 
Overview of the commonly used sequencing platforms, bioinformatic search tool...
Overview of the commonly used sequencing platforms, bioinformatic search tool...Overview of the commonly used sequencing platforms, bioinformatic search tool...
Overview of the commonly used sequencing platforms, bioinformatic search tool...OECD Environment
 
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Manikhandan Mudaliar
 
Data dialogue - Human Genomic Data Discovery
Data dialogue - Human Genomic Data DiscoveryData dialogue - Human Genomic Data Discovery
Data dialogue - Human Genomic Data DiscoveryFiona Nielsen
 
Towards evidence-based clinical practice guidelines implementation at King Sa...
Towards evidence-based clinical practice guidelines implementation at King Sa...Towards evidence-based clinical practice guidelines implementation at King Sa...
Towards evidence-based clinical practice guidelines implementation at King Sa...Yasser Sami Abdel Dayem Amer
 
2022-11-23 DTL Future of data-driven life sciences, Utrecht, Alain van Gool.pdf
2022-11-23 DTL Future of data-driven life sciences, Utrecht, Alain van Gool.pdf2022-11-23 DTL Future of data-driven life sciences, Utrecht, Alain van Gool.pdf
2022-11-23 DTL Future of data-driven life sciences, Utrecht, Alain van Gool.pdfAlain van Gool
 
Variation and the VEP: Ensembl Online Webinar series
Variation and the VEP: Ensembl Online Webinar seriesVariation and the VEP: Ensembl Online Webinar series
Variation and the VEP: Ensembl Online Webinar seriesDenise Carvalho-Silva, PhD
 
Active research management and sharing
Active research management and sharingActive research management and sharing
Active research management and sharingJisc
 
Reverse Engineering of Clinical Trials to Improve Research
Reverse Engineering of Clinical Trials to Improve ResearchReverse Engineering of Clinical Trials to Improve Research
Reverse Engineering of Clinical Trials to Improve ResearchWolfgang Kuchinke
 
Supporting researchers in the molecular life sciences Jeff Christiansen
Supporting researchers in the molecular life sciences Jeff Christiansen Supporting researchers in the molecular life sciences Jeff Christiansen
Supporting researchers in the molecular life sciences Jeff Christiansen ARDC
 
AORTIC Conference - The Continuous Update Project: Introduction to the Project
AORTIC Conference - The Continuous Update Project: Introduction to the ProjectAORTIC Conference - The Continuous Update Project: Introduction to the Project
AORTIC Conference - The Continuous Update Project: Introduction to the ProjectWorld Cancer Research Fund International
 
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...Juan Antonio Vizcaino
 
10-3. Rare disease registries. Samantha Parker (eng)
10-3. Rare disease registries. Samantha Parker (eng)10-3. Rare disease registries. Samantha Parker (eng)
10-3. Rare disease registries. Samantha Parker (eng)KidneyOrgRu
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformaticsphilmaweb
 
B Dent2 Reviewfeedbacksession2008
B Dent2 Reviewfeedbacksession2008B Dent2 Reviewfeedbacksession2008
B Dent2 Reviewfeedbacksession2008jeremyc
 
Grand round whsiao_may2015
Grand round whsiao_may2015Grand round whsiao_may2015
Grand round whsiao_may2015IRIDA_community
 
How Can We Make Genomic Epidemiology a Widespread Reality? - William Hsiao
How Can We Make Genomic Epidemiology a Widespread Reality?  - William HsiaoHow Can We Make Genomic Epidemiology a Widespread Reality?  - William Hsiao
How Can We Make Genomic Epidemiology a Widespread Reality? - William HsiaoWilliam Hsiao
 

Similar to Mariana Neves - 2017 - A parallel collection of clinical trials in Portuguese and English (20)

2023-11-09 HealthRI Biobanking day_Amsterdam_Alain van Gool.pdf
2023-11-09 HealthRI Biobanking day_Amsterdam_Alain van Gool.pdf2023-11-09 HealthRI Biobanking day_Amsterdam_Alain van Gool.pdf
2023-11-09 HealthRI Biobanking day_Amsterdam_Alain van Gool.pdf
 
Overview of the commonly used sequencing platforms, bioinformatic search tool...
Overview of the commonly used sequencing platforms, bioinformatic search tool...Overview of the commonly used sequencing platforms, bioinformatic search tool...
Overview of the commonly used sequencing platforms, bioinformatic search tool...
 
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
 
SeniorDesignAdvisoryCouncilPoster_2016 (1)
SeniorDesignAdvisoryCouncilPoster_2016 (1)SeniorDesignAdvisoryCouncilPoster_2016 (1)
SeniorDesignAdvisoryCouncilPoster_2016 (1)
 
Data dialogue - Human Genomic Data Discovery
Data dialogue - Human Genomic Data DiscoveryData dialogue - Human Genomic Data Discovery
Data dialogue - Human Genomic Data Discovery
 
Towards evidence-based clinical practice guidelines implementation at King Sa...
Towards evidence-based clinical practice guidelines implementation at King Sa...Towards evidence-based clinical practice guidelines implementation at King Sa...
Towards evidence-based clinical practice guidelines implementation at King Sa...
 
2022-11-23 DTL Future of data-driven life sciences, Utrecht, Alain van Gool.pdf
2022-11-23 DTL Future of data-driven life sciences, Utrecht, Alain van Gool.pdf2022-11-23 DTL Future of data-driven life sciences, Utrecht, Alain van Gool.pdf
2022-11-23 DTL Future of data-driven life sciences, Utrecht, Alain van Gool.pdf
 
Variation and the VEP: Ensembl Online Webinar series
Variation and the VEP: Ensembl Online Webinar seriesVariation and the VEP: Ensembl Online Webinar series
Variation and the VEP: Ensembl Online Webinar series
 
Active research management and sharing
Active research management and sharingActive research management and sharing
Active research management and sharing
 
Lab'InSight Toxicological Risk Assessment UCL - LTAP 24.10.2013
Lab'InSight Toxicological Risk Assessment UCL - LTAP 24.10.2013Lab'InSight Toxicological Risk Assessment UCL - LTAP 24.10.2013
Lab'InSight Toxicological Risk Assessment UCL - LTAP 24.10.2013
 
Reverse Engineering of Clinical Trials to Improve Research
Reverse Engineering of Clinical Trials to Improve ResearchReverse Engineering of Clinical Trials to Improve Research
Reverse Engineering of Clinical Trials to Improve Research
 
Supporting researchers in the molecular life sciences Jeff Christiansen
Supporting researchers in the molecular life sciences Jeff Christiansen Supporting researchers in the molecular life sciences Jeff Christiansen
Supporting researchers in the molecular life sciences Jeff Christiansen
 
AORTIC Conference - The Continuous Update Project: Introduction to the Project
AORTIC Conference - The Continuous Update Project: Introduction to the ProjectAORTIC Conference - The Continuous Update Project: Introduction to the Project
AORTIC Conference - The Continuous Update Project: Introduction to the Project
 
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
 
10-3. Rare disease registries. Samantha Parker (eng)
10-3. Rare disease registries. Samantha Parker (eng)10-3. Rare disease registries. Samantha Parker (eng)
10-3. Rare disease registries. Samantha Parker (eng)
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformatics
 
Ensembl Browser Workshop
Ensembl Browser WorkshopEnsembl Browser Workshop
Ensembl Browser Workshop
 
B Dent2 Reviewfeedbacksession2008
B Dent2 Reviewfeedbacksession2008B Dent2 Reviewfeedbacksession2008
B Dent2 Reviewfeedbacksession2008
 
Grand round whsiao_may2015
Grand round whsiao_may2015Grand round whsiao_may2015
Grand round whsiao_may2015
 
How Can We Make Genomic Epidemiology a Widespread Reality? - William Hsiao
How Can We Make Genomic Epidemiology a Widespread Reality?  - William HsiaoHow Can We Make Genomic Epidemiology a Widespread Reality?  - William Hsiao
How Can We Make Genomic Epidemiology a Widespread Reality? - William Hsiao
 

More from Association for Computational Linguistics

Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...Association for Computational Linguistics
 
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...Association for Computational Linguistics
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsAssociation for Computational Linguistics
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsAssociation for Computational Linguistics
 
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...Association for Computational Linguistics
 
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...Association for Computational Linguistics
 
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Association for Computational Linguistics
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopAssociation for Computational Linguistics
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...Association for Computational Linguistics
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...Association for Computational Linguistics
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Association for Computational Linguistics
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Association for Computational Linguistics
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopAssociation for Computational Linguistics
 
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...Association for Computational Linguistics
 

More from Association for Computational Linguistics (20)

Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal TextMuis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
 
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
 
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour AnalysisCastro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
 
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
 
Elior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
Elior Sulem - 2018 - Semantic Structural Evaluation for Text SimplificationElior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
Elior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
 
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
 
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
 
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
 
Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
 
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
 
Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015
 
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
 

Recently uploaded

Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfSanaAli374401
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterMateoGardella
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.MateoGardella
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docxPoojaSen20
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 

Recently uploaded (20)

Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 

Mariana Neves - 2017 - A parallel collection of clinical trials in Portuguese and English

  • 1. 10th Workshop on Building and Using Comparable Corpora at ACL‘17 Vancouver, Canada A parallel collection of clinical trials in Portuguese and English Mariana Neves August 3rd, 2017
  • 2. Parallel corpora of documents ● Parallel collections of documents are valuable resources for training, tuning and evaluating machine translation (MT) tools. ● However, these are not available for some domains, e.g., biomedicine, and manually creating such collections is an expensive task.
  • 3. Parallel corpora for news vs. biomedical domains (https://ufal.mff.cuni.cz/ufal_medical_corpus)
  • 4. Sources for biomedical documents ● Clinical discharge summaries – Not available due to privacy issues – Usually monolingual (http://home.iprimus.com.au/callanders/Image-21.jpg)
  • 5. Sources for biomedical documents ● Scientific publications – Many are freely available, but frequently monolingual (English) – There are some exceptions, e.g., Scielo, EDP (http://www.scielo.br/scielo.php?script=sci_abstract&pid=S1676-06032017000100201&lng=pt&nrm=iso&tlng=en http://www.scielo.br/scielo.php?script=sci_abstract&pid=S1676-06032017000100201&lng=pt&nrm=iso&tlng=pt)
  • 6. Sources for biomedical documents ● Scientific publications – Scielo corpus: the first parallel (comparable) corpus of scientific publications (Neves M, Jimeno-Yepes A and Névéol A. The Scielo Corpus: a Parallel Corpus of Scientific Publications for Biomedicine, International Conference on Language Resources and Evaluation (LREC), 2016, Portoroz, Slovenia. )
  • 7. Sources for biomedical documents ● Scientific publications – Datasets from Scielo and EDP currently being used in the WMT Biomedical Translation Task (http://www.statmt.org/wmt17/biomedical-translation-task.htm)
  • 8. Sources for biomedical documents ● Clinical trials – Freely (publicily) available but usually monolingual (English) (http://clinicaltrials.gov/)
  • 9. Sources for biomedical documents ● Clinical trials – Even sometimes in countries whose native language isn‘t English... (http://reec.aemps.es)
  • 10. Deutschen Register Klinischer Studien (DRKS) (http://www.drks.de) ● Clinical trials – Sometimes parallel documents are available but license doesn‘t allow its distribution
  • 11. Brazilian Clinical Trials Registry / Registro Brasileiro de Ensaios Clínicos (ReBEC) (http://www.ensaiosclinicos.gov.br/)
  • 12. Overview of a clinical trial (http://www.ensaiosclinicos.gov.br/rg/RBR-48pb9h/)
  • 13. Overview of a clinical trial (http://www.ensaiosclinicos.gov.br/rg/RBR-48pb9h/)
  • 14. Overview of a clinical trial (http://www.ensaiosclinicos.gov.br/rg/RBR-48pb9h/)
  • 15. Overview of a clinical trial (http://www.ensaiosclinicos.gov.br/rg/RBR-48pb9h/)
  • 16. Overview of a clinical trial (http://www.ensaiosclinicos.gov.br/rg/RBR-48pb9h/)
  • 17. Corpus construction: Pipeline ● Data download ● OpenXML Trials parsing ● Sentence splitting ● Sentence alignment ● Quality checking Similar to : Neves M, Jimeno-Yepes A and Névéol A. The Scielo Corpus: a Parallel Corpus of Scientific Publications for Biomedicine, International Conference on Language Resources and Evaluation (LREC), 2016, Portoroz, Slovenia.
  • 19. Data download (120 links as of January 4th)
  • 24. Open XML parsing ● Fields that have been considered: (a) trial identifier (b) public title (c) scientific title (d) interventions to be carried out (e) inclusion criteria for taking part (f) exclusion criteria for not participating (g) primary outcome (h) secondary outcome Final documents based on the concatenation of the various fields following the order in the OpenXML Trials file
  • 25. Sentence splitting ● „Sentence Detector“ models for EN and PT (https://opennlp.apache.org/)
  • 26. Sentence alignment ● Default parameters of GMA ● List of stopwords – EN: http://www.textfixer.com/tutorials/common-english-words.txt – PT: http://www.linguateca.pt/chave/stopwords/chave.MF300.txt and English (http://nlp.cs.nyu.edu/GMA/)
  • 28. Results ● Clinical trials corpus: 1188 documents – EN: 23,843 sentences, 625,881 tokens – PT: 23,666 sentences, 665,325 tokens
  • 29. Results ● Manual validation of 50 trials – 67% of the sentences are correctly aligned – 28% of the sentences were not aligned – 5% of the sentences had some overlap ● In contrast, our results for the Scielo corpus had a >80% correct alignment
  • 30. Discussion ● Many of the wrong alignments were due to shifted sentences – Mainly due to fields being placed in different order due to multiple instances of the same type
  • 31. Discussion ● Some few errors were due to wrong sentence splitting Subject must be at least 18 years of age; males and females with a documented diagnosis of ulcerative colitis (UC) at least 4 months prior to entry into the study; subjects with moderately to severely active UC based on Mayo score criteria; subjects must have failed or be intolerant of at least one of the following treatments for UC: corticosteroids (oral ou intravenous), azathioprine or 6 mercaptopurine (6MP), anti TNF alpha therapy (infliximab ou adalimumab).
  • 32. Discussion ● Some errors were due to splitting of (very) long sentences Diseases which cause damage in the intestinal mucosa, diseases that significantly increase the gastrointestinal transit as infectious enteritis, celiac disease, inflammatory bowel disease (Chron), drug-induced enteritis or radiation, diverticular disease of the colon; History of surgery: heart (whatever), renal (exercises kidney or renal agenesis), intestinal (partial or total removal of the esophagus, stomach, duodenum, jejunum, ileum, ascending colon, transverse colon, descending colon, sigmoid colon or rectum) , liver or pancreas; Volunteers smoking more than five cigarettes a day; different eating habits of the population standard, eg vegetarianism, veganism; History of alcohol consumption or use of drugs of abuse; Made use of antibiotics as regular medication (continuous use) within the 4 weeks preceding the valuation date and / or the start of the breath test; This examination Colonoscopy one month before the breath test H2 expired.
  • 33. Conclusions ● A novel comparable/parallel corpus of clinical trials for EN/PT – Reasonable size, easy to obtain and freely available ● However, further processing is necessary to improve the quality of the corpus. ● Experiments still pending to evaluate its suitability for MT. ● Available at: – https://github.com/biomedical-translation-corpora/wmt-task
  • 34. Thank you! Looking forward to answering your questions! Mariana Neves Current email: Mariana.Lara-Neves@bfr.bund.de Current affiliation: German Federal Institute for Risk Assessment