SlideShare a Scribd company logo
1 of 18
Download to read offline
STEREO: A Pipeline for Extracting Experiment Statistics,
Conditions, and Topics from Scientific Papers
Steffen Epp, Marcel Hoffman, Nicolas Lell, Michael Mohr, Ansgar Scherp |
University of Ulm, Germany | November 24, 2021
Seite 2 Motivation | STEREO | November 24, 2021
Motivation
▶ Reporting of statistics should follow APA style guides
▶ Eases reading, avoids misunderstanding, easy to verify the stats
▶ In practise, however, we see in scientific papers ...
▶ ... styleguide deviations “Physical demand
(t(23) = −2.22, p = 0.37) and temporal demand
(t(23) = 2.72, p = .012) are significantly different”
→ Statistics not at end of sentence and non-standard leading 0
at p-value.
▶ ... variables missing: “Similarly fall in hemoglobin was
associated with total operative time (r = 0.49, p = 0.003)”
→ no degree of freedom reported
▶ Different naming: “The value of Spearman correlation coefficient
showed that the threshold effects did not exist in the adult BAL
group (Spearman correlation coefficient : 0.158; P = 0.727)”
→ “Spearman correlation coefficient” instead of r(df)
Seite 3 Motivation | STEREO | November 24, 2021
Example of Extracted Statistics Record
▶ “There was no significant effect for sex, (t(38) = 1.7, p = .097)
despite women attaining higher scores than men”
▶ Extracted: {degreeOfFreddom = 38,
statisticVal = 1.7, pvalue = .097, topic = personal data and
conditions = {men, women}}
Seite 4 Methods | STEREO | November 24, 2021
Pipeline
Rule-based 
using active Wrapper
ABAE
GBCE
Evaluation Evaluation
Step 1
Preprocessing
Step0 Step 2
Figure: The STEREO pipeline
Preprocessing:
▶ Split text by using the regular expression: (.s?[A − Z]) to get
sentences.
▶ Filter out all sentence without numbers.
Seite 5 Methods | STEREO | November 24, 2021
Statistic Extraction
▶ Two sets of rules: R+
and R−
▶ For each rule r+
i ∈ R+
there is a set of sub-rules Si
The applications of the protein
R−
z }| {
CD45RA are significantly different
(t(23) = −2.22, p = 0.37)
| {z }
R+
Seite 5 Methods | STEREO | November 24, 2021
Statistic Extraction
▶ Two sets of rules: R+
and R−
▶ For each rule r+
i ∈ R+
there is a set of sub-rules Si
The applications of the protein
r−
j
z }| {
CD45RA are significantly different
(
si1
z }| {
t(23) = −2.22
| {z }
si2
, p = 0.37
| {z }
si3
)
| {z }
r+
i
Seite 6 Methods | STEREO | November 24, 2021
New Rules
▶ Sentence contains number without match
▶ Ask user to input new R+
or R−
rule
→ active wrapper induction
▶ “As we showed in Sec,
no match
z}|{
4.2 there is ... ”
→ New R−
rule: r−
j = “Sec,s* d+.d+”
Seite 7 Methods | STEREO | November 24, 2021
ABAE Method
▶ Attention Based Aspect Extraction 1
▶ Set number of Topics K & train unsupervised
▶ Manually label each Topic from representative words
Example:
▶ Associating words to aspect: {day, week, month, hour, wk,
period, lasted, time, weekly, elapse, year, thereafter, minute,
daily, 24h ... } → Topic: Time
▶ “A negative non-significant relationship between PHQ-9 total
score and age 21-29; r (340) = -0.042, p = 0.441 ...” → Topic:
Mental Health
1
He et. al.: An Unsupervised Neural Attention Model for Aspect Extraction
https://aclanthology.org/P17-1036/
Seite 8 Methods | STEREO | November 24, 2021
GBCE Method
▶ Grammar Based Condition Extraction
▶ POS and Grammar annotation through SpaCy
▶ Rules to identify noun phrases based on common phrases and
annotations
Example:
▶ “. . . increase in risk for men was bigger than for women . . . ”
▶ Rule scheme: Noun (subject) + verb + comparative adjective +
than + noun (object)
▶ Conditions: {men, women}
Seite 9 Results | STEREO | November 24, 2021
Dataset
▶ Cord-19 Dataset, version 21st September 2020
▶ 108k scientific papers
▶ 16m sentences after preprocessing
▶ 55% of sentences contain at least one digit
Seite 10 Results | STEREO | November 24, 2021
Rules Learned by Wrapper Induction
▶ Rules were learned on 500 documents.
▶ 85 R+
and 1,425 R−
Rules were found.
▶ On a sample of 10,000 unseen documents they covered 95% of
the sentences with digits.
Seite 11 Results | STEREO | November 24, 2021
Rule-based Statistics Extraction: Results
Statistic APA conform non-APA conform
Student’s t-test 608 179
Pearson Correlation 113 4,962
Spearman Correlation 1 528
ANOVA 0 9
Mann-Whitney U 2 34
Wilcoxon Signed-Rank 0 0
Chi-Square 14 31
not supported not applied 19,151
not determinable not applicable 87,904
Table: This table shows how many statistics of each type were extracted. Not
supported are e. g. odds ratio, IQR etc.. Not determinable are e. g. solely
reported p value where the type of statistic could not be decided.
Seite 12 Results | STEREO | November 24, 2021
Assessment of Statistic Extraction Results
Statistic APA conform non-APA conform
Student’s t-test 1.0 0.91
Pearson Correlation 1.0 0.98
Spearman Correlation 1.0 1.0
ANOVA n/a 1.0
Mann-Whitney U 1.0 1.0
Wilcoxon Signed-Rank n/a n/a
Chi-Square 1.0 0.97
other - 0.95
Table: The precision has been calculated for each statistic type on 200
samples. If less than 200 samples were extracted, the precision has been
calculated on the respective amount of extracted samples.
Seite 13 Results | STEREO | November 24, 2021
Topic Extraction from Experiments using ABEA: Results
emb train K Result APA Result non-APA
supp-sen supp-sen 15 33 31
supp-sen supp-sen 30 75 73
all-sen supp-sen 15 51 57
all-sen supp-sen 30 48 49
▶ The best result was achieved with embedding and model only
trained on sentences with our supported statistics (vs sentences
with any statistics and all sentences) and K = 30.
▶ This model correctly classified 75/100 APA and 73/100
non-APA conform sentences.
▶ Example Topics: {risk factors, statistics, mental health}
Seite 14 Results | STEREO | November 24, 2021
Grammar-based Condition Extraction: Results
GBCE Result APA Result non-APA
Correctly classified 46 30
Reason 1: Failed grammar 4 5
Reason 2: Sentence structure 10 3
Reason 3: Preprocessing error 9 12
Reason 4: Dependency parser 18 2
Reason 5: GBCE miss 25 47
Table: Number of correctly extracted experimental conditions and reasons
why the extraction failed. In serveral samples, a combination of reasons were
the cause.
Seite 15 Discussion | STEREO | November 24, 2021
Generalization
▶ The whole approach should transfer to other domains.
▶ Depending on the domain fine tuning of the R−
rules and
adding new statistic type to the R+
rules may be necessary.
▶ GBCE should generalize well to other domains, since it is based
on English grammar.
▶ ABAE can be transferred to similar domains, on different
domains it needs to be re-trained.
Seite 16 Discussion | STEREO | November 24, 2021
Threat to Validity/Reproducibility
▶ For some statistic we extracted just a few samples (e. g.
ANOVA), these results could not be representative
▶ It is possible that topic or condition extraction requires more
than one sentence to capture the context
▶ In ABAE the inference of topics could lead to errors, but all
terms were checked to reduce these possibility
▶ An extended version of the paper can be found on arxiv2
▶ For reproducibility the code and ruleset are publicly available on:
github.com/Foisunt/STEREO
2
https://arxiv.org/abs/2103.14124
Seite 17 Discussion | STEREO | November 24, 2021
Summary of our Results
▶ High quality stats extraction (100% accuracy on APA conform
and 95% on non-APA conform sentences)
▶ The vast majority of statistics (> 99%) is not strictly APA
conform.
▶ Some ABAE models found “statistics” or “result reporting”
topics, which are technically correct but not useful.
▶ We found no parameter setting for ABAE (embedding, K) that
clearly worked better in general.
▶ GBCE works better on APA conform sentence, because on
average they have a better grammatical structure.
Thank you! Questions?

More Related Content

Similar to STEREO: A Pipeline for Extracting Experiment Statistics, Conditions, and Topics from Scientific Papers

Factor analysis in Spss
Factor analysis in SpssFactor analysis in Spss
Factor analysis in SpssFayaz Ahmad
 
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...cscpconf
 
EDR8201 Week 3 Assignment: Analyze Central Tendency and Variability
EDR8201 Week 3 Assignment: Analyze Central Tendency and VariabilityEDR8201 Week 3 Assignment: Analyze Central Tendency and Variability
EDR8201 Week 3 Assignment: Analyze Central Tendency and Variabilityeckchela
 
Sct2013 boston,randomizationmetricsposter,d6.2
Sct2013 boston,randomizationmetricsposter,d6.2Sct2013 boston,randomizationmetricsposter,d6.2
Sct2013 boston,randomizationmetricsposter,d6.2Dennis Sweitzer
 
Newbold_chap18.ppt
Newbold_chap18.pptNewbold_chap18.ppt
Newbold_chap18.pptcfisicaster
 
Quality Control.ppt
Quality Control.pptQuality Control.ppt
Quality Control.pptTasrovaUrmi
 
Icbme2020- Use of neural network algorithms to predict arterial blood gas ite...
Icbme2020- Use of neural network algorithms to predict arterial blood gas ite...Icbme2020- Use of neural network algorithms to predict arterial blood gas ite...
Icbme2020- Use of neural network algorithms to predict arterial blood gas ite...Mohammad Sabouri
 
Saliency Based Hookworm and Infection Detection for Wireless Capsule Endoscop...
Saliency Based Hookworm and Infection Detection for Wireless Capsule Endoscop...Saliency Based Hookworm and Infection Detection for Wireless Capsule Endoscop...
Saliency Based Hookworm and Infection Detection for Wireless Capsule Endoscop...IRJET Journal
 
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Johann Petrak
 
European conference on educational research
European conference on educational research European conference on educational research
European conference on educational research GIRUMTAREKE
 
European conference on educational research
European conference on educational research European conference on educational research
European conference on educational research GIRUMTAREKE
 
[EMBC 2021] Hierarchical Consistency Regularized Mean Teacher for Semi-superv...
[EMBC 2021] Hierarchical Consistency Regularized Mean Teacher for Semi-superv...[EMBC 2021] Hierarchical Consistency Regularized Mean Teacher for Semi-superv...
[EMBC 2021] Hierarchical Consistency Regularized Mean Teacher for Semi-superv...Ziyuan Zhao
 
Ali asgher slides copy
Ali asgher slides   copyAli asgher slides   copy
Ali asgher slides copyAli Asgher
 
Predicting student performance using aggregated data sources
Predicting student performance using aggregated data sourcesPredicting student performance using aggregated data sources
Predicting student performance using aggregated data sourcesOlugbenga Wilson Adejo
 
Time Series Forecasting using Neural Nets (GNNNs)
Time Series Forecasting using Neural Nets (GNNNs)Time Series Forecasting using Neural Nets (GNNNs)
Time Series Forecasting using Neural Nets (GNNNs)Sean Golliher
 
Neural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain ShiftNeural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain ShiftSebastian Ruder
 
Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regress...
Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regress...Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regress...
Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regress...Artificial Intelligence Institute at UofSC
 

Similar to STEREO: A Pipeline for Extracting Experiment Statistics, Conditions, and Topics from Scientific Papers (20)

Factor analysis in Spss
Factor analysis in SpssFactor analysis in Spss
Factor analysis in Spss
 
FinalReport
FinalReportFinalReport
FinalReport
 
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
 
EDR8201 Week 3 Assignment: Analyze Central Tendency and Variability
EDR8201 Week 3 Assignment: Analyze Central Tendency and VariabilityEDR8201 Week 3 Assignment: Analyze Central Tendency and Variability
EDR8201 Week 3 Assignment: Analyze Central Tendency and Variability
 
Sct2013 boston,randomizationmetricsposter,d6.2
Sct2013 boston,randomizationmetricsposter,d6.2Sct2013 boston,randomizationmetricsposter,d6.2
Sct2013 boston,randomizationmetricsposter,d6.2
 
Newbold_chap18.ppt
Newbold_chap18.pptNewbold_chap18.ppt
Newbold_chap18.ppt
 
Quality Control.ppt
Quality Control.pptQuality Control.ppt
Quality Control.ppt
 
Icbme2020- Use of neural network algorithms to predict arterial blood gas ite...
Icbme2020- Use of neural network algorithms to predict arterial blood gas ite...Icbme2020- Use of neural network algorithms to predict arterial blood gas ite...
Icbme2020- Use of neural network algorithms to predict arterial blood gas ite...
 
Saliency Based Hookworm and Infection Detection for Wireless Capsule Endoscop...
Saliency Based Hookworm and Infection Detection for Wireless Capsule Endoscop...Saliency Based Hookworm and Infection Detection for Wireless Capsule Endoscop...
Saliency Based Hookworm and Infection Detection for Wireless Capsule Endoscop...
 
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
 
European conference on educational research
European conference on educational research European conference on educational research
European conference on educational research
 
European conference on educational research
European conference on educational research European conference on educational research
European conference on educational research
 
[EMBC 2021] Hierarchical Consistency Regularized Mean Teacher for Semi-superv...
[EMBC 2021] Hierarchical Consistency Regularized Mean Teacher for Semi-superv...[EMBC 2021] Hierarchical Consistency Regularized Mean Teacher for Semi-superv...
[EMBC 2021] Hierarchical Consistency Regularized Mean Teacher for Semi-superv...
 
Ali asgher slides copy
Ali asgher slides   copyAli asgher slides   copy
Ali asgher slides copy
 
Predicting student performance using aggregated data sources
Predicting student performance using aggregated data sourcesPredicting student performance using aggregated data sources
Predicting student performance using aggregated data sources
 
Time Series Forecasting using Neural Nets (GNNNs)
Time Series Forecasting using Neural Nets (GNNNs)Time Series Forecasting using Neural Nets (GNNNs)
Time Series Forecasting using Neural Nets (GNNNs)
 
Neural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain ShiftNeural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain Shift
 
Les5e ppt 09
Les5e ppt 09Les5e ppt 09
Les5e ppt 09
 
Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regress...
Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regress...Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regress...
Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regress...
 
Ietcpresentation
IetcpresentationIetcpresentation
Ietcpresentation
 

More from Ansgar Scherp

Analysis of GraphSum's Attention Weights to Improve the Explainability of Mul...
Analysis of GraphSum's Attention Weights to Improve the Explainability of Mul...Analysis of GraphSum's Attention Weights to Improve the Explainability of Mul...
Analysis of GraphSum's Attention Weights to Improve the Explainability of Mul...Ansgar Scherp
 
Text Localization in Scientific Figures using Fully Convolutional Neural Netw...
Text Localization in Scientific Figures using Fully Convolutional Neural Netw...Text Localization in Scientific Figures using Fully Convolutional Neural Netw...
Text Localization in Scientific Figures using Fully Convolutional Neural Netw...Ansgar Scherp
 
A Comparison of Approaches for Automated Text Extraction from Scholarly Figures
A Comparison of Approaches for Automated Text Extraction from Scholarly FiguresA Comparison of Approaches for Automated Text Extraction from Scholarly Figures
A Comparison of Approaches for Automated Text Extraction from Scholarly FiguresAnsgar Scherp
 
About Multimedia Presentation Generation and Multimedia Metadata: From Synthe...
About Multimedia Presentation Generation and Multimedia Metadata: From Synthe...About Multimedia Presentation Generation and Multimedia Metadata: From Synthe...
About Multimedia Presentation Generation and Multimedia Metadata: From Synthe...Ansgar Scherp
 
Mining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataMining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataAnsgar Scherp
 
Knowledge Discovery in Social Media and Scientific Digital Libraries
Knowledge Discovery in Social Media and Scientific Digital LibrariesKnowledge Discovery in Social Media and Scientific Digital Libraries
Knowledge Discovery in Social Media and Scientific Digital LibrariesAnsgar Scherp
 
A Comparison of Different Strategies for Automated Semantic Document Annotation
A Comparison of Different Strategies for Automated Semantic Document AnnotationA Comparison of Different Strategies for Automated Semantic Document Annotation
A Comparison of Different Strategies for Automated Semantic Document AnnotationAnsgar Scherp
 
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...Ansgar Scherp
 
A Framework for Iterative Signing of Graph Data on the Web
A Framework for Iterative Signing of Graph Data on the WebA Framework for Iterative Signing of Graph Data on the Web
A Framework for Iterative Signing of Graph Data on the WebAnsgar Scherp
 
Smart photo selection: interpret gaze as personal interest
Smart photo selection: interpret gaze as personal interestSmart photo selection: interpret gaze as personal interest
Smart photo selection: interpret gaze as personal interestAnsgar Scherp
 
Events in Multimedia - Theory, Model, Application
Events in Multimedia - Theory, Model, ApplicationEvents in Multimedia - Theory, Model, Application
Events in Multimedia - Theory, Model, ApplicationAnsgar Scherp
 
Can you see it? Annotating Image Regions based on Users' Gaze Information
Can you see it? Annotating Image Regions based on Users' Gaze InformationCan you see it? Annotating Image Regions based on Users' Gaze Information
Can you see it? Annotating Image Regions based on Users' Gaze InformationAnsgar Scherp
 
Linked open data - how to juggle with more than a billion triples
Linked open data - how to juggle with more than a billion triplesLinked open data - how to juggle with more than a billion triples
Linked open data - how to juggle with more than a billion triplesAnsgar Scherp
 
SchemEX -- Building an Index for Linked Open Data
SchemEX -- Building an Index for Linked Open DataSchemEX -- Building an Index for Linked Open Data
SchemEX -- Building an Index for Linked Open DataAnsgar Scherp
 
SchemEX -- Building an Index for Linked Open Data
SchemEX -- Building an Index for Linked Open DataSchemEX -- Building an Index for Linked Open Data
SchemEX -- Building an Index for Linked Open DataAnsgar Scherp
 
A Model of Events for Integrating Event-based Information in Complex Socio-te...
A Model of Events for Integrating Event-based Information in Complex Socio-te...A Model of Events for Integrating Event-based Information in Complex Socio-te...
A Model of Events for Integrating Event-based Information in Complex Socio-te...Ansgar Scherp
 
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
SchemEX - Creating the Yellow Pages for the Linked Open Data CloudSchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
SchemEX - Creating the Yellow Pages for the Linked Open Data CloudAnsgar Scherp
 
strukt - A Pattern System for Integrating Individual and Organizational Knowl...
strukt - A Pattern System for Integrating Individual and Organizational Knowl...strukt - A Pattern System for Integrating Individual and Organizational Knowl...
strukt - A Pattern System for Integrating Individual and Organizational Knowl...Ansgar Scherp
 
Identifying Objects in Images from Analyzing the User‘s Gaze Movements for Pr...
Identifying Objects in Images from Analyzing the User‘s Gaze Movements for Pr...Identifying Objects in Images from Analyzing the User‘s Gaze Movements for Pr...
Identifying Objects in Images from Analyzing the User‘s Gaze Movements for Pr...Ansgar Scherp
 
Linked Open Data (Entwurfsprinzipien und Muster für vernetzte Daten)
Linked Open Data (Entwurfsprinzipien und Muster für vernetzte Daten)Linked Open Data (Entwurfsprinzipien und Muster für vernetzte Daten)
Linked Open Data (Entwurfsprinzipien und Muster für vernetzte Daten)Ansgar Scherp
 

More from Ansgar Scherp (20)

Analysis of GraphSum's Attention Weights to Improve the Explainability of Mul...
Analysis of GraphSum's Attention Weights to Improve the Explainability of Mul...Analysis of GraphSum's Attention Weights to Improve the Explainability of Mul...
Analysis of GraphSum's Attention Weights to Improve the Explainability of Mul...
 
Text Localization in Scientific Figures using Fully Convolutional Neural Netw...
Text Localization in Scientific Figures using Fully Convolutional Neural Netw...Text Localization in Scientific Figures using Fully Convolutional Neural Netw...
Text Localization in Scientific Figures using Fully Convolutional Neural Netw...
 
A Comparison of Approaches for Automated Text Extraction from Scholarly Figures
A Comparison of Approaches for Automated Text Extraction from Scholarly FiguresA Comparison of Approaches for Automated Text Extraction from Scholarly Figures
A Comparison of Approaches for Automated Text Extraction from Scholarly Figures
 
About Multimedia Presentation Generation and Multimedia Metadata: From Synthe...
About Multimedia Presentation Generation and Multimedia Metadata: From Synthe...About Multimedia Presentation Generation and Multimedia Metadata: From Synthe...
About Multimedia Presentation Generation and Multimedia Metadata: From Synthe...
 
Mining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataMining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open Data
 
Knowledge Discovery in Social Media and Scientific Digital Libraries
Knowledge Discovery in Social Media and Scientific Digital LibrariesKnowledge Discovery in Social Media and Scientific Digital Libraries
Knowledge Discovery in Social Media and Scientific Digital Libraries
 
A Comparison of Different Strategies for Automated Semantic Document Annotation
A Comparison of Different Strategies for Automated Semantic Document AnnotationA Comparison of Different Strategies for Automated Semantic Document Annotation
A Comparison of Different Strategies for Automated Semantic Document Annotation
 
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...
 
A Framework for Iterative Signing of Graph Data on the Web
A Framework for Iterative Signing of Graph Data on the WebA Framework for Iterative Signing of Graph Data on the Web
A Framework for Iterative Signing of Graph Data on the Web
 
Smart photo selection: interpret gaze as personal interest
Smart photo selection: interpret gaze as personal interestSmart photo selection: interpret gaze as personal interest
Smart photo selection: interpret gaze as personal interest
 
Events in Multimedia - Theory, Model, Application
Events in Multimedia - Theory, Model, ApplicationEvents in Multimedia - Theory, Model, Application
Events in Multimedia - Theory, Model, Application
 
Can you see it? Annotating Image Regions based on Users' Gaze Information
Can you see it? Annotating Image Regions based on Users' Gaze InformationCan you see it? Annotating Image Regions based on Users' Gaze Information
Can you see it? Annotating Image Regions based on Users' Gaze Information
 
Linked open data - how to juggle with more than a billion triples
Linked open data - how to juggle with more than a billion triplesLinked open data - how to juggle with more than a billion triples
Linked open data - how to juggle with more than a billion triples
 
SchemEX -- Building an Index for Linked Open Data
SchemEX -- Building an Index for Linked Open DataSchemEX -- Building an Index for Linked Open Data
SchemEX -- Building an Index for Linked Open Data
 
SchemEX -- Building an Index for Linked Open Data
SchemEX -- Building an Index for Linked Open DataSchemEX -- Building an Index for Linked Open Data
SchemEX -- Building an Index for Linked Open Data
 
A Model of Events for Integrating Event-based Information in Complex Socio-te...
A Model of Events for Integrating Event-based Information in Complex Socio-te...A Model of Events for Integrating Event-based Information in Complex Socio-te...
A Model of Events for Integrating Event-based Information in Complex Socio-te...
 
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
SchemEX - Creating the Yellow Pages for the Linked Open Data CloudSchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
 
strukt - A Pattern System for Integrating Individual and Organizational Knowl...
strukt - A Pattern System for Integrating Individual and Organizational Knowl...strukt - A Pattern System for Integrating Individual and Organizational Knowl...
strukt - A Pattern System for Integrating Individual and Organizational Knowl...
 
Identifying Objects in Images from Analyzing the User‘s Gaze Movements for Pr...
Identifying Objects in Images from Analyzing the User‘s Gaze Movements for Pr...Identifying Objects in Images from Analyzing the User‘s Gaze Movements for Pr...
Identifying Objects in Images from Analyzing the User‘s Gaze Movements for Pr...
 
Linked Open Data (Entwurfsprinzipien und Muster für vernetzte Daten)
Linked Open Data (Entwurfsprinzipien und Muster für vernetzte Daten)Linked Open Data (Entwurfsprinzipien und Muster für vernetzte Daten)
Linked Open Data (Entwurfsprinzipien und Muster für vernetzte Daten)
 

Recently uploaded

Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Silpa
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Monika Rani
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptxAlMamun560346
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)Areesha Ahmad
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxRizalinePalanog2
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Servicemonikaservice1
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 

Recently uploaded (20)

Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 

STEREO: A Pipeline for Extracting Experiment Statistics, Conditions, and Topics from Scientific Papers

  • 1. STEREO: A Pipeline for Extracting Experiment Statistics, Conditions, and Topics from Scientific Papers Steffen Epp, Marcel Hoffman, Nicolas Lell, Michael Mohr, Ansgar Scherp | University of Ulm, Germany | November 24, 2021
  • 2. Seite 2 Motivation | STEREO | November 24, 2021 Motivation ▶ Reporting of statistics should follow APA style guides ▶ Eases reading, avoids misunderstanding, easy to verify the stats ▶ In practise, however, we see in scientific papers ... ▶ ... styleguide deviations “Physical demand (t(23) = −2.22, p = 0.37) and temporal demand (t(23) = 2.72, p = .012) are significantly different” → Statistics not at end of sentence and non-standard leading 0 at p-value. ▶ ... variables missing: “Similarly fall in hemoglobin was associated with total operative time (r = 0.49, p = 0.003)” → no degree of freedom reported ▶ Different naming: “The value of Spearman correlation coefficient showed that the threshold effects did not exist in the adult BAL group (Spearman correlation coefficient : 0.158; P = 0.727)” → “Spearman correlation coefficient” instead of r(df)
  • 3. Seite 3 Motivation | STEREO | November 24, 2021 Example of Extracted Statistics Record ▶ “There was no significant effect for sex, (t(38) = 1.7, p = .097) despite women attaining higher scores than men” ▶ Extracted: {degreeOfFreddom = 38, statisticVal = 1.7, pvalue = .097, topic = personal data and conditions = {men, women}}
  • 4. Seite 4 Methods | STEREO | November 24, 2021 Pipeline Rule-based  using active Wrapper ABAE GBCE Evaluation Evaluation Step 1 Preprocessing Step0 Step 2 Figure: The STEREO pipeline Preprocessing: ▶ Split text by using the regular expression: (.s?[A − Z]) to get sentences. ▶ Filter out all sentence without numbers.
  • 5. Seite 5 Methods | STEREO | November 24, 2021 Statistic Extraction ▶ Two sets of rules: R+ and R− ▶ For each rule r+ i ∈ R+ there is a set of sub-rules Si The applications of the protein R− z }| { CD45RA are significantly different (t(23) = −2.22, p = 0.37) | {z } R+
  • 6. Seite 5 Methods | STEREO | November 24, 2021 Statistic Extraction ▶ Two sets of rules: R+ and R− ▶ For each rule r+ i ∈ R+ there is a set of sub-rules Si The applications of the protein r− j z }| { CD45RA are significantly different ( si1 z }| { t(23) = −2.22 | {z } si2 , p = 0.37 | {z } si3 ) | {z } r+ i
  • 7. Seite 6 Methods | STEREO | November 24, 2021 New Rules ▶ Sentence contains number without match ▶ Ask user to input new R+ or R− rule → active wrapper induction ▶ “As we showed in Sec, no match z}|{ 4.2 there is ... ” → New R− rule: r− j = “Sec,s* d+.d+”
  • 8. Seite 7 Methods | STEREO | November 24, 2021 ABAE Method ▶ Attention Based Aspect Extraction 1 ▶ Set number of Topics K & train unsupervised ▶ Manually label each Topic from representative words Example: ▶ Associating words to aspect: {day, week, month, hour, wk, period, lasted, time, weekly, elapse, year, thereafter, minute, daily, 24h ... } → Topic: Time ▶ “A negative non-significant relationship between PHQ-9 total score and age 21-29; r (340) = -0.042, p = 0.441 ...” → Topic: Mental Health 1 He et. al.: An Unsupervised Neural Attention Model for Aspect Extraction https://aclanthology.org/P17-1036/
  • 9. Seite 8 Methods | STEREO | November 24, 2021 GBCE Method ▶ Grammar Based Condition Extraction ▶ POS and Grammar annotation through SpaCy ▶ Rules to identify noun phrases based on common phrases and annotations Example: ▶ “. . . increase in risk for men was bigger than for women . . . ” ▶ Rule scheme: Noun (subject) + verb + comparative adjective + than + noun (object) ▶ Conditions: {men, women}
  • 10. Seite 9 Results | STEREO | November 24, 2021 Dataset ▶ Cord-19 Dataset, version 21st September 2020 ▶ 108k scientific papers ▶ 16m sentences after preprocessing ▶ 55% of sentences contain at least one digit
  • 11. Seite 10 Results | STEREO | November 24, 2021 Rules Learned by Wrapper Induction ▶ Rules were learned on 500 documents. ▶ 85 R+ and 1,425 R− Rules were found. ▶ On a sample of 10,000 unseen documents they covered 95% of the sentences with digits.
  • 12. Seite 11 Results | STEREO | November 24, 2021 Rule-based Statistics Extraction: Results Statistic APA conform non-APA conform Student’s t-test 608 179 Pearson Correlation 113 4,962 Spearman Correlation 1 528 ANOVA 0 9 Mann-Whitney U 2 34 Wilcoxon Signed-Rank 0 0 Chi-Square 14 31 not supported not applied 19,151 not determinable not applicable 87,904 Table: This table shows how many statistics of each type were extracted. Not supported are e. g. odds ratio, IQR etc.. Not determinable are e. g. solely reported p value where the type of statistic could not be decided.
  • 13. Seite 12 Results | STEREO | November 24, 2021 Assessment of Statistic Extraction Results Statistic APA conform non-APA conform Student’s t-test 1.0 0.91 Pearson Correlation 1.0 0.98 Spearman Correlation 1.0 1.0 ANOVA n/a 1.0 Mann-Whitney U 1.0 1.0 Wilcoxon Signed-Rank n/a n/a Chi-Square 1.0 0.97 other - 0.95 Table: The precision has been calculated for each statistic type on 200 samples. If less than 200 samples were extracted, the precision has been calculated on the respective amount of extracted samples.
  • 14. Seite 13 Results | STEREO | November 24, 2021 Topic Extraction from Experiments using ABEA: Results emb train K Result APA Result non-APA supp-sen supp-sen 15 33 31 supp-sen supp-sen 30 75 73 all-sen supp-sen 15 51 57 all-sen supp-sen 30 48 49 ▶ The best result was achieved with embedding and model only trained on sentences with our supported statistics (vs sentences with any statistics and all sentences) and K = 30. ▶ This model correctly classified 75/100 APA and 73/100 non-APA conform sentences. ▶ Example Topics: {risk factors, statistics, mental health}
  • 15. Seite 14 Results | STEREO | November 24, 2021 Grammar-based Condition Extraction: Results GBCE Result APA Result non-APA Correctly classified 46 30 Reason 1: Failed grammar 4 5 Reason 2: Sentence structure 10 3 Reason 3: Preprocessing error 9 12 Reason 4: Dependency parser 18 2 Reason 5: GBCE miss 25 47 Table: Number of correctly extracted experimental conditions and reasons why the extraction failed. In serveral samples, a combination of reasons were the cause.
  • 16. Seite 15 Discussion | STEREO | November 24, 2021 Generalization ▶ The whole approach should transfer to other domains. ▶ Depending on the domain fine tuning of the R− rules and adding new statistic type to the R+ rules may be necessary. ▶ GBCE should generalize well to other domains, since it is based on English grammar. ▶ ABAE can be transferred to similar domains, on different domains it needs to be re-trained.
  • 17. Seite 16 Discussion | STEREO | November 24, 2021 Threat to Validity/Reproducibility ▶ For some statistic we extracted just a few samples (e. g. ANOVA), these results could not be representative ▶ It is possible that topic or condition extraction requires more than one sentence to capture the context ▶ In ABAE the inference of topics could lead to errors, but all terms were checked to reduce these possibility ▶ An extended version of the paper can be found on arxiv2 ▶ For reproducibility the code and ruleset are publicly available on: github.com/Foisunt/STEREO 2 https://arxiv.org/abs/2103.14124
  • 18. Seite 17 Discussion | STEREO | November 24, 2021 Summary of our Results ▶ High quality stats extraction (100% accuracy on APA conform and 95% on non-APA conform sentences) ▶ The vast majority of statistics (> 99%) is not strictly APA conform. ▶ Some ABAE models found “statistics” or “result reporting” topics, which are technically correct but not useful. ▶ We found no parameter setting for ABAE (embedding, K) that clearly worked better in general. ▶ GBCE works better on APA conform sentence, because on average they have a better grammatical structure. Thank you! Questions?