SlideShare a Scribd company logo
1 of 11
1
2
 An exponential growth of scientific literature is prominently observed over the past decades.
 Abstracts
 High-level key phrases
 Theories and models
 Automatic extraction of theories and models:
 Facilitate building knowledge graphs
 Be incorporated into to the faceted search feature of an academic search engine
 Literature analysis, such as domain development and innovation composition
 We focus on extracting object names and aspects from figure captions.
3
 Theory entity extraction has not been extensively explored.
 No existing labeled data for extracting theory entities in social and behavioral science
(SBS) domains.
 Crowdsourcing is not appropriate
 Manual annotation is time-consuming
 We propose to use distant supervision to address the data sparsity problem of theory
extraction.
 Use an already existing database, such as Wikipedia, to collect instances of entity
mentions.
 Then use these instances to automatically generate our training data.
 We focus on extracting object names and aspects from figure captions.
4
 The pipeline is composed of
the following five modules:
 Web Scraping
 Obtaining Body Text
 Sentence Segmentation
 Elasticsearch
 Automatic Annotation
 We focus on extracting object names and aspects from figure captions.
5
 The pipeline is composed of the following five modules:
1. Web Scraping
 Use a web scraper to obtain theory and model entities from Wikipedia webpages.
 A heuristic filter is used to keep phrases ending with the following head words such as
“theory”, “model”, “concept”...
2. Obtaining Body Text
 Use GROBID to convert PDF documents into XML format and keep the body text
3. Sentence Segmentation
 Use Stanza to segment the body text of papers into sentences
 870,000 sentences from the 2400 papers
4. Elasticsearch
 The sentences are indexed by Elasticsearch.
5. Automatic Annotation
 The seed theory mentions obtained in Web Scraping are used to query the Elasticsearch
index
 Sentences represented in BIO (Begin, Inside, Outside) schema
 We focus on extracting object names and aspects from figure captions.
6
Automatic Annotation
 The parent sample is obtained by the Defense Advanced Research Projects Agency
(DARPA) programme ‘Systematizing Confidence in Open Research and Evidence’
(SCORE) project, containing approximately 30,000 articles published from 2009-2018 in
62 major SBS journals in psychology, economics, politics, management, education, etc.
 We obtain the text for labeling from a random sample of 2400 SBS papers.
 The ground truth dataset contains 4534 sentences with 550 unique theory mentions
automatically annotated by the pipeline.
 We focus on extracting object names and aspects from figure captions.
7
 We compare four deep neural network architectures, including BiLSTM, BiLSTM-CRF,
Transformer, and GCN.
 BiLSTM: The BiLSTM architecture analyzes the contextual dependency for each token from
both forwards and backwards, and then assigns each token a label based on probability
scores for each tag.
 BiLSTM-CRF: BiLSTM can work together with a Conditional Random Field (CRF) layer,
which labels a token based on its own features, features and labels of nearby tokens.
 Transformer: A transformer model predicts labels of tokens based on features of neighboring
tokens simultaneously using a multi-head attention mechanism.
 GCN: GCN (Graph Convolutional Network) is a type of CNN that processes graph-like data
structures.
 We focus on extracting object names and aspects from figure captions.
8
 We focus on extracting object names and aspects from figure captions.
9
 Time efficiency
 Less than half an hour to go through 870,000 sentences and check whether they contain
any of the 550 theory phrases
 Two hours to annotate the 4534 sentences
 Extraction results of a small dataset:
 All theory names extracted from these sentences were new.
 In particular, about 42% contain head words that were not in the heuristic filter.
 We focus on extracting object names and aspects from figure captions.
10
 We focus on extracting object names and aspects from figure captions.
11
Conclusion
 We proposed a trainable framework that extracts theory and model mentions from
scientific papers using distant supervision.
 We have created a new benchmark corpus consisting of 4534 annotated
sentences from papers in SBS domains. This dataset can be used for future
models on theory extraction.
 We compared several NER neural architectures and investigated their
dependency on pre-trained language models. The empirical results indicated that
the RoBERTa-BiLSTM-CRF architecture achieved the best performance with an
F1 score of 77.21% and a precision of 89.72%.

More Related Content

Similar to Presentation_Doceng.pptx

Project Presentation
Project PresentationProject Presentation
Project Presentation
butest
 
Activity Context Modeling in Context-Aware
Activity Context Modeling in Context-AwareActivity Context Modeling in Context-Aware
Activity Context Modeling in Context-Aware
Editor IJCATR
 
Combining and easing the access of the eswc semantic web data 0
Combining and easing the access of the eswc semantic web data 0Combining and easing the access of the eswc semantic web data 0
Combining and easing the access of the eswc semantic web data 0
STIinnsbruck
 
Automatically Generating Wikipedia Articles: A Structure-Aware Approach
Automatically Generating Wikipedia Articles:  A Structure-Aware ApproachAutomatically Generating Wikipedia Articles:  A Structure-Aware Approach
Automatically Generating Wikipedia Articles: A Structure-Aware Approach
George Ang
 

Similar to Presentation_Doceng.pptx (20)

BlurockPresentation-KEOD2023
BlurockPresentation-KEOD2023BlurockPresentation-KEOD2023
BlurockPresentation-KEOD2023
 
Towards Ontology Development Based on Relational Database
Towards Ontology Development Based on Relational DatabaseTowards Ontology Development Based on Relational Database
Towards Ontology Development Based on Relational Database
 
Ir 02
Ir   02Ir   02
Ir 02
 
Focused Crawling System based on Improved LSI
Focused Crawling System based on Improved LSIFocused Crawling System based on Improved LSI
Focused Crawling System based on Improved LSI
 
Segmentation
SegmentationSegmentation
Segmentation
 
IRJET- Data Mining - Secure Keyword Manager
IRJET- Data Mining - Secure Keyword ManagerIRJET- Data Mining - Secure Keyword Manager
IRJET- Data Mining - Secure Keyword Manager
 
Academic Linkage A Linkage Platform For Large Volumes Of Academic Information
Academic Linkage  A Linkage Platform For Large Volumes Of Academic InformationAcademic Linkage  A Linkage Platform For Large Volumes Of Academic Information
Academic Linkage A Linkage Platform For Large Volumes Of Academic Information
 
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
 
Re-enactment of Newspaper Articles
Re-enactment of Newspaper ArticlesRe-enactment of Newspaper Articles
Re-enactment of Newspaper Articles
 
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...
 
IRJET - Automated Essay Grading System using Deep Learning
IRJET -  	  Automated Essay Grading System using Deep LearningIRJET -  	  Automated Essay Grading System using Deep Learning
IRJET - Automated Essay Grading System using Deep Learning
 
Scientific Publication Retrieval in Linked Data
Scientific Publication Retrieval in Linked DataScientific Publication Retrieval in Linked Data
Scientific Publication Retrieval in Linked Data
 
Project Presentation
Project PresentationProject Presentation
Project Presentation
 
Activity Context Modeling in Context-Aware
Activity Context Modeling in Context-AwareActivity Context Modeling in Context-Aware
Activity Context Modeling in Context-Aware
 
Combining and easing the access of the eswc semantic web data 0
Combining and easing the access of the eswc semantic web data 0Combining and easing the access of the eswc semantic web data 0
Combining and easing the access of the eswc semantic web data 0
 
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
 
IRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text RankIRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text Rank
 
Automatically Generating Wikipedia Articles: A Structure-Aware Approach
Automatically Generating Wikipedia Articles:  A Structure-Aware ApproachAutomatically Generating Wikipedia Articles:  A Structure-Aware Approach
Automatically Generating Wikipedia Articles: A Structure-Aware Approach
 
Semantic Knowledge Acquisition of Information for Syntactic web
Semantic Knowledge Acquisition of Information for Syntactic web Semantic Knowledge Acquisition of Information for Syntactic web
Semantic Knowledge Acquisition of Information for Syntactic web
 
Reduce Query Time Up to 60% with Selective Search
Reduce Query Time Up to 60% with Selective SearchReduce Query Time Up to 60% with Selective Search
Reduce Query Time Up to 60% with Selective Search
 

Recently uploaded

development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
Cherry
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
Cherry
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
Cherry
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
Scintica Instrumentation
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cherry
 
Major groups of bacteria: Spirochetes, Chlamydia, Rickettsia, nanobes, mycopl...
Major groups of bacteria: Spirochetes, Chlamydia, Rickettsia, nanobes, mycopl...Major groups of bacteria: Spirochetes, Chlamydia, Rickettsia, nanobes, mycopl...
Major groups of bacteria: Spirochetes, Chlamydia, Rickettsia, nanobes, mycopl...
Cherry
 

Recently uploaded (20)

Genome Projects : Human, Rice,Wheat,E coli and Arabidopsis.
Genome Projects : Human, Rice,Wheat,E coli and Arabidopsis.Genome Projects : Human, Rice,Wheat,E coli and Arabidopsis.
Genome Projects : Human, Rice,Wheat,E coli and Arabidopsis.
 
Genome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxGenome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptx
 
Terpineol and it's characterization pptx
Terpineol and it's characterization pptxTerpineol and it's characterization pptx
Terpineol and it's characterization pptx
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
GBSN - Microbiology (Unit 4) Concept of Asepsis
GBSN - Microbiology (Unit 4) Concept of AsepsisGBSN - Microbiology (Unit 4) Concept of Asepsis
GBSN - Microbiology (Unit 4) Concept of Asepsis
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
Understanding Partial Differential Equations: Types and Solution Methods
Understanding Partial Differential Equations: Types and Solution MethodsUnderstanding Partial Differential Equations: Types and Solution Methods
Understanding Partial Differential Equations: Types and Solution Methods
 
GBSN - Biochemistry (Unit 3) Metabolism
GBSN - Biochemistry (Unit 3) MetabolismGBSN - Biochemistry (Unit 3) Metabolism
GBSN - Biochemistry (Unit 3) Metabolism
 
Cot curve, melting temperature, unique and repetitive DNA
Cot curve, melting temperature, unique and repetitive DNACot curve, melting temperature, unique and repetitive DNA
Cot curve, melting temperature, unique and repetitive DNA
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
 
FS P2 COMBO MSTA LAST PUSH past exam papers.
FS P2 COMBO MSTA LAST PUSH past exam papers.FS P2 COMBO MSTA LAST PUSH past exam papers.
FS P2 COMBO MSTA LAST PUSH past exam papers.
 
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY // USES OF ANTIOBIOTICS TYPES OF ANTIB...
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY  // USES OF ANTIOBIOTICS TYPES OF ANTIB...ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY  // USES OF ANTIOBIOTICS TYPES OF ANTIB...
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY // USES OF ANTIOBIOTICS TYPES OF ANTIB...
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Site specific recombination and transposition.........pdf
Site specific recombination and transposition.........pdfSite specific recombination and transposition.........pdf
Site specific recombination and transposition.........pdf
 
Major groups of bacteria: Spirochetes, Chlamydia, Rickettsia, nanobes, mycopl...
Major groups of bacteria: Spirochetes, Chlamydia, Rickettsia, nanobes, mycopl...Major groups of bacteria: Spirochetes, Chlamydia, Rickettsia, nanobes, mycopl...
Major groups of bacteria: Spirochetes, Chlamydia, Rickettsia, nanobes, mycopl...
 

Presentation_Doceng.pptx

  • 1. 1
  • 2. 2  An exponential growth of scientific literature is prominently observed over the past decades.  Abstracts  High-level key phrases  Theories and models  Automatic extraction of theories and models:  Facilitate building knowledge graphs  Be incorporated into to the faceted search feature of an academic search engine  Literature analysis, such as domain development and innovation composition
  • 3.  We focus on extracting object names and aspects from figure captions. 3  Theory entity extraction has not been extensively explored.  No existing labeled data for extracting theory entities in social and behavioral science (SBS) domains.  Crowdsourcing is not appropriate  Manual annotation is time-consuming  We propose to use distant supervision to address the data sparsity problem of theory extraction.  Use an already existing database, such as Wikipedia, to collect instances of entity mentions.  Then use these instances to automatically generate our training data.
  • 4.  We focus on extracting object names and aspects from figure captions. 4  The pipeline is composed of the following five modules:  Web Scraping  Obtaining Body Text  Sentence Segmentation  Elasticsearch  Automatic Annotation
  • 5.  We focus on extracting object names and aspects from figure captions. 5  The pipeline is composed of the following five modules: 1. Web Scraping  Use a web scraper to obtain theory and model entities from Wikipedia webpages.  A heuristic filter is used to keep phrases ending with the following head words such as “theory”, “model”, “concept”... 2. Obtaining Body Text  Use GROBID to convert PDF documents into XML format and keep the body text 3. Sentence Segmentation  Use Stanza to segment the body text of papers into sentences  870,000 sentences from the 2400 papers 4. Elasticsearch  The sentences are indexed by Elasticsearch. 5. Automatic Annotation  The seed theory mentions obtained in Web Scraping are used to query the Elasticsearch index  Sentences represented in BIO (Begin, Inside, Outside) schema
  • 6.  We focus on extracting object names and aspects from figure captions. 6 Automatic Annotation  The parent sample is obtained by the Defense Advanced Research Projects Agency (DARPA) programme ‘Systematizing Confidence in Open Research and Evidence’ (SCORE) project, containing approximately 30,000 articles published from 2009-2018 in 62 major SBS journals in psychology, economics, politics, management, education, etc.  We obtain the text for labeling from a random sample of 2400 SBS papers.  The ground truth dataset contains 4534 sentences with 550 unique theory mentions automatically annotated by the pipeline.
  • 7.  We focus on extracting object names and aspects from figure captions. 7  We compare four deep neural network architectures, including BiLSTM, BiLSTM-CRF, Transformer, and GCN.  BiLSTM: The BiLSTM architecture analyzes the contextual dependency for each token from both forwards and backwards, and then assigns each token a label based on probability scores for each tag.  BiLSTM-CRF: BiLSTM can work together with a Conditional Random Field (CRF) layer, which labels a token based on its own features, features and labels of nearby tokens.  Transformer: A transformer model predicts labels of tokens based on features of neighboring tokens simultaneously using a multi-head attention mechanism.  GCN: GCN (Graph Convolutional Network) is a type of CNN that processes graph-like data structures.
  • 8.  We focus on extracting object names and aspects from figure captions. 8
  • 9.  We focus on extracting object names and aspects from figure captions. 9  Time efficiency  Less than half an hour to go through 870,000 sentences and check whether they contain any of the 550 theory phrases  Two hours to annotate the 4534 sentences  Extraction results of a small dataset:  All theory names extracted from these sentences were new.  In particular, about 42% contain head words that were not in the heuristic filter.
  • 10.  We focus on extracting object names and aspects from figure captions. 10
  • 11.  We focus on extracting object names and aspects from figure captions. 11 Conclusion  We proposed a trainable framework that extracts theory and model mentions from scientific papers using distant supervision.  We have created a new benchmark corpus consisting of 4534 annotated sentences from papers in SBS domains. This dataset can be used for future models on theory extraction.  We compared several NER neural architectures and investigated their dependency on pre-trained language models. The empirical results indicated that the RoBERTa-BiLSTM-CRF architecture achieved the best performance with an F1 score of 77.21% and a precision of 89.72%.