This document provides an overview of biomedical natural language processing (BioNLP) and its applications. It discusses what BioNLP is, common data sources like biomedical literature and clinical notes, and key NLP tasks such as named entity recognition, information extraction, and question answering. Deep learning methods for tasks like document classification and information extraction are presented. Popular BioNLP tools, packages, and benchmarks are also mentioned. Finally, examples of applying NLP to analyze gene expression data from repositories like GEO are provided.
Automatic extraction of microorganisms and their habitats from free text usin...Catherine Canevet
This document summarizes an approach to automatically extract microorganisms and their habitats from free text using text mining workflows. The approach uses a named entity recognizer combining dictionaries and machine learning to identify organisms and habitats. It then employs relation mining to extract sentences expressing relationships between organisms and habitats. Evaluation shows the organism recognition achieves 84% precision but habitat recognition is lower at 68% precision, limiting the relation mining performance. Ongoing work aims to improve the habitat model and make the overall approach more robust to noise from PDF to text conversion.
Towards a Simple, Standards-Compliant, and Generic Phylogenetic DatabaseHilmar Lapp
Jane's lab uses a freely available open source software package called PhyloDOM to create a phylogenetic database of their molecular data resolving the phylogeny of endangered frog species, allowing their results to be easily shared and integrated by other researchers through web interfaces, data aggregators, and visualization tools that take advantage of standardized metadata.
SooryaKiran Bioinformatics is a global bioinformatics solutions provider that focuses on customized bioinformatics services and products. It develops algorithms and software for biological sequence analysis, structure prediction, and other areas. Key products include tools for sequence generation, analysis, and homology identification. The company collaborates with research institutions and has provided solutions for SNP analysis, genome analysis, and mitochondrial DNA analysis to clients around the world.
Bioinformatics is an interdisciplinary field that combines biology, computer science, and information technology. It involves using computational approaches to analyze and interpret biological data such as DNA, RNA, protein sequences, and 3D protein structures. Some key areas where bioinformatics is applied include genomics, proteomics, systems biology, and phenotype analysis. It uses computational tools and algorithms like BLAST, FASTA, Biopython, and BioJava to solve biological problems. Current applications of bioinformatics in studying COVID-19 include detecting and tracking the virus, analyzing sequencing data, evaluating containment measures, studying virus evolution, and developing drug targets and therapies.
This document provides a summary of Satanjeev Banerjee's education, work experience, areas of research interest, publications, software projects, and references. It details his PhD studies in language technologies at Carnegie Mellon University, as well as his master's degrees from CMU and the University of Minnesota Duluth. His work experience includes research assistantships at CMU working on topics like speech summarization and meeting understanding. He has numerous publications in these areas and has developed software like the SmartNotes meeting recording system and the METEOR machine translation evaluation metric.
DNA Query Language DNAQL: A Novel ApproachEditor IJCATR
This document describes a proposed DNA Query Language (DNAQL) that could allow researchers to query DNA databases in a more intuitive way compared to SQL. The DNAQL is presented as a novel approach that would introduce an abstraction layer between the user interface and database, translating queries in a familiar biochemical language into SQL understood by the database. This could help alleviate challenges biochemistry researchers face in accessing and analyzing protein data from databases. The document provides background on DNA computing and challenges in current databases, reviews related work on biological database querying, and describes the proposed methodology for DNA sequencing and DNAQL.
Experiences with logic programming in bioinformaticsChris Mungall
This document discusses experiences applying logic programming techniques in bioinformatics. It describes Obol, a system that used definite clause grammars to parse biological terms, and Blipkit, a reusable bioinformatics toolkit built for SWI-Prolog. Blipkit includes domain models, I/O modules, and tools for integrating with relational databases and web services. The document discusses applications of logic programming for tasks like genome inference, phenotype matching, and consistency checking biological data. It evaluates different logic programming approaches for representing genomic data and rules.
Lei Zheng has over 15 years of experience in areas such as machine learning, data mining, and software development. He currently works as a Senior Software Engineer at Yahoo, where he develops algorithms for spam filtering and detection of abusive behavior. Previously he held research positions at the University of Pittsburgh and JustSystems Evans Research, where he implemented algorithms and systems for information retrieval, natural language processing, and data mining.
Automatic extraction of microorganisms and their habitats from free text usin...Catherine Canevet
This document summarizes an approach to automatically extract microorganisms and their habitats from free text using text mining workflows. The approach uses a named entity recognizer combining dictionaries and machine learning to identify organisms and habitats. It then employs relation mining to extract sentences expressing relationships between organisms and habitats. Evaluation shows the organism recognition achieves 84% precision but habitat recognition is lower at 68% precision, limiting the relation mining performance. Ongoing work aims to improve the habitat model and make the overall approach more robust to noise from PDF to text conversion.
Towards a Simple, Standards-Compliant, and Generic Phylogenetic DatabaseHilmar Lapp
Jane's lab uses a freely available open source software package called PhyloDOM to create a phylogenetic database of their molecular data resolving the phylogeny of endangered frog species, allowing their results to be easily shared and integrated by other researchers through web interfaces, data aggregators, and visualization tools that take advantage of standardized metadata.
SooryaKiran Bioinformatics is a global bioinformatics solutions provider that focuses on customized bioinformatics services and products. It develops algorithms and software for biological sequence analysis, structure prediction, and other areas. Key products include tools for sequence generation, analysis, and homology identification. The company collaborates with research institutions and has provided solutions for SNP analysis, genome analysis, and mitochondrial DNA analysis to clients around the world.
Bioinformatics is an interdisciplinary field that combines biology, computer science, and information technology. It involves using computational approaches to analyze and interpret biological data such as DNA, RNA, protein sequences, and 3D protein structures. Some key areas where bioinformatics is applied include genomics, proteomics, systems biology, and phenotype analysis. It uses computational tools and algorithms like BLAST, FASTA, Biopython, and BioJava to solve biological problems. Current applications of bioinformatics in studying COVID-19 include detecting and tracking the virus, analyzing sequencing data, evaluating containment measures, studying virus evolution, and developing drug targets and therapies.
This document provides a summary of Satanjeev Banerjee's education, work experience, areas of research interest, publications, software projects, and references. It details his PhD studies in language technologies at Carnegie Mellon University, as well as his master's degrees from CMU and the University of Minnesota Duluth. His work experience includes research assistantships at CMU working on topics like speech summarization and meeting understanding. He has numerous publications in these areas and has developed software like the SmartNotes meeting recording system and the METEOR machine translation evaluation metric.
DNA Query Language DNAQL: A Novel ApproachEditor IJCATR
This document describes a proposed DNA Query Language (DNAQL) that could allow researchers to query DNA databases in a more intuitive way compared to SQL. The DNAQL is presented as a novel approach that would introduce an abstraction layer between the user interface and database, translating queries in a familiar biochemical language into SQL understood by the database. This could help alleviate challenges biochemistry researchers face in accessing and analyzing protein data from databases. The document provides background on DNA computing and challenges in current databases, reviews related work on biological database querying, and describes the proposed methodology for DNA sequencing and DNAQL.
Experiences with logic programming in bioinformaticsChris Mungall
This document discusses experiences applying logic programming techniques in bioinformatics. It describes Obol, a system that used definite clause grammars to parse biological terms, and Blipkit, a reusable bioinformatics toolkit built for SWI-Prolog. Blipkit includes domain models, I/O modules, and tools for integrating with relational databases and web services. The document discusses applications of logic programming for tasks like genome inference, phenotype matching, and consistency checking biological data. It evaluates different logic programming approaches for representing genomic data and rules.
Lei Zheng has over 15 years of experience in areas such as machine learning, data mining, and software development. He currently works as a Senior Software Engineer at Yahoo, where he develops algorithms for spam filtering and detection of abusive behavior. Previously he held research positions at the University of Pittsburgh and JustSystems Evans Research, where he implemented algorithms and systems for information retrieval, natural language processing, and data mining.
This document discusses the discovery that DNA previously thought to have no value ("junk DNA") may actually play important roles in gene expression regulation. Scientists investigated junk DNA in the model plant Arabidopsis thaliana and found short, linked patterns of DNA called pyknons. This suggests a universal genetic mechanism is at play across biology that is not yet fully understood. The discovery illustrates the connection between coding and non-coding DNA and that the term "junk DNA" may need reevaluation.
The document discusses using inductive logic programming (ILP) to perform information extraction (IE) on biomedical texts. It outlines applying ILP to learn recursive theories from annotated examples to fill slots in templates and extract entities. The learning strategy involves searching for theories for each concept in parallel before discovering dependencies between concepts. Text processing involves tokenization, POS tagging and domain dictionaries before description generation for ILP.
The document discusses using inductive logic programming (ILP) to perform information extraction (IE) on biomedical texts. It outlines applying ILP to learn recursive theories from annotated examples to fill slots in templates and extract entities. The learning strategy involves searching for theories for each concept in parallel before discovering dependencies between concepts. Text processing involves annotating documents before using features of tokens for ILP learning.
Literature Based Framework for Semantic Descriptions of e-Science resourcesHammad Afzal
Hammad Afzal gave a seminar at the National University of Sciences and Technology in Islamabad about developing a literature-based framework for semantic descriptions of e-Science resources. He discussed using natural language processing techniques to automatically generate semantic profiles of bioinformatics resources by extracting information from relevant scientific literature. His approach involved building a controlled vocabulary from literature and then mining literature to find semantic descriptions of resources.
The document describes a framework for biological relation extraction using biomedical ontologies and text mining. It discusses introducing biomedical text mining and outlines the problem, motivation, and challenges. It then presents the overall system components and architecture, including searching/browsing, Swanson's algorithm, protein-protein interactions, and gene clustering applications. The framework concept issues, design issues, sequence diagram, and database are also covered at a high level.
1) The document describes a feature extraction program developed to analyze gene expression data from the Gene Expression Omnibus (GEO) database.
2) The program was tested on human transcription factor expression data sets and was able to successfully extract gene expression information.
3) Analyzing specific gene sets from GEO files had previously been a labor-intensive task, but the object-oriented program streamlines this process using C and Perl programming languages.
Bioinformatics is the application of computational tools and techniques to analyze and interpret biological data. It involves the development of these tools and databases, as well as their application to better understand biological systems and functions at the molecular level through analysis of genetic sequences, protein structures, and more. The goal is to gain a global understanding of cellular functions by analyzing genetic data as dictated by the central dogma of biology, and relating sequence information to protein functions and cellular processes.
1) The document presents a machine learning approach to automatically disambiguate proteins, genes, and mRNA mentioned in text by assigning them class labels.
2) It uses contextual features around the terms and three machine learning algorithms (Naive Bayes, decision trees, rule learning) trained on unlabeled text to perform the disambiguation.
3) Evaluation on biomedical articles shows the approach achieves accuracy comparable to other biomedical named entity disambiguation systems without requiring extensive human annotation.
BITS: Overview of important biological databases beyond sequencesBITS
Module 4 Other relevant biological data sources beyond sequences
Part of training session "Basic Bioinformatics concepts, databases and tools" - http://www.bits.vib.be/training
introduction,history scope and applications of
relation to other fields , bioinformatics,biological databases,computers internet,sequence development, and
introduction to sequence development and alignment
This document discusses using natural language processing techniques to analyze scientific papers and extract structured knowledge. It describes analyzing papers to recognize named entities, parse syntactic dependencies and semantic arguments, resolve coreferences, and extract relations. This extracted information can be used to generate structured abstracts, find related papers, perform content-based search, and discover new facts. As an example, it outlines a project that aims to read research papers to assemble and reason over causal models in cancer biology.
DextMP: Text mining for finding moonlighting proteinsPurdue University
Slides presented at ISMB 2017 in Prague on "DextMP: deep dive in text for predicting moonlighting proteins" by Ishita K. Khan, Mansurul Bhuiiyan, &. Daisuke Kihara. ISMB Proceeding talk, published on Bioinformatics: https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btx231
ERIC is a bioinformatics resource center funded by NIAID that focuses on integrating data from five enteropathogens and related organisms. It provides genome assemblies, annotations, comparative genomics tools, and text mining of literature to extract terms and relationships relevant to bacterial pathogenesis. Text mining allows quick summarization of publications and search across extracted data, facilitating faster conclusions than reading individual papers. All of ERIC's data and tools are freely available to the scientific community.
CDAO presentation.
The idea of the comparative analysis ontoloty has been presented worldwide, including: NESCent (USA), IGBMC (France), UFRJ (Brazil). Providing a semantic framework for evolutionary analysis in a high-throughtput way after the next and third generation sequencing is the way to approach evolutionary-based studies into genome-wide analysis. The darwinian core of reasoning also allows CDAO to be used with other entities.
This document discusses using inductive logic programming (ILP) for information extraction from biomedical texts to populate a database of mitochondrial genome variability data. It describes formulating the information extraction task, preprocessing texts using natural language processing tools, and using the ILP system ATRE to learn rules for extracting entities and their relations from texts to fill template slots in the database. The goal is to handle the complexity of biomedical language and scale information extraction to the increasing volume of literature.
Introduction to Ontologies for Environmental BiologyBarry Smith
1. The document introduces ontologies for environmental biology and discusses several disciplines that could benefit from their use, including GIS, ecology, environmental biology, and various "-omics" fields.
2. It describes what an ontology is and compares ontologies to legends for maps or diagrams, which allow integration and help humans and computers make sense of complex data. Ontologies provide standardized terminology and annotations.
3. The document outlines the Open Biomedical Ontologies (OBO) Foundry, a collection of interoperable reference ontologies for annotating biomedical data. Foundry ontologies include the Gene Ontology and other ontologies for molecules, cells, anatomical structures, and more. They are developed through consensus and share
A bio text mining workbench combines natural language processing tools and machine learning methods to extract biological information from texts. It includes tools for named entity recognition of genes, proteins and other entities. It also extracts biological events involving interactions between entities. An active learning approach is used to improve the named entity recognition by selecting the most informative texts for human annotation. This helps enhance the model with less human effort. The workbench provides interfaces for corpus management, annotation and evaluation of the named entity and event extraction systems.
Formal languages to map Genotype to Phenotype in Natural Genomesmadalladam
The document discusses using formal language theory to model genotype to phenotype (G2P) mappings. It proposes that G2P mappings are non-linear networks rather than linear pathways, and that formal languages could be used to formally represent these networks. Specifically, it suggests using concepts from computational linguistics like context-free grammars, attribute grammars, and semantic actions to parse genetic sequences and compute their phenotypic outcomes. As an example, it presents a context-free grammar for designing genetic constructs and computing their chemical dynamics using an attribute grammar. In summary, formal languages may provide a way to rigorously define the complex non-linear relationships between genotypes and resulting phenotypes.
The document discusses various methodologies for extracting information from biological literature, including entity recognition to identify genes/proteins mentioned in text, relationship extraction using co-occurrence and natural language processing techniques, and text categorization to identify specific relationship types. It provides examples of applying these methods to extract entity and relationship information from a sample sentence.
The simplified electron and muon model, Oscillating Spacetime: The Foundation...RitikBhardwaj56
Discover the Simplified Electron and Muon Model: A New Wave-Based Approach to Understanding Particles delves into a groundbreaking theory that presents electrons and muons as rotating soliton waves within oscillating spacetime. Geared towards students, researchers, and science buffs, this book breaks down complex ideas into simple explanations. It covers topics such as electron waves, temporal dynamics, and the implications of this model on particle physics. With clear illustrations and easy-to-follow explanations, readers will gain a new outlook on the universe's fundamental nature.
This document discusses the discovery that DNA previously thought to have no value ("junk DNA") may actually play important roles in gene expression regulation. Scientists investigated junk DNA in the model plant Arabidopsis thaliana and found short, linked patterns of DNA called pyknons. This suggests a universal genetic mechanism is at play across biology that is not yet fully understood. The discovery illustrates the connection between coding and non-coding DNA and that the term "junk DNA" may need reevaluation.
The document discusses using inductive logic programming (ILP) to perform information extraction (IE) on biomedical texts. It outlines applying ILP to learn recursive theories from annotated examples to fill slots in templates and extract entities. The learning strategy involves searching for theories for each concept in parallel before discovering dependencies between concepts. Text processing involves tokenization, POS tagging and domain dictionaries before description generation for ILP.
The document discusses using inductive logic programming (ILP) to perform information extraction (IE) on biomedical texts. It outlines applying ILP to learn recursive theories from annotated examples to fill slots in templates and extract entities. The learning strategy involves searching for theories for each concept in parallel before discovering dependencies between concepts. Text processing involves annotating documents before using features of tokens for ILP learning.
Literature Based Framework for Semantic Descriptions of e-Science resourcesHammad Afzal
Hammad Afzal gave a seminar at the National University of Sciences and Technology in Islamabad about developing a literature-based framework for semantic descriptions of e-Science resources. He discussed using natural language processing techniques to automatically generate semantic profiles of bioinformatics resources by extracting information from relevant scientific literature. His approach involved building a controlled vocabulary from literature and then mining literature to find semantic descriptions of resources.
The document describes a framework for biological relation extraction using biomedical ontologies and text mining. It discusses introducing biomedical text mining and outlines the problem, motivation, and challenges. It then presents the overall system components and architecture, including searching/browsing, Swanson's algorithm, protein-protein interactions, and gene clustering applications. The framework concept issues, design issues, sequence diagram, and database are also covered at a high level.
1) The document describes a feature extraction program developed to analyze gene expression data from the Gene Expression Omnibus (GEO) database.
2) The program was tested on human transcription factor expression data sets and was able to successfully extract gene expression information.
3) Analyzing specific gene sets from GEO files had previously been a labor-intensive task, but the object-oriented program streamlines this process using C and Perl programming languages.
Bioinformatics is the application of computational tools and techniques to analyze and interpret biological data. It involves the development of these tools and databases, as well as their application to better understand biological systems and functions at the molecular level through analysis of genetic sequences, protein structures, and more. The goal is to gain a global understanding of cellular functions by analyzing genetic data as dictated by the central dogma of biology, and relating sequence information to protein functions and cellular processes.
1) The document presents a machine learning approach to automatically disambiguate proteins, genes, and mRNA mentioned in text by assigning them class labels.
2) It uses contextual features around the terms and three machine learning algorithms (Naive Bayes, decision trees, rule learning) trained on unlabeled text to perform the disambiguation.
3) Evaluation on biomedical articles shows the approach achieves accuracy comparable to other biomedical named entity disambiguation systems without requiring extensive human annotation.
BITS: Overview of important biological databases beyond sequencesBITS
Module 4 Other relevant biological data sources beyond sequences
Part of training session "Basic Bioinformatics concepts, databases and tools" - http://www.bits.vib.be/training
introduction,history scope and applications of
relation to other fields , bioinformatics,biological databases,computers internet,sequence development, and
introduction to sequence development and alignment
This document discusses using natural language processing techniques to analyze scientific papers and extract structured knowledge. It describes analyzing papers to recognize named entities, parse syntactic dependencies and semantic arguments, resolve coreferences, and extract relations. This extracted information can be used to generate structured abstracts, find related papers, perform content-based search, and discover new facts. As an example, it outlines a project that aims to read research papers to assemble and reason over causal models in cancer biology.
DextMP: Text mining for finding moonlighting proteinsPurdue University
Slides presented at ISMB 2017 in Prague on "DextMP: deep dive in text for predicting moonlighting proteins" by Ishita K. Khan, Mansurul Bhuiiyan, &. Daisuke Kihara. ISMB Proceeding talk, published on Bioinformatics: https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btx231
ERIC is a bioinformatics resource center funded by NIAID that focuses on integrating data from five enteropathogens and related organisms. It provides genome assemblies, annotations, comparative genomics tools, and text mining of literature to extract terms and relationships relevant to bacterial pathogenesis. Text mining allows quick summarization of publications and search across extracted data, facilitating faster conclusions than reading individual papers. All of ERIC's data and tools are freely available to the scientific community.
CDAO presentation.
The idea of the comparative analysis ontoloty has been presented worldwide, including: NESCent (USA), IGBMC (France), UFRJ (Brazil). Providing a semantic framework for evolutionary analysis in a high-throughtput way after the next and third generation sequencing is the way to approach evolutionary-based studies into genome-wide analysis. The darwinian core of reasoning also allows CDAO to be used with other entities.
This document discusses using inductive logic programming (ILP) for information extraction from biomedical texts to populate a database of mitochondrial genome variability data. It describes formulating the information extraction task, preprocessing texts using natural language processing tools, and using the ILP system ATRE to learn rules for extracting entities and their relations from texts to fill template slots in the database. The goal is to handle the complexity of biomedical language and scale information extraction to the increasing volume of literature.
Introduction to Ontologies for Environmental BiologyBarry Smith
1. The document introduces ontologies for environmental biology and discusses several disciplines that could benefit from their use, including GIS, ecology, environmental biology, and various "-omics" fields.
2. It describes what an ontology is and compares ontologies to legends for maps or diagrams, which allow integration and help humans and computers make sense of complex data. Ontologies provide standardized terminology and annotations.
3. The document outlines the Open Biomedical Ontologies (OBO) Foundry, a collection of interoperable reference ontologies for annotating biomedical data. Foundry ontologies include the Gene Ontology and other ontologies for molecules, cells, anatomical structures, and more. They are developed through consensus and share
A bio text mining workbench combines natural language processing tools and machine learning methods to extract biological information from texts. It includes tools for named entity recognition of genes, proteins and other entities. It also extracts biological events involving interactions between entities. An active learning approach is used to improve the named entity recognition by selecting the most informative texts for human annotation. This helps enhance the model with less human effort. The workbench provides interfaces for corpus management, annotation and evaluation of the named entity and event extraction systems.
Formal languages to map Genotype to Phenotype in Natural Genomesmadalladam
The document discusses using formal language theory to model genotype to phenotype (G2P) mappings. It proposes that G2P mappings are non-linear networks rather than linear pathways, and that formal languages could be used to formally represent these networks. Specifically, it suggests using concepts from computational linguistics like context-free grammars, attribute grammars, and semantic actions to parse genetic sequences and compute their phenotypic outcomes. As an example, it presents a context-free grammar for designing genetic constructs and computing their chemical dynamics using an attribute grammar. In summary, formal languages may provide a way to rigorously define the complex non-linear relationships between genotypes and resulting phenotypes.
The document discusses various methodologies for extracting information from biological literature, including entity recognition to identify genes/proteins mentioned in text, relationship extraction using co-occurrence and natural language processing techniques, and text categorization to identify specific relationship types. It provides examples of applying these methods to extract entity and relationship information from a sample sentence.
Similar to Introduction to BioNLP and its applications (20)
The simplified electron and muon model, Oscillating Spacetime: The Foundation...RitikBhardwaj56
Discover the Simplified Electron and Muon Model: A New Wave-Based Approach to Understanding Particles delves into a groundbreaking theory that presents electrons and muons as rotating soliton waves within oscillating spacetime. Geared towards students, researchers, and science buffs, this book breaks down complex ideas into simple explanations. It covers topics such as electron waves, temporal dynamics, and the implications of this model on particle physics. With clear illustrations and easy-to-follow explanations, readers will gain a new outlook on the universe's fundamental nature.
A workshop hosted by the South African Journal of Science aimed at postgraduate students and early career researchers with little or no experience in writing and publishing journal articles.
How to Setup Warehouse & Location in Odoo 17 InventoryCeline George
In this slide, we'll explore how to set up warehouses and locations in Odoo 17 Inventory. This will help us manage our stock effectively, track inventory levels, and streamline warehouse operations.
This presentation was provided by Steph Pollock of The American Psychological Association’s Journals Program, and Damita Snow, of The American Society of Civil Engineers (ASCE), for the initial session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session One: 'Setting Expectations: a DEIA Primer,' was held June 6, 2024.
This presentation includes basic of PCOS their pathology and treatment and also Ayurveda correlation of PCOS and Ayurvedic line of treatment mentioned in classics.
Main Java[All of the Base Concepts}.docxadhitya5119
This is part 1 of my Java Learning Journey. This Contains Custom methods, classes, constructors, packages, multithreading , try- catch block, finally block and more.
How to Fix the Import Error in the Odoo 17Celine George
An import error occurs when a program fails to import a module or library, disrupting its execution. In languages like Python, this issue arises when the specified module cannot be found or accessed, hindering the program's functionality. Resolving import errors is crucial for maintaining smooth software operation and uninterrupted development processes.
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
Walmart Business+ and Spark Good for Nonprofits.pdfTechSoup
"Learn about all the ways Walmart supports nonprofit organizations.
You will hear from Liz Willett, the Head of Nonprofits, and hear about what Walmart is doing to help nonprofits, including Walmart Business and Spark Good. Walmart Business+ is a new offer for nonprofits that offers discounts and also streamlines nonprofits order and expense tracking, saving time and money.
The webinar may also give some examples on how nonprofits can best leverage Walmart Business+.
The event will cover the following::
Walmart Business + (https://business.walmart.com/plus) is a new shopping experience for nonprofits, schools, and local business customers that connects an exclusive online shopping experience to stores. Benefits include free delivery and shipping, a 'Spend Analytics” feature, special discounts, deals and tax-exempt shopping.
Special TechSoup offer for a free 180 days membership, and up to $150 in discounts on eligible orders.
Spark Good (walmart.com/sparkgood) is a charitable platform that enables nonprofits to receive donations directly from customers and associates.
Answers about how you can do more with Walmart!"
2. What is BioNLP
Biomedical Natural Language Processing,
Biomedical Text Mining
Knowledge extraction from text in
biomedical/clinical field
Objective: Facilitate understanding and
decision making in computational way
Subfield of bioinformatics
Common: sequence data
Difference: alphabet
3. Data
Data is generated by human
• Biomedical Literature
• Clinical Notes
• Textual Description in Knowledge
Database
6. NER Methods
Dictionary-based
ML-based
Dictionary-based
1. dictionary construction
2. text string-based matching
Deep Learning Model (PhenoTagger[1])
[1] Luo, L., Yan, S., Lai, P. T., Veltri, D., Oler, A., Xirasagar, S., ... & Lu, Z.
(2021). PhenoTagger: a hybrid method for phenotype concept recognition using
human phenotype ontology. Bioinformatics, 37(13), 1884-1890.
7. Example of Document Classification
Cancer Hallmark Annotation[2]
… Genes that were overexpressed in OM3 included
oncogenes, cell cycle regulators , and those involved in
signal transduction, whereas genes for DNA repair
enzymes and inhibitors of transformation and metastasis
were suppressed. In arsenic-treated cells, multiple DNA
repair proteins were overexpressed. … [3]
[2] Yan, Shankai, and Ka-Chun Wong. "Elucidating high-dimensional cancer
hallmark annotation via enriched ontology." Journal of biomedical informatics 73
(2017): 84-94.
[3] Bae, Dong-Soon, et al. "Characterization of gene expression changes
associated with MNNG, arsenic, or metal mixture treatment in human
keratinocytes: application of cDNA microarray technology." Environmental Health
Perspectives 110.suppl 6 (2002): 931-941.
Activating invasion and metastasis (IM)
Genomic instability and mutation (GI)
Sustaining proliferative signaling (PS)
Evading growth suppressors (GS)
9. Example of IE
Bacteria Gene Interaction Extraction[4]
These studies indicate that spo0H is transcribed by
the major vegetative RNA polymerase, E sigma A. [5]
Gene
Polymerase
Complex
Polymerase
Complex
Transcription
ActionTarget
TranscriptionBy
[4] Yan, Shankai, and Ka-Chun Wong. "Context awareness and embedding for
biomedical event extraction." Bioinformatics 36.2 (2020): 637-643.
[5] Weir, J; Predich, M; Dubnau, E; Nair, G; Smith, I. (1991). Regulation of
spo0H, a gene coding for the Bacillus subtilis sigma H factor. J. Bacteriol. vol. 173
(2) p. 521-529
10. • Extract lexical
features from text
• Construct word
matrix, edge matrix
and label matrices.
Preprocessing
• Multiclass
• Trigger
Classification
• Edge Classification
Classifiers Training
• Annotate the triggers
• Induce relation records
Post-processing
TRIGGER-BASED APPROACH
12. W0 W1 … Wa-1 Wa Wa+1 … Wb-1 Wb Wb+1 … Wn-1Wn
W0 … Wa X0
Wn … Wa X1
Word
Embedding
Trained on
PubMed
Literature
LSTM
LSTM
Concate MLP Argument
Argument
Embedding
Pretrained
Argument
Detection
Model
W0 … Wa X0
Wn … Wa X1
Weighted Binary
Cross-entropy
Labels Loss Function
Deep Learning Approach [4]
W0 … Wn Input word stream
X0
Input variable
Neural Network
Layer
[4] Yan, Shankai, and Ka-Chun Wong. "Context awareness and embedding for
biomedical event extraction." Bioinformatics 36.2 (2020): 637-643.
13. Difference from
Conventional
NLP Tasks
Contains Biomedical Symbols,
Punctuations, Terms
The writers are from the
biomedical/clinical expertise group
Different Corpus /Context
14. Leaderboard
GLUE[6]
BLUE[7]
[6] Wang, Alex, et al. "GLUE: A multi-task
benchmark and analysis platform for natural
language understanding." arXiv preprint
arXiv:1804.07461 (2018).
[7] Peng, Yifan, Shankai Yan, and Zhiyong Lu.
"Transfer learning in biomedical natural language
processing: an evaluation of BERT and ELMo on ten
benchmarking datasets." arXiv preprint
arXiv:1906.05474 (2019).
16. Traditional Machine Learning Methods
Handcrafted Features
Tokens
Part-of-speech
Entity type
Grammatical function tag
Distance in the parse tree
Classifical ML models
Support Vector Machine (SVM)
Bayesian Classifier
Multi-layer Perception (MLP)
Ensemble Classifiers (Random
Forest, Extra Trees, etc.)
17. SOTA Deep Learning Methods
Word Embedding / Language Model (Pre-trained)
Sequence to Vector
Encoder:
– Bag of Embedding
– RNN/LSTM/GRU
– CNN
Classifier:
– Linear
– MLP
18. Packages
NLTK: simple text processing
SpaCy: Pipeline
Stanza: Pipeline
Gensim: Word/Doc Embedding
Hugging Face (Transformers): SOTA Model
20. DATA DESCRIPTION IN GEO
Text
MetaTable
Text
Raw Data
DataTable
Text
Raw Data
Curated Collection of Samples Expression Measurements of DataSet
2,393,216 95,138