A solution for finding an interesting set of publications based on enrichment analysis of the relationship between Genes and MeSHTerms using Chi-Square statistics. Built over publicly available datasets from NCBI and uses technologies Springboot, AWS batch, Docker, ECR, ECS, Spark, React, and Postgres.
4. OrthologousGenes
● Orthologs are sequences in different
species that share a common ancestor
● Non-human species used for genetic
research, often before research is done
on humans; effects possibly comparable
to human
● Key starting point for the literature
review
Ancestral Gene
5. TypicalResearchProcess
Researcher enters
gene symbol on
NCBI Gene Database
GeneQuery
Link out to PubMed
and read
publications for gene
IdentifyPublications
Use MeSH annotations to
identify diseases; use
other genes to expand
analysis
Extrapolate
Compile a list of MeSH terms and
genes that are represented in
publications
IdentifyMeSHterms&Co-occurringGenes
Filter for species of
interest
IdentifyGeneRecord
0201 03 04 05
6. ResearchScale
● 30,000,000+ total publications in PubMed
● 27,044,830 total gene IDs on NCBI Gene
● 29,351 total MeSH terms
● BRCA1
○ 322 gene IDs; 16,892 publications
● GRM5
○ 352 gene IDs; 1,035 publications
● MeSH terms link genes to diseases
○ ~10-15 per publication: possibly thousands of disease associations per gene
7. WinnowConnectstheDots
Genes
Genetic information for various
species are stored, including gene
id, species, symbol, description,
etc.
MeSHTerms~Diseases
Medical subheadings (MeSH terms)
are standardized annotations assigned
to PubMed articles; give insight into
diseases
Publications
Publications are annotated with
MeSH terms and linked to genes via
the ‘gene2pubmed’ dataset
ConnectGenesto
Diseases
10. Winnow
A Web Application to
● streamline literature review process
● winnow down the set of useful relationships
Goal
● Automate all aspects: raw data → enrichment analysis
of NCBI Gene and PubMed datasets.
17. ● Initial customer meeting, and once more with each milestone
● User requirements as user stories in Trello
● Planning poker for estimation and re-estimation
● Agile, one week sprint length, ~25 points/week
● Bi-weekly standing meetings
○ Sprint retrospective and planning on Saturdays
○ TA meeting on Wednesdays
● Stand-up via Slack 4x weekly
● Pair programming zoom meetings as needed
DevelopmentProcess
19. NotableChallenges
DataIngestionTransformations
Custom marshalling and conversion
of attributes to ingest relationship
and varied data formats
DataComputation
Imprecision of computing large
factorials led to the use of Spark and
Pearson’s chi-squared test
MeSHTermTreeRecursion
Initial race conditions, difficulty
triggering page refresh
20. Tools
Aspect Choice
Environment Parity Docker, Test Data, Gradle
Version Control GitHub
Sprint Management Trello
Online collaboration Slack
Static Code Test SpotBugs
Code Coverage Jacoco
UI Testing Jest/Enzyme
29. ● Links genes to diseases in a curated subset of
approximately 900,000 PubMed articles related to the
human genome and its orthologs
● Identifies genes co-occuring in publications
● Helps researchers expedite their literature reviews
● Helps researchers develop new hypotheses more
quickly
● Allows users to export the results of their searches
Winnow
30. CREDITS: This presentation template was
created by Slidesgo, including icons by
Flaticon, and infographics & images by
Freepik
THANK YOU
Please keep this slide for attribution
Questions?