This document describes research extracting and analyzing biological methods mentioned in the scientific literature. It developed bioNerDS, a tool to automatically extract mentions of computational resources from papers. bioNerDS was used to analyze over 1.8 million mentions from 230,000 open access articles, finding patterns in resource usage over time and between journals. Challenges included ambiguity, variability in names, and extracting methods from ordered resource mentions. The goal is to provide a way to extract "best practices" for any resource-based domain by mining the literature.
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...Amit Sheth
Literature-Based Discovery (LBD) refers to the process of uncovering hidden connections that are implicit in scientific literature. Numerous hypotheses have been generated from scientific literature, which influenced innovations in diagnosis, treatment, preventions and overall public health. However, much of the existing research on discovering hidden connections among concepts have used distributional statistics and graph-theoretic measures to capture implicit associations. Such metrics do not explicitly capture the semantics of hidden connections. ...
While effective in some situations, the practice of relying on domain expertise, structured background knowledge and heuristics to complement distributional and graph-theoretic approaches, has serious limitations. ..
This dissertation proposes an innovative context-driven, automatic subgraph creation method for finding hidden and complex associations among concepts, along multiple thematic dimensions. It outlines definitions for context and shared context, based on implicit and explicit (or formal) semantics, which compensate for deficiencies in statistical and graph-based metrics. It also eliminates the need for heuristics a priori. An evidence-based evaluation of the proposed framework showed that 8 out of 9 existing scientific discoveries could be recovered using this approach. Additionally, insights into the meaning of associations could be obtained using provenance provided by the system. In a statistical evaluation to determine the interestingness of the generated subgraphs, it was observed that an arbitrary association is mentioned in only approximately 4 articles in MEDLINE, on average. These results suggest that leveraging implicit and explicit context, as defined in this dissertation, is an advancement of the state-of-the-art in LBD research.
Ph.D. Committee: Drs. Amit Sheth (Advisor), TK Prasad, Michael Raymer,
Ramakanth Kavuluru (UKY), Thomas C. Rindflesch (NLM) and Varun Bhagwan (Yahoo! Labs)
Relevant Publications (more at: http://knoesis.wright.edu/students/delroy/)
D. Cameron, R. Kavuluru, T. C. Rindflesch, O. Bodenreider, A. P. Sheth, K. Thirunarayan. Leveraging Distributional Semantics for Domain Agnostic Literature-Based Discovery (under preparation)
D. Cameron, O. Bodenreider, H. Yalamanchili, T. Danh, S. Vallabhaneni, K. Thirunarayan, A. P. Sheth, T. C. Rindflesch. A Graph-based Recovery and Decomposition of Swanson’s Hypothesis using Semantic Predications. Journal of Biomedical Informatics (JBI13), 46(2): 238–251, 2013
D. Cameron, R. Kavuluru, O. Bodenreider, P. N. Mendes, A. P. Sheth, K. Thirunarayan. Semantic Predications for Complex Information Needs in Biomedical Literature International Bioinformatics and Biomedical Conference (BIBM11), pp. 512–519, 2011 (acceptance rate=19.4%)
D. Cameron, P. N. Mendes, A. P. Sheth, V. Chan. Semantics-empowered Text Exploration for Knowledge Discovery. ACM Southeast Conference (ACMSE10), 14, 2010
Tips And Tricks For Bioinformatics Software Engineeringjtdudley
This is a talk I've given twice at Stanford recently. It's essentially a brain dump of my thoughts on being a Bioinformatician with lots of links to useful tools.
EXAMPLE OF SATSANKALP, SADVASANA, SADBHAVANA, SADVICHAR AND SATPRERANA (ASPIRATIONS, PASSIONS, FEELINGS, THOUGHTS/POLICIES, AND PERSPECTIVE/MOTIVATION) OF AND FROM; TRUTH; THROUGH NAMASMARAN (JAAP, JAP, JIKRA, SUMIRAN, SIMARAN etc; i.e. remembering God):
CRM is like 'Whack-A-Mole' trying to hit moving targets: customers. Longtime entrepreneur Steve Harmon shares some insights from his talk at a recent conference on how to beat the game not with bigger hammers but different tools.
This talk was held at the 13th meeting on Sept 23rd 2014 by Bruno Ungermann.
Conceptual overview of Hadoop based analytics, comparison between data warehouse architecture and Big Data architecture, characteristics of „schema on read“, typical Big Data use cases like customer analytics, operational analytics and EDW optimization, short software demo
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...Amit Sheth
Literature-Based Discovery (LBD) refers to the process of uncovering hidden connections that are implicit in scientific literature. Numerous hypotheses have been generated from scientific literature, which influenced innovations in diagnosis, treatment, preventions and overall public health. However, much of the existing research on discovering hidden connections among concepts have used distributional statistics and graph-theoretic measures to capture implicit associations. Such metrics do not explicitly capture the semantics of hidden connections. ...
While effective in some situations, the practice of relying on domain expertise, structured background knowledge and heuristics to complement distributional and graph-theoretic approaches, has serious limitations. ..
This dissertation proposes an innovative context-driven, automatic subgraph creation method for finding hidden and complex associations among concepts, along multiple thematic dimensions. It outlines definitions for context and shared context, based on implicit and explicit (or formal) semantics, which compensate for deficiencies in statistical and graph-based metrics. It also eliminates the need for heuristics a priori. An evidence-based evaluation of the proposed framework showed that 8 out of 9 existing scientific discoveries could be recovered using this approach. Additionally, insights into the meaning of associations could be obtained using provenance provided by the system. In a statistical evaluation to determine the interestingness of the generated subgraphs, it was observed that an arbitrary association is mentioned in only approximately 4 articles in MEDLINE, on average. These results suggest that leveraging implicit and explicit context, as defined in this dissertation, is an advancement of the state-of-the-art in LBD research.
Ph.D. Committee: Drs. Amit Sheth (Advisor), TK Prasad, Michael Raymer,
Ramakanth Kavuluru (UKY), Thomas C. Rindflesch (NLM) and Varun Bhagwan (Yahoo! Labs)
Relevant Publications (more at: http://knoesis.wright.edu/students/delroy/)
D. Cameron, R. Kavuluru, T. C. Rindflesch, O. Bodenreider, A. P. Sheth, K. Thirunarayan. Leveraging Distributional Semantics for Domain Agnostic Literature-Based Discovery (under preparation)
D. Cameron, O. Bodenreider, H. Yalamanchili, T. Danh, S. Vallabhaneni, K. Thirunarayan, A. P. Sheth, T. C. Rindflesch. A Graph-based Recovery and Decomposition of Swanson’s Hypothesis using Semantic Predications. Journal of Biomedical Informatics (JBI13), 46(2): 238–251, 2013
D. Cameron, R. Kavuluru, O. Bodenreider, P. N. Mendes, A. P. Sheth, K. Thirunarayan. Semantic Predications for Complex Information Needs in Biomedical Literature International Bioinformatics and Biomedical Conference (BIBM11), pp. 512–519, 2011 (acceptance rate=19.4%)
D. Cameron, P. N. Mendes, A. P. Sheth, V. Chan. Semantics-empowered Text Exploration for Knowledge Discovery. ACM Southeast Conference (ACMSE10), 14, 2010
Tips And Tricks For Bioinformatics Software Engineeringjtdudley
This is a talk I've given twice at Stanford recently. It's essentially a brain dump of my thoughts on being a Bioinformatician with lots of links to useful tools.
EXAMPLE OF SATSANKALP, SADVASANA, SADBHAVANA, SADVICHAR AND SATPRERANA (ASPIRATIONS, PASSIONS, FEELINGS, THOUGHTS/POLICIES, AND PERSPECTIVE/MOTIVATION) OF AND FROM; TRUTH; THROUGH NAMASMARAN (JAAP, JAP, JIKRA, SUMIRAN, SIMARAN etc; i.e. remembering God):
CRM is like 'Whack-A-Mole' trying to hit moving targets: customers. Longtime entrepreneur Steve Harmon shares some insights from his talk at a recent conference on how to beat the game not with bigger hammers but different tools.
This talk was held at the 13th meeting on Sept 23rd 2014 by Bruno Ungermann.
Conceptual overview of Hadoop based analytics, comparison between data warehouse architecture and Big Data architecture, characteristics of „schema on read“, typical Big Data use cases like customer analytics, operational analytics and EDW optimization, short software demo
Estrategia de daytrading para mercados de alta volatilidad basada en gapsRaul Canessa
En esta presentación se describe una estrategia de trading para mercados de volatilidad elevada, la cual se basa en gaps o brechas en el precio. Se utiliza para operar con marcos de tiempo de 1, 2, 3 y 5 minutos durante la apertura del mercado, en la misma sesión en que se detecta el gap. La posición se abre cuando ocurre el rompimiento alcista de la primera candela.
A cluster is a type of parallel or distributed computer system, which consists of a collection of inter-connected stand-alone computers working together as a single integrated computing resource.
A brief technical overview about GPU power consumption and performance, with references to the latest architecture developed by Nvidia: Maxwell and Tegra X1.
Co-Author: Pietro Piscione (https://www.linkedin.com/pub/pietro-piscione/84/b37/926)
Cultural Times - The first global map of cultural and creative industriesEY
EY released on the 3rd of December 2015 "Cultural Times", the first global map of Cultural and Creative Industries. This overview underlines the contribution of the creative economy to global growth and job creation.
Find out more on ey.com/CulturalTimes
Real-Time Analytics with Apache Cassandra and Apache SparkGuido Schmutz
Time series data is everywhere: IoT, sensor data, financial transactions. The industry has moved to databases like Cassandra to handle the high velocity and high volume of data that is now common place. However data is pointless without being able to process it in near real time. That's where Spark combined with Cassandra comes in! What was one just your storage system (Cassandra) can be transformed into an analytics system and it's really surprising how easy it is!
The UCSC genome browser: A Neuroscience focused overviewVictoria Perreau
An self guided tutorial based overview of the UCSC genome browser for accessing public neuroscience data, in particular data from the ENCODE project. Including additional transcriptomic resources for the Neurosciences.
Building bioinformatics resources for the global communityExternalEvents
http://www.fao.org/about/meetings/wgs-on-food-safety-management/en/
Building bioinformatics resources for the global community. Presentation from the Technical Meeting on the impact of Whole Genome Sequencing (WGS) on food safety management and GMI-9, 23-25 May 2016, Rome, Italy.
WikiPathways: how open source and open data can make omics technology more us...Chris Evelo
Presentation about collaborative development of open source pathway analysis code and pathways and about usage in analytical software distributed with analytical machines like mass spectrophotometers.
Metabolomic Data Analysis Workshop and Tutorials (2014)Dmitry Grapov
Get more information:
http://imdevsoftware.wordpress.com/2014/10/11/2014-metabolomic-data-analysis-and-visualization-workshop-and-tutorials/
Recently I had the pleasure of teaching statistical and multivariate data analysis and visualization at the annual Summer Sessions in Metabolomics 2014, organized by the NIH West Coast Metabolomics Center.
Similar to last year, I’ve posted all the content (lectures, labs and software) for any one to follow along with at their own pace. I also plan to release videos for all the lectures and labs.
CINECA webinar slides: Modular and reproducible workflows for federated molec...CINECAProject
Genetic analysis of molecular traits such as gene expression, splicing and chromatin accessibility requires a number of complex analysis steps that can easily take weeks or months for a analyst to implement from scratch. In the CINECA project, we have developed a number of modular Nextflow workflows that standardise and automate these steps. In this webinar, we will give an overview of the CINECA workflows for genotype imputation, gene expression and splicing quantification, data normalisation and association testing, and demonstrate how these workflows can be used in a federated setting without transferring identifiable personal data between partners.
The CINECA webinar series aims to discuss ways to address common challenges and share best practices in the field of cohort data analysis, as well as distribute CINECA project results. All CINECA webinars include an audience Q&A session during which attendees can ask questions and make suggestions. Please note that all webinars are recorded and available for posterior viewing.
This webinar took place on 10th November 2020 and is part of the CINECA webinar series.
For previous and upcoming CINECA webinars see:
https://www.cineca-project.eu/webinars
WHAT IS BIOINFORMATICS?
Computational Biology/Bioinformatics is the application of computer sciences and allied technologies to answer the questions of Biologists, about the mysteries of life. It has evolved to serve as the bridge between:
Observations (data) in diverse biologically-related disciplines and
The derivations of understanding (information)
APPLICATIONS OF BIOINFORMATICS
Computer Aided Drug Design
Microarray Bioinformatics
Proteomics
Genomics
Biological Databases
Phylogenetics
Systems Biology
Slide 1: Title Slide
Extrachromosomal Inheritance
Slide 2: Introduction to Extrachromosomal Inheritance
Definition: Extrachromosomal inheritance refers to the transmission of genetic material that is not found within the nucleus.
Key Components: Involves genes located in mitochondria, chloroplasts, and plasmids.
Slide 3: Mitochondrial Inheritance
Mitochondria: Organelles responsible for energy production.
Mitochondrial DNA (mtDNA): Circular DNA molecule found in mitochondria.
Inheritance Pattern: Maternally inherited, meaning it is passed from mothers to all their offspring.
Diseases: Examples include Leber’s hereditary optic neuropathy (LHON) and mitochondrial myopathy.
Slide 4: Chloroplast Inheritance
Chloroplasts: Organelles responsible for photosynthesis in plants.
Chloroplast DNA (cpDNA): Circular DNA molecule found in chloroplasts.
Inheritance Pattern: Often maternally inherited in most plants, but can vary in some species.
Examples: Variegation in plants, where leaf color patterns are determined by chloroplast DNA.
Slide 5: Plasmid Inheritance
Plasmids: Small, circular DNA molecules found in bacteria and some eukaryotes.
Features: Can carry antibiotic resistance genes and can be transferred between cells through processes like conjugation.
Significance: Important in biotechnology for gene cloning and genetic engineering.
Slide 6: Mechanisms of Extrachromosomal Inheritance
Non-Mendelian Patterns: Do not follow Mendel’s laws of inheritance.
Cytoplasmic Segregation: During cell division, organelles like mitochondria and chloroplasts are randomly distributed to daughter cells.
Heteroplasmy: Presence of more than one type of organellar genome within a cell, leading to variation in expression.
Slide 7: Examples of Extrachromosomal Inheritance
Four O’clock Plant (Mirabilis jalapa): Shows variegated leaves due to different cpDNA in leaf cells.
Petite Mutants in Yeast: Result from mutations in mitochondrial DNA affecting respiration.
Slide 8: Importance of Extrachromosomal Inheritance
Evolution: Provides insight into the evolution of eukaryotic cells.
Medicine: Understanding mitochondrial inheritance helps in diagnosing and treating mitochondrial diseases.
Agriculture: Chloroplast inheritance can be used in plant breeding and genetic modification.
Slide 9: Recent Research and Advances
Gene Editing: Techniques like CRISPR-Cas9 are being used to edit mitochondrial and chloroplast DNA.
Therapies: Development of mitochondrial replacement therapy (MRT) for preventing mitochondrial diseases.
Slide 10: Conclusion
Summary: Extrachromosomal inheritance involves the transmission of genetic material outside the nucleus and plays a crucial role in genetics, medicine, and biotechnology.
Future Directions: Continued research and technological advancements hold promise for new treatments and applications.
Slide 11: Questions and Discussion
Invite Audience: Open the floor for any questions or further discussion on the topic.
Cancer cell metabolism: special Reference to Lactate PathwayAADYARAJPANDEY1
Normal Cell Metabolism:
Cellular respiration describes the series of steps that cells use to break down sugar and other chemicals to get the energy we need to function.
Energy is stored in the bonds of glucose and when glucose is broken down, much of that energy is released.
Cell utilize energy in the form of ATP.
The first step of respiration is called glycolysis. In a series of steps, glycolysis breaks glucose into two smaller molecules - a chemical called pyruvate. A small amount of ATP is formed during this process.
Most healthy cells continue the breakdown in a second process, called the Kreb's cycle. The Kreb's cycle allows cells to “burn” the pyruvates made in glycolysis to get more ATP.
The last step in the breakdown of glucose is called oxidative phosphorylation (Ox-Phos).
It takes place in specialized cell structures called mitochondria. This process produces a large amount of ATP. Importantly, cells need oxygen to complete oxidative phosphorylation.
If a cell completes only glycolysis, only 2 molecules of ATP are made per glucose. However, if the cell completes the entire respiration process (glycolysis - Kreb's - oxidative phosphorylation), about 36 molecules of ATP are created, giving it much more energy to use.
IN CANCER CELL:
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
introduction to WARBERG PHENOMENA:
WARBURG EFFECT Usually, cancer cells are highly glycolytic (glucose addiction) and take up more glucose than do normal cells from outside.
Otto Heinrich Warburg (; 8 October 1883 – 1 August 1970) In 1931 was awarded the Nobel Prize in Physiology for his "discovery of the nature and mode of action of the respiratory enzyme.
WARNBURG EFFECT : cancer cells under aerobic (well-oxygenated) conditions to metabolize glucose to lactate (aerobic glycolysis) is known as the Warburg effect. Warburg made the observation that tumor slices consume glucose and secrete lactate at a higher rate than normal tissues.
This pdf is about the Schizophrenia.
For more details visit on YouTube; @SELF-EXPLANATORY;
https://www.youtube.com/channel/UCAiarMZDNhe1A3Rnpr_WkzA/videos
Thanks...!
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
A brief information about the SCOP protein database used in bioinformatics.
The Structural Classification of Proteins (SCOP) database is a comprehensive and authoritative resource for the structural and evolutionary relationships of proteins. It provides a detailed and curated classification of protein structures, grouping them into families, superfamilies, and folds based on their structural and sequence similarities.
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Sérgio Sacani
We characterize the earliest galaxy population in the JADES Origins Field (JOF), the deepest
imaging field observed with JWST. We make use of the ancillary Hubble optical images (5 filters
spanning 0.4−0.9µm) and novel JWST images with 14 filters spanning 0.8−5µm, including 7 mediumband filters, and reaching total exposure times of up to 46 hours per filter. We combine all our data
at > 2.3µm to construct an ultradeep image, reaching as deep as ≈ 31.4 AB mag in the stack and
30.3-31.0 AB mag (5σ, r = 0.1” circular aperture) in individual filters. We measure photometric
redshifts and use robust selection criteria to identify a sample of eight galaxy candidates at redshifts
z = 11.5 − 15. These objects show compact half-light radii of R1/2 ∼ 50 − 200pc, stellar masses of
M⋆ ∼ 107−108M⊙, and star-formation rates of SFR ∼ 0.1−1 M⊙ yr−1
. Our search finds no candidates
at 15 < z < 20, placing upper limits at these redshifts. We develop a forward modeling approach to
infer the properties of the evolving luminosity function without binning in redshift or luminosity that
marginalizes over the photometric redshift uncertainty of our candidate galaxies and incorporates the
impact of non-detections. We find a z = 12 luminosity function in good agreement with prior results,
and that the luminosity function normalization and UV luminosity density decline by a factor of ∼ 2.5
from z = 12 to z = 14. We discuss the possible implications of our results in the context of theoretical
models for evolution of the dark matter halo mass function.
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature
1. Extrac'on
and
Representa'on
of
in
silico
Biological
Methods
from
the
Literature
Geraint
Duck
Supervisors:
Robert
Stevens,
Goran
Nenadic
and
David
Robertson
Advisor:
Joshua
Knowles
School
of
Computer
Science,
University
of
Manchester
2. Importance
of
Method
in
Science
• Understanding
– Key
part
of
research,
central
to
science
– Reproducibility
and
replica'on
– What?
Why?
Where?
How?
When?
– Extension
• Advise/evaluate
– “Current
Approach”
– “Best
Prac'ce”
2
3. Background
• In
silico:
performed
on
a
computer,
or
through
computer
simula'on
• Bioinforma'cs
is
a
resource-‐focused
domain
– Numerous
resources
appearing
– Literature
is
growing
rapidly
• Resource
availability
and
usage
is
central
to
biological
research
• Current
aTempts
oUen
manually
curated
and/
or
incomplete
3
4. The
Method
to
Obtain
a
Method
4
1. Extrac'on
– Automa'cally
extract
resource
and
task
men'ons
from
the
bioinforma'cs
literature
• This
presenta'on
focuses
on
this
step
2. Representa'on
and
Analysis
– Evaluate
the
extracted
men'ons
for
paTerns
of
representa'on
3. Explora'on
– Provide
a
means
of
exploring
the
methods
extracted
to
aid
other
research/researchers
5. Key
Hypothesis:
Resource
ordering
implies
method
• An
analogy
–
baking
a
cake:
– Ingredients:
buTer,
eggs,
flour,
sugar,
etc…
– Recipe/method:
Set
oven
to
180°C,
mix
in
a
bowl
the
buTer
and
sugar…
Divide
between
'ns,
cook
in
oven
for
30mins…
5
6. Key
Hypothesis:
Resource
ordering
implies
method
• An
analogy
–
baking
a
cake:
– Ingredients:
bu#er,
eggs,
flour,
sugar,
etc…
– Recipe/method:
Set
oven
to
180°C,
mix
in
a
bowl
the
bu#er
and
sugar…
Divide
between
2ns,
cook
in
oven
for
30mins…
6
Key:
Resource;
Task
7. Example:
Lagerström
et
al.
(2006)
…
all
sequences
were
aligned
…
using
…
BLAT
3.0
…
in
which
case
the
GenBank
sequence
was
used…
…
divided
…
by
BLAST
searches
…
were
combined
into
a
FASTA
file
and
aligned
using
…
ClustalW
1.82
…
The
alignment
was
bootstrapped
…
using
SEQBOOT
from
the
…
Phylip
3.6
package
…
[excerpt
removed]
…
branch
lengths
were
es'mated
in
TreePuzzle
using
the
following
parameters
…
…
constructed
and
scored
automa'cally
using
a
bash-‐
script
that
u'lized
ClustalW
as
alignment
engine
and
infoalign
from
the
EMBOSS
2.8.0
package
for
scoring,
…
All
sta's'cal
analysis
was
performed
using
MiniTab.
Graphs
were
ploTed
using
MicrosoU
Excel
and
MiniTab.
7
8. Example:
Lagerström
et
al.
(2006)
…
all
sequences
were
aligned
…
using
…
BLAT
3.0
…
in
which
case
the
GenBank
sequence
was
used…
…
divided
…
by
BLAST
searches
…
were
combined
into
a
FASTA
file
and
aligned
using
…
ClustalW
1.82
…
The
alignment
was
bootstrapped
…
using
SEQBOOT
from
the
…
Phylip
3.6
package
…
[excerpt
removed]
…
branch
lengths
were
es2mated
in
TreePuzzle
using
the
following
parameters
…
…
constructed
and
scored
automa'cally
using
a
bash-‐
script
that
u'lized
ClustalW
as
alignment
engine
and
infoalign
from
the
EMBOSS
2.8.0
package
for
scoring,
…
All
sta's'cal
analysis
was
performed
using
MiniTab.
Graphs
were
plo#ed
using
MicrosoL
Excel
and
MiniTab.
8
Key:
Resource;
Task;
Poten2al
Challenge
9. Example:
Lagerström
et
al.
(2006)
…
all
sequences
were
aligned
…
using
…
BLAT
3.0
…
in
which
case
the
GenBank
sequence
was
used…
…
divided
…
by
BLAST
searches
…
were
combined
into
a
FASTA
file
and
aligned
using
…
ClustalW
1.82
…
The
alignment
was
bootstrapped
…
using
SEQBOOT
from
the
…
Phylip
3.6
package
…
[excerpt
removed]
…
branch
lengths
were
es2mated
in
TreePuzzle
using
the
following
parameters.
…
constructed
and
scored
automa'cally
using
a
bash-‐
script
that
u'lized
ClustalW
as
alignment
engine
and
infoalign
from
the
EMBOSS
2.8.0
package
for
scoring,
…
All
sta's'cal
analysis
was
performed
using
MiniTab.
Graphs
were
plo#ed
using
MicrosoL
Excel
and
MiniTab.
9
Key:
Resource;
Task;
Poten2al
Challenge
10. Example:
Lagerström
et
al.
(2006)
10
Key:
GenBank
BLAT,
aligned
BLAST,
searched
ClustalW,
aligned
Resource;
Task
SEQBOOT,
bootstrapped
(Phylip)
TreePuzzle,
esDmated
ClustalW,
aligned
infoalign,
scored
(EMBOSS)
MiniTab,
staDsDcs
MS
Excel,
graphs
ploIed
MiniTab,
graphs
ploIed
Tree
Construc'on
Sequence
and
Tree
Analysis
Result
Visualisa'on
Sequence
Alignment
11. Example…
• Mul'ple
methods
– Usage
counts
– Recentness
of
use
– “best-‐prac'ce”
11
12. Challenges
-‐
Ambiguity
• leg
• white
• cab
• HIV
– Human
immunodeficiency
virus
– Human
immunovirus
• analysis
• Network
• graph
• DIP
– distal
interphalangeal
– Database
of
Interac'ng
Proteins
12
13. Challenges
-‐
Variability
• Orthographics
– Swiss
Prot
– SWISS-‐PROT
– SwissProt
• Misspellings
and
typos
– One
paper,
same
resource,
spelt
3
different
ways
• Abbrevia'ons
– Different
authors
can
use
different
acronyms
for
the
same
thing
13
14. Name
Composi'on
• Majority
are
single
nouns
– includes
acronyms
• 6%
lowercase
common
nouns
– affy,
bioconductor
• A
few
contained
numbers
– S4,
t2prhd
• A
few
misclassified
as
verbs
– …each
query
protein
is
first
BLASTed
with…
– …held
near
their
equilibrium
values
using
SHAKE.
– …graphical
representaKons
were
achieved
using
dot
v1.10…
14
15. Name
Composi'on
• Longest
Names
(most
tokens)
– Corpus:
5
–
Gene
Expression
Profile
Analysis
Suite
– Dic'onary:
12
–
PredicKon
of
Protein
SorKng
Signals
and
LocalisaKon
Sites
in
Amino
Acid
Sequences
• Evaluated
token
frequencies
within
our
dic'onary
– Long-‐tail
curve
– 87%
used
only
once
15
17. Named
En'ty
Recogni'on
(NER)
• Variety
of
NER
uses
– Species
– Gene/protein
names
– Chemical
names
• Variety
of
NER
accuracy
– 95%
F-‐score
species
(LINNAEUS)
– 73%
F-‐score
(strict)
gene
name
(ABNER)
– Over
70%
F-‐score
chemical
names
(OSCAR3)
17
18. bioNerDS
•
Automa'cally
matches
database
and
soLware
names
in
the
literature
–
Uses
dic'onary,
rules
and
clues
•
F-‐scores
between
63
and
91%
– Mixed
results
depending
on
corpus
– Issues
of
mul'ple
men'ons
of
a
single
resource
in
one
paper
– Ambiguity
and
variability…
hTp://bionerds.sourceforge.net/
18
20. Preliminary
Analysis
of
Resource
Usage
• Used
bioNerDS
to
extract
name
men'ons
from
two
journals:
– Genome
Biology
– BMC
Bioinforma'cs
• Analysed
differences
20
21. bioNerDS:
Results
• Over
36,000
men'ons
in
BMC
BioinformaKcs
• Over
15,000
men'ons
in
Genome
Biology.
• 78%
of
Genome
Biology
and
98%
of
BMC
BioinformaKcs
papers
contained
at
least
one
resource
men'on.
• The
top
5
men'oned
resources
were:
R,
BLAST,
GO,
GenBank,
GEO
and
PDB.
• The
general
trend
across
both
journals
have
most
major
resources
declining
in
usage
21
23. bioNerDS:
Full
PMC
Set
• Run
on
full
open-‐access
PMC
set
– ~230,000
full-‐text
ar'cles
– ~1000
different
journals
– Extracted
~1.8M
men'ons
• Method?
• Method
fingerprints
• Trying
to
extract
(data-‐mine):
– Ordering
– PaTerns
– Co-‐occurance
– Rela'onships
– Associate
rules
– Frequent
subsets
– “Networks”
23
24. Method
Analysis
and
Explora'on
• Mining
“best-‐prac'ce”:
Metrics
– Most
common
– Newest
– Who
uses
it
– What
resources
is
it
comprised
of
• Challenges
– Scien'fic
discourse
–
provenance
informa'on
– Men'on
order
does
not
imply
order
of
use
• Clustering
and
associa'ons
• Fingerprints
24
25. Conclusion
• Literature
mining
bioinforma'cs
in
silico
methods
• Developed
bioNerDS:
automated
resource
name
extrac'on
• Extrac'ng
and
analysing
paTerns
of
resource
usage
– Full
PMC
corpus
• Provided
a
way
to
extract
method
for
any
resource
based
domain
– Applied
this
to
bioinforma'cs
25
27. Resource
Men'ons
per
Journal
Journal
Total
ArDcles
Total
MenDons
RaDo
Nucleic
Acids
Research
7,192
200,339
27.8558
PLoS
One
15,791
168,624
10.6785
BMC
Bioinforma'cs
3,982
149,668
37.5861
BMC
Genomics
3,203
90,396
28.2223
Genome
Biology
2,321
48,976
21.1012
Acta
Crystallographica.
Sec'on
E,
Structure
Reports
Online
11,834
41,383
3.497
BMC
Evolu'onary
Biology
1,570
31,222
19.8866
PLoS
Computa'on
Biology
1,613
30,185
18.7136
PLoS
Gene'cs
1,876
29,734
15.8497
PLoS
Pathology
1,691
20,661
12.2182
27
28. Named
En'ty
Recogni'on
(NER)
• Variety
of
NER
uses
– Species
– Gene/protein
names
– Chemical
names
• Evalua'ng
NER
– True
posi'ves,
false
posi'ves,
false
nega'ves
– Precision:
– Recall:
– F-‐score:
28
29. Named
En'ty
Recogni'on
(NER)
• Evalua'ng
NER
– True
posi'ves,
false
posi'ves,
false
nega'ves
• tp:
Correct
• fp:
Returned
incorrect
• fn:
Missed
– Precision:
tp
/
(
tp
+
fp
)
• How
accurate
are
the
results
we
obtained
– Recall:
tp
/
(
tp
+
fn
)
• How
many
of
the
total
correct
results
did
we
obtain
– F-‐score:
2
x
P
x
R
/
(
P
+
R
)
29
30. Named
En'ty
Recogni'on
(NER)
• Evalua'ng
NER
– True
posi'ves,
false
posi'ves,
false
nega'ves
– Precision:
tp
/
(
tp
+
fp
)
– Recall:
tp
/
(
tp
+
fn
)
– F-‐score:
2
x
P
x
R
/
(
P
+
R
)
• Variety
of
NER
accuracy
– 95%
F-‐score
species
(LINNAEUS)
– 73%
F-‐score
(strict)
gene
name
(ABNER)
– Over
70%
F-‐score
chemical
names
(OSCAR3)
30