SlideShare a Scribd company logo
1 of 116
R PACKAGES UNPACKED
R-PACKAGE USE, DEVELOPMENT, AND A
FORAY INTO BIOCONDUCTOR PACKAGES
SHANA WHITE
PHD CANDIDATE: BIOSTATISTICS + BIG DATA TRACK
PRE-DOCTORAL FELLOW: MECEH
BIOSTATISTICS SEMINAR: SEPTEMBER 1, 2017
OVERVIEW
• Motivating example: KEGGlincs
• Background info (graphs, biological networks)
• Challenge – edge-focused pathway annotation
• R Packages
• What [are they exactly]?
• How [do you navigate them]?
• Where [do you find or publish them]?
NETWORK GRAPHS & DATA: QUICK PRIMER
• Network/Network Graph =
• Nodes (entities)
• People, cities, genes
• Edges (relationships between nodes)
• Similar heritage (people), distance (cities),
interaction (genes)
• Size/shape/color of nodes & edges
formatted to summarize important
characteristics, for example…
NETWORK GRAPHS & DATA: QUICK PRIMER
• Network/Network Graph =
• Nodes (entities)
• People, cities, genes
• Edges (relationships between nodes)
• Similar heritage (people), distance (cities),
interaction (genes)
• Size/shape/color of nodes & edges
formatted to summarize important
characteristics, for example…
Directionality
From  To, Source 
Target
NETWORK GRAPHS & DATA: QUICK PRIMER
• Network/Network Graph =
• Nodes (entities)
• People, cities, genes
• Edges (relationships between nodes)
• Similar heritage (people), road connection
(cities), interaction (genes)
• Size/shape/color of nodes & edges
formatted to summarize important
characteristics, for example…
Strength of Relationship
Edge weight represents edge-related
variable
Ex) Number of relatives in common, distance
NETWORK GRAPHS & DATA: QUICK PRIMER
• Network/Network Graph =
• Nodes (entities)
• People, cities, genes
• Edges (relationships between nodes)
• Similar heritage (people), road connection
(cities), interaction (genes)
• Size/shape/color of nodes & edges
formatted to summarize important
characteristics, for example…
Nature of Relationship
Edge color represents edge-
related variable
Ex) activation or repression (genes)
NETWORK GRAPHS & DATA: QUICK PRIMER
• Network/Network Graph =
• Nodes (entities)
• People, cities, genes
• Edges (relationships between nodes)
• Similar heritage (people), road connection
(cities), interaction (genes)
• Size/shape/color of nodes & edges
formatted to summarize important
characteristics, for example…
Node Attribute
Node color represents node-related-variable
Ex) male or female (people), high expression or low
expression (genes)
NETWORK GRAPHS & DATA: QUICK PRIMER
• Network/Network Graph =
• Nodes (entities)
• People, cities, genes
• Edges (relationships between nodes)
• Similar heritage (people), road connection
(cities), interaction (genes)
• Size/shape/color of nodes & edges
formatted to summarize important
characteristics, for example…
Node Attribute
Node size represents node-related-variable
Ex) number or residents (cities), graph attribute
(connectedness)
NETWORK GRAPH PRIMER CNTD.: DATA
Source
A B C D F G
Target
A 0 1 1 1 0 0
B 1 0 0 0 0 0
C 1 0 0 1 1 1
D 1 0 1 0 0 0
F 0 0 1 0 0 0
G 0 0 1 0 0 0
source target weight color
A B 8 red
A C 6 blue
A D 3 red
C F 5 blue
C D 9 red
C G 1 blue
D E 4 red
D G 9 blue
B H 2 red
B I 3 blue
B J 5 red
I J 7 blue
A H 7 red
H E 1 blue
E K 8 red
Node weight color
A 5 orange
B 2 green
C 7 orange
D 3 green
F 2 green
G 2 orange
• Graph image visually
summarizes data
• Data objects [as
matrices] that
‘encode’ the topology
can be used in
analyses
• Ex: Find subnetworks,
measure differences in
topology between
networks, highlight
important nodes
• Visualization & analysis
of graph objects
Adjacency Matrix
Edge Information
Node Information
EDGE DATA
NODE DATA
BIOLOGICAL NETWORKS AND BIOINFORMATICS
MOLECULAR NETWORKS IN THE AGE OF ‘OMICS’
The ‘big [omics] data’
sets essential to
bioinformatics
research are typically
generated/analyzed at
the molecular level
Awesome image from :http://www.bioregulatory-systems-medicine.com/en/brsm-model/autoregulation-of-
biological-networks
BIOLOGICAL NETWORKS AND BIOINFORMATICS
MOLECULAR NETWORKS IN THE AGE OF ‘OMICS’
Awesome image from :http://www.bioregulatory-systems-medicine.com/en/brsm-model/autoregulation-of-
biological-networks
BIOLOGICAL NETWORKS AND BIOINFORMATICS
MOLECULAR NETWORKS IN THE AGE OF ‘OMICS’
Signaling pathway image:- http://openi.nlm.nih.gov/detailedresult.php?img=2993235_mplantssq046f07_4c&req=4
Molecules + How they interact  Nodes + Edges 
Network
BIOLOGICAL NETWORKS AND BIOINFORMATICS
MOLECULAR NETWORKS IN THE AGE OF ‘OMICS’
Signaling pathway image:- http://openi.nlm.nih.gov/detailedresult.php?img=2993235_mplantssq046f07_4c&req=4
Molecules + How they interact  Nodes + Edges 
Network
Biological [Molecular] Network
Nodes = Genes, proteins, chemical compounds, drugs
Edges = Relationships/Interactions between molecular
entities
Signaling Pathway: Directed Edges
Molecular networks are constructs to filter the ‘signal from
the noise’
Any given network is really a subnetwork of the entire
system
(i.e. one cellular process among many happening
simultaneously)
Goal of biological network analysis - “Distill the signal from
the noise” by combining ‘omics’ data and pathway topology to
EX) Gene Signaling
Pathway
KEGG: POPULAR REPOSITORY FOR BIOLOGICAL
PATHWAYS
• Over 300 [signaling]
pathways
• Graph image summarizes
data from multiple
sources
• Data objects [as matrices]
that ‘encode’ the topology
can be used in analyses
• *After parsing from
“KGML” (KEGG mark-up
language) file
• Overlaying data on
pathway nodes for
analysis and visualization
has been addressed both
by KEGG and other
Nodes: Genes/Proteins
Edges: Relationship between
genes/proteins
MOTIVATING EXAMPLE: KEGGLINCS
• Primary Challenge [handed down to me from Dr. Medvedovic]:
• Map gene-gene relationship data generated from the LINCS L1000 data
set to the edges of KEGG pathways
LINCS L1000 KNOCK-OUT DATA
Cancer Cell Lines
http://www.lincsproject.org/LINCS/tools
(ex: MCF7, PC3,
HA1E)
9,000+ Gene Perturbations
x
LINCS = Library of Integrated Network-Based Cellular Signatures
L1000 DATA COLLECTION (BRIEFLY!)
Unpertubed
Samples
Functional
Knock-Out
(KO) Samples
1000‘Landmark’
Genes
Gene
Signature:
Top 100
UP/DOWN-
regulated
genes
Gene perturbation via
shRNA: disrupt conversion of
mRNA into functional protein
Changes in
cellular
regulation in
absence of
functional
gene
Generat
e gene-
expressi
on data
LINCS L1000 KNOCK-OUT DATA
Cancer Cell Lines
http://www.lincsproject.org/LINCS/tools
(ex: MCF7, PC3,
HA1E)
9,000+ Gene Perturbations
x
LINCS = Library of Integrated Network-Based Cellular Signatures
LINCS L1000 KNOCK-OUT DATA
Cancer Cell Lines
http://www.lincsproject.org/LINCS/tools
(ex: MCF7, PC3,
HA1E)
9,000+ Gene Perturbations
x
Cell Line Perturbation Signature
c1 p1 s1,1
c2 p1 s2,1
⋮ p1 ⋮
cn-1 p1 sn-1,1
cn p1 sn,1
c1 p2 s1,2
c2 p2 s2,2
⋮ p2 ⋮
cn-1 p2 sn-1,2
cn p2 sn,2
⋮ ⋮ ⋮
c1 pm-1 s1,m-1
c2 pm-1 s2,m-1
⋮ pm-1 ⋮
cn-1 pm-1 sn-1,m-1
cn pm-1 sn,m-1
c1 pm s1,m
c2 pm s2,m
⋮ pm ⋮
cn-1 pm sn-1,m
cn pm sn,m
LINCS L1000 KO DATA
=
LINCS = Library of Integrated Network-Based Cellular Signatures
LINCS L1000 OVERLAP DATA
KO_1 KO_2
UP_in_Commo
n
DOWN_in_Comm
on DOWN_UP UP_DOWN
MTOR AKT1 25 17 2 3
25 2
3 17
• Within a given cell line, how concordant are changes in
regulation when different genes are knocked out?
Concordant
Concordant
Discordant
Discordant
Confusion Matrix
Use modified Fisher’s
test to calculate
summary score and
p-value
EXAMPLE OF EDGE DATA TO BE MAPPED TO
KEGG
gene 1 gene 2 score p-value
DEPTOR MTOR -0.285755049 0.398477502
IL1RAP MTOR 0.964732247 0.577905732
MTOR IGBP1 2.123934037 0.000045456
MTOR RPS6KB1 0.161554125 0.625360024
MTOR RPS6KB1 -0.192375372 0.289930272
MTOR RPS6KB2 0.732379001 0.448532973
PIK3R2 MTOR 0.701104025 0.543121475
RRAGC MTOR 0.109984056 0.488447135
MTOR SLC7A5 -0.814937768 0.974542498
MOTIVATING EXAMPLE: KEGGLINCS
• Primary Challenge [handed down from my research mentor]:
• Map gene-gene relationship data generated from the LINCS L1000 data
set to the edges of KEGG pathways
• Generate data tables for analysis and conditional formatting of pathway
edges
• Not so straight forward…
• Images from KEGG website are abbreviated…
In the KGML, each node has a unique entry ID
(numeric) but may be associated with more
than one gene
Ex) Node 12 (labeled “Akt”) –> AKT1, AKT2,
AKT3
MOTIVATING EXAMPLE: KEGGLINCS
• Primary Challenge [handed down from my research mentor]:
• Map gene-gene strength-of-relationship data generated from the LINCS
L1000 data set to the edges of KEGG pathways
• Not so straight forward…
• Images from KEGG website are abbreviated…
• …and digital representation may not be complete
GRAPH OF EDGES ON KEGG WEBSITE
GRAPH OF EDGES DOCUMENTED IN
KGML
GRAPH OF EDGES DOCUMENTED IN KGML
Edges ‘disappeared’
from KGML [and
hence pathway] after
pathway was updated
GRAPH OF EDGES ON KEGG WEBSITE
Edges ‘disappeared’
from KGML [and
hence pathway] after
pathway was updated
MOTIVATING EXAMPLE (CNTD)
• Preliminary Goal: develop a software platform that streamlines
edge-focused KEGG pathway visualization and
analysis
• “Help filter signal from noise of data set with large n”
Add in edges with
documentation in
other KEGG
pathways
Expand
edges
Visualize
L1000 data
overlaid on
edges
Generate data
structures to reflect
altered topology for
use in analysis
Streamlined KEGG pathway visualization
MOTIVATING EXAMPLE (CNTD)
• Preliminary Goal: develop a software platform that streamlines
edge-focused KEGG pathway visualization and
analysis
• “Help filter signal from noise of data set with large n”
• Secondary goal: make platform easy to share with any
researcher interested in L1000 data and/or KEGG
pathway analysis
• Ultimate solution: Produce an R [Bioconductor] package –
“KEGGlincs”
WHY AN R PACKAGE?
• “R is an integrated suite of software facilities for data
manipulation, calculation and graphical display”1
1) https://www.r-project.org/about.html
SCHEMATIC OF HOW MY PACKAGE WORKS
(WHICH “NODE” DO YOU THINK IS THE MOST
ACTIVE?)
WHY AN R PACKAGE?
• “R is an integrated suite of software facilities for data
manipulation, calculation and graphical display”1
• “It allows statisticians to do very intricate and complicated analyses
without knowing the blood and guts of computing systems.”
• R is extremely relevant in our field
• Accomplish secondary goal: Publish package in bioinformatics-
focused software repository to easily share platform with other
researchers
• (Personally) Leap over that last hurdle…
• …while R is intended for those without a computer science background, it
certainly does not seem like it at times!
1) https://www.r-project.org/about.html
• Using
functions
• Troubleshootin
g error
messages
• Managing
variables in
your working
directory
• Writing your
own functions
• Building your
package library
• Working with R
in a project
environment
• Producing
publishable
results with R
• Making an
interactive app
with R Shiny
• Developing an
R package
Hurdles & Milestones
When Working With R
FIRST THINGS FIRST…WHAT IS R?
• “R is an open source programming language and software
environment for statistical computing and graphics
• supported by the R Foundation for Statistical Computing.”
• R is OPEN SOURCE (FREE!)
• Encourages sharing of ideas, learning from others
• R has become software tool of choice for data scientists
• Especially relevant to the field of bioinformatics
• “Out of the box” R has the capabilities of a [very nice] graphing
calculator
• “Running a command” refers to the processing of lines of code
EXAMPLE OF R WORKING ENVIRONMENT
• “x”, “y” and
“model” are
data-associated
variables
• “plot”, “lm” and
“summary” are
functions
EXAMPLE OF R WORKING ENVIRONMENT
KEGG_lincs("hsa04115")
KEGG_lincs("hsa04115”, “PC3”)
With the right functions, R
can communicate with many
other working
environments/accommodate
other software languages!
Example of KEGGlincs “master
function”
R FUNCTIONS
• R functions are chunks of code (written in R coding language)
• Allow for reproducible data manipulation, analysis and
visualization.
• The R functions “plot”, “lm” and “summary” come from the R
packages graphics, stats, and base (all
considered ‘base packages’)
WHAT IS AN R PACKAGE?....
• “Packages are the fundamental units of reproducible R code.”
• “They include reusable R functions, the documentation that describes how to use them,
and sample data.”
• Types of R packages:
• “Base/system packages” – contain ‘essential’ features and are pre-installed in R’s
library
• Ex) “datasets”, ”graphics”, “utils”
• “Add-on” packages can:
• Exist locally – early stages of development, for proprietary use
• Be installed from an open-source package repository
THE R-PACKAGE-REPO ‘SPECIFICITY’
HIERARCHY
Version-control platform
for packages written in
many different software
languages
“A central platform for the
development of R packages, R-
related software and further
projects”
Main repository for R
packages
The
Comprehensive
R Archive
Network
Repository for
Bioinformatics-based R
packages
THE R-PACKAGE-REPO ‘SPECIFICITY’
HIERARCHY
The
Comprehensive
R Archive
Network
Must make it past
review process to
be “published” in
repository
Main benefit of publishing a package == streamline installation for
other users
What are the basic documents contained
in a published/publishable R package?
WHAT IS AN R PACKAGE?....
• “Packages are the fundamental units of reproducible R code.”
• “They include reusable R functions, the documentation that describes how to use them,
and sample data.”
Package Folder
WHAT IS AN R PACKAGE?....
• “Packages are the fundamental units of reproducible R code.”
• “They include reusable R functions, the documentation that describes how to use them,
and sample data.”
Package Folder  R Folder
WHAT IS AN R PACKAGE?....
• “Packages are the fundamental units of reproducible R code.”
• “They include reusable R functions, the documentation that describes how to use them,
and sample data.”
Package Folder  R Folder  .R Files
WHAT IS AN R PACKAGE?....
• “Packages are the fundamental units of reproducible R code.”
• “They include reusable R functions, the documentation that describes how to use them,
and sample data.”
Package Folder  R Folder  .R Files  Functions [and their
documentation*] written in R code
WHAT IS AN R PACKAGE?....
• “Packages are the fundamental units of reproducible R code.”
• “They include reusable R functions, the documentation that describes how to use them,
and sample data.”
Package Folder  man Folder
WHAT IS AN R PACKAGE?....
• “Packages are the fundamental units of reproducible R code.”
• “They include reusable R functions, the documentation that describes how to use them,
and sample data.”
Package Folder  man Folder  .Rd File
WHAT IS AN R PACKAGE?....
• “Packages are the fundamental units of reproducible R code.”
• “They include reusable R functions, the documentation that describes how to use them,
and sample data.”
Package Folder  man Folder  .Rd File  Code used to generate
standard manual
WHAT IS AN R PACKAGE?....
• “Packages are the fundamental units of reproducible R code.”
• “They include reusable R functions, the documentation that describes how to use them,
and sample data.”
Package Folder  NAMESPACE
WHAT IS AN R PACKAGE?....
• “Packages are the fundamental units of reproducible R code.”
• “They include reusable R functions, the documentation that describes how to use them,
and sample data.”
Package Folder  DESCRIPTION
WHAT IS AN R PACKAGE?....
• “Packages are the fundamental units of reproducible R code.”
• “They include reusable R functions, the documentation that describes how to use them,
and sample data.”
Package Folder  DESCRIPTION
WHAT IS AN R PACKAGE?....
• “Packages are the fundamental units of reproducible R code.”
• “They include reusable R functions, the documentation that describes how to use
them*, and sample data.”
*Packages such as roxygen2 and devtools facilitate package documentation
and management*
EX) Running devtools::document() automatically generates .Rd files as specified by the documentation fo
each function in the R file
WHAT IS AN R PACKAGE?....
• “Packages are the fundamental units of reproducible R code.”
• “They include reusable R functions, the documentation that describes how to use them,
and sample data.”
Package Folder  data folder
WHAT IS AN R PACKAGE?....
• “Packages are the fundamental units of reproducible R code.”
• “They include reusable R functions, the documentation that describes how to use them,
and sample data.”
• “data” folder/files
needed if:
• A specific data set is
intrinsic to the function’s
utility
Package Folder  data folder  .rda file
Case 1: Data set called within function
WHAT IS AN R PACKAGE?....
• “Packages are the fundamental units of reproducible R code.”
• “They include reusable R functions, the documentation that describes how to use them,
and sample data.”
• “data” folder/files
needed if:
• A specific data set is
intrinsic to the function’s
utility
• Examples [in
documentation] require
specific data objects to
run successfully
Package Folder  data folder  .rda file
Case 2: Data set required to successfully
run examples
WHAT IS AN R PACKAGE?....
• “Packages are the fundamental units of reproducible R code.”
• “They include reusable R functions, the documentation that describes how to use them,
and sample data.”
• “data” folder/files needed
if:
• A specific data set is
intrinsic to the function’s
utility
• Examples [in
documentation] require
specific data objects to run
successfully
• If the data file needed by a
package is very large (GB vs
MB), a separate dataPackage Folder  data folder  .rda file
WHAT IS A BIOCONDUCTOR PACKAGE?
•Bioconductor packages must meet requirements for R
packages and be aligned with goals set by Bioconductor
• “Provide widespread access to a broad range of powerful statistical and
graphical methods for the analysis of genomic data.
• Facilitate the inclusion of biological metadata in the analysis of genomic data
• Provide a common software platform that enables the rapid development
and deployment of plug-able, scalable, and interoperable software.
• Train researchers on computational and statistical methods for the analysis of
genomic data.
• Further scientific understanding by producing high-quality **documentation and
reproducible research**.”
[TECHNICAL] ADDITIONAL REQUIREMENTS
FOR BIOCONDUCTOR PACKAGES
• Package files must meet standards set by Bioconductor (pass
BiocCheck)
• Ex) line lengths, indentation sizes, etc.
• Must include a vignette that includes detailed documentation
Package Folder  vignettes Folder
[TECHNICAL] ADDITIONAL REQUIREMENTS
FOR BIOCONDUCTOR PACKAGES
• Package files must meet standards set by Bioconductor (pass
BiocCheck)
• Ex) line lengths, indentation sizes, etc.
• Must include a vignette that includes detailed documentation
Package Folder  vignettes Folder  .Rmd file
[TECHNICAL] ADDITIONAL REQUIREMENTS
FOR BIOCONDUCTOR PACKAGES
• Package files must meet standards set by Bioconductor (pass
BiocCheck)
• Ex) line lengths, indentation sizes, etc.
• Must include a vignette that includes detailed documentation of
example usage
Package Folder  vignettes Folder  .Rmd file  Code used to
generate vignette
MANUAL VS. PDF
• “Provides a
task-oriented
description of
package
functionality”
• Bioconductor
packages must
have at least
one vignette
• Contains entries
for each
function
contained in the
package
• Automatically
PACKAGE  PUBLISHED BIOCONDUCTOR
PACKAGE
• Preliminary steps:
• Manage package via GitHub (details in supplemental slides)
• If your package requires a large data set (> 5MB), create a separate data
package
• Check that package works on 3 main operating systems (Windows,
macOS, Linux)
• Submit to Bioconductor
• Check release schedule – new Bioconductor version every 6 months
(October & March)
• Package review process takes approx. 2-5 weeks
• After package is accepted:
• Package maintenance via GitHub Bioconductor-mirror
• Check build/check report after introducing changes
KEGGLINCS: LOCATIONS AND
DOCUMENTATIONS
BIOCONDUCTOR LANDING PAGE
KEGGLINCS: LOCATIONS AND
DOCUMENTATIONS
Bioconductor Landing
Page
REFERENCE MANUAL
As an installed package,
information from the manual
can be retrieved by prefacing a
package/function with
“?” and pressing enter
**I strongly recommend
rstudio …
especially for those new to
R**
KEGGLINCS: LOCATIONS AND
DOCUMENTATIONS
Bioconductor Landing
Page Reference Manual
GITHUB WEB PAGE/REPOSITORY
KEGGLINCS: LOCATIONS AND
DOCUMENTATIONS
Bioconductor Landing
Page
GitHub Bioconductor-
mirror
Reference Manual
VIGNETTE OF *DETAILED* WORKFLOWS
KEGGLINCS: LOCATIONS AND
DOCUMENTATIONS
Bioconductor Landing
Page
GitHub Bioconductor-
mirror
Reference Manual
Vignette
POSTER: PRESENTED AT CONFERENCES AND
PUBLISHED VIA F1000 RESEARCH (NON-PEER
REVIEWED
The pathway maps provided by KEGG
summarize information across genes to
single nodes or even groups of nodes. While
this may be convenient for basic
visualization, it is limiting in terms of
understanding the exact relationships that
are documented in the databases used by
KEGG. The primary functions of KEGGlincs
download up-to-date information from
KEGG, parse/reconfigure it, and then re-map
the updated information via the interactive
software Cytoscape. Although more
complex, the updated map features
represent the exact relationships mapped in
KEGG, have more explicit definitions of edge
subtypes, and show clear mark-up of
grouped nodes (see Figures 1a/1b).
TOP: Original pathway available from the
KEGG website
BOTTOM: Pathway rendered in Cytoscape
with expanded edges and formatting using
single command within R: KEGG_lincs(“hsa04068”)
Edge Color Key
Red: Activation or Expression *
Orange: Activating PTM **
Green: PTM (no activation/inhibition activity defined)
Blue: Inhibition
Purple: Inhibiting PTM
Black(solid): Binding/Association
Black(dashed): Indirect effect (no activation/inhibition activity defined)
*Any dashed colored line indicates that the effect is indirect
**PTM = post-translational modification or, as KEGG defines them,
‘molecular events’. The specific types of PTMs (indicated by edge label)
include:
+p: phosphorylation
-p: dephosphorylation
+g: glycosylation
+u: ubiquitination
+m: methylation
KEGGlincs Design and Application:
An R Package for Exploring Relationships in Biological Pathways
Shana White1, Mario Medvedovic1
1Laboratory for Statistical Genomics, Department of Environmental Health, Division of Biostatistics and Bioinformatics, University of Cincinnati College of Medicine, 3223 Eden Ave. ML 56, Cincinnati OH 45267-0056, USA
Overview of Data Integration & Visualization
a. KEGG information in the form of KGML
(modified EXtensible Markup Language or
XML format) files.
b. Optional feature: Add edge data; the
companion data package KOdata contains
both baseline expression levels for individual
genes and consensus genomic signatures
(CGS’s) for gene-knock-out’s corresponding to
cell-line-specific LINCS L1000 data.
c. KGML files are parsed, edges are expanded
and user data is incorporated as mapping
features using the Bioconductor package
KEGGlincs with R software.
d. Data is seamlessly transported from the R
environment to graphing software via
functionality developed by the RESTful API
cyREST.
e. Maps are ultimately visualized with the
interactive graphing software Cytoscape.
BD2K-LINCS
DATACOORDINATIONAND
INTEGRATIONCENTER
The LINCS L1000 dataset maintained by the BROAD institute is a collection of gene expression
results/consensus genomic signatures (CGS’s) particular to a given [typically cancer] cell type that arise
when single genes are knocked out or cells are perturbed with a single drug/chemical in controlled
experiments. For this example, gene-gene and gene-drug relationships in the ErbB Signaling pathway are
compared across the MCF7 (breast cancer) and A375 (melanoma) cell-lines.
Abstract
The Library of Integrated Network-based Signals (LINCS) project is a data generation venture that is a
quintessential example of current efforts concerning ‘big data’ in the biomedical research environment.
One element of this project is the production of gene expression profiles corresponding to individual gene
knock-outs within specific cancer cell lines. The R package ‘KEGGlincs’ and the companion data package
‘KOdata’, both recently published/updated with the latest version of Bioconductor (3.4/3.5), were
developed to promote synergy between existing pathway structures from the Kyoto Encyclopedia of Genes
and Genomes (KEGG) and LINCS data in order to reveal mechanisms of biochemical signaling processes that
display heterogeneity across different types of cells.
KEGG pathways are manually-curated biological pathways represented as networks of nodes (genes) and
directed edges whereby experimental evidence determines the nature and direction of an edge
(relationship) between genes. The network structure for pathways that KEGG provides is a promising tool
for bioinformatics research, and indeed there are existing methods for quantifying the level of pathway
perturbation [within an experiment] that make use of KEGG pathways. However, the existing approaches
consider changes of gene expression only of the genes in a particular pathway and not changes in
expression of downstream targets. This restricts the definition of perturbation to mean change in gene
expression rather than a much broader, and perhaps more meaningful, change in gene function.
The LINCS data available from the KOdata resource along with the functionality offered by KEGGlincs allow
for the investigation of relationships between genes in a given pathway in a cell-type-specific manner via
analysis of overlapping de-regulated genes corresponding to pairs of experimental knockouts. This
approach to pathway analysis yields quantitative measures and a novel method for annotating
relationships (edges) between genes programmatically created in R and automatically visualized in an
interactive session via Cytoscape software.
Example Application:
Visualizing Cancer Cell-Line-Specific Relationships in the
ErbB Signaling Pathway with LINCS L1000 data
Introduction :‘Edge-Focused’ Interactive Pathway Visualization
References
Kanehisa M and Goto S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research 2000; 28: 27-30 .
Ono K, et al. CyREST: Turbocharging Cytoscape Access for External Tools via a RESTful API. F1000Research 2015; 4 :478.
R Core Team (2015). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Shannon P, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Research 2003 Nov; 13(11):2498-504.
White S (2016). KEGGlincs: Visualize all edges within a KEGG pathway and overlay LINCS data [option]. R package version 1.0.0.
a. b.
c. d.
e.
Acknowledgements
This work has been funded by the NIH grant U54HL127624 and the Molecular Epidemiology in Children’s Environmental Health training grant T32-ES10957.
Edge colors represent the following
possible combinations of direction
of Fisher’s Exact Test summary
scores (a modified Odd’s Ratio (OR)
score; either positive(+) or negative
(-)) and their corresponding
[adjusted] p-values:
Red: OR(+), pval(sig)
Orange: OR(+), pval(non-sig)
Purple: OR(-), pval(non-sig)
Blue: OR(-), pval(sig)
The width of the edges are
mapped to represent the
magnitude of the OR (thicker edge
indicates larger OR relative to other
edges with data).
Figure 1a: ErbB Pathway + knock-out data for MCF7 Figure 1b: ErbB Pathway + knock-out data for A375
Figure 3: ErbB Pathway + drug data to compare MCF7 and A375 cell-lines
Figure 2: ErbB Pathway + knock-out data to compare MCF7 and A375 cell-lines Figures 1a and 1b display the edge-mapping
offered by the using the KEGG_lincs function:
KEGG_lincs(“hsa04012”, “MCF7”)
KEGG_lincs(“hsa04012”, “A375”)
These figures suggest that many of the gene-
gene relationships annotated in the KGML for
the ErbB pathway show concordance as
determined by L1000 data for both cell lines.
Alternatively, users may wish to generate a map
that will highlight the differences between two
cell-lines in one summarized picture; Figure 2
displays this utility. This type of map is
generated using the development function
KL_compare:
KL_compare(“hsa04012”, “MCF7”, “A375”)
For convenient viewing, edges that are shared
between two nodes get grouped together if they
fall into the same category as follows:
Orange: Higher concordance in first cell-line
Green: Higher concordance in second cell-line
The width of the arrow represents the effect size
(or average effect size, if shared) of differences
in edge concordance [as measured by a test
statistic to compare OR’s]; the shade of the edge
roughly indicates how many edges share that
location/category (note that these are easily
recovered in an interactive Cytoscape session).
Figure 3 has been generated to showcase this
package’s utility in terms of incorporating drug
perturbation signatures from the L1000 data set
into pathway maps. The development function
KL_drug returns known drug-target
relationships specific to the pathway of choice;
these are not usually included in the original
pathway KGML. These drug-gene relationships
are added as edges and are quantified in a
manner identical to that for gene-gene edges.
The edge colors highlight differences as follows:
Yellow: Higher concordance in first cell-line
Blue: Higher concordance in second cell-line
A quick interpretation of this map suggests drugs
that effect the ErbB pathway target BRAF in
A375 cells more so than in MCF7 cells and that
Vemurafenib is perhaps the most effective. On
the other hand, ErbB is more of a direct target
for the MCF7 cell-line compared to A375.
KEGGLINCS: LOCATIONS AND
DOCUMENTATIONS
Bioconductor Landing
Page
GitHub Bioconductor-
mirror
Reference Manual
Vignette
Poster Publication
IN-THE-WORKS: SUBMIT KEGGLINCS AS
APPLICATION NOTE IN JOURNAL
BIOINFORMATICS
KEGGLINCS: LOCATIONS AND
DOCUMENTATIONS
Bioconductor Landing
Page
GitHub Bioconductor-
mirror
Reference Manual
Vignette
Poster Publication
Bioinformatics
Application Note
Multiple sources of information
can be confusing when using a
package
KEGGLINCS: LOCATIONS AND
DOCUMENTATIONS
Bioconductor Landing
Page
GitHub Bioconductor-
mirror
Reference Manual
Vignette
Poster Publication
Bioinformatics
Application Note
Info
‘hub’
How to use
individual
function;
indexed
How functions
are coded
Intended use
for package
Package
Summary
Package
Summary;
technical notes
QUICK NOTE: R/BIOCONDUCTOR
TROUBLESHOOTING
• Somewhat paradoxically, getting a package to successfully
install can be a road block for using a package
• Check that you have the most recent version of R and Bioconductor
before installing new Bioconductor packages.
• Some packages depend on other packages that must be installed before
they can successfully install
CONCLUDING REMARKS
• Get to know R!
• Use R in projects to manage data, perform analyses and produce graphs
• Use R to enhance your learning in classes
• Ex) Translate algorithms from Design and Analysis of Algorithms to R code
• Don’t give up…with Google and determination you will figure it out
• Even if you are working with R code you do not intend to
publish, having your code organized as a package makes it
easier to manage
• Creating/Publishing a package is only the first step!!
• So your package is out there…now what?
QUESTIONS??
• “There are plenty of people out there that go to school for eight
years”
• “Yeah…they’re called doctors”
• Tommy Boy
SUPPLEMENTAL SLIDES
Introduction to Package Development and
Management
R-PACKAGE OVERVIEW
• Premise: You have written a useful function and have it saved
as a “.R” file
• The function works in conjunction with data, saved as a “.RData” file
• You have decided to take the leap and formally declare to yourself and
the world that you will endeavor to turn your hunk of code into an R
package
• Use rstudio to set up a package-related project
• Link project to GitHub
OVERVIEW OF KGML:
INFORMATION IN KEGG MARK-UP
LANGUAGE
Map
Features
Chemical
Compoun
ds
Genes
Grouping
Informatio
n
N
O
D
E
S
E
D
G
E
S
Relationsh
ips
between
Nodes
KGML format
• Stores
information for
pathway
structures and
relationships
among them
• XML document
• 2 main ‘tables’
R-PACKAGE BASICS
• Best practice: Use GitHub to manage package
• Update, check, and synch to GitHub directly from Rstudio
• Documentation required for: functions, data, package
• Useful packages: devtools and roxygen2
*SPECIAL* BIOCONDUCTOR ISSUES
• Data management
• Data package may need to be developed if working with large data sets
• Interoperation with other R packages
• Does Bioconductor support your package’s dependencies?
• Documentation
• Requires detailed vignette
• Code format specifications
• Ex: Indent = 4 spaces, line limit = 80 chars
• Publication ‘deadlines’
• New Bioconductor versions released in October and April
SUBMITTING TO BIOCONDUCTOR
• Open a new issue at the Bioconductor Contributions repository
(GitHub)
SUBMITTING TO BIOCONDUCTOR
• Open a new issue at the Bioconductor Contributions repository
(GitHub)
• Set package version to 0.99.0 in DESCRIPTION file
• A ‘real person’ is assigned to your package
• Tests package across 3 platforms (Windows, Mac OS, Linux)
• Answers questions from developer
• Evaluates fit with Bioconductor (??)
• Update package based on reviewers comments until acceptance
• Bump version number “z” by 1 in DESCRIPTION (ie. 0.99.1) and push
changes to GitHub
MAINTAINING BIOCONDUCTOR PACKAGES
• Currently handled via git-svn synch
• Switching over to git – maintain from Rstudio session (?)
• Nightly build checks
• Updates to package need to pass checks
R Packages Unpacked
R Packages Unpacked

More Related Content

Similar to R Packages Unpacked

Contractor-Borner-SNA-SAC
Contractor-Borner-SNA-SACContractor-Borner-SNA-SAC
Contractor-Borner-SNA-SAC
webuploader
 
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Databricks
 
ODSC_Cherven_20160518
ODSC_Cherven_20160518ODSC_Cherven_20160518
ODSC_Cherven_20160518
Ken Cherven
 

Similar to R Packages Unpacked (20)

Contractor-Borner-SNA-SAC
Contractor-Borner-SNA-SACContractor-Borner-SNA-SAC
Contractor-Borner-SNA-SAC
 
2016 Cytoscape 3.3 Tutorial
2016 Cytoscape 3.3 Tutorial2016 Cytoscape 3.3 Tutorial
2016 Cytoscape 3.3 Tutorial
 
Follow the money with graphs
Follow the money with graphsFollow the money with graphs
Follow the money with graphs
 
Cytoscape Talk 2010
Cytoscape Talk 2010Cytoscape Talk 2010
Cytoscape Talk 2010
 
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
 
Opportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architecturesOpportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architectures
 
RNA-Seq with R-Bioconductor
RNA-Seq with R-BioconductorRNA-Seq with R-Bioconductor
RNA-Seq with R-Bioconductor
 
Sensors1(1)
Sensors1(1)Sensors1(1)
Sensors1(1)
 
Cytoscape Network Visualization and Analysis
Cytoscape Network Visualization and AnalysisCytoscape Network Visualization and Analysis
Cytoscape Network Visualization and Analysis
 
ODSC_Cherven_20160518
ODSC_Cherven_20160518ODSC_Cherven_20160518
ODSC_Cherven_20160518
 
Evolution of Graph Algorithms – Benefits and Challenges
Evolution of Graph Algorithms – Benefits and ChallengesEvolution of Graph Algorithms – Benefits and Challenges
Evolution of Graph Algorithms – Benefits and Challenges
 
Generating synthetic online social network graph data and topologies
Generating synthetic online social network graph data and topologiesGenerating synthetic online social network graph data and topologies
Generating synthetic online social network graph data and topologies
 
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
 
Visualization of 3D Genome Data
Visualization of 3D Genome DataVisualization of 3D Genome Data
Visualization of 3D Genome Data
 
R-programming-training-in-mumbai
R-programming-training-in-mumbaiR-programming-training-in-mumbai
R-programming-training-in-mumbai
 
Data Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIData Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AI
 
Pathway and network analysis
Pathway and network analysisPathway and network analysis
Pathway and network analysis
 
07 data structures_and_representations
07 data structures_and_representations07 data structures_and_representations
07 data structures_and_representations
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
 
kkyle_poster_FINAL
kkyle_poster_FINALkkyle_poster_FINAL
kkyle_poster_FINAL
 

Recently uploaded

Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
gajnagarg
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
gajnagarg
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 

Recently uploaded (20)

Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 

R Packages Unpacked

  • 1. R PACKAGES UNPACKED R-PACKAGE USE, DEVELOPMENT, AND A FORAY INTO BIOCONDUCTOR PACKAGES SHANA WHITE PHD CANDIDATE: BIOSTATISTICS + BIG DATA TRACK PRE-DOCTORAL FELLOW: MECEH BIOSTATISTICS SEMINAR: SEPTEMBER 1, 2017
  • 2. OVERVIEW • Motivating example: KEGGlincs • Background info (graphs, biological networks) • Challenge – edge-focused pathway annotation • R Packages • What [are they exactly]? • How [do you navigate them]? • Where [do you find or publish them]?
  • 3. NETWORK GRAPHS & DATA: QUICK PRIMER • Network/Network Graph = • Nodes (entities) • People, cities, genes • Edges (relationships between nodes) • Similar heritage (people), distance (cities), interaction (genes) • Size/shape/color of nodes & edges formatted to summarize important characteristics, for example…
  • 4. NETWORK GRAPHS & DATA: QUICK PRIMER • Network/Network Graph = • Nodes (entities) • People, cities, genes • Edges (relationships between nodes) • Similar heritage (people), distance (cities), interaction (genes) • Size/shape/color of nodes & edges formatted to summarize important characteristics, for example… Directionality From  To, Source  Target
  • 5. NETWORK GRAPHS & DATA: QUICK PRIMER • Network/Network Graph = • Nodes (entities) • People, cities, genes • Edges (relationships between nodes) • Similar heritage (people), road connection (cities), interaction (genes) • Size/shape/color of nodes & edges formatted to summarize important characteristics, for example… Strength of Relationship Edge weight represents edge-related variable Ex) Number of relatives in common, distance
  • 6. NETWORK GRAPHS & DATA: QUICK PRIMER • Network/Network Graph = • Nodes (entities) • People, cities, genes • Edges (relationships between nodes) • Similar heritage (people), road connection (cities), interaction (genes) • Size/shape/color of nodes & edges formatted to summarize important characteristics, for example… Nature of Relationship Edge color represents edge- related variable Ex) activation or repression (genes)
  • 7. NETWORK GRAPHS & DATA: QUICK PRIMER • Network/Network Graph = • Nodes (entities) • People, cities, genes • Edges (relationships between nodes) • Similar heritage (people), road connection (cities), interaction (genes) • Size/shape/color of nodes & edges formatted to summarize important characteristics, for example… Node Attribute Node color represents node-related-variable Ex) male or female (people), high expression or low expression (genes)
  • 8. NETWORK GRAPHS & DATA: QUICK PRIMER • Network/Network Graph = • Nodes (entities) • People, cities, genes • Edges (relationships between nodes) • Similar heritage (people), road connection (cities), interaction (genes) • Size/shape/color of nodes & edges formatted to summarize important characteristics, for example… Node Attribute Node size represents node-related-variable Ex) number or residents (cities), graph attribute (connectedness)
  • 9. NETWORK GRAPH PRIMER CNTD.: DATA Source A B C D F G Target A 0 1 1 1 0 0 B 1 0 0 0 0 0 C 1 0 0 1 1 1 D 1 0 1 0 0 0 F 0 0 1 0 0 0 G 0 0 1 0 0 0 source target weight color A B 8 red A C 6 blue A D 3 red C F 5 blue C D 9 red C G 1 blue D E 4 red D G 9 blue B H 2 red B I 3 blue B J 5 red I J 7 blue A H 7 red H E 1 blue E K 8 red Node weight color A 5 orange B 2 green C 7 orange D 3 green F 2 green G 2 orange • Graph image visually summarizes data • Data objects [as matrices] that ‘encode’ the topology can be used in analyses • Ex: Find subnetworks, measure differences in topology between networks, highlight important nodes • Visualization & analysis of graph objects Adjacency Matrix Edge Information Node Information EDGE DATA NODE DATA
  • 10. BIOLOGICAL NETWORKS AND BIOINFORMATICS MOLECULAR NETWORKS IN THE AGE OF ‘OMICS’ The ‘big [omics] data’ sets essential to bioinformatics research are typically generated/analyzed at the molecular level Awesome image from :http://www.bioregulatory-systems-medicine.com/en/brsm-model/autoregulation-of- biological-networks
  • 11. BIOLOGICAL NETWORKS AND BIOINFORMATICS MOLECULAR NETWORKS IN THE AGE OF ‘OMICS’ Awesome image from :http://www.bioregulatory-systems-medicine.com/en/brsm-model/autoregulation-of- biological-networks
  • 12. BIOLOGICAL NETWORKS AND BIOINFORMATICS MOLECULAR NETWORKS IN THE AGE OF ‘OMICS’ Signaling pathway image:- http://openi.nlm.nih.gov/detailedresult.php?img=2993235_mplantssq046f07_4c&req=4 Molecules + How they interact  Nodes + Edges  Network
  • 13. BIOLOGICAL NETWORKS AND BIOINFORMATICS MOLECULAR NETWORKS IN THE AGE OF ‘OMICS’ Signaling pathway image:- http://openi.nlm.nih.gov/detailedresult.php?img=2993235_mplantssq046f07_4c&req=4 Molecules + How they interact  Nodes + Edges  Network Biological [Molecular] Network Nodes = Genes, proteins, chemical compounds, drugs Edges = Relationships/Interactions between molecular entities Signaling Pathway: Directed Edges Molecular networks are constructs to filter the ‘signal from the noise’ Any given network is really a subnetwork of the entire system (i.e. one cellular process among many happening simultaneously) Goal of biological network analysis - “Distill the signal from the noise” by combining ‘omics’ data and pathway topology to EX) Gene Signaling Pathway
  • 14. KEGG: POPULAR REPOSITORY FOR BIOLOGICAL PATHWAYS • Over 300 [signaling] pathways • Graph image summarizes data from multiple sources • Data objects [as matrices] that ‘encode’ the topology can be used in analyses • *After parsing from “KGML” (KEGG mark-up language) file • Overlaying data on pathway nodes for analysis and visualization has been addressed both by KEGG and other Nodes: Genes/Proteins Edges: Relationship between genes/proteins
  • 15. MOTIVATING EXAMPLE: KEGGLINCS • Primary Challenge [handed down to me from Dr. Medvedovic]: • Map gene-gene relationship data generated from the LINCS L1000 data set to the edges of KEGG pathways
  • 16. LINCS L1000 KNOCK-OUT DATA Cancer Cell Lines http://www.lincsproject.org/LINCS/tools (ex: MCF7, PC3, HA1E) 9,000+ Gene Perturbations x LINCS = Library of Integrated Network-Based Cellular Signatures
  • 17. L1000 DATA COLLECTION (BRIEFLY!) Unpertubed Samples Functional Knock-Out (KO) Samples 1000‘Landmark’ Genes Gene Signature: Top 100 UP/DOWN- regulated genes Gene perturbation via shRNA: disrupt conversion of mRNA into functional protein Changes in cellular regulation in absence of functional gene Generat e gene- expressi on data
  • 18. LINCS L1000 KNOCK-OUT DATA Cancer Cell Lines http://www.lincsproject.org/LINCS/tools (ex: MCF7, PC3, HA1E) 9,000+ Gene Perturbations x LINCS = Library of Integrated Network-Based Cellular Signatures
  • 19. LINCS L1000 KNOCK-OUT DATA Cancer Cell Lines http://www.lincsproject.org/LINCS/tools (ex: MCF7, PC3, HA1E) 9,000+ Gene Perturbations x Cell Line Perturbation Signature c1 p1 s1,1 c2 p1 s2,1 ⋮ p1 ⋮ cn-1 p1 sn-1,1 cn p1 sn,1 c1 p2 s1,2 c2 p2 s2,2 ⋮ p2 ⋮ cn-1 p2 sn-1,2 cn p2 sn,2 ⋮ ⋮ ⋮ c1 pm-1 s1,m-1 c2 pm-1 s2,m-1 ⋮ pm-1 ⋮ cn-1 pm-1 sn-1,m-1 cn pm-1 sn,m-1 c1 pm s1,m c2 pm s2,m ⋮ pm ⋮ cn-1 pm sn-1,m cn pm sn,m LINCS L1000 KO DATA = LINCS = Library of Integrated Network-Based Cellular Signatures
  • 20. LINCS L1000 OVERLAP DATA KO_1 KO_2 UP_in_Commo n DOWN_in_Comm on DOWN_UP UP_DOWN MTOR AKT1 25 17 2 3 25 2 3 17 • Within a given cell line, how concordant are changes in regulation when different genes are knocked out? Concordant Concordant Discordant Discordant Confusion Matrix Use modified Fisher’s test to calculate summary score and p-value
  • 21. EXAMPLE OF EDGE DATA TO BE MAPPED TO KEGG gene 1 gene 2 score p-value DEPTOR MTOR -0.285755049 0.398477502 IL1RAP MTOR 0.964732247 0.577905732 MTOR IGBP1 2.123934037 0.000045456 MTOR RPS6KB1 0.161554125 0.625360024 MTOR RPS6KB1 -0.192375372 0.289930272 MTOR RPS6KB2 0.732379001 0.448532973 PIK3R2 MTOR 0.701104025 0.543121475 RRAGC MTOR 0.109984056 0.488447135 MTOR SLC7A5 -0.814937768 0.974542498
  • 22. MOTIVATING EXAMPLE: KEGGLINCS • Primary Challenge [handed down from my research mentor]: • Map gene-gene relationship data generated from the LINCS L1000 data set to the edges of KEGG pathways • Generate data tables for analysis and conditional formatting of pathway edges • Not so straight forward… • Images from KEGG website are abbreviated…
  • 23. In the KGML, each node has a unique entry ID (numeric) but may be associated with more than one gene Ex) Node 12 (labeled “Akt”) –> AKT1, AKT2, AKT3
  • 24. MOTIVATING EXAMPLE: KEGGLINCS • Primary Challenge [handed down from my research mentor]: • Map gene-gene strength-of-relationship data generated from the LINCS L1000 data set to the edges of KEGG pathways • Not so straight forward… • Images from KEGG website are abbreviated… • …and digital representation may not be complete
  • 25. GRAPH OF EDGES ON KEGG WEBSITE
  • 26. GRAPH OF EDGES DOCUMENTED IN KGML
  • 27. GRAPH OF EDGES DOCUMENTED IN KGML Edges ‘disappeared’ from KGML [and hence pathway] after pathway was updated
  • 28. GRAPH OF EDGES ON KEGG WEBSITE Edges ‘disappeared’ from KGML [and hence pathway] after pathway was updated
  • 29. MOTIVATING EXAMPLE (CNTD) • Preliminary Goal: develop a software platform that streamlines edge-focused KEGG pathway visualization and analysis • “Help filter signal from noise of data set with large n”
  • 30. Add in edges with documentation in other KEGG pathways Expand edges Visualize L1000 data overlaid on edges Generate data structures to reflect altered topology for use in analysis Streamlined KEGG pathway visualization
  • 31. MOTIVATING EXAMPLE (CNTD) • Preliminary Goal: develop a software platform that streamlines edge-focused KEGG pathway visualization and analysis • “Help filter signal from noise of data set with large n” • Secondary goal: make platform easy to share with any researcher interested in L1000 data and/or KEGG pathway analysis • Ultimate solution: Produce an R [Bioconductor] package – “KEGGlincs”
  • 32. WHY AN R PACKAGE? • “R is an integrated suite of software facilities for data manipulation, calculation and graphical display”1 1) https://www.r-project.org/about.html
  • 33. SCHEMATIC OF HOW MY PACKAGE WORKS (WHICH “NODE” DO YOU THINK IS THE MOST ACTIVE?)
  • 34. WHY AN R PACKAGE? • “R is an integrated suite of software facilities for data manipulation, calculation and graphical display”1 • “It allows statisticians to do very intricate and complicated analyses without knowing the blood and guts of computing systems.” • R is extremely relevant in our field • Accomplish secondary goal: Publish package in bioinformatics- focused software repository to easily share platform with other researchers • (Personally) Leap over that last hurdle… • …while R is intended for those without a computer science background, it certainly does not seem like it at times! 1) https://www.r-project.org/about.html
  • 35. • Using functions • Troubleshootin g error messages • Managing variables in your working directory • Writing your own functions • Building your package library • Working with R in a project environment • Producing publishable results with R • Making an interactive app with R Shiny • Developing an R package Hurdles & Milestones When Working With R
  • 36. FIRST THINGS FIRST…WHAT IS R? • “R is an open source programming language and software environment for statistical computing and graphics • supported by the R Foundation for Statistical Computing.” • R is OPEN SOURCE (FREE!) • Encourages sharing of ideas, learning from others • R has become software tool of choice for data scientists • Especially relevant to the field of bioinformatics • “Out of the box” R has the capabilities of a [very nice] graphing calculator • “Running a command” refers to the processing of lines of code
  • 37. EXAMPLE OF R WORKING ENVIRONMENT
  • 38. • “x”, “y” and “model” are data-associated variables • “plot”, “lm” and “summary” are functions EXAMPLE OF R WORKING ENVIRONMENT
  • 39. KEGG_lincs("hsa04115") KEGG_lincs("hsa04115”, “PC3”) With the right functions, R can communicate with many other working environments/accommodate other software languages! Example of KEGGlincs “master function”
  • 40. R FUNCTIONS • R functions are chunks of code (written in R coding language) • Allow for reproducible data manipulation, analysis and visualization. • The R functions “plot”, “lm” and “summary” come from the R packages graphics, stats, and base (all considered ‘base packages’)
  • 41. WHAT IS AN R PACKAGE?.... • “Packages are the fundamental units of reproducible R code.” • “They include reusable R functions, the documentation that describes how to use them, and sample data.” • Types of R packages: • “Base/system packages” – contain ‘essential’ features and are pre-installed in R’s library • Ex) “datasets”, ”graphics”, “utils” • “Add-on” packages can: • Exist locally – early stages of development, for proprietary use • Be installed from an open-source package repository
  • 42. THE R-PACKAGE-REPO ‘SPECIFICITY’ HIERARCHY Version-control platform for packages written in many different software languages “A central platform for the development of R packages, R- related software and further projects” Main repository for R packages The Comprehensive R Archive Network Repository for Bioinformatics-based R packages
  • 43. THE R-PACKAGE-REPO ‘SPECIFICITY’ HIERARCHY The Comprehensive R Archive Network Must make it past review process to be “published” in repository Main benefit of publishing a package == streamline installation for other users What are the basic documents contained in a published/publishable R package?
  • 44. WHAT IS AN R PACKAGE?.... • “Packages are the fundamental units of reproducible R code.” • “They include reusable R functions, the documentation that describes how to use them, and sample data.” Package Folder
  • 45. WHAT IS AN R PACKAGE?.... • “Packages are the fundamental units of reproducible R code.” • “They include reusable R functions, the documentation that describes how to use them, and sample data.” Package Folder  R Folder
  • 46. WHAT IS AN R PACKAGE?.... • “Packages are the fundamental units of reproducible R code.” • “They include reusable R functions, the documentation that describes how to use them, and sample data.” Package Folder  R Folder  .R Files
  • 47. WHAT IS AN R PACKAGE?.... • “Packages are the fundamental units of reproducible R code.” • “They include reusable R functions, the documentation that describes how to use them, and sample data.” Package Folder  R Folder  .R Files  Functions [and their documentation*] written in R code
  • 48. WHAT IS AN R PACKAGE?.... • “Packages are the fundamental units of reproducible R code.” • “They include reusable R functions, the documentation that describes how to use them, and sample data.” Package Folder  man Folder
  • 49. WHAT IS AN R PACKAGE?.... • “Packages are the fundamental units of reproducible R code.” • “They include reusable R functions, the documentation that describes how to use them, and sample data.” Package Folder  man Folder  .Rd File
  • 50. WHAT IS AN R PACKAGE?.... • “Packages are the fundamental units of reproducible R code.” • “They include reusable R functions, the documentation that describes how to use them, and sample data.” Package Folder  man Folder  .Rd File  Code used to generate standard manual
  • 51. WHAT IS AN R PACKAGE?.... • “Packages are the fundamental units of reproducible R code.” • “They include reusable R functions, the documentation that describes how to use them, and sample data.” Package Folder  NAMESPACE
  • 52. WHAT IS AN R PACKAGE?.... • “Packages are the fundamental units of reproducible R code.” • “They include reusable R functions, the documentation that describes how to use them, and sample data.” Package Folder  DESCRIPTION
  • 53. WHAT IS AN R PACKAGE?.... • “Packages are the fundamental units of reproducible R code.” • “They include reusable R functions, the documentation that describes how to use them, and sample data.” Package Folder  DESCRIPTION
  • 54. WHAT IS AN R PACKAGE?.... • “Packages are the fundamental units of reproducible R code.” • “They include reusable R functions, the documentation that describes how to use them*, and sample data.” *Packages such as roxygen2 and devtools facilitate package documentation and management* EX) Running devtools::document() automatically generates .Rd files as specified by the documentation fo each function in the R file
  • 55. WHAT IS AN R PACKAGE?.... • “Packages are the fundamental units of reproducible R code.” • “They include reusable R functions, the documentation that describes how to use them, and sample data.” Package Folder  data folder
  • 56. WHAT IS AN R PACKAGE?.... • “Packages are the fundamental units of reproducible R code.” • “They include reusable R functions, the documentation that describes how to use them, and sample data.” • “data” folder/files needed if: • A specific data set is intrinsic to the function’s utility Package Folder  data folder  .rda file
  • 57. Case 1: Data set called within function
  • 58. WHAT IS AN R PACKAGE?.... • “Packages are the fundamental units of reproducible R code.” • “They include reusable R functions, the documentation that describes how to use them, and sample data.” • “data” folder/files needed if: • A specific data set is intrinsic to the function’s utility • Examples [in documentation] require specific data objects to run successfully Package Folder  data folder  .rda file
  • 59. Case 2: Data set required to successfully run examples
  • 60. WHAT IS AN R PACKAGE?.... • “Packages are the fundamental units of reproducible R code.” • “They include reusable R functions, the documentation that describes how to use them, and sample data.” • “data” folder/files needed if: • A specific data set is intrinsic to the function’s utility • Examples [in documentation] require specific data objects to run successfully • If the data file needed by a package is very large (GB vs MB), a separate dataPackage Folder  data folder  .rda file
  • 61. WHAT IS A BIOCONDUCTOR PACKAGE? •Bioconductor packages must meet requirements for R packages and be aligned with goals set by Bioconductor • “Provide widespread access to a broad range of powerful statistical and graphical methods for the analysis of genomic data. • Facilitate the inclusion of biological metadata in the analysis of genomic data • Provide a common software platform that enables the rapid development and deployment of plug-able, scalable, and interoperable software. • Train researchers on computational and statistical methods for the analysis of genomic data. • Further scientific understanding by producing high-quality **documentation and reproducible research**.”
  • 62. [TECHNICAL] ADDITIONAL REQUIREMENTS FOR BIOCONDUCTOR PACKAGES • Package files must meet standards set by Bioconductor (pass BiocCheck) • Ex) line lengths, indentation sizes, etc. • Must include a vignette that includes detailed documentation Package Folder  vignettes Folder
  • 63. [TECHNICAL] ADDITIONAL REQUIREMENTS FOR BIOCONDUCTOR PACKAGES • Package files must meet standards set by Bioconductor (pass BiocCheck) • Ex) line lengths, indentation sizes, etc. • Must include a vignette that includes detailed documentation Package Folder  vignettes Folder  .Rmd file
  • 64. [TECHNICAL] ADDITIONAL REQUIREMENTS FOR BIOCONDUCTOR PACKAGES • Package files must meet standards set by Bioconductor (pass BiocCheck) • Ex) line lengths, indentation sizes, etc. • Must include a vignette that includes detailed documentation of example usage Package Folder  vignettes Folder  .Rmd file  Code used to generate vignette
  • 65. MANUAL VS. PDF • “Provides a task-oriented description of package functionality” • Bioconductor packages must have at least one vignette • Contains entries for each function contained in the package • Automatically
  • 66. PACKAGE  PUBLISHED BIOCONDUCTOR PACKAGE • Preliminary steps: • Manage package via GitHub (details in supplemental slides) • If your package requires a large data set (> 5MB), create a separate data package • Check that package works on 3 main operating systems (Windows, macOS, Linux) • Submit to Bioconductor • Check release schedule – new Bioconductor version every 6 months (October & March) • Package review process takes approx. 2-5 weeks • After package is accepted: • Package maintenance via GitHub Bioconductor-mirror • Check build/check report after introducing changes
  • 71. As an installed package, information from the manual can be retrieved by prefacing a package/function with “?” and pressing enter **I strongly recommend rstudio … especially for those new to R**
  • 74. KEGGLINCS: LOCATIONS AND DOCUMENTATIONS Bioconductor Landing Page GitHub Bioconductor- mirror Reference Manual
  • 76. KEGGLINCS: LOCATIONS AND DOCUMENTATIONS Bioconductor Landing Page GitHub Bioconductor- mirror Reference Manual Vignette
  • 77. POSTER: PRESENTED AT CONFERENCES AND PUBLISHED VIA F1000 RESEARCH (NON-PEER REVIEWED The pathway maps provided by KEGG summarize information across genes to single nodes or even groups of nodes. While this may be convenient for basic visualization, it is limiting in terms of understanding the exact relationships that are documented in the databases used by KEGG. The primary functions of KEGGlincs download up-to-date information from KEGG, parse/reconfigure it, and then re-map the updated information via the interactive software Cytoscape. Although more complex, the updated map features represent the exact relationships mapped in KEGG, have more explicit definitions of edge subtypes, and show clear mark-up of grouped nodes (see Figures 1a/1b). TOP: Original pathway available from the KEGG website BOTTOM: Pathway rendered in Cytoscape with expanded edges and formatting using single command within R: KEGG_lincs(“hsa04068”) Edge Color Key Red: Activation or Expression * Orange: Activating PTM ** Green: PTM (no activation/inhibition activity defined) Blue: Inhibition Purple: Inhibiting PTM Black(solid): Binding/Association Black(dashed): Indirect effect (no activation/inhibition activity defined) *Any dashed colored line indicates that the effect is indirect **PTM = post-translational modification or, as KEGG defines them, ‘molecular events’. The specific types of PTMs (indicated by edge label) include: +p: phosphorylation -p: dephosphorylation +g: glycosylation +u: ubiquitination +m: methylation KEGGlincs Design and Application: An R Package for Exploring Relationships in Biological Pathways Shana White1, Mario Medvedovic1 1Laboratory for Statistical Genomics, Department of Environmental Health, Division of Biostatistics and Bioinformatics, University of Cincinnati College of Medicine, 3223 Eden Ave. ML 56, Cincinnati OH 45267-0056, USA Overview of Data Integration & Visualization a. KEGG information in the form of KGML (modified EXtensible Markup Language or XML format) files. b. Optional feature: Add edge data; the companion data package KOdata contains both baseline expression levels for individual genes and consensus genomic signatures (CGS’s) for gene-knock-out’s corresponding to cell-line-specific LINCS L1000 data. c. KGML files are parsed, edges are expanded and user data is incorporated as mapping features using the Bioconductor package KEGGlincs with R software. d. Data is seamlessly transported from the R environment to graphing software via functionality developed by the RESTful API cyREST. e. Maps are ultimately visualized with the interactive graphing software Cytoscape. BD2K-LINCS DATACOORDINATIONAND INTEGRATIONCENTER The LINCS L1000 dataset maintained by the BROAD institute is a collection of gene expression results/consensus genomic signatures (CGS’s) particular to a given [typically cancer] cell type that arise when single genes are knocked out or cells are perturbed with a single drug/chemical in controlled experiments. For this example, gene-gene and gene-drug relationships in the ErbB Signaling pathway are compared across the MCF7 (breast cancer) and A375 (melanoma) cell-lines. Abstract The Library of Integrated Network-based Signals (LINCS) project is a data generation venture that is a quintessential example of current efforts concerning ‘big data’ in the biomedical research environment. One element of this project is the production of gene expression profiles corresponding to individual gene knock-outs within specific cancer cell lines. The R package ‘KEGGlincs’ and the companion data package ‘KOdata’, both recently published/updated with the latest version of Bioconductor (3.4/3.5), were developed to promote synergy between existing pathway structures from the Kyoto Encyclopedia of Genes and Genomes (KEGG) and LINCS data in order to reveal mechanisms of biochemical signaling processes that display heterogeneity across different types of cells. KEGG pathways are manually-curated biological pathways represented as networks of nodes (genes) and directed edges whereby experimental evidence determines the nature and direction of an edge (relationship) between genes. The network structure for pathways that KEGG provides is a promising tool for bioinformatics research, and indeed there are existing methods for quantifying the level of pathway perturbation [within an experiment] that make use of KEGG pathways. However, the existing approaches consider changes of gene expression only of the genes in a particular pathway and not changes in expression of downstream targets. This restricts the definition of perturbation to mean change in gene expression rather than a much broader, and perhaps more meaningful, change in gene function. The LINCS data available from the KOdata resource along with the functionality offered by KEGGlincs allow for the investigation of relationships between genes in a given pathway in a cell-type-specific manner via analysis of overlapping de-regulated genes corresponding to pairs of experimental knockouts. This approach to pathway analysis yields quantitative measures and a novel method for annotating relationships (edges) between genes programmatically created in R and automatically visualized in an interactive session via Cytoscape software. Example Application: Visualizing Cancer Cell-Line-Specific Relationships in the ErbB Signaling Pathway with LINCS L1000 data Introduction :‘Edge-Focused’ Interactive Pathway Visualization References Kanehisa M and Goto S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research 2000; 28: 27-30 . Ono K, et al. CyREST: Turbocharging Cytoscape Access for External Tools via a RESTful API. F1000Research 2015; 4 :478. R Core Team (2015). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. Shannon P, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Research 2003 Nov; 13(11):2498-504. White S (2016). KEGGlincs: Visualize all edges within a KEGG pathway and overlay LINCS data [option]. R package version 1.0.0. a. b. c. d. e. Acknowledgements This work has been funded by the NIH grant U54HL127624 and the Molecular Epidemiology in Children’s Environmental Health training grant T32-ES10957. Edge colors represent the following possible combinations of direction of Fisher’s Exact Test summary scores (a modified Odd’s Ratio (OR) score; either positive(+) or negative (-)) and their corresponding [adjusted] p-values: Red: OR(+), pval(sig) Orange: OR(+), pval(non-sig) Purple: OR(-), pval(non-sig) Blue: OR(-), pval(sig) The width of the edges are mapped to represent the magnitude of the OR (thicker edge indicates larger OR relative to other edges with data). Figure 1a: ErbB Pathway + knock-out data for MCF7 Figure 1b: ErbB Pathway + knock-out data for A375 Figure 3: ErbB Pathway + drug data to compare MCF7 and A375 cell-lines Figure 2: ErbB Pathway + knock-out data to compare MCF7 and A375 cell-lines Figures 1a and 1b display the edge-mapping offered by the using the KEGG_lincs function: KEGG_lincs(“hsa04012”, “MCF7”) KEGG_lincs(“hsa04012”, “A375”) These figures suggest that many of the gene- gene relationships annotated in the KGML for the ErbB pathway show concordance as determined by L1000 data for both cell lines. Alternatively, users may wish to generate a map that will highlight the differences between two cell-lines in one summarized picture; Figure 2 displays this utility. This type of map is generated using the development function KL_compare: KL_compare(“hsa04012”, “MCF7”, “A375”) For convenient viewing, edges that are shared between two nodes get grouped together if they fall into the same category as follows: Orange: Higher concordance in first cell-line Green: Higher concordance in second cell-line The width of the arrow represents the effect size (or average effect size, if shared) of differences in edge concordance [as measured by a test statistic to compare OR’s]; the shade of the edge roughly indicates how many edges share that location/category (note that these are easily recovered in an interactive Cytoscape session). Figure 3 has been generated to showcase this package’s utility in terms of incorporating drug perturbation signatures from the L1000 data set into pathway maps. The development function KL_drug returns known drug-target relationships specific to the pathway of choice; these are not usually included in the original pathway KGML. These drug-gene relationships are added as edges and are quantified in a manner identical to that for gene-gene edges. The edge colors highlight differences as follows: Yellow: Higher concordance in first cell-line Blue: Higher concordance in second cell-line A quick interpretation of this map suggests drugs that effect the ErbB pathway target BRAF in A375 cells more so than in MCF7 cells and that Vemurafenib is perhaps the most effective. On the other hand, ErbB is more of a direct target for the MCF7 cell-line compared to A375.
  • 78. KEGGLINCS: LOCATIONS AND DOCUMENTATIONS Bioconductor Landing Page GitHub Bioconductor- mirror Reference Manual Vignette Poster Publication
  • 79. IN-THE-WORKS: SUBMIT KEGGLINCS AS APPLICATION NOTE IN JOURNAL BIOINFORMATICS
  • 80. KEGGLINCS: LOCATIONS AND DOCUMENTATIONS Bioconductor Landing Page GitHub Bioconductor- mirror Reference Manual Vignette Poster Publication Bioinformatics Application Note Multiple sources of information can be confusing when using a package
  • 81. KEGGLINCS: LOCATIONS AND DOCUMENTATIONS Bioconductor Landing Page GitHub Bioconductor- mirror Reference Manual Vignette Poster Publication Bioinformatics Application Note Info ‘hub’ How to use individual function; indexed How functions are coded Intended use for package Package Summary Package Summary; technical notes
  • 82. QUICK NOTE: R/BIOCONDUCTOR TROUBLESHOOTING • Somewhat paradoxically, getting a package to successfully install can be a road block for using a package • Check that you have the most recent version of R and Bioconductor before installing new Bioconductor packages. • Some packages depend on other packages that must be installed before they can successfully install
  • 83. CONCLUDING REMARKS • Get to know R! • Use R in projects to manage data, perform analyses and produce graphs • Use R to enhance your learning in classes • Ex) Translate algorithms from Design and Analysis of Algorithms to R code • Don’t give up…with Google and determination you will figure it out • Even if you are working with R code you do not intend to publish, having your code organized as a package makes it easier to manage • Creating/Publishing a package is only the first step!! • So your package is out there…now what?
  • 84. QUESTIONS?? • “There are plenty of people out there that go to school for eight years” • “Yeah…they’re called doctors” • Tommy Boy
  • 85. SUPPLEMENTAL SLIDES Introduction to Package Development and Management
  • 86. R-PACKAGE OVERVIEW • Premise: You have written a useful function and have it saved as a “.R” file • The function works in conjunction with data, saved as a “.RData” file • You have decided to take the leap and formally declare to yourself and the world that you will endeavor to turn your hunk of code into an R package • Use rstudio to set up a package-related project • Link project to GitHub
  • 87. OVERVIEW OF KGML: INFORMATION IN KEGG MARK-UP LANGUAGE Map Features Chemical Compoun ds Genes Grouping Informatio n N O D E S E D G E S Relationsh ips between Nodes KGML format • Stores information for pathway structures and relationships among them • XML document • 2 main ‘tables’
  • 88.
  • 89.
  • 90.
  • 91.
  • 92.
  • 93.
  • 94.
  • 95.
  • 96.
  • 97.
  • 98.
  • 99.
  • 100.
  • 101.
  • 102.
  • 103.
  • 104.
  • 105.
  • 106.
  • 107.
  • 108. R-PACKAGE BASICS • Best practice: Use GitHub to manage package • Update, check, and synch to GitHub directly from Rstudio • Documentation required for: functions, data, package • Useful packages: devtools and roxygen2
  • 109. *SPECIAL* BIOCONDUCTOR ISSUES • Data management • Data package may need to be developed if working with large data sets • Interoperation with other R packages • Does Bioconductor support your package’s dependencies? • Documentation • Requires detailed vignette • Code format specifications • Ex: Indent = 4 spaces, line limit = 80 chars • Publication ‘deadlines’ • New Bioconductor versions released in October and April
  • 110.
  • 111. SUBMITTING TO BIOCONDUCTOR • Open a new issue at the Bioconductor Contributions repository (GitHub)
  • 112.
  • 113. SUBMITTING TO BIOCONDUCTOR • Open a new issue at the Bioconductor Contributions repository (GitHub) • Set package version to 0.99.0 in DESCRIPTION file • A ‘real person’ is assigned to your package • Tests package across 3 platforms (Windows, Mac OS, Linux) • Answers questions from developer • Evaluates fit with Bioconductor (??) • Update package based on reviewers comments until acceptance • Bump version number “z” by 1 in DESCRIPTION (ie. 0.99.1) and push changes to GitHub
  • 114. MAINTAINING BIOCONDUCTOR PACKAGES • Currently handled via git-svn synch • Switching over to git – maintain from Rstudio session (?) • Nightly build checks • Updates to package need to pass checks