R Packages Unpacked

R PACKAGES UNPACKED
R-PACKAGE USE, DEVELOPMENT, AND A
FORAY INTO BIOCONDUCTOR PACKAGES
SHANA WHITE
PHD CANDIDATE: BIOSTATISTICS + BIG DATA TRACK
PRE-DOCTORAL FELLOW: MECEH
BIOSTATISTICS SEMINAR: SEPTEMBER 1, 2017

OVERVIEW
• Motivating example: KEGGlincs
• Background info (graphs, biological networks)
• Challenge – edge-focused pathway annotation
• R Packages
• What [are they exactly]?
• How [do you navigate them]?
• Where [do you find or publish them]?

NETWORK GRAPHS & DATA: QUICK PRIMER
• Network/Network Graph =
• Nodes (entities)
• People, cities, genes
• Edges (relationships between nodes)
• Similar heritage (people), distance (cities),
interaction (genes)
• Size/shape/color of nodes & edges
formatted to summarize important
characteristics, for example…

• Similar heritage (people), distance (cities),
interaction (genes)
Directionality
From  To, Source 
Target

• Similar heritage (people), road connection
(cities), interaction (genes)
Strength of Relationship
Edge weight represents edge-related
variable
Ex) Number of relatives in common, distance

Nature of Relationship
Edge color represents edge-
related variable
Ex) activation or repression (genes)

Node Attribute
Node color represents node-related-variable
Ex) male or female (people), high expression or low
expression (genes)

Node Attribute
Node size represents node-related-variable
Ex) number or residents (cities), graph attribute
(connectedness)

NETWORK GRAPH PRIMER CNTD.: DATA
Source
A B C D F G
Target
A 0 1 1 1 0 0
B 1 0 0 0 0 0
C 1 0 0 1 1 1
D 1 0 1 0 0 0
F 0 0 1 0 0 0
G 0 0 1 0 0 0
source target weight color
A B 8 red
A C 6 blue
A D 3 red
C F 5 blue
C D 9 red
C G 1 blue
D E 4 red
D G 9 blue
B H 2 red
B I 3 blue
B J 5 red
I J 7 blue
A H 7 red
H E 1 blue
E K 8 red
Node weight color
A 5 orange
B 2 green
C 7 orange
D 3 green
F 2 green
G 2 orange
• Graph image visually
summarizes data
• Data objects [as
matrices] that
‘encode’ the topology
can be used in
analyses
• Ex: Find subnetworks,
measure differences in
topology between
networks, highlight
important nodes
• Visualization & analysis
of graph objects
Adjacency Matrix
Edge Information
Node Information
EDGE DATA
NODE DATA

BIOLOGICAL NETWORKS AND BIOINFORMATICS
MOLECULAR NETWORKS IN THE AGE OF ‘OMICS’
The ‘big [omics] data’
sets essential to
bioinformatics
research are typically
generated/analyzed at
the molecular level
Awesome image from :http://www.bioregulatory-systems-medicine.com/en/brsm-model/autoregulation-of-
biological-networks

Awesome image from :http://www.bioregulatory-systems-medicine.com/en/brsm-model/autoregulation-of-
biological-networks

Signaling pathway image:- http://openi.nlm.nih.gov/detailedresult.php?img=2993235_mplantssq046f07_4c&req=4
Molecules + How they interact  Nodes + Edges 
Network

Signaling pathway image:- http://openi.nlm.nih.gov/detailedresult.php?img=2993235_mplantssq046f07_4c&req=4
Molecules + How they interact  Nodes + Edges 
Network
Biological [Molecular] Network
Nodes = Genes, proteins, chemical compounds, drugs
Edges = Relationships/Interactions between molecular
entities
Signaling Pathway: Directed Edges
Molecular networks are constructs to filter the ‘signal from
the noise’
Any given network is really a subnetwork of the entire
system
(i.e. one cellular process among many happening
simultaneously)
Goal of biological network analysis - “Distill the signal from
the noise” by combining ‘omics’ data and pathway topology to
EX) Gene Signaling
Pathway

KEGG: POPULAR REPOSITORY FOR BIOLOGICAL
PATHWAYS
• Over 300 [signaling]
pathways
• Graph image summarizes
data from multiple
sources
• Data objects [as matrices]
that ‘encode’ the topology
can be used in analyses
• *After parsing from
“KGML” (KEGG mark-up
language) file
• Overlaying data on
pathway nodes for
analysis and visualization
has been addressed both
by KEGG and other
Nodes: Genes/Proteins
Edges: Relationship between
genes/proteins

MOTIVATING EXAMPLE: KEGGLINCS
• Primary Challenge [handed down to me from Dr. Medvedovic]:
• Map gene-gene relationship data generated from the LINCS L1000 data
set to the edges of KEGG pathways

LINCS L1000 KNOCK-OUT DATA
Cancer Cell Lines
http://www.lincsproject.org/LINCS/tools
(ex: MCF7, PC3,
HA1E)
9,000+ Gene Perturbations
x
LINCS = Library of Integrated Network-Based Cellular Signatures

L1000 DATA COLLECTION (BRIEFLY!)
Unpertubed
Samples
Functional
Knock-Out
(KO) Samples
1000‘Landmark’
Genes
Gene
Signature:
Top 100
UP/DOWN-
regulated
genes
Gene perturbation via
shRNA: disrupt conversion of
mRNA into functional protein
Changes in
cellular
regulation in
absence of
functional
gene
Generat
e gene-
expressi
on data

LINCS L1000 KNOCK-OUT DATA
Cancer Cell Lines
http://www.lincsproject.org/LINCS/tools
(ex: MCF7, PC3,
HA1E)
9,000+ Gene Perturbations
x
Cell Line Perturbation Signature
c1 p1 s1,1
c2 p1 s2,1
⋮ p1 ⋮
cn-1 p1 sn-1,1
cn p1 sn,1
c1 p2 s1,2
c2 p2 s2,2
⋮ p2 ⋮
cn-1 p2 sn-1,2
cn p2 sn,2
⋮ ⋮ ⋮
c1 pm-1 s1,m-1
c2 pm-1 s2,m-1
⋮ pm-1 ⋮
cn-1 pm-1 sn-1,m-1
cn pm-1 sn,m-1
c1 pm s1,m
c2 pm s2,m
⋮ pm ⋮
cn-1 pm sn-1,m
cn pm sn,m
LINCS L1000 KO DATA
=
LINCS = Library of Integrated Network-Based Cellular Signatures

LINCS L1000 OVERLAP DATA
KO_1 KO_2
UP_in_Commo
n
DOWN_in_Comm
on DOWN_UP UP_DOWN
MTOR AKT1 25 17 2 3
25 2
3 17
• Within a given cell line, how concordant are changes in
regulation when different genes are knocked out?
Concordant
Concordant
Discordant
Discordant
Confusion Matrix
Use modified Fisher’s
test to calculate
summary score and
p-value

EXAMPLE OF EDGE DATA TO BE MAPPED TO
KEGG
gene 1 gene 2 score p-value
DEPTOR MTOR -0.285755049 0.398477502
IL1RAP MTOR 0.964732247 0.577905732
MTOR IGBP1 2.123934037 0.000045456
MTOR RPS6KB1 0.161554125 0.625360024
MTOR RPS6KB1 -0.192375372 0.289930272
MTOR RPS6KB2 0.732379001 0.448532973
PIK3R2 MTOR 0.701104025 0.543121475
RRAGC MTOR 0.109984056 0.488447135
MTOR SLC7A5 -0.814937768 0.974542498

• Primary Challenge [handed down from my research mentor]:
• Map gene-gene relationship data generated from the LINCS L1000 data
set to the edges of KEGG pathways
• Generate data tables for analysis and conditional formatting of pathway
edges
• Not so straight forward…
• Images from KEGG website are abbreviated…

In the KGML, each node has a unique entry ID
(numeric) but may be associated with more
than one gene
Ex) Node 12 (labeled “Akt”) –> AKT1, AKT2,
AKT3

• Primary Challenge [handed down from my research mentor]:
• Map gene-gene strength-of-relationship data generated from the LINCS
L1000 data set to the edges of KEGG pathways
• Not so straight forward…
• Images from KEGG website are abbreviated…
• …and digital representation may not be complete

GRAPH OF EDGES ON KEGG WEBSITE

GRAPH OF EDGES DOCUMENTED IN
KGML

GRAPH OF EDGES DOCUMENTED IN KGML
Edges ‘disappeared’
from KGML [and
hence pathway] after
pathway was updated

GRAPH OF EDGES ON KEGG WEBSITE
Edges ‘disappeared’
from KGML [and
hence pathway] after
pathway was updated

MOTIVATING EXAMPLE (CNTD)
• Preliminary Goal: develop a software platform that streamlines
edge-focused KEGG pathway visualization and
analysis
• “Help filter signal from noise of data set with large n”

Add in edges with
documentation in
other KEGG
pathways
Expand
edges
Visualize
L1000 data
overlaid on
edges
Generate data
structures to reflect
altered topology for
use in analysis
Streamlined KEGG pathway visualization

MOTIVATING EXAMPLE (CNTD)
• Preliminary Goal: develop a software platform that streamlines
edge-focused KEGG pathway visualization and
analysis
• “Help filter signal from noise of data set with large n”
• Secondary goal: make platform easy to share with any
researcher interested in L1000 data and/or KEGG
pathway analysis
• Ultimate solution: Produce an R [Bioconductor] package –
“KEGGlincs”

WHY AN R PACKAGE?
• “R is an integrated suite of software facilities for data
manipulation, calculation and graphical display”1
1) https://www.r-project.org/about.html

SCHEMATIC OF HOW MY PACKAGE WORKS
(WHICH “NODE” DO YOU THINK IS THE MOST
ACTIVE?)

WHY AN R PACKAGE?
• “R is an integrated suite of software facilities for data
manipulation, calculation and graphical display”1
• “It allows statisticians to do very intricate and complicated analyses
without knowing the blood and guts of computing systems.”
• R is extremely relevant in our field
• Accomplish secondary goal: Publish package in bioinformatics-
focused software repository to easily share platform with other
researchers
• (Personally) Leap over that last hurdle…
• …while R is intended for those without a computer science background, it
certainly does not seem like it at times!
1) https://www.r-project.org/about.html

• Using
functions
• Troubleshootin
g error
messages
• Managing
variables in
your working
directory
• Writing your
own functions
• Building your
package library
• Working with R
in a project
environment
• Producing
publishable
results with R
• Making an
interactive app
with R Shiny
• Developing an
R package
Hurdles & Milestones
When Working With R

FIRST THINGS FIRST…WHAT IS R?
• “R is an open source programming language and software
environment for statistical computing and graphics
• supported by the R Foundation for Statistical Computing.”
• R is OPEN SOURCE (FREE!)
• Encourages sharing of ideas, learning from others
• R has become software tool of choice for data scientists
• Especially relevant to the field of bioinformatics
• “Out of the box” R has the capabilities of a [very nice] graphing
calculator
• “Running a command” refers to the processing of lines of code

EXAMPLE OF R WORKING ENVIRONMENT

• “x”, “y” and
“model” are
data-associated
variables
• “plot”, “lm” and
“summary” are
functions
EXAMPLE OF R WORKING ENVIRONMENT

KEGG_lincs("hsa04115")
KEGG_lincs("hsa04115”, “PC3”)
With the right functions, R
can communicate with many
other working
environments/accommodate
other software languages!
Example of KEGGlincs “master
function”

R FUNCTIONS
• R functions are chunks of code (written in R coding language)
• Allow for reproducible data manipulation, analysis and
visualization.
• The R functions “plot”, “lm” and “summary” come from the R
packages graphics, stats, and base (all
considered ‘base packages’)

WHAT IS AN R PACKAGE?....
• “Packages are the fundamental units of reproducible R code.”
• “They include reusable R functions, the documentation that describes how to use them,
and sample data.”
• Types of R packages:
• “Base/system packages” – contain ‘essential’ features and are pre-installed in R’s
library
• Ex) “datasets”, ”graphics”, “utils”
• “Add-on” packages can:
• Exist locally – early stages of development, for proprietary use
• Be installed from an open-source package repository

THE R-PACKAGE-REPO ‘SPECIFICITY’
HIERARCHY
Version-control platform
for packages written in
many different software
languages
“A central platform for the
development of R packages, R-
related software and further
projects”
Main repository for R
packages
The
Comprehensive
R Archive
Network
Repository for
Bioinformatics-based R
packages

THE R-PACKAGE-REPO ‘SPECIFICITY’
HIERARCHY
The
Comprehensive
R Archive
Network
Must make it past
review process to
be “published” in
repository
Main benefit of publishing a package == streamline installation for
other users
What are the basic documents contained
in a published/publishable R package?

and sample data.”
Package Folder

and sample data.”
Package Folder  R Folder

and sample data.”
Package Folder  R Folder  .R Files

and sample data.”
Package Folder  R Folder  .R Files  Functions [and their
documentation*] written in R code

and sample data.”
Package Folder  man Folder

and sample data.”
Package Folder  man Folder  .Rd File

and sample data.”
Package Folder  man Folder  .Rd File  Code used to generate
standard manual

and sample data.”
Package Folder  NAMESPACE

and sample data.”
Package Folder  DESCRIPTION

• “They include reusable R functions, the documentation that describes how to use
them*, and sample data.”
*Packages such as roxygen2 and devtools facilitate package documentation
and management*
EX) Running devtools::document() automatically generates .Rd files as specified by the documentation fo
each function in the R file

and sample data.”
Package Folder  data folder

and sample data.”
• “data” folder/files
needed if:
• A specific data set is
intrinsic to the function’s
utility
Package Folder  data folder  .rda file

Case 1: Data set called within function

and sample data.”
• “data” folder/files
needed if:
utility
• Examples [in
documentation] require
specific data objects to
run successfully
Package Folder  data folder  .rda file

Case 2: Data set required to successfully
run examples

and sample data.”
• “data” folder/files needed
if:
utility
• Examples [in
documentation] require
specific data objects to run
successfully
• If the data file needed by a
package is very large (GB vs
MB), a separate dataPackage Folder  data folder  .rda file

WHAT IS A BIOCONDUCTOR PACKAGE?
•Bioconductor packages must meet requirements for R
packages and be aligned with goals set by Bioconductor
• “Provide widespread access to a broad range of powerful statistical and
graphical methods for the analysis of genomic data.
• Facilitate the inclusion of biological metadata in the analysis of genomic data
• Provide a common software platform that enables the rapid development
and deployment of plug-able, scalable, and interoperable software.
• Train researchers on computational and statistical methods for the analysis of
genomic data.
• Further scientific understanding by producing high-quality **documentation and
reproducible research**.”

[TECHNICAL] ADDITIONAL REQUIREMENTS
FOR BIOCONDUCTOR PACKAGES
• Package files must meet standards set by Bioconductor (pass
BiocCheck)
• Ex) line lengths, indentation sizes, etc.
• Must include a vignette that includes detailed documentation
Package Folder  vignettes Folder

BiocCheck)
• Must include a vignette that includes detailed documentation
Package Folder  vignettes Folder  .Rmd file

BiocCheck)
• Must include a vignette that includes detailed documentation of
example usage
Package Folder  vignettes Folder  .Rmd file  Code used to
generate vignette

MANUAL VS. PDF
• “Provides a
task-oriented
description of
package
functionality”
• Bioconductor
packages must
have at least
one vignette
• Contains entries
for each
function
contained in the
package
• Automatically

PACKAGE  PUBLISHED BIOCONDUCTOR
PACKAGE
• Preliminary steps:
• Manage package via GitHub (details in supplemental slides)
• If your package requires a large data set (> 5MB), create a separate data
package
• Check that package works on 3 main operating systems (Windows,
macOS, Linux)
• Submit to Bioconductor
• Check release schedule – new Bioconductor version every 6 months
(October & March)
• Package review process takes approx. 2-5 weeks
• After package is accepted:
• Package maintenance via GitHub Bioconductor-mirror
• Check build/check report after introducing changes

KEGGLINCS: LOCATIONS AND
DOCUMENTATIONS

DOCUMENTATIONS
Bioconductor Landing
Page

As an installed package,
information from the manual
can be retrieved by prefacing a
package/function with
“?” and pressing enter
**I strongly recommend
rstudio …
especially for those new to
R**

DOCUMENTATIONS
Page Reference Manual

DOCUMENTATIONS
Page
GitHub Bioconductor-
mirror
Reference Manual

VIGNETTE OF *DETAILED* WORKFLOWS

DOCUMENTATIONS
Page
mirror
Reference Manual
Vignette

POSTER: PRESENTED AT CONFERENCES AND
PUBLISHED VIA F1000 RESEARCH (NON-PEER
REVIEWED
The pathway maps provided by KEGG
summarize information across genes to
single nodes or even groups of nodes. While
this may be convenient for basic
visualization, it is limiting in terms of
understanding the exact relationships that
are documented in the databases used by
KEGG. The primary functions of KEGGlincs
download up-to-date information from
KEGG, parse/reconfigure it, and then re-map
the updated information via the interactive
software Cytoscape. Although more
complex, the updated map features
represent the exact relationships mapped in
KEGG, have more explicit definitions of edge
subtypes, and show clear mark-up of
grouped nodes (see Figures 1a/1b).
TOP: Original pathway available from the
KEGG website
BOTTOM: Pathway rendered in Cytoscape
with expanded edges and formatting using
single command within R: KEGG_lincs(“hsa04068”)
Edge Color Key
Red: Activation or Expression *
Orange: Activating PTM **
Green: PTM (no activation/inhibition activity defined)
Blue: Inhibition
Purple: Inhibiting PTM
Black(solid): Binding/Association
Black(dashed): Indirect effect (no activation/inhibition activity defined)
*Any dashed colored line indicates that the effect is indirect
**PTM = post-translational modification or, as KEGG defines them,
‘molecular events’. The specific types of PTMs (indicated by edge label)
include:
+p: phosphorylation
-p: dephosphorylation
+g: glycosylation
+u: ubiquitination
+m: methylation
KEGGlincs Design and Application:
An R Package for Exploring Relationships in Biological Pathways
Shana White1, Mario Medvedovic1
1Laboratory for Statistical Genomics, Department of Environmental Health, Division of Biostatistics and Bioinformatics, University of Cincinnati College of Medicine, 3223 Eden Ave. ML 56, Cincinnati OH 45267-0056, USA
Overview of Data Integration & Visualization
a. KEGG information in the form of KGML
(modified EXtensible Markup Language or
XML format) files.
b. Optional feature: Add edge data; the
companion data package KOdata contains
both baseline expression levels for individual
genes and consensus genomic signatures
(CGS’s) for gene-knock-out’s corresponding to
cell-line-specific LINCS L1000 data.
c. KGML files are parsed, edges are expanded
and user data is incorporated as mapping
features using the Bioconductor package
KEGGlincs with R software.
d. Data is seamlessly transported from the R
environment to graphing software via
functionality developed by the RESTful API
cyREST.
e. Maps are ultimately visualized with the
interactive graphing software Cytoscape.
BD2K-LINCS
DATACOORDINATIONAND
INTEGRATIONCENTER
The LINCS L1000 dataset maintained by the BROAD institute is a collection of gene expression
results/consensus genomic signatures (CGS’s) particular to a given [typically cancer] cell type that arise
when single genes are knocked out or cells are perturbed with a single drug/chemical in controlled
experiments. For this example, gene-gene and gene-drug relationships in the ErbB Signaling pathway are
compared across the MCF7 (breast cancer) and A375 (melanoma) cell-lines.
Abstract
The Library of Integrated Network-based Signals (LINCS) project is a data generation venture that is a
quintessential example of current efforts concerning ‘big data’ in the biomedical research environment.
One element of this project is the production of gene expression profiles corresponding to individual gene
knock-outs within specific cancer cell lines. The R package ‘KEGGlincs’ and the companion data package
‘KOdata’, both recently published/updated with the latest version of Bioconductor (3.4/3.5), were
developed to promote synergy between existing pathway structures from the Kyoto Encyclopedia of Genes
and Genomes (KEGG) and LINCS data in order to reveal mechanisms of biochemical signaling processes that
display heterogeneity across different types of cells.
KEGG pathways are manually-curated biological pathways represented as networks of nodes (genes) and
directed edges whereby experimental evidence determines the nature and direction of an edge
(relationship) between genes. The network structure for pathways that KEGG provides is a promising tool
for bioinformatics research, and indeed there are existing methods for quantifying the level of pathway
perturbation [within an experiment] that make use of KEGG pathways. However, the existing approaches
consider changes of gene expression only of the genes in a particular pathway and not changes in
expression of downstream targets. This restricts the definition of perturbation to mean change in gene
expression rather than a much broader, and perhaps more meaningful, change in gene function.
The LINCS data available from the KOdata resource along with the functionality offered by KEGGlincs allow
for the investigation of relationships between genes in a given pathway in a cell-type-specific manner via
analysis of overlapping de-regulated genes corresponding to pairs of experimental knockouts. This
approach to pathway analysis yields quantitative measures and a novel method for annotating
relationships (edges) between genes programmatically created in R and automatically visualized in an
interactive session via Cytoscape software.
Example Application:
Visualizing Cancer Cell-Line-Specific Relationships in the
ErbB Signaling Pathway with LINCS L1000 data
Introduction :‘Edge-Focused’ Interactive Pathway Visualization
References
Kanehisa M and Goto S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research 2000; 28: 27-30 .
Ono K, et al. CyREST: Turbocharging Cytoscape Access for External Tools via a RESTful API. F1000Research 2015; 4 :478.
R Core Team (2015). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Shannon P, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Research 2003 Nov; 13(11):2498-504.
White S (2016). KEGGlincs: Visualize all edges within a KEGG pathway and overlay LINCS data [option]. R package version 1.0.0.
a. b.
c. d.
e.
Acknowledgements
This work has been funded by the NIH grant U54HL127624 and the Molecular Epidemiology in Children’s Environmental Health training grant T32-ES10957.
Edge colors represent the following
possible combinations of direction
of Fisher’s Exact Test summary
scores (a modified Odd’s Ratio (OR)
score; either positive(+) or negative
(-)) and their corresponding
[adjusted] p-values:
Red: OR(+), pval(sig)
Orange: OR(+), pval(non-sig)
Purple: OR(-), pval(non-sig)
Blue: OR(-), pval(sig)
The width of the edges are
mapped to represent the
magnitude of the OR (thicker edge
indicates larger OR relative to other
edges with data).
Figure 1a: ErbB Pathway + knock-out data for MCF7 Figure 1b: ErbB Pathway + knock-out data for A375
Figure 3: ErbB Pathway + drug data to compare MCF7 and A375 cell-lines
Figure 2: ErbB Pathway + knock-out data to compare MCF7 and A375 cell-lines Figures 1a and 1b display the edge-mapping
offered by the using the KEGG_lincs function:
KEGG_lincs(“hsa04012”, “MCF7”)
KEGG_lincs(“hsa04012”, “A375”)
These figures suggest that many of the gene-
gene relationships annotated in the KGML for
the ErbB pathway show concordance as
determined by L1000 data for both cell lines.
Alternatively, users may wish to generate a map
that will highlight the differences between two
cell-lines in one summarized picture; Figure 2
displays this utility. This type of map is
generated using the development function
KL_compare:
KL_compare(“hsa04012”, “MCF7”, “A375”)
For convenient viewing, edges that are shared
between two nodes get grouped together if they
fall into the same category as follows:
Orange: Higher concordance in first cell-line
Green: Higher concordance in second cell-line
The width of the arrow represents the effect size
(or average effect size, if shared) of differences
in edge concordance [as measured by a test
statistic to compare OR’s]; the shade of the edge
roughly indicates how many edges share that
location/category (note that these are easily
recovered in an interactive Cytoscape session).
Figure 3 has been generated to showcase this
package’s utility in terms of incorporating drug
perturbation signatures from the L1000 data set
into pathway maps. The development function
KL_drug returns known drug-target
relationships specific to the pathway of choice;
these are not usually included in the original
pathway KGML. These drug-gene relationships
are added as edges and are quantified in a
manner identical to that for gene-gene edges.
The edge colors highlight differences as follows:
Yellow: Higher concordance in first cell-line
Blue: Higher concordance in second cell-line
A quick interpretation of this map suggests drugs
that effect the ErbB pathway target BRAF in
A375 cells more so than in MCF7 cells and that
Vemurafenib is perhaps the most effective. On
the other hand, ErbB is more of a direct target
for the MCF7 cell-line compared to A375.

DOCUMENTATIONS
Page
mirror
Reference Manual
Vignette
Poster Publication

IN-THE-WORKS: SUBMIT KEGGLINCS AS
APPLICATION NOTE IN JOURNAL
BIOINFORMATICS

DOCUMENTATIONS
Page
mirror
Reference Manual
Vignette
Poster Publication
Bioinformatics
Application Note
Multiple sources of information
can be confusing when using a
package

DOCUMENTATIONS
Page
mirror
Reference Manual
Vignette
Poster Publication
Bioinformatics
Application Note
Info
‘hub’
How to use
individual
function;
indexed
How functions
are coded
Intended use
for package
Package
Summary
Package
Summary;
technical notes

QUICK NOTE: R/BIOCONDUCTOR
TROUBLESHOOTING
• Somewhat paradoxically, getting a package to successfully
install can be a road block for using a package
• Check that you have the most recent version of R and Bioconductor
before installing new Bioconductor packages.
• Some packages depend on other packages that must be installed before
they can successfully install

CONCLUDING REMARKS
• Get to know R!
• Use R in projects to manage data, perform analyses and produce graphs
• Use R to enhance your learning in classes
• Ex) Translate algorithms from Design and Analysis of Algorithms to R code
• Don’t give up…with Google and determination you will figure it out
• Even if you are working with R code you do not intend to
publish, having your code organized as a package makes it
easier to manage
• Creating/Publishing a package is only the first step!!
• So your package is out there…now what?

QUESTIONS??
• “There are plenty of people out there that go to school for eight
years”
• “Yeah…they’re called doctors”
• Tommy Boy

SUPPLEMENTAL SLIDES
Introduction to Package Development and
Management

R-PACKAGE OVERVIEW
• Premise: You have written a useful function and have it saved
as a “.R” file
• The function works in conjunction with data, saved as a “.RData” file
• You have decided to take the leap and formally declare to yourself and
the world that you will endeavor to turn your hunk of code into an R
package
• Use rstudio to set up a package-related project
• Link project to GitHub

OVERVIEW OF KGML:
INFORMATION IN KEGG MARK-UP
LANGUAGE
Map
Features
Chemical
Compoun
ds
Genes
Grouping
Informatio
n
N
O
D
E
S
E
D
G
E
S
Relationsh
ips
between
Nodes
KGML format
• Stores
information for
pathway
structures and
relationships
among them
• XML document
• 2 main ‘tables’

R-PACKAGE BASICS
• Best practice: Use GitHub to manage package
• Update, check, and synch to GitHub directly from Rstudio
• Documentation required for: functions, data, package
• Useful packages: devtools and roxygen2

*SPECIAL* BIOCONDUCTOR ISSUES
• Data management
• Data package may need to be developed if working with large data sets
• Interoperation with other R packages
• Does Bioconductor support your package’s dependencies?
• Documentation
• Requires detailed vignette
• Code format specifications
• Ex: Indent = 4 spaces, line limit = 80 chars
• Publication ‘deadlines’
• New Bioconductor versions released in October and April

SUBMITTING TO BIOCONDUCTOR
• Open a new issue at the Bioconductor Contributions repository
(GitHub)

SUBMITTING TO BIOCONDUCTOR
• Open a new issue at the Bioconductor Contributions repository
(GitHub)
• Set package version to 0.99.0 in DESCRIPTION file
• A ‘real person’ is assigned to your package
• Tests package across 3 platforms (Windows, Mac OS, Linux)
• Answers questions from developer
• Evaluates fit with Bioconductor (??)
• Update package based on reviewers comments until acceptance
• Bump version number “z” by 1 in DESCRIPTION (ie. 0.99.1) and push
changes to GitHub

MAINTAINING BIOCONDUCTOR PACKAGES
• Currently handled via git-svn synch
• Switching over to git – maintain from Rstudio session (?)
• Nightly build checks
• Updates to package need to pass checks

R Packages Unpacked

Recommended

Recommended

More Related Content

Similar to R Packages Unpacked

Similar to R Packages Unpacked (20)

Recently uploaded

Recently uploaded (20)

R Packages Unpacked