2019 03-11 bio it-world west genepattern notebook slides

Michael Reich
UCSD School of Medicine
Jill P. Mesirov Lab
March 11, 2019

Jill P. Mesirov Lab
• New statistical and machine
learning techniques for the
genomic analysis of cancer and
other diseases
• Gene Set Enrichment
Analysis (GSEA)
• Software and reproducible
research environments
• Integrative Genomics Viewer
(IGV)
• GenePattern

Jupyter
The Notebook Environment Explosion
Beaker Zeppelin CoCalc
(SageMath)
MS Azure
Notebooks
IBM
PixieDust
Google
CoLaboratory
Google Cloud
DataLab
Jupyter

Published Jupyter Notebooks
Books or other large collections
• Introductory Tutorials
• Programming and Computer Science
• Statistics, Machine Learning and Data
Science
• Mathematics, Physics, Chemistry,
Biology
• Earth Science and Geo-Spatial data
• Linguistics and Text Mining
• Signal Processing
• Engineering Education
Scientific computing and data analysis
• General topics in scientific computing
• Social data
• Psychology and Neuroscience
• Machine Learning, Statistics and Probability
• Physics, Chemistry and Biology
• Economics and Finance
• Earth science and geo-spatial data
• Data visualization and plotting
• Mathematics
• Signal, Sound and Image Processing
• Natural Language Processing
https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks

Notebook Benefits
• Combines scientific narrative with executable code
• No space limitation
• Include process as well as results
• Provides a reproducible* record of analysis
• Inception-to-publication environment for science
• Highly-detailed tutorials

Notebook environments require
programming expertise
• Python, R, or other supported languages
• Analysis packages: scikit-learn, etc.
• Plotting packages: matplotlib, plotly, etc.

Programming paradigm for text formatting
Markdown
LaTeX

Many tools exist in other formats

Many bioinformatics analyses are
compute-intensive

What is needed for bioinformatics notebooks
• Access to the wealth of existing analysis tools
• Seamless flow of datasets between cells
• Simplified user interface to code
• Biologist-friendly text formatting tools
• Easy publication and dissemination of new work
and adoption of others’ work

The GenePattern Notebook Environment
for collaborative and reproducible in silico research
• Shareable, executable notebook documents
• Support in silico research from inception through publication
• Seamless integration of GenePattern with Jupyter notebook
• No programming required
• Access to compute resources
www.genepattern-notebook.org
Reich, M., Tabor, T., Liefeld, T., Thorvaldsdóttir, H., Hill, B., Tamayo, P., & Mesirov, J. P. (2017).
The GenePattern Notebook Environment. Cell systems, 5(2), 149-151.

Access to analysis tools: GenePattern
Module
Integrator
Easy addition
of new tools
Analysis Engine
Record/replay analyses
Versioning of methods
Web service access
Clients
Programming
Web
Access for all users
Module Repository
GSEA
MuTect
HISAT2
Hundreds of analysis and
visualization tools
Notebook
www.genepattern.org

> java -Djava.awt.headless=true -
Dwin=cluster.exe -Dmac=clusterMac -
Dlinux=clusterLinux -
Dlinux64=clusterLinux64 -cp
hcl.jar/legacy-gp-modules.jar/ant.jar
org.genepattern.modules.hcl.RunCluster -f
input.filename log.transform row.center
row.normalize column.center
column.normalize -u output.base.name -e
column.distance.measure -g
row.distance.measure -m clustering.method
Corresponding GenePattern
visual representation
Standard “command-line” method for
running analysis
GenePattern wraps software tools in an accessible visible format

>250 GenePattern Modules, 3/2019
Data-specific analysis methods
• Gene expression (RNA-seq/microarray)
• DNA sequence variation/copy number inference
• Regulatory networks
• Proteomics
• Flow cytometry
• RNAi
Machine learning methods
• Clustering (hierarchical, k-means, SOM, …)
• Classification (SVM, k-nearest neighbor, CART…)
• Dimension reduction (NMF, PCA, …)
Utilities
• Preprocessing
• Data import
• Format conversion
• etc.

Access to multiple compute resources
IU HPC GenePattern Server
Pipelines
Modules
Jobs
Local GP Server
Pipelines
Modules
Jobs
GenePattern Server
Amazon Cloud
Pipelines
Modules
Jobs
GenePattern UsersServers

GenePattern Notebook features
GenePattern login cell
Input user credentials
Access any GenePattern server
Connect through GenePattern API
Module browser window
View, search, and select available modules and
pipelines
GenePattern analysis cell
Upload datasets and enter parameters
Submit jobs to GenePattern server
View job status
Link to result files and share results

Seamless Data Flow within notebooks
Send a result to the input of another notebook cell
result_file.gct

GenePattern Notebook Data Flow
Send a result to the input of another notebook cell

Input a result from the output of another notebook cell
Input file:

Receive an output file from another notebook cell

Seamless data flow
Read a GenePattern result file into Python

Seamless data flow
Send a result file to a dataframe
• Pandas is a popular Python library for working with data.
• Certain GenePattern file formats can be automatically imported into a pandas
Dataframe by selecting “Send to Dataframe” in the menu.

User-friendly UI to Python functions

Biologist-friendly text formatting tools:
Rich-Text Editor

Excerpts from a GenePattern Notebook.
Interactive
visualizations
GenePattern analysis
cells allow point-
and-click execution
WYSIWYG
text formatting
Full Jupyter
Notebook toolbar

SVM
notebooks
Scikit-learn
~80 lines of code
GenePattern
1 cell
0 lines of code

GenePattern Python Library
import gp
# Create a GenePattern server proxy instance
gpserver = gp.GPServer('http://localhost:8080/gp','myusername', 'mypassword')
# Obtain GPTask by module name
module = gp.GPTask(gpserver, "PreprocessDataset")
# Load module parameter data
module.param_load()
# Create a job specification
job_spec = module.make_job_spec()
# Upload a file to the server
uploaded_file = gpserver.upload_file("file_name", "/path/to/the/file/file_name")
job_spec.set_parameter("input.filename", uploaded_file.get_url())
# Submit the job to the GenePattern server
job = gpserver.run_job(job_spec)
• Complete programmatic access
• Automatic integration with GenePattern cell data

GenePattern Notebook Repository
• Create, execute, and share
GenePattern notebooks
• Zero-install usage of
GenePattern Notebook
• Python mathematical
statistical machine learning
and graphics packages
• Save and adapt shared
notebooks
• Runs in Amazon cloud utilizing
scalable compute
• Open-source, available on
GitHub
www.genepattern-notebook.org

Public Notebook Library
• Featured notebooks created
by GenePattern team and
collaborators
• Templates for common
bioinformatics analysis
scenarios
• Community-contributed
published notebooks
• Tutorial notebooks teaching
basic principles of the
environment
• Adaptable to your own
research
• Search and tag functions

Collaborative notebook development

Publishing and disseminating notebooks

Publishing and disseminating notebooks
https://notebook.genepattern.org/services/sharing/notebooks/321/preview/

Docker swarm group
launched on EC2
autoscale group
Web
browser
GenePattern Server
GenePattern Notebook Repository and Workspace
S3
data transfer to compute nodes
EFS
Notebooks and user data
A
W
S
B
a
t
c
h
Cloud Architecture
EC2 instance
EC2
Compute node
Analysis
Module
Analysis
Module
EC2
Compute node
Analysis
Module
Analysis
Module
EC2
Compute node
Analysis
Module
Analysis
Module
Docker swarm running
on EC2 autoscale
group
Notebook
Kernel
Notebook
Kernel
Notebook
Kernel
Notebook
Kernel
EC2 instance
JupyterHub Server
Python
packages
GenePattern
extensions

Local GenePattern Notebooks
Docker image
• genepattern-notebook on Docker Hub
Python
• pip install genepattern-notebook
• conda install genepattern-notebook

In development
• Release of in-depth analysis notebooks
– CoGAPS latent space analysis (Elana Fertig Lab)
– GSEA with information-theoretic metrics (Pablo Tamayo lab)
– REVEALER identifies the context of genomic alterations that drive
tumor formation (Pablo Tamayo lab)
– AMARETTO multi-omics analysis to identify cancer driver genes and
mutations (Nathalie Pochet, Olivier Gevaert lab)
• Incorporation of additional tools
– IGV.js
– NDEx Network Data Exchange
– Etc.
• Real-time collaboration

GenePattern Community
• First public release in 2004
• Current version: 3.9.10
• >65,000 registered users
• Public server runs 2000-5000
analyses/week
• Open source – BSD-style license
• https://github.com/genepattern
• Community-contributed methods
o CRISPR analysis
o Bisulfite sequencing
o Flow cytometry
o RNAi screens
o etc.
Web site:
www.genepattern.org
Community
repository:
www.gparc.org
Public server:
cloud.genepattern.org

Acknowledgements
Mesirov Lab
Peter Carr
Lukas Chavez
David Eby
Barbara Hill
Edwin Juarez
Forrest Kim
Ted Liefeld
Clarence Mah
Jim Robinson
Thorin Tabor
Helga Thorvaldsdottir
Douglass Turner
Jill P. Mesirov, PI
Collaborators
Pablo Tamayo Lab, UCSD
Fernando Perez, Lawrence
Berkeley Lab

2019 03-11 bio it-world west genepattern notebook slides

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 2019 03-11 bio it-world west genepattern notebook slides

Similar to 2019 03-11 bio it-world west genepattern notebook slides (20)

Recently uploaded

Recently uploaded (20)

2019 03-11 bio it-world west genepattern notebook slides