SlideShare a Scribd company logo
Michael Reich
UCSD School of Medicine
Jill P. Mesirov Lab
March 11, 2019
Jill P. Mesirov Lab
• New statistical and machine
learning techniques for the
genomic analysis of cancer and
other diseases
• Gene Set Enrichment
Analysis (GSEA)
• Software and reproducible
research environments
• Integrative Genomics Viewer
(IGV)
• GenePattern
Baker, Nature 2016
Jupyter
The Notebook Environment Explosion
Beaker Zeppelin CoCalc
(SageMath)
MS Azure
Notebooks
IBM
PixieDust
Google
CoLaboratory
Google Cloud
DataLab
Jupyter
Published Jupyter Notebooks
Books or other large collections
• Introductory Tutorials
• Programming and Computer Science
• Statistics, Machine Learning and Data
Science
• Mathematics, Physics, Chemistry,
Biology
• Earth Science and Geo-Spatial data
• Linguistics and Text Mining
• Signal Processing
• Engineering Education
Scientific computing and data analysis
• General topics in scientific computing
• Social data
• Psychology and Neuroscience
• Machine Learning, Statistics and Probability
• Physics, Chemistry and Biology
• Economics and Finance
• Earth science and geo-spatial data
• Data visualization and plotting
• Mathematics
• Signal, Sound and Image Processing
• Natural Language Processing
https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks
Reproducible Research
Notebook Benefits
• Combines scientific narrative with executable code
• No space limitation
• Include process as well as results
• Provides a reproducible* record of analysis
• Inception-to-publication environment for science
• Highly-detailed tutorials
Notebook environments require
programming expertise
• Python, R, or other supported languages
• Analysis packages: scikit-learn, etc.
• Plotting packages: matplotlib, plotly, etc.
Programming paradigm for text formatting
Markdown
LaTeX
Many tools exist in other formats
Many bioinformatics analyses are
compute-intensive
What is needed for bioinformatics notebooks
• Access to the wealth of existing analysis tools
• Seamless flow of datasets between cells
• Simplified user interface to code
• Biologist-friendly text formatting tools
• Easy publication and dissemination of new work
and adoption of others’ work
The GenePattern Notebook Environment
for collaborative and reproducible in silico research
• Shareable, executable notebook documents
• Support in silico research from inception through publication
• Seamless integration of GenePattern with Jupyter notebook
• No programming required
• Access to compute resources
www.genepattern-notebook.org
Reich, M., Tabor, T., Liefeld, T., Thorvaldsdóttir, H., Hill, B., Tamayo, P., & Mesirov, J. P. (2017).
The GenePattern Notebook Environment. Cell systems, 5(2), 149-151.
Access to analysis tools: GenePattern
Module
Integrator
Easy addition
of new tools
Analysis Engine
Record/replay analyses
Versioning of methods
Web service access
Clients
Programming
Web
Access for all users
Module Repository
GSEA
MuTect
HISAT2
Hundreds of analysis and
visualization tools
Notebook
www.genepattern.org
> java -Djava.awt.headless=true -
Dwin=cluster.exe -Dmac=clusterMac -
Dlinux=clusterLinux -
Dlinux64=clusterLinux64 -cp
hcl.jar/legacy-gp-modules.jar/ant.jar
org.genepattern.modules.hcl.RunCluster -f
input.filename log.transform row.center
row.normalize column.center
column.normalize -u output.base.name -e
column.distance.measure -g
row.distance.measure -m clustering.method
Corresponding GenePattern
visual representation
Standard “command-line” method for
running analysis
GenePattern wraps software tools in an accessible visible format
>250 GenePattern Modules, 3/2019
Data-specific analysis methods
• Gene expression (RNA-seq/microarray)
• DNA sequence variation/copy number inference
• Regulatory networks
• Proteomics
• Flow cytometry
• RNAi
Machine learning methods
• Clustering (hierarchical, k-means, SOM, …)
• Classification (SVM, k-nearest neighbor, CART…)
• Dimension reduction (NMF, PCA, …)
Utilities
• Preprocessing
• Data import
• Format conversion
• etc.
Access to multiple compute resources
IU HPC GenePattern Server
Pipelines
Modules
Jobs
Local GP Server
Pipelines
Modules
Jobs
GenePattern Server
Amazon Cloud
Pipelines
Modules
Jobs
GenePattern UsersServers
GenePattern Notebook features
GenePattern login cell
Input user credentials
Access any GenePattern server
Connect through GenePattern API
Module browser window
View, search, and select available modules and
pipelines
GenePattern analysis cell
Upload datasets and enter parameters
Submit jobs to GenePattern server
View job status
Link to result files and share results
Seamless Data Flow within notebooks
Send a result to the input of another notebook cell
result_file.gct
GenePattern Notebook Data Flow
Send a result to the input of another notebook cell
GenePattern Notebook Data Flow
Input a result from the output of another notebook cell
Input file:
GenePattern Notebook Data Flow
Receive an output file from another notebook cell
Seamless data flow
Read a GenePattern result file into Python
Seamless data flow
Send a result file to a dataframe
• Pandas is a popular Python library for working with data.
• Certain GenePattern file formats can be automatically imported into a pandas
Dataframe by selecting “Send to Dataframe” in the menu.
User-friendly UI to Python functions
Biologist-friendly text formatting tools:
Rich-Text Editor
Excerpts from a GenePattern Notebook.
Interactive
visualizations
GenePattern analysis
cells allow point-
and-click execution
WYSIWYG
text formatting
Full Jupyter
Notebook toolbar
SVM
notebooks
Scikit-learn
~80 lines of code
GenePattern
1 cell
0 lines of code
GenePattern Python Library
import gp
# Create a GenePattern server proxy instance
gpserver = gp.GPServer('http://localhost:8080/gp','myusername', 'mypassword')
# Obtain GPTask by module name
module = gp.GPTask(gpserver, "PreprocessDataset")
# Load module parameter data
module.param_load()
# Create a job specification
job_spec = module.make_job_spec()
# Upload a file to the server
uploaded_file = gpserver.upload_file("file_name", "/path/to/the/file/file_name")
job_spec.set_parameter("input.filename", uploaded_file.get_url())
# Submit the job to the GenePattern server
job = gpserver.run_job(job_spec)
• Complete programmatic access
• Automatic integration with GenePattern cell data
GenePattern Notebook Repository
• Create, execute, and share
GenePattern notebooks
• Zero-install usage of
GenePattern Notebook
• Python mathematical
statistical machine learning
and graphics packages
• Save and adapt shared
notebooks
• Runs in Amazon cloud utilizing
scalable compute
• Open-source, available on
GitHub
www.genepattern-notebook.org
Public Notebook Library
• Featured notebooks created
by GenePattern team and
collaborators
• Templates for common
bioinformatics analysis
scenarios
• Community-contributed
published notebooks
• Tutorial notebooks teaching
basic principles of the
environment
• Adaptable to your own
research
• Search and tag functions
Collaborative notebook development
Collaborative notebook development
Collaborative notebook development
Collaborative notebook development
Publishing and disseminating notebooks
Publishing and disseminating notebooks
Publishing and disseminating notebooks
Publishing and disseminating notebooks
https://notebook.genepattern.org/services/sharing/notebooks/321/preview/
Docker swarm group
launched on EC2
autoscale group
Web
browser
GenePattern Server
GenePattern Notebook Repository and Workspace
S3
data transfer to compute nodes
EFS
Notebooks and user data
A
W
S
B
a
t
c
h
Cloud Architecture
EC2 instance
EC2
Compute node
Analysis
Module
Analysis
Module
EC2
Compute node
Analysis
Module
Analysis
Module
EC2
Compute node
Analysis
Module
Analysis
Module
Docker swarm running
on EC2 autoscale
group
Notebook
Kernel
Notebook
Kernel
Notebook
Kernel
Notebook
Kernel
EC2 instance
JupyterHub Server
Python
packages
GenePattern
extensions
Local GenePattern Notebooks
Docker image
• genepattern-notebook on Docker Hub
Python
• pip install genepattern-notebook
• conda install genepattern-notebook
In development
• Release of in-depth analysis notebooks
– CoGAPS latent space analysis (Elana Fertig Lab)
– GSEA with information-theoretic metrics (Pablo Tamayo lab)
– REVEALER identifies the context of genomic alterations that drive
tumor formation (Pablo Tamayo lab)
– AMARETTO multi-omics analysis to identify cancer driver genes and
mutations (Nathalie Pochet, Olivier Gevaert lab)
• Incorporation of additional tools
– IGV.js
– NDEx Network Data Exchange
– Etc.
• Real-time collaboration
GenePattern Community
• First public release in 2004
• Current version: 3.9.10
• >65,000 registered users
• Public server runs 2000-5000
analyses/week
• Open source – BSD-style license
• https://github.com/genepattern
• Community-contributed methods
o CRISPR analysis
o Bisulfite sequencing
o Flow cytometry
o RNAi screens
o etc.
Web site:
www.genepattern.org
Community
repository:
www.gparc.org
Public server:
cloud.genepattern.org
Acknowledgements
Mesirov Lab
Peter Carr
Lukas Chavez
David Eby
Barbara Hill
Edwin Juarez
Forrest Kim
Ted Liefeld
Clarence Mah
Jim Robinson
Thorin Tabor
Helga Thorvaldsdottir
Douglass Turner
Jill P. Mesirov, PI
Collaborators
Pablo Tamayo Lab, UCSD
Fernando Perez, Lawrence
Berkeley Lab

More Related Content

What's hot

EarthCube DDMA AGU
EarthCube DDMA AGUEarthCube DDMA AGU
EarthCube DDMA AGUTanu Malik
 
SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...
SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...
SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...Keiichiro Ono
 
Reproducible Workflow with Cytoscape and Jupyter Notebook
Reproducible Workflow with Cytoscape and Jupyter NotebookReproducible Workflow with Cytoscape and Jupyter Notebook
Reproducible Workflow with Cytoscape and Jupyter Notebook
Keiichiro Ono
 
XGSN: An Open-source Semantic Sensing Middleware for the Web of Things
XGSN: An Open-source Semantic Sensing Middleware for the Web of ThingsXGSN: An Open-source Semantic Sensing Middleware for the Web of Things
XGSN: An Open-source Semantic Sensing Middleware for the Web of Things
Jean-Paul Calbimonte
 
Software tools to facilitate materials science research
Software tools to facilitate materials science researchSoftware tools to facilitate materials science research
Software tools to facilitate materials science research
Anubhav Jain
 
GSN Global Sensor Networks for Environmental Data Management
GSN Global Sensor Networks for Environmental Data ManagementGSN Global Sensor Networks for Environmental Data Management
GSN Global Sensor Networks for Environmental Data Management
Jean-Paul Calbimonte
 
Assessing Galaxy's ability to express scientific workflows in bioinformatics
Assessing Galaxy's ability to express scientific workflows in bioinformaticsAssessing Galaxy's ability to express scientific workflows in bioinformatics
Assessing Galaxy's ability to express scientific workflows in bioinformaticsPeter van Heusden
 
Introduction to Biological Network Analysis and Visualization with Cytoscape ...
Introduction to Biological Network Analysis and Visualization with Cytoscape ...Introduction to Biological Network Analysis and Visualization with Cytoscape ...
Introduction to Biological Network Analysis and Visualization with Cytoscape ...
Keiichiro Ono
 
Taming Big Data!
Taming Big Data!Taming Big Data!
Taming Big Data!
Ian Foster
 
Introduction to Biological Network Analysis and Visualization with Cytoscape ...
Introduction to Biological Network Analysis and Visualization with Cytoscape ...Introduction to Biological Network Analysis and Visualization with Cytoscape ...
Introduction to Biological Network Analysis and Visualization with Cytoscape ...
Keiichiro Ono
 
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
Keiichiro Ono
 
Agile large-scale machine-learning pipelines in drug discovery
Agile large-scale machine-learning pipelines in drug discoveryAgile large-scale machine-learning pipelines in drug discovery
Agile large-scale machine-learning pipelines in drug discovery
Ola Spjuth
 
Building Reproducible Network Data Analysis / Visualization Workflows
Building Reproducible Network Data Analysis / Visualization WorkflowsBuilding Reproducible Network Data Analysis / Visualization Workflows
Building Reproducible Network Data Analysis / Visualization Workflows
Keiichiro Ono
 
Accelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy ScienceAccelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy Science
Ian Foster
 
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV DataThe DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
Anubhav Jain
 
HathiTrust Research Center Data Capsule Overview 09.10.14
HathiTrust Research Center Data Capsule Overview 09.10.14HathiTrust Research Center Data Capsule Overview 09.10.14
HathiTrust Research Center Data Capsule Overview 09.10.14
Robert H. McDonald
 
The Schema Editor of OpenIoT for Semantic Sensor Networks
The Schema Editor of OpenIoT for Semantic Sensor NetworksThe Schema Editor of OpenIoT for Semantic Sensor Networks
The Schema Editor of OpenIoT for Semantic Sensor Networks
Jean-Paul Calbimonte
 
Cytoscape and External Data Analysis Tools
Cytoscape and External Data Analysis ToolsCytoscape and External Data Analysis Tools
Cytoscape and External Data Analysis Tools
Keiichiro Ono
 
Towards the Cytoscape Cyberinfrastructure
Towards the Cytoscape CyberinfrastructureTowards the Cytoscape Cyberinfrastructure
Towards the Cytoscape Cyberinfrastructure
Keiichiro Ono
 

What's hot (20)

EarthCube DDMA AGU
EarthCube DDMA AGUEarthCube DDMA AGU
EarthCube DDMA AGU
 
SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...
SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...
SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...
 
Reproducible Workflow with Cytoscape and Jupyter Notebook
Reproducible Workflow with Cytoscape and Jupyter NotebookReproducible Workflow with Cytoscape and Jupyter Notebook
Reproducible Workflow with Cytoscape and Jupyter Notebook
 
XGSN: An Open-source Semantic Sensing Middleware for the Web of Things
XGSN: An Open-source Semantic Sensing Middleware for the Web of ThingsXGSN: An Open-source Semantic Sensing Middleware for the Web of Things
XGSN: An Open-source Semantic Sensing Middleware for the Web of Things
 
Software tools to facilitate materials science research
Software tools to facilitate materials science researchSoftware tools to facilitate materials science research
Software tools to facilitate materials science research
 
X-GSN in OpenIoT SummerSchool
X-GSN in OpenIoT SummerSchoolX-GSN in OpenIoT SummerSchool
X-GSN in OpenIoT SummerSchool
 
GSN Global Sensor Networks for Environmental Data Management
GSN Global Sensor Networks for Environmental Data ManagementGSN Global Sensor Networks for Environmental Data Management
GSN Global Sensor Networks for Environmental Data Management
 
Assessing Galaxy's ability to express scientific workflows in bioinformatics
Assessing Galaxy's ability to express scientific workflows in bioinformaticsAssessing Galaxy's ability to express scientific workflows in bioinformatics
Assessing Galaxy's ability to express scientific workflows in bioinformatics
 
Introduction to Biological Network Analysis and Visualization with Cytoscape ...
Introduction to Biological Network Analysis and Visualization with Cytoscape ...Introduction to Biological Network Analysis and Visualization with Cytoscape ...
Introduction to Biological Network Analysis and Visualization with Cytoscape ...
 
Taming Big Data!
Taming Big Data!Taming Big Data!
Taming Big Data!
 
Introduction to Biological Network Analysis and Visualization with Cytoscape ...
Introduction to Biological Network Analysis and Visualization with Cytoscape ...Introduction to Biological Network Analysis and Visualization with Cytoscape ...
Introduction to Biological Network Analysis and Visualization with Cytoscape ...
 
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
 
Agile large-scale machine-learning pipelines in drug discovery
Agile large-scale machine-learning pipelines in drug discoveryAgile large-scale machine-learning pipelines in drug discovery
Agile large-scale machine-learning pipelines in drug discovery
 
Building Reproducible Network Data Analysis / Visualization Workflows
Building Reproducible Network Data Analysis / Visualization WorkflowsBuilding Reproducible Network Data Analysis / Visualization Workflows
Building Reproducible Network Data Analysis / Visualization Workflows
 
Accelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy ScienceAccelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy Science
 
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV DataThe DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
 
HathiTrust Research Center Data Capsule Overview 09.10.14
HathiTrust Research Center Data Capsule Overview 09.10.14HathiTrust Research Center Data Capsule Overview 09.10.14
HathiTrust Research Center Data Capsule Overview 09.10.14
 
The Schema Editor of OpenIoT for Semantic Sensor Networks
The Schema Editor of OpenIoT for Semantic Sensor NetworksThe Schema Editor of OpenIoT for Semantic Sensor Networks
The Schema Editor of OpenIoT for Semantic Sensor Networks
 
Cytoscape and External Data Analysis Tools
Cytoscape and External Data Analysis ToolsCytoscape and External Data Analysis Tools
Cytoscape and External Data Analysis Tools
 
Towards the Cytoscape Cyberinfrastructure
Towards the Cytoscape CyberinfrastructureTowards the Cytoscape Cyberinfrastructure
Towards the Cytoscape Cyberinfrastructure
 

Similar to 2019 03-11 bio it-world west genepattern notebook slides

GenePattern Integration with Globus
GenePattern Integration with GlobusGenePattern Integration with Globus
GenePattern Integration with Globus
Globus
 
Blastn plus jupyter on Docker
Blastn plus jupyter on DockerBlastn plus jupyter on Docker
Blastn plus jupyter on Docker
Lynn Langit
 
Beyond the Science Gateway
Beyond the Science GatewayBeyond the Science Gateway
Beyond the Science Gateway
Boston Consulting Group
 
Abhishek Training PPT.pptx
Abhishek Training PPT.pptxAbhishek Training PPT.pptx
Abhishek Training PPT.pptx
KashishKashish22
 
Databases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyDatabases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems Immunology
Yannick Pouliot
 
Reproducibility of computational workflows is automated using continuous anal...
Reproducibility of computational workflows is automated using continuous anal...Reproducibility of computational workflows is automated using continuous anal...
Reproducibility of computational workflows is automated using continuous anal...
Kento Aoyama
 
Reproducible research concepts and tools
Reproducible research concepts and toolsReproducible research concepts and tools
Reproducible research concepts and tools
C. Tobin Magle
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
Gaignard Alban
 
Automating biostatistics workflows using R-based webtools
Automating biostatistics workflows using R-based webtoolsAutomating biostatistics workflows using R-based webtools
Automating biostatistics workflows using R-based webtools
Bioinformatics and Computational Biosciences Branch
 
Enabling open and reproducible computer systems research: the good, the bad a...
Enabling open and reproducible computer systems research: the good, the bad a...Enabling open and reproducible computer systems research: the good, the bad a...
Enabling open and reproducible computer systems research: the good, the bad a...
Grigori Fursin
 
Data visualisation in python tool - a brief
Data visualisation in python tool - a briefData visualisation in python tool - a brief
Data visualisation in python tool - a brief
ameermalik11
 
Deep Learning with CNTK
Deep Learning with CNTKDeep Learning with CNTK
Deep Learning with CNTK
Ashish Jaiman
 
Lviv Data Science Club (Sergiy Lunyakin)
Lviv Data Science Club (Sergiy Lunyakin)Lviv Data Science Club (Sergiy Lunyakin)
Lviv Data Science Club (Sergiy Lunyakin)
Lviv Startup Club
 
Scratchpads: past, present and future
Scratchpads: past, present and futureScratchpads: past, present and future
Scratchpads: past, present and future
Vince Smith
 
Scratchpads: past, present and future
Scratchpads: past, present and futureScratchpads: past, present and future
Scratchpads: past, present and future
Vince Smith
 
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
VMware Tanzu
 
Research Solutions for Education
Research Solutions for EducationResearch Solutions for Education
Research Solutions for Education
Lee Stott
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
Marco Quartulli
 
Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...
mestato
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
VMware Tanzu
 

Similar to 2019 03-11 bio it-world west genepattern notebook slides (20)

GenePattern Integration with Globus
GenePattern Integration with GlobusGenePattern Integration with Globus
GenePattern Integration with Globus
 
Blastn plus jupyter on Docker
Blastn plus jupyter on DockerBlastn plus jupyter on Docker
Blastn plus jupyter on Docker
 
Beyond the Science Gateway
Beyond the Science GatewayBeyond the Science Gateway
Beyond the Science Gateway
 
Abhishek Training PPT.pptx
Abhishek Training PPT.pptxAbhishek Training PPT.pptx
Abhishek Training PPT.pptx
 
Databases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyDatabases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems Immunology
 
Reproducibility of computational workflows is automated using continuous anal...
Reproducibility of computational workflows is automated using continuous anal...Reproducibility of computational workflows is automated using continuous anal...
Reproducibility of computational workflows is automated using continuous anal...
 
Reproducible research concepts and tools
Reproducible research concepts and toolsReproducible research concepts and tools
Reproducible research concepts and tools
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
Automating biostatistics workflows using R-based webtools
Automating biostatistics workflows using R-based webtoolsAutomating biostatistics workflows using R-based webtools
Automating biostatistics workflows using R-based webtools
 
Enabling open and reproducible computer systems research: the good, the bad a...
Enabling open and reproducible computer systems research: the good, the bad a...Enabling open and reproducible computer systems research: the good, the bad a...
Enabling open and reproducible computer systems research: the good, the bad a...
 
Data visualisation in python tool - a brief
Data visualisation in python tool - a briefData visualisation in python tool - a brief
Data visualisation in python tool - a brief
 
Deep Learning with CNTK
Deep Learning with CNTKDeep Learning with CNTK
Deep Learning with CNTK
 
Lviv Data Science Club (Sergiy Lunyakin)
Lviv Data Science Club (Sergiy Lunyakin)Lviv Data Science Club (Sergiy Lunyakin)
Lviv Data Science Club (Sergiy Lunyakin)
 
Scratchpads: past, present and future
Scratchpads: past, present and futureScratchpads: past, present and future
Scratchpads: past, present and future
 
Scratchpads: past, present and future
Scratchpads: past, present and futureScratchpads: past, present and future
Scratchpads: past, present and future
 
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
 
Research Solutions for Education
Research Solutions for EducationResearch Solutions for Education
Research Solutions for Education
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
 

Recently uploaded

Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
kejapriya1
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
ChetanK57
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
S.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary levelS.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary level
ronaldlakony0
 
role of pramana in research.pptx in science
role of pramana in research.pptx in sciencerole of pramana in research.pptx in science
role of pramana in research.pptx in science
sonaliswain16
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
fafyfskhan251kmf
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
Areesha Ahmad
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Sérgio Sacani
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
tonzsalvador2222
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
Lokesh Patil
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
YOGESH DOGRA
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
sanjana502982
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
sachin783648
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 

Recently uploaded (20)

Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
S.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary levelS.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary level
 
role of pramana in research.pptx in science
role of pramana in research.pptx in sciencerole of pramana in research.pptx in science
role of pramana in research.pptx in science
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 

2019 03-11 bio it-world west genepattern notebook slides

  • 1. Michael Reich UCSD School of Medicine Jill P. Mesirov Lab March 11, 2019
  • 2. Jill P. Mesirov Lab • New statistical and machine learning techniques for the genomic analysis of cancer and other diseases • Gene Set Enrichment Analysis (GSEA) • Software and reproducible research environments • Integrative Genomics Viewer (IGV) • GenePattern
  • 4. Jupyter The Notebook Environment Explosion Beaker Zeppelin CoCalc (SageMath) MS Azure Notebooks IBM PixieDust Google CoLaboratory Google Cloud DataLab Jupyter
  • 5. Published Jupyter Notebooks Books or other large collections • Introductory Tutorials • Programming and Computer Science • Statistics, Machine Learning and Data Science • Mathematics, Physics, Chemistry, Biology • Earth Science and Geo-Spatial data • Linguistics and Text Mining • Signal Processing • Engineering Education Scientific computing and data analysis • General topics in scientific computing • Social data • Psychology and Neuroscience • Machine Learning, Statistics and Probability • Physics, Chemistry and Biology • Economics and Finance • Earth science and geo-spatial data • Data visualization and plotting • Mathematics • Signal, Sound and Image Processing • Natural Language Processing https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks
  • 7. Notebook Benefits • Combines scientific narrative with executable code • No space limitation • Include process as well as results • Provides a reproducible* record of analysis • Inception-to-publication environment for science • Highly-detailed tutorials
  • 8. Notebook environments require programming expertise • Python, R, or other supported languages • Analysis packages: scikit-learn, etc. • Plotting packages: matplotlib, plotly, etc.
  • 9. Programming paradigm for text formatting Markdown LaTeX
  • 10. Many tools exist in other formats
  • 11. Many bioinformatics analyses are compute-intensive
  • 12. What is needed for bioinformatics notebooks • Access to the wealth of existing analysis tools • Seamless flow of datasets between cells • Simplified user interface to code • Biologist-friendly text formatting tools • Easy publication and dissemination of new work and adoption of others’ work
  • 13. The GenePattern Notebook Environment for collaborative and reproducible in silico research • Shareable, executable notebook documents • Support in silico research from inception through publication • Seamless integration of GenePattern with Jupyter notebook • No programming required • Access to compute resources www.genepattern-notebook.org Reich, M., Tabor, T., Liefeld, T., Thorvaldsdóttir, H., Hill, B., Tamayo, P., & Mesirov, J. P. (2017). The GenePattern Notebook Environment. Cell systems, 5(2), 149-151.
  • 14. Access to analysis tools: GenePattern Module Integrator Easy addition of new tools Analysis Engine Record/replay analyses Versioning of methods Web service access Clients Programming Web Access for all users Module Repository GSEA MuTect HISAT2 Hundreds of analysis and visualization tools Notebook www.genepattern.org
  • 15. > java -Djava.awt.headless=true - Dwin=cluster.exe -Dmac=clusterMac - Dlinux=clusterLinux - Dlinux64=clusterLinux64 -cp hcl.jar/legacy-gp-modules.jar/ant.jar org.genepattern.modules.hcl.RunCluster -f input.filename log.transform row.center row.normalize column.center column.normalize -u output.base.name -e column.distance.measure -g row.distance.measure -m clustering.method Corresponding GenePattern visual representation Standard “command-line” method for running analysis GenePattern wraps software tools in an accessible visible format
  • 16. >250 GenePattern Modules, 3/2019 Data-specific analysis methods • Gene expression (RNA-seq/microarray) • DNA sequence variation/copy number inference • Regulatory networks • Proteomics • Flow cytometry • RNAi Machine learning methods • Clustering (hierarchical, k-means, SOM, …) • Classification (SVM, k-nearest neighbor, CART…) • Dimension reduction (NMF, PCA, …) Utilities • Preprocessing • Data import • Format conversion • etc.
  • 17. Access to multiple compute resources IU HPC GenePattern Server Pipelines Modules Jobs Local GP Server Pipelines Modules Jobs GenePattern Server Amazon Cloud Pipelines Modules Jobs GenePattern UsersServers
  • 18. GenePattern Notebook features GenePattern login cell Input user credentials Access any GenePattern server Connect through GenePattern API Module browser window View, search, and select available modules and pipelines GenePattern analysis cell Upload datasets and enter parameters Submit jobs to GenePattern server View job status Link to result files and share results
  • 19. Seamless Data Flow within notebooks Send a result to the input of another notebook cell result_file.gct
  • 20. GenePattern Notebook Data Flow Send a result to the input of another notebook cell
  • 21. GenePattern Notebook Data Flow Input a result from the output of another notebook cell Input file:
  • 22. GenePattern Notebook Data Flow Receive an output file from another notebook cell
  • 23. Seamless data flow Read a GenePattern result file into Python
  • 24. Seamless data flow Send a result file to a dataframe • Pandas is a popular Python library for working with data. • Certain GenePattern file formats can be automatically imported into a pandas Dataframe by selecting “Send to Dataframe” in the menu.
  • 25. User-friendly UI to Python functions
  • 26. Biologist-friendly text formatting tools: Rich-Text Editor
  • 27. Excerpts from a GenePattern Notebook. Interactive visualizations GenePattern analysis cells allow point- and-click execution WYSIWYG text formatting Full Jupyter Notebook toolbar
  • 28. SVM notebooks Scikit-learn ~80 lines of code GenePattern 1 cell 0 lines of code
  • 29. GenePattern Python Library import gp # Create a GenePattern server proxy instance gpserver = gp.GPServer('http://localhost:8080/gp','myusername', 'mypassword') # Obtain GPTask by module name module = gp.GPTask(gpserver, "PreprocessDataset") # Load module parameter data module.param_load() # Create a job specification job_spec = module.make_job_spec() # Upload a file to the server uploaded_file = gpserver.upload_file("file_name", "/path/to/the/file/file_name") job_spec.set_parameter("input.filename", uploaded_file.get_url()) # Submit the job to the GenePattern server job = gpserver.run_job(job_spec) • Complete programmatic access • Automatic integration with GenePattern cell data
  • 30. GenePattern Notebook Repository • Create, execute, and share GenePattern notebooks • Zero-install usage of GenePattern Notebook • Python mathematical statistical machine learning and graphics packages • Save and adapt shared notebooks • Runs in Amazon cloud utilizing scalable compute • Open-source, available on GitHub www.genepattern-notebook.org
  • 31. Public Notebook Library • Featured notebooks created by GenePattern team and collaborators • Templates for common bioinformatics analysis scenarios • Community-contributed published notebooks • Tutorial notebooks teaching basic principles of the environment • Adaptable to your own research • Search and tag functions
  • 39. Publishing and disseminating notebooks https://notebook.genepattern.org/services/sharing/notebooks/321/preview/
  • 40. Docker swarm group launched on EC2 autoscale group Web browser GenePattern Server GenePattern Notebook Repository and Workspace S3 data transfer to compute nodes EFS Notebooks and user data A W S B a t c h Cloud Architecture EC2 instance EC2 Compute node Analysis Module Analysis Module EC2 Compute node Analysis Module Analysis Module EC2 Compute node Analysis Module Analysis Module Docker swarm running on EC2 autoscale group Notebook Kernel Notebook Kernel Notebook Kernel Notebook Kernel EC2 instance JupyterHub Server Python packages GenePattern extensions
  • 41. Local GenePattern Notebooks Docker image • genepattern-notebook on Docker Hub Python • pip install genepattern-notebook • conda install genepattern-notebook
  • 42. In development • Release of in-depth analysis notebooks – CoGAPS latent space analysis (Elana Fertig Lab) – GSEA with information-theoretic metrics (Pablo Tamayo lab) – REVEALER identifies the context of genomic alterations that drive tumor formation (Pablo Tamayo lab) – AMARETTO multi-omics analysis to identify cancer driver genes and mutations (Nathalie Pochet, Olivier Gevaert lab) • Incorporation of additional tools – IGV.js – NDEx Network Data Exchange – Etc. • Real-time collaboration
  • 43. GenePattern Community • First public release in 2004 • Current version: 3.9.10 • >65,000 registered users • Public server runs 2000-5000 analyses/week • Open source – BSD-style license • https://github.com/genepattern • Community-contributed methods o CRISPR analysis o Bisulfite sequencing o Flow cytometry o RNAi screens o etc. Web site: www.genepattern.org Community repository: www.gparc.org Public server: cloud.genepattern.org
  • 44. Acknowledgements Mesirov Lab Peter Carr Lukas Chavez David Eby Barbara Hill Edwin Juarez Forrest Kim Ted Liefeld Clarence Mah Jim Robinson Thorin Tabor Helga Thorvaldsdottir Douglass Turner Jill P. Mesirov, PI Collaborators Pablo Tamayo Lab, UCSD Fernando Perez, Lawrence Berkeley Lab