5. Keiichiro Ono
Background
Bioinformatics
Computer Science
Work
Research
Bioinformatics workflow
Visualization pipeline
Data
Visualization
Networks
Other Biological Data
Integration
Molecular Interactions
Pathways
Annotations
Software Development
Cytoscape
NeXO
Cyberinfrastructure
All kinds of small tools
Like
Art
Kandinsky
Mondrian
Music
Electronica
Techno
Minimal
Detroit
Jazz
Sci-fi
Movie
Novel
Life
US
San Diego
San Francisco Bay Area
Los Angeles
Orange County
Japan
Gifu
Tokyo
8. Keiichiro Ono
Background
Bioinformatics
Computer Science
Work
Research
Bioinformatics workflow
Visualization pipeline
Data
Visualization
Networks
Other Biological Data
Integration
Molecular Interactions
Pathways
Annotations
Software Development
Cytoscape
NeXO
Cyberinfrastructure
All kinds of small tools
Like
Art
Kandinsky
Mondrian
Music
Electronica
Techno
Minimal
Detroit
Jazz
Sci-fi
Movie
Novel
Life
US
San Diego
San Francisco Bay Area
Los Angeles
Orange County
Japan
Gifu
Tokyo
16. An Open Source Platform for Biological Network Data
Integration, Analysis and Visualization
Cytoscape
17. Cytoscape
- Open Source (LGPL)
- Free for both commercial and academic use
- Developed and maintained by universities,
companies, and research institutions
- De-facto standard software in biological network
research community
- Expandable by Apps
36. Export View As Web App
- Open Cytoscape
- Load a sample network (Small ones)
- Apply layout
- File → Export → Network Views as Web Page…
- Open in browser
- python -m SimpleHTTPServer 8000
39. Chart Editor
- Visualize multiple data points
to a single view
- Time series data
- Multiple GO terms
- Chart types: Bar, Box, Pie,
Heat Map, Ring
- Part of standard Visual Style
Editor
- Everything will be saved
into session files
45. Part I Summary
- Cytoscape 3.2 includes new features for advanced
network visualization
- More integration to Cytoscape.js
- Build prototype web-based visualization in
Cytoscape
- v3.3
- Not finalized yet… Feature preview in summer
51. Problems in Bioinformatics
- No more free lunch
- Even if you buy expensive machines, you cannot get free performance gain
anymore. You have to design your code for massively distributed
environment. (From Scale-up to Scale-out)
- Complex Data Analysis Pipeline
- Need to build pipeline by connecting multiple resources, or services
- Needs for complex, customized data visualization
- Reproducibility
➡ But building, deploying, and maintaining reproducible pipeline is not
straight-forward
52. Problems in Bioinformatics
- No more free lunch
- Even if you buy expensive machines, you cannot get free performance gain
anymore. You have to design your code for massively distributed
environment. (From Scale-up to Scale-out)
- Complex Data Analysis Pipeline
- Need to build pipeline by connecting multiple resources, or services
- Needs for complex, customized data visualization
- Reproducibility
➡ But building, deploying, and maintaining reproducible pipeline is not
straight-forward
57. Why Reproducibility is Important?
- Saving your time
- Reusability
- Make your sponsors (funding agencies) happy
58. We (the NIH) Are Working On, But As
Yet Do Not Have Good Answers To:
1. Today, how much are we actually
spending on data and software related
activities?
2. How much should we be spending to
achieve the maximum benefit to
biomedical science relative to what we
spend in other areas?
Biomedical Research as an Open Digital Enterprise by Philip E. Bourne Ph.D.
Associate Director for Data Science (NIH)
59. Reproducibility
! Most of the 27 Institutes and Centers of the NIH are
currently reviewing the ability to reproduce research
they are funding
! The NIH recently convened a meeting with publishers
to discuss the issue – a set of guiding principles
arose
Biomedical Research as an Open Digital Enterprise by Philip E. Bourne Ph.D.
Associate Director for Data Science (NIH)
65. Problems
- Complex Dependency
- OS-dependent code
- Network of dependencies to run popular library X
- Library version numbers
- It-worked-on-my-machine syndrome
- Installation Hell
66. Software Distribution Problem
- “It-worked-on-my-machine” syndrome
- This is a serious problem especially when
you want to share your workflow with
collaborators.
99. Git/GitHub For Sharing Code/Notebooks
- Git - Distributed Source Code Management
System
- GitHub - (Public) Remote repository + great user
interface for working with OSS code
100. - Create a new repository from existing one
- Complete copy of the original + your full access
- Pull Request
Forking
106. Bare Metal Machine
OS (Linux)
Docker
Frameworks
Application
Frameworks
Application
Frameworks
Application
Frameworks
Application
Frameworks
Application
107.
108. What is Docker?
- Container to run applications in an isolated
environment
- Application = Layer of images
- Sharable Environments
- Environments as code
125. Run Options
- -p: Publish port
- -p 80:8080 - Publish container’s port 8080 to 80
- -v: Mount local volume
- -v $PWD:/myapp - Mount current working
directory to container’s /myapp directory
126. Run Options
- -n: Name the instance
- -n graph-analysis
- -d: Run in background (as daemon)
127. docker run -p 80:3000 -d
-v $PWD:/webapp/contents node
Example
129. Run Docker Image
- Publish port 80
- Run in background
- Mound forked repository to /
notebook
- Add environment variables
- Password
130. docker run -d -v $PWD:/notebooks
-p 80:8888 -e "PASSWORD=yourpass"
-e "USE_HTTP=1" idekerlab/
vizbi-2015
Actual Command to Run Our Image (one-line)
Current directory should be under your home
(e.g. /Users/foo/Documents/vizbi-2015)
143. Access Notebook Server
Running in a Docker Container
- We will use a extended version of official ipython/
SciPy server image
- By default, it uses secure connection (https)
- /notebooks is the root directory of notebooks
- Mount local file system to share Notebooks
between container and your laptop
149. User Type I
- Average computing skills
- Use Excel as their primary
workbench for data analysis
- For them, bioinformatics
means using some of
NCBI/EBI web tools or
DAVID
- Have tons of data not
analyzed / visualized yet
- Excel is my friend.
150. User Type II
- Advanced computing skills
- Use Python + SciPy /
NumPy, R +
Bioconductor, or
MATLAB every day
- If necessary, write their
own packages
- Use HPC technologies a lot
- Manual operation is evil.
151. Both of them are Important!
- Type I: “Bench Biologists”
- Domain experts
- Data producers
- Type II: Computational Biologists
- Experts of large-scale data analysis
- Especially important for genome-scale
data analysis
They are ignored for a long
time in Cytoscape world…
152. User Type II
- Advanced computing skills
- Use Python + SciPy /
NumPy, R +
Bioconductor, or
MATLAB every day
- If necessary, write their
own packages
- Use HPC technologies a lot
- Manual operation is evil.
153. Requests from Type II Users
- I have 200 networks in my session and I need to create
one PDF per view. How can I do it with Cytoscape?
- I need to use igraph for network analysis, but its
visualization feature is limited. I want to use Cytoscape
as an external visualization engine for R.
- Usually I use IPython Notebook to record my work.
How can I integrate Cytoscape into my workflow?
- I want to generate Style for each time point and create
small multiples of networks.
155. What is cyREST?
- Platform-independent, RESTful API
module for Cytoscape
- Means you can access basic
Cytoscape data objects
programmatically
REST
156. Interactive Data Analysis
Environments
In-House Databases External Computing Resources
- Graph Layout
- Statistical Analysis
- Data Pre-processing
RStudio
- NumPy
- SciPy
- Pandas
- NetworkX
IPython Notebook
File / Code Hosting ServicesPublic Data Repository
PSICQUIC Services
EBI RDF Platform
Other Bioinformatics Web Applications / Services
- igraph
- rCurl
Command Line Tools
> sed
> awk
> grep
> curl
Web Browsers
Data Repository & Collaboration Service
Data Bus (Internet)
Your Workstation
Cytoscape App Store
Cytoscape Desktop
Apps
Core
REST
169. 2014
- Cytoscape 3.2.0: (Modularized) Java Application
- Client applications are migrating to the web browsers
- “Pure” desktop applications are dying slowly…
- Even desktop applications depend on eternal services
- JavaScript everywhere
- Cloud Computing
- Scale-out over scale-up
170. Trend in Software Design
- An application is a collection of smaller services
- JavaScript is a first-class citizen in the world of
programming languages
- Design application with cloud services in mind
172. In the modern era, software is commonly delivered as a
service: called web apps, or software-as-a-service. The twelve-factor
app is a methodology for building software-as-a-service apps that:
• Use declarative formats for setup automation, to minimize time and
cost for new developers joining the project
• Have a clean contract with the underlying operating system, offering
maximum portability between execution environments
• Are suitable for deployment on modern cloud platforms, obviating
the need for servers and systems administration
• Minimize divergence between development and production,
enabling continuous deployment for maximum agility
• And can scale up without significant changes to tooling,
architecture, or development practices.
173.
174. This MANIFESTO counters
current trends in
bioinformatics where
institutes and companies
are creating monolithic
software solutions aimed
mostly at end-users.