- 大野 圭一朗(Keiichiro Ono)
- カリフォルニア大学
サンディエゴ校医学部
- Trey Ideker Lab
- Software Engineer /
Research Associate
- National Resource for Network
Biology (NRNB)
- nrnb.org
5.
Keiichiro Ono
Background
Bioinformatics
Computer Science
Work
Research
Bioinformaticsworkflow
Visualization pipeline
Data
Visualization
Networks
Other Biological Data
Integration
Molecular Interactions
Pathways
Annotations
Software Development
Cytoscape
NeXO
Cyberinfrastructure
All kinds of small tools
Like
Art
Kandinsky
Mondrian
Music
Electronica
Techno
Minimal
Detroit
Jazz
Sci-fi
Movie
Novel
Life
US
San Diego
San Francisco Bay Area
Los Angeles
Orange County
Japan
Gifu
Tokyo
Keiichiro Ono
Background
Bioinformatics
Computer Science
Work
Research
Bioinformaticsworkflow
Visualization pipeline
Data
Visualization
Networks
Other Biological Data
Integration
Molecular Interactions
Pathways
Annotations
Software Development
Cytoscape
NeXO
Cyberinfrastructure
All kinds of small tools
Like
Art
Kandinsky
Mondrian
Music
Electronica
Techno
Minimal
Detroit
Jazz
Sci-fi
Movie
Novel
Life
US
San Diego
San Francisco Bay Area
Los Angeles
Orange County
Japan
Gifu
Tokyo
Chart Editor
- Visualizemultiple data points
to a single view
- Time series data
- Multiple GO terms
- Chart types: Bar, Box, Pie,
Heat Map, Ring
- Part of standard Visual Style
Editor
- Everything will be saved
into session files
User Type I
-いわゆる「ベンチバイオロ
ジスト」
- Excelを多用
- ドメインの専門家でデータ
を生み出す人々
- しかし解析と可視化はま
だ…と言う場合も多い
48.
User Type II
-バイオインフォマティシャン
- Python + SciPy/NumPy, R +
Bioconductor, MATLABといっ
たもの日常的に利用する
- 必要に応じてライブラリも書く
- 大規模な計算機リソースも多用
する
- 「手作業は悪!」
49.
どちらのユーザーも重要!
- Type I:“Bench Biologists”
- Domain experts
- Data producers
- Type II: Computational Biologists
- Experts of large-scale data analysis
- Especially important for genome-scale
data analysis
Cytoscapeにはこちらに
フォーカスした機能が少ない…
50.
User Type II
-バイオインフォマティシャン
- Python + SciPy /NumPy, R +
Bioconductor, MATLABといっ
たもの日常的に利用する
- 必要に応じてコードも書く
- 大規模な計算機リソースも多
用する
- 「手作業は悪!」
51.
Requests from TypeII Users
- I have 200 networks in my session and I need to create
one PDF per view. How can I do it with Cytoscape?
- I need to use igraph for network analysis, but its
visualization feature is limited. I want to use Cytoscape
as an external visualization engine for R.
- Usually I use IPython Notebook to record my work.
How can I integrate Cytoscape into my workflow?
- I want to generate Style for each time point and create
small multiples of networks.
ソフトウェア開発スタイルの変化
- An applicationis a collection of smaller services
- JavaScript is a first-class citizen in the world of
programming languages
- Design application with cloud services in mind
In the modernera, software is commonly delivered as a
service: called web apps, or software-as-a-service. The twelve-factor
app is a methodology for building software-as-a-service apps that:
• Use declarative formats for setup automation, to minimize time and
cost for new developers joining the project
• Have a clean contract with the underlying operating system, offering
maximum portability between execution environments
• Are suitable for deployment on modern cloud platforms, obviating
the need for servers and systems administration
• Minimize divergence between development and production,
enabling continuous deployment for maximum agility
• And can scale up without significant changes to tooling,
architecture, or development practices.
This MANIFESTO counters
currenttrends in
bioinformatics where
institutes and companies
are creating monolithic
software solutions aimed
mostly at end-users.
78.
–THE SMALL TOOLSMANIFESTO FOR BIOINFORMATICS
“Every single tool should do the smallest possible
task really well”
データ解析ツールの傾向
- Python isbecoming the standard
language for “Data Scientists”
- Python itself is a very slow language,
but is a perfect glue
- Lots of tools are made by scientists
(e.g. Anaconda by Continuum)
- They do understand current
problems in modern scientific
computing, and trying to solve them
- Visualization needsvaries,
especially for complex data sets like the
one from life science domain
- For that purpose, Java is not the best
language to implement applications
- Even large-scale data visualization
applications are moving to the web
browsers
- Canvas (Cytoscape.js), WebGL
(Three.js), SVG (D3.js)
- Most of the talented hackers are
working on the web browsers, i.e.,
JavaScript
科学系計算機環境における課題
- No morefree lunch
- Even if you buy expensive machines, you cannot get free
performance gain anymore. You have to design your code for
massively distributed environment. (From Scale-up to Scale-out)
- Complex Data Analysis Pipeline
- Needs for complex, customized data visualization
- Reproducibility
- パイプラインの構築そのものが複雑で再現性の確保が困難
Srivas, Rohith etal. “Assembling Global Maps of Cellular Function through
Integrative Analysis of Physical and Genetic Networks.” Nature Protocols
6.9 (2011): 1308–1323. PMC. Web. 1 Dec. 2014.
108.
Core algorithm 1
asPython
Java Implementation of
Algorithms
Cytoscape 2.x Plugin
Biological
Problem
Cytoscape 3.x App
Core algorithm 2
as Python
Core algorithm n
as Python
PanGIA Service
(Implement in Python again…?)
by Sourav
by Greg, Rohith
by Greg, Rothith and Cytoscape Team
by David
History of PanGIA Application
NeXO Web
- TermEnrichment Analysis
- From list of genes, perform
hypergeometric test over set of
machine-generated ontology (NeXO)
terms and display terms with p-values
- It is independent from all other parts of
NeXO Web application
113.
Term
Enrichment Service APIby Flask
Python Core
SciPy
NumPy
Overview of NeXO Term
Enrichment Service
NeXO Web RESTful API
114.
Term
Enrichment Service APIby Flask
Python Core
SciPy
NumPy
Overview of NeXO Term
Enrichment Service
NeXO Web RESTful API
115.
Option 1: Asa Cytoscape App
- Re-implement this algorithm as a Cytoscape App
(Java Application)
- Pros:
- Easy to install
- Cons:
- A lot of work…
- Should be written in Java
- Does not scale-out!
116.
Option 2: Asa Service
- Wrap existing applications and deploy to platform of users’ choice:
- Laptops, private servers, and commercial cloud services (AWS/Google
Computing Cloud, etc.)
- Pros:
- Scales-out
- Client-independent
- Workflow-friendly
- Cons:
- Need to adopt to the new way of software design
- Relatively more complex deployment
117.
–THE SMALL TOOLSMANIFESTO FOR BIOINFORMATICS
“Every single tool should do the smallest possible
task really well”
118.
これからのツール
- Best practice:アプリケーションを小さなサービス
の集合としてデザインする
- 文字通り、各サービスが独立したアプリケーショ
ンとしても動作するよう心がける
- デスクトップでもクラスタでも動作するように
Software Distribution Problem
-“It-worked-on-my-machine” syndrome
- This is a serious problem especially when
you want to share your workflow with
collaborators.
131.
What is Docker?
-Container to run applications in an isolated
environment
- Application = Layer of images
- Sharable Environments
- Environments as code
We (the NIH)Are Working On, But As
Yet Do Not Have Good Answers To:
1. Today, how much are we actually
spending on data and software related
activities?
2. How much should we be spending to
achieve the maximum benefit to
biomedical science relative to what we
spend in other areas?
Biomedical Research as an Open Digital Enterprise by Philip E. Bourne Ph.D.
Associate Director for Data Science (NIH)
149.
Reproducibility
! Most ofthe 27 Institutes and Centers of the NIH are
currently reviewing the ability to reproduce research
they are funding
! The NIH recently convened a meeting with publishers
to discuss the issue – a set of guiding principles
arose
Biomedical Research as an Open Digital Enterprise by Philip E. Bourne Ph.D.
Associate Director for Data Science (NIH)
150.
NIH The Commons
(Definitionby Dr. Bourne)
• Is Not:
• A database
• Confined to one
physical location
• A new large
infrastructure
• Owned by any one
group
• Is:
• A conceptual framework
• Analogous to the Internet
• A collaboratory
• A few shared rules
• All research objects have
unique identifiers
• All research objects have
limited provenance