Building collaborative workflows for scientific data

Building
collaborative
workflows for
scientific data
bmpvieira.com/orcambridge14

|
Phd Student @
Bioinformatics and
Population Genomics
Supervisor:
Yannick Wurm |
Before:
Bruno Vieira @bmpvieira
@yannick__
© 2014 Bruno Vieira CC-BY 4.0

Reproducibility layers
Code
Data
Workflow
Environment

The GitHub for Science...
is GitHub!

Dat
open source tool for sharing and
collaborating on data
started august '13, we are grant funded
and 100% open source
public
on freenode
dat-data.com
#dat
gitter.im/datproject/discussions
Dat Community Call #1

Dat - "git for data"
npm install -g dat
dat init
collect-data | dat import
dat listen

Dat
dat clone
dat pull --live
dat blobs put mygenome data.fasta
dat cat | transform
dat cat | docker run -i transform
http://eukaryota.dathub.org

Dat
Planned
dat checkout revision
dat diff
dat branch
multi master replication
sync to databases
registry

Data stored locally in leveldb, but can use
other backends such as
Postgres
Redis
etc
Files stored in blob-stores
s3
local-fs
bitorrent
ftp
etc

Dat features
auto schema generation
free REST API
all APIs are streaming

Dat workshop
maxogden.github.io/get-dat

Dat quick deploy
github.com/bmpvieira/heroku-dat-template

Bionode
open source project for modular and
universal bioinformatics
started january '14
bionode.io

Some problems I faced
during my research:
Difficulty getting relevant descriptions and
datasets from NCBI API using bio* libs
For web projects, needed to implement
the same functionality on browser and
server
Difficulty writing scalable, reproducible
and complex bioinformatic pipelines

Bionode also collaborates with BioJS

Bionode
npm install -g bionode
bionode ncbi download gff bacteria
bionode ncbi download sra arthropoda |
bionode sra fastq-dump
npm install -g bionode-ncbi
bionode-ncbi search assembly formicidae |
dat import --json

Bionode - list of modules
Name Type Status People
Data
access
status production
Parser status production
Wrangling status production
Data
access
status production
Parser status production
ncbi
fasta
seq IM
ensembl
blast-
parser

Documentation status production
template
JS pipeline
Gasket
pipeline
Dat/Bionode
workshop

Wrappers status development
Parser status development
sra
bwa
sam
bbi

status request
Name Type People
Data access
Data access
Parser
Parser
Wrappers
Wrappers
Wrappers
ebi
semantic
vcf
gff
bowtie
sge badryan
blast

Name Type People
Wrappers
Wrappers
Wrappers
Wrappers
Wrappers
Wrappers
vsearch
khmer
rsem
gmap
star
go badryan

Bionode - Why wrappers?
Same interface between modules
(Streams and NDJSON)
Easy installation with NPM
Semantic versioning
Add tests
Abstract complexity / More user friendly

Bionode - Why Node.js?
Same code client/server side

Need to reimplement the same code on
browser and server.
Solution: JavaScript everywhere
->
-> ,
->
->
Afra bionode-seq
GeneValidator seq fasta
SequenceServer
BioJS collaborating for code reuse
Biodalliance converting to bionode

Reusable, small and tested
modules

Benefit from other JS
projects
Dat BioJS NoFlo

Difficulty getting relevant description and
Python example: URL for the Achromyrmex
assembly?
Solution:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000188075.1_Si_gnG
import xml.etree.ElementTree as ET
from Bio import Entrez
Entrez.email = "mail@bmpvieira.com"
esearch_handle = Entrez.esearch(db="assembly", term="Achromyrmex")
esearch_record = Entrez.read(esearch_handle)
for id in esearch_record['IdList']:
esummary_handle = Entrez.esummary(db="assembly", id=id)
esummary_record = Entrez.read(esummary_handle)
documentSummarySet = esummary_record['DocumentSummarySet']
document = documentSummarySet['DocumentSummary'][0]
metadata_XML = document['Meta'].encode('utf-8')
metadata = ET.fromstring('' + metadata_XML + '')
for entry in Metadata[1]:
print entry.text
bionode-ncbi

Example: URL for the Achromyrmex
assembly?
JavaScript
http://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000204515.1_Aech_3.9/GCA_00020
4515.1_Aech_3.9_genomic.fna.gz
var bio = require('bionode')
bio.ncbi.urls('assembly', 'Acromyrmex', function(urls) {
console.log(urls[0].genomic.fna)
})
bio.ncbi.urls('assembly', 'Acromyrmex').on('data', printGenomeURL)
function printGenomeURL(urls) {
console.log(urls[0].genomic.fna)
})

Example: URL for the Achromyrmex
assembly?
JavaScript
BASH
http://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000204515.1_Aech_3.9/GCA_00020
4515.1_Aech_3.9_genomic.fna.gz
var ncbi = require('bionode-ncbi')
var ndjson = require('ndjson')
ncbi.urls('assembly', 'Acromyrmex')
.pipe(ndjson.stringify())
.pipe(process.stdout)
bionode-ncbi urls assembly Acromyrmex |
tool-stream extractProperty genomic.fna

Difficulty writing scalable, reproducible and
complex bioinformatic pipelines.
Solution: Node.js Streams everywhere
var ncbi = require('bionode-ncbi')
var tool = require('tool-stream')
var through = require('through2')
var fork1 = through.obj()
var fork2 = through.obj()

Solution: Node.js Streams everywhere
ncbi
.search('sra', 'Solenopsis invicta')
.pipe(fork1)
.pipe(dat.reads)
fork1
.pipe(tool.extractProperty('expxml.Biosample.id'))
.pipe(ncbi.search('biosample'))
.pipe(dat.samples)
fork1
.pipe(tool.extractProperty('uid'))
.pipe(ncbi.link('sra', 'pubmed'))
.pipe(ncbi.search('pubmed'))
.pipe(fork2)
.pipe(dat.papers)

bionode-example-dat-gasket
get-dat workshop
get-dat bionode gasket example

{
"import-data": [
"bionode-ncbi search genome eukaryota",
"dat import --json --primary=uid"
],
"search-ncbi": [
"dat cat",
"grep Guillardia",
"tool-stream extractProperty assemblyid",
"bionode-ncbi download assembly -",
"tool-stream collectMatch status completed",
"tool-stream extractProperty uid",
"bionode-ncbi link assembly bioproject -",
"tool-stream extractProperty destUID",
"bionode-ncbi link bioproject sra -",
"tool-stream extractProperty destUID",
"grep 35526",
"bionode-ncbi download sra -",
"tool-stream collectMatch status completed",
"tee > metadata.json"
],

"index-and-align": [
"cat metadata.json",
"bionode-sra fastq-dump -",
"tool-stream extractProperty destFile",
"bionode-bwa mem **/*fna.gz"
],
"convert-to-bam": [
"bionode-sam 35526/SRR070675.sam"
]
}

datscript
pipeline main
run pipeline import
pipeline import
run foobar | run dat import --json
bmpvieira example
ekg example

Docker for reproducible
science
docker run bmpvieira/thesis

- Modular and universal bioinformatics
Pipeable UNIX command line tools and
JavaScript / Node.js APIs for bioinformatic
analysis workflows on the server and browser.
- Build data pipelines
Provides a streaming interface between every file
format and data storage backend. "git for data"
Bionode.io
#bionode
gitter.im/bionode/bionode
Dat-data.com
#dat
gitter.im/datproject/discussions

Acknowledgements
@yannick__
@maxogden
@mafintosh
@erikgarrison
@QM_SBCS
@opendata
Bionode contributors

Thanks!
"Science should work as an
Open Source project"
dat-data.com
bionode.io

Building collaborative workflows for scientific data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Building collaborative workflows for scientific data

Similar to Building collaborative workflows for scientific data (20)

Recently uploaded

Recently uploaded (20)

Building collaborative workflows for scientific data