21. Dat
open source tool for sharing and
collaborating on data
started august '13, we are grant funded
and 100% open source
public
on freenode
dat-data.com
#dat
gitter.im/datproject/discussions
Dat Community Call #1
22.
23. Dat - "git for data"
npm install -g dat
dat init
collect-data | dat import
dat listen
24.
25. Dat
dat clone
dat pull --live
dat blobs put mygenome data.fasta
dat cat | transform
dat cat | docker run -i transform
http://eukaryota.dathub.org
34. Some problems I faced
during my research:
Difficulty getting relevant descriptions and
datasets from NCBI API using bio* libs
For web projects, needed to implement
the same functionality on browser and
server
Difficulty writing scalable, reproducible
and complex bioinformatic pipelines
37. Bionode - list of modules
Name Type Status People
Data
access
status production
Parser status production
Wrangling status production
Data
access
status production
Parser status production
ncbi
fasta
seq IM
ensembl
blast-
parser
38. Bionode - list of modules
Name Type Status People
Documentation status production
Documentation status production
Documentation status production
Documentation status production
template
JS pipeline
Gasket
pipeline
Dat/Bionode
workshop
39. Bionode - list of modules
Name Type Status People
Wrappers status development
Wrappers status development
Wrappers status development
Parser status development
sra
bwa
sam
bbi
40. Bionode - list of modules
status request
Name Type People
Data access
Data access
Parser
Parser
Wrappers
Wrappers
Wrappers
ebi
semantic
vcf
gff
bowtie
sge badryan
blast
41. Bionode - list of modules
Name Type People
Wrappers
Wrappers
Wrappers
Wrappers
Wrappers
Wrappers
vsearch
khmer
rsem
gmap
star
go badryan
42. Bionode - Why wrappers?
Same interface between modules
(Streams and NDJSON)
Easy installation with NPM
Semantic versioning
Add tests
Abstract complexity / More user friendly
44. Need to reimplement the same code on
browser and server.
Solution: JavaScript everywhere
->
-> ,
->
->
Afra bionode-seq
GeneValidator seq fasta
SequenceServer
BioJS collaborating for code reuse
Biodalliance converting to bionode
51. Difficulty getting relevant description and
datasets from NCBI API using bio* libs
Python example: URL for the Achromyrmex
assembly?
Solution:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000188075.1_Si_gnG
import xml.etree.ElementTree as ET
from Bio import Entrez
Entrez.email = "mail@bmpvieira.com"
esearch_handle = Entrez.esearch(db="assembly", term="Achromyrmex")
esearch_record = Entrez.read(esearch_handle)
for id in esearch_record['IdList']:
esummary_handle = Entrez.esummary(db="assembly", id=id)
esummary_record = Entrez.read(esummary_handle)
documentSummarySet = esummary_record['DocumentSummarySet']
document = documentSummarySet['DocumentSummary'][0]
metadata_XML = document['Meta'].encode('utf-8')
metadata = ET.fromstring('' + metadata_XML + '')
for entry in Metadata[1]:
print entry.text
bionode-ncbi
52. Difficulty getting relevant description and
datasets from NCBI API using bio* libs
Example: URL for the Achromyrmex
assembly?
JavaScript
http://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000204515.1_Aech_3.9/GCA_00020
4515.1_Aech_3.9_genomic.fna.gz
var bio = require('bionode')
bio.ncbi.urls('assembly', 'Acromyrmex', function(urls) {
console.log(urls[0].genomic.fna)
})
bio.ncbi.urls('assembly', 'Acromyrmex').on('data', printGenomeURL)
function printGenomeURL(urls) {
console.log(urls[0].genomic.fna)
})
53. Difficulty getting relevant description and
datasets from NCBI API using bio* libs
Example: URL for the Achromyrmex
assembly?
JavaScript
BASH
http://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000204515.1_Aech_3.9/GCA_00020
4515.1_Aech_3.9_genomic.fna.gz
var ncbi = require('bionode-ncbi')
var ndjson = require('ndjson')
ncbi.urls('assembly', 'Acromyrmex')
.pipe(ndjson.stringify())
.pipe(process.stdout)
bionode-ncbi urls assembly Acromyrmex |
tool-stream extractProperty genomic.fna
54. Difficulty writing scalable, reproducible and
complex bioinformatic pipelines.
Solution: Node.js Streams everywhere
var ncbi = require('bionode-ncbi')
var tool = require('tool-stream')
var through = require('through2')
var fork1 = through.obj()
var fork2 = through.obj()
62. Difficulty writing scalable, reproducible and
complex bioinformatic pipelines.
datscript
pipeline main
run pipeline import
pipeline import
run foobar | run dat import --json
bmpvieira example
ekg example
66. - Modular and universal bioinformatics
Pipeable UNIX command line tools and
JavaScript / Node.js APIs for bioinformatic
analysis workflows on the server and browser.
- Build data pipelines
Provides a streaming interface between every file
format and data storage backend. "git for data"
Bionode.io
#bionode
gitter.im/bionode/bionode
Dat-data.com
#dat
gitter.im/datproject/discussions