Your SlideShare is downloading. ×
  • Like
Bridging the gap: Enabling top research in translational research - Knut Reinert
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Bridging the gap: Enabling top research in translational research - Knut Reinert

  • 528 views
Published

In this talk I will convey to you my view about the necessary steps for enabling efficient …

In this talk I will convey to you my view about the necessary steps for enabling efficient
research in biomedical research in the times where biotechnology can give us comprehensive views of certain data.
I will start by arguing that the NGS technologies developed in the recent years changed the research landscape to a degree similar to the beginning of the millennium when the human genome was initially sequenced.
As a consequence, the research tools of many biomedical researcher have or will change in the sense that they will conduct large scale, complex computations. Hence, as a community, we have to turn our focus to how we develop such tools.
Thinking about this becomes essential since in the near future clinical decisions concerning the treatment of individuals (personalised medicine) will be based on such computations. I will talk about the past and future role of software libraries for enabling translational research and exemplify some points with the SeqAn C++ library developed in my lab.

Published in Health & Medicine , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
528
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
7
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Prof. Dr. Knut Reinert Algorithmische Bioinformatik, FB Mathematik und Informatik Bridging the gap: Enabling top research in translational research Knut Reinert Freie Universität Berlin Institute for Computer Science Australia, 10/13
  • 2. ~ 13 years ago... Data volume and cost: In 2000 the 3 billion base pairs of the human genome were sequenced for about 3 billion US$ Dollar 100 million bp per day Australia, 10/13 6
  • 3. Sequencing today... Illumina HiSeq 100 billion bps per day Within roughly ten years sequencing has become about 10 million times cheaper Australia, 10/13 7
  • 4. Sequencing earth? 107 species x 108 genome size => earth genome has 1015 bps 104 Hiseqs can each sequence 1011 bps per day => earth genome at 10x in 10 days Australia, 10/13 8
  • 5. Future of NGS data analysis Australia, 10/13 9
  • 6. Why is translational research hard? Medical doctors/Biologists Result 1 Quality 0.75 Result 2 Quality 0.95 Result 3 Quality 0.35 Biomedical Modelers Engineers (Hardware) Implementation 1 Quality 0.65 Implementation 2 Quality 0.75 Engineers (Software) Algorithmicists Mathematicians Algorithm 1 quality 0.75 Algorithm 2 quality 0.45 Implementation 3 Quality 0.85 Algorithm 3 quality 0.95 Algorithm 4 quality 0.85 0.95*0.85*0.95=0.76 Australia, 10/13 10
  • 7. Software libraries bridge gap Structural variants RNA-Seq ChIP-Seq Metagenomics abundance Sequence assembly Cancer genomics Analysis pipelines Experimentalists We need to be very good Algorithm libraries on all levels Maintainable tool Prototype implementation Algorithm design Computer Scientists FM-index Multicore Suffix arrays Australia, 10/13 Theoretical Considerations Secondary memory Fast I/O K-mer filter Hardware acceleration 11
  • 8. This talk Translational research Australia, 10/13 12
  • 9. This talk SeqAn Content SeqAn Performance Australia, 10/13 14
  • 10. SeqAn Now SeqAn/SeqAn tools have been cited more than 360 times Among the institutions are (omitting German institutes): Department of Genetics, Harvard Medical School, Boston, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, J. Craig Venter Institute, under BSD USA, Is Rockville MD, license and Department of Molecular Biology, Princeton University, hence free for academic Applied Mathematics Program, Yale University, New Haven, IBM T.J. Watson Research Center, Yorktown Heights, AND commercial use. The Ohio State University, Columbus, University of Minnesota, Australian National University, Canberra, Department of Statistics, University of Oxford, Swedish University of Agricultural Sciences (SLU), Uppsala, Graduate School of Life Sciences, University of Cambridge, Broad Institute, Cambridge, USA, EMBL-EBI, University of California, University of Chicago, Iowa State University, Ames, The Pennsylvania State University, Peking University, Beijing University of Science and Technology of China, BGI-Shenzhen, China, Beijing Institute of Genomics…… Australia, 10/13 16
  • 11. SeqAn developers 16 14 12 External CSC BMBF DFG IMPRS FU 10 8 6 4 2 0 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 Australia, 10/13 17
  • 12. SeqAn Content - SDK Australia, 10/13 18
  • 13. SeqAn SDK Components - Tutorials Australia, 10/13 19
  • 14. SeqAn SDK Components – Reference Manual Australia, 10/13 20
  • 15. SeqAn SDK Components Review Board to ensure code quality CDash/CTest to automatically Code coverage reports compile and test across platforms Australia, 10/13 21
  • 16. SeqAn Content algorithms & data structures Australia, 10/13 22
  • 17. Unified Alignment Algorithms Versatile & Extensible DP-Interface For Example ... Standard DP-Algorithms Global & Semi Global Alignments Local Alignments Split Breakpoint Detection Australia, 10/13 Modified DP-Algorithms Banded Chain Alignment 23
  • 18. Unified Alignment Algorithms Configure Central Configuration DPProfile_<TAlgorithm, TGaps, TTrace> Compile … selects code snippets accordingly … generates desired DP-Algorithm Run Unbanded DP-Algorithms Australia, 10/13 Banded DP-Algorithms 24
  • 19. Unified Alignment Algorithms For Example ... Needleman-Wunsch with Traceback: DPProfile_<GlobalAlignment_<>, LinearGaps, TracebackOn<> > Semi-Global Gotoh without Traceback: DPProfile_<GlobalAlignment_<FreeEndGaps_<True, False, True, False> >, AffineGaps, TracebackOff> Banded Smith-Waterman with Affine Gap Costs: DPBand_<BandOn>(lowerDiag, upperDiag), DPProfile_<LocalAlignment_<>, AffineGaps, TracebackOn<> > Split-Breakpoint Detection for Right Anchor: DPProfile_<SplitAlignment_<>, AffineGaps, TracebackOn<GapsRight> > Australia, 10/13 25
  • 20. Unified Alignment Algorithms ... And how much slower is that? (affine alignment of 10kb Dengue virus sequences) 10 9 8 7 time [s] 6 5 4 3 2 1 0 SeqAn Australia, 10/13 NCBI GGSEARCH NEEDLE 27
  • 21. Support for Common File Formats Important file formats for HTS analysis SequenceStream ss(“file.fa.gz”); Sequences while (!atEnd(ss)) FASTA, FASTQ Indexed FASTA (FAI) for random access { Genomic Features GFF 2, GFF 3, GTF, BED Read Mapping SAM, BAM (plus BAM indices) Variants VCF readRecord(id, seq, ss); cout << id << 't' << seq << 'n'; } BamStream bs(“file.bam”); while (!atEnd(bs)) { readRecord(record, bs); cout << record.qName << 't' << record.pos << 'n’; } … or write your own parser Tutorials and helper routines for writing your own parsers. Australia, 10/13 28
  • 22. Journaled Sequences Store Multiple Genomes Save Storage Capacities StringSet<TJournaled, Owner<JournalSet> > set; setGlobalReference(set, refSeq); String<Dna, Journaled<Alloc<> > > appendValue(set, seq1); join(set, idx, JoinConfig<>()); Ref: G1:   G2: GN: Australia, 10/13 29
  • 23. Journaled stringset benchmark 1091 x chr. 22 (~60 GB) Task: run a sequential algorithms over all strings (in demo the Horspool algorithm) Use core parallelism AND data parallelism (JournaledStringSet) Timings with and without I/O Australia, 10/13 31
  • 24. Timings without I/O (secs) 180 160 SS 140 about 4x slower (no DP) 5x times faster (with DP) JSS, 120 100 NO DP 80 60 JS, DP 40 20 0 1 Australia, 10/13 2 4 8 32
  • 25. Timings with I/O (min) 14 12 ~ 30 times faster 10 8 6 SS JSS, NO DP JSS, DP 4 2 0 SS Australia, 10/13 JSS, NO DP JSS, DP 33
  • 26. Fragment Store All-In-One Data Structure for HTS Designed to represent: • reads, mate-pairs, reference genomes • pairwise alignments • genome annotations Easy-to-use interface and high-level functions for typical workflows. Genome Annotations Annotation files can easily be imported (GFF/GTF/UCSC): FragmentStore<> store; read(file, store, Gff()); CDS mRNA exon gene mRNA root The annotation tree can be traversed with iterators: Iterator<TStore, AnnotationTree<> > it(store); goDown(it); exon exon exon gene CDS mRNA exon exon CDS Australia, 10/13 34
  • 27. Fragment Store (Multi) Read Alignments Read alignments can be easily imported: std::ifstream file("ex1.sam"); read(file, store, Sam()); … and accessed as a multiple alignment, e.g. for visualization: AlignedReadLayout layout; layoutAlignment(layout, store); printAlignment(svgFile, Raw(), layout, store, 1, 0, 150, 0, 36); Australia, 10/13 35
  • 28. Unified Full-Text Indexing Framework Available Indices Suffix Trees: • suffix array • enhanced suffix array • lazy suffix tree Prefix Trie: • FM-index q-Gram Indices: • direct addressing • open addressing • gapped Index<TSeq, IndexEsa<> > Index<StringSet<TSeq>, FMIndex<> > All indices support multiple strings and external memory construction/usage. Index Lookup Interface All indices support the (sequential) find interface: Finder<TIndex> finder(index); while (find(finder, "TATAA")) cout << "Hit at position" << position(finder) << endl; Australia, 10/13 36
  • 29. Unified Full-Text Indexing Framework String Tree Interface Suffix/prefix trees can be accessed with iterators that support different traversals: • top-down • depth-first search • random Iterator<TIndex, TopDown<> >::Type it(index); goDown(it, 'a'); suffix tree Advanced Index Algorithms Repeat search iterators • for maximal/supermaximal repeats and maximal unique matches Pattern search with backtracking • parallel exact/approximate search of multiple patterns (tree vs. tree) q-gram filters • counting/seed filters for approximate pattern search and local alignments Australia, 10/13 37
  • 30. Applications Australia, 10/13 38
  • 31. STELLAR – exact local aligner ... finds all maximal ε-matches in short time. Filtering Index module sequence 1 Finder<Tsequence,Swift<SwiftLocal> > sequence 2 Index <TSeq, IndexQGram> Verification Australia, 10/13 39
  • 32. STELLAR – exact local aligner Verification Alignment module Align <Tseq> LocalAlignmentEnumerator<TScore, Banded> Seeds module Seed <Simple> extendSeed(seed, ...,GappedXDrop()) Australia, 10/13 40
  • 33. Stellar – exact local aligner Stellar is based on a SWIFT filter and allows epsilon threshold matches with X-drops Australia, 10/13 41
  • 34. Exact vs. Heuristics E-value 6x10-84 not found by standard BLAST Australia, 10/13 42
  • 35. Splazers: split read mapping Combination of split matches SplazersS is based on a SWIFT filter and allows large indels Australia, 10/13 43
  • 36. Splazers: split read mapping Acceptable speed and very sensitive Australia, 10/13 44
  • 37. SeqAn Performance Australia, 10/13 47
  • 38. Masai read mapper Australia, 10/13 48
  • 39. Masai read mapper Reads Genome Chr. 1 Chr. 2 Chr. X ACGCTTCATCGCCCT… Index of reads (Radix tree of seeds) Index of genome (e.g. FM-index) Algorithm is based on the simultaneous traversal of two string indices (e.g., FM-index, Enhanced suffix array, Lazy suffix tree) 49 Australia, 10/13
  • 40. Read Mapping: Masai Faster and more accurate than BWA and Bowtie2 Timings on a single core Australia, 10/13 50
  • 41. No bias in SNP/Sequencing error Australia, 10/13 51
  • 42. Easily exchange index…. Australia, 10/13 52
  • 43. Collaboration to parallelize indices and verification algorithms in SeqAn, to speed up any applications making use of indices What about multi-core implementation? Australia, 10/13 53
  • 44. SeqAn going parallel GOAL Parallelize the finder interface of SeqAn so it works on CPU and accelerators like GPU Will be replaced by hg18 and 10 million 20-mers Australia, 10/13 54
  • 45. SeqAn going parallel Construct FM-index on reverse genome Set # OMP threads Call generic count function Australia, 10/13 55
  • 46. SeqAn going parallel : NVIDIA GPUs Copy needles and index to GPU SAME count function as on CPU ! Australia, 10/13 56
  • 47. SeqAn going parallel Count occurrences of 10 million 20-mers in the human genome using an FM-index I7,3.2 GHz …12... Intel Xeon Phi 7120, 244 threads NVIDIA Tesla K20 Australia, 10/13 18.6 sec 1X 2.66 sec 7X 2.18 sec 8.5 X 0.4 s 47 X 57
  • 48. SeqAn going parallel Approx. count occurrences of 1.2 million 33-mers in the human genome using an FM-index I7,3.2 GHz …12... 66.1 s 9.0 s 1X 7.3 X Intel Xeon Phi 7120, 244 threads 3.9 s 16.9 X NVIDIA Tesla K20 3.2 s 20.7 X Australia, 10/13 58
  • 49. Workflow integration Australia, 10/13 59
  • 50. Generic workflow nodes Australia, 10/13 60
  • 51. Library Integration • Give every tool a self-describing output format: semantic annotation of its inputs/outputs • In OpenMS and SeqAn we developed CTD (Common Tool Description) for this purpose • Each tool can thus ‘tell’ its requirements and options in a coherent format • All interfaces are fully described by this format • Information on the tools options, I/O are entirely contained within the individual tool (maintenance!) Australia, 10/13 61
  • 52. CTD Format General tool description in header <tool name="MasaiMapper" version="0.7.1 [14053]" docurl="http://www.seqan.de" category="Read Mapping" > <executableName>masai_mapper</executableName> <description>Masai Mapper</description> <manual>Masai is a fast and accurate read mapper based on approximate seeds and multiple backtracking. See http://www.seqan.de/projects/masai for more information. (c) Copyright 2011-2012 by Enrico Siragusa. </manual> <cli> <clielement optionIdentifier="--write-ctd-file-ext" isList="false"> <mapping referenceName="masai_mapper.write-ctd-file-ext" /> </clielement> ........ Australia, 10/13 62
  • 53. Node generator Generic workflow nodes can generate nodes to be used in e.g. KNIME. https://github.com/genericworkflownodes It is compatible with both internal and external tools. This means, ANY tool can be integrated in KNIME as long as it has a CTD. Australia, 10/13 63
  • 54. Workflows Enabling Software Re-Use External tools Australia, 10/13 65
  • 55. Workflows Enabling Software Re-Use Australia, 10/13 66
  • 56. SeqAn projects Genome comparison (Evolutionary models due to new breakpoint distance, Local alignments to find syntheny blocks) David Weese Björn Kahlert Sabrina Krakau Parallelization of indices Useability and (Multicore, GPU, Xeon Phi), KNIME mapping Error correction, Read support Jialu Hu Birte Kehr Stephan Aiche Leon Kuchenbecker (Charité) Oliver Stolpe (Charité) Australia, 10/13 Manuel Holtgrewe Metagenomics and bisulphite mapping BlastX replacement, Bisulphite analysis Enrico Siragusa Variant detection (SNPs, small indels, insert assembly, split read mapping) Network analysis (Network alignment) René Märker Jochen Singer Anne-Katrin Emde Data parallelism (Data parallel iterators, dynamic indices) Kathrin Trappe Temesgen Dadi 67
  • 57. Prof. Dr. Knut Reinert Algorithmische Bioinformatik, FB Mathematik und Informatik THANK YOU for your attention The OpenMS and SeqAn teams (Berlin, Tübingen, Zürich) The KNIME team (Michael Berhold, Konstanz) NVIDIA (Jacopo Pantaleoni, Jonathan Cohen) www.seqan.de, www.openms.de Australia, 10/13 SeqAn Nvidia webinar October 22nd 2013 at 9.00 AM pacific time.