Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Bioinformatics of TB: A case study in big data

855 views

Published on

Brief presentation on the challenges and current state of play with regards to the bioinformatics of a pathogen, M. tuberculosis. Presented at the UWC/UCT Big Data workshop in January 2015

Published in: Health & Medicine
  • Be the first to comment

  • Be the first to like this

Bioinformatics of TB: A case study in big data

  1. 1. Bioinformatics of TB A case study in big data Peter van Heusden pvh@sanbi.ac.za and Alan Christoffels South African National Bioinformatics Institute University of the Western Cape Bellville, South Africa January 2015
  2. 2. The plummeting cost of sequencing
  3. 3. M. tuberculosis Widespread pathogen, responsible for 1.3 million deaths annually Genome size ~4 megabases Illumina NGS sequencing run ~2 gigabytes (uncompressed)
  4. 4. M. tuberculosis Widespread pathogen, responsible for 1.3 million deaths annually Genome size ~4 megabases Illumina NGS sequencing run ~2 gigabytes (uncompressed) Typical student project (2014) 1. Gather data (on hard disk / over network) 2. Run annotation pipeline (compute time < 1 week, disk used 20 to 40 GB) 3. Examine significance of variation compared to “reference sequence”
  5. 5. What’s coming down the pipe In South Africa alone we have access to samples from several thousand strains of TB Low cost of sequencing means 1. More depth: capture population of pathogens in single patient 2. More length: study progression of infection in a patient 3. More breadth: build in depth regional or global picture of pathogen sequence
  6. 6. Mapping a virulent TB strain “Evolutionary history and global spread of the Mycobacterium tuberculosis Beijing lineage” Merker et al (2015) Beijing lineage strains associated with Multi-Drug Resistant (MDR) TB spread worldwide Studied 4987 isolates, fully sequenced 110 representatives Mapped 6 clonal complexes and ancestral base sublineage Paper presents wealth of different data types: 1. DNA reads 2. Genotyping 3. Phylogeny 4. Geospatial 5. Time series data 6. Metadata on samples and experiments
  7. 7. More data: not more of the same Existing publishing puts focus on results not data Research data is very seldom FAIR: 1. Findable 2. Accessible 3. Interpretable 4. Reusable (j.mp/fairdata1)
  8. 8. Change data handling, change research results In the 21st century, much of the vast volume of scientific data captured by new instruments on a 24/7 basis, along with information generated in the artificial worlds of computer models, is likely to reside forever in a live, substantially publicly accessible, curated state for the purposes of continued analysis. This analysis will result in the development of many new theories! (Jim Gray) “Big” in “Big Data” is not (only) about data volume Cheap pathogen sequencing is driving complexity of questions that can be asked of data ...but only if data is FAIR
  9. 9. Why we’re not all riding to work on unicorns [W]e now have terrible data management tools for most of the science disciplines. . . . When you go and look at what scientists are doing, day in and day out, in terms of data analysis, it is truly dreadful. (Jim Gray) Who curates your data? How is it managed? Where is it analysed? And who gets access?
  10. 10. Future directions for SANBI (data management) research Research programme is necessarily modest: 1. Cross-institution authentication, authorisation and movement of data 2. New storage technologies 3. Data repositories in addition to filesystems 4. Storing and querying data on sequence collections, not individual samples
  11. 11. Future directions for SANBI (data management) research Research programme is necessarily modest: 1. Cross-institution authentication, authorisation and movement of data 2. New storage technologies 3. Data repositories in addition to filesystems 4. Storing and querying data on sequence collections, not individual samples Individual institutes can only prototype solutions: scale of the challenge will require much broader collaborative development

×