Next generation genomics: Petascale data in the life sciences

Next Generation Genomics: Petascale data in the life sciences ,[object Object]

Wellcome Trust Sanger Institute

The Sanger Institute ,[object Object]

Based in Hinxton Genome Campus, Cambridge, UK. ,[object Object],[object Object]

We have active cancer, malaria, pathogen and genomic variation / human health studies. ,[object Object],[object Object]

Next-generation Sequencing Life sciences is drowning in data from our new sequencing machines. Traditional sequencing: ,[object Object],Next-generation: sequencing. ,[object Object],Machines are cheap(ish) and small. ,[object Object]

Big labs can afford lots of them.

Economic Trends: ,[object Object]

$500 Million. ,[object Object],[object Object]

Large centres are now doing studies with 1000s and 10,000s of genomes. ,[object Object],[object Object]

$500 genome is probable within 5 years.

Output Trends ,[object Object]

The scary graph Instrument upgrades Peak Yearly capillary sequencing

Managing Growth ,[object Object]

Intermediate analysis typically need 10x disk space of the raw data. ,[object Object],[object Object]

Sequencing cost: T d =12 months

DNA Sequencing TCTTTATTTTAGCTGGACCAGACCAATTTTGAGGAAAGGATACAGACAGCGCCTG AAGGTATGTTCATGTACATTGTTTAGTTGAAGAGAGAAATTCATATTATTAATTA TGGTGGCTAATGCCTGTAATCCCAACTATTTGGGAGGCCAAGATGAGAGGATTGC ATAAAAAAGTTAGCTGGGAATGGTAGTGCATGCTTGTATTCCCAGCTACTCAGGAGGCTG TGCACTCCAGCTTGGGTGACACAG CAACCCTCTCTCTCTAAAAAAAAAAAAAAAAAGG AAATAATCAGTTTCCTAAGATTTTTTTCCTGAAAAATACACATTTGGTTTCA ATGAAGTAAATCG ATTTGCTTTCAAAACCTTTATATTTGAATACAAATGTACTCC 250 Million * 75-108 Base fragments Human Genome (3GBases)

Use more sophisticated algorithms that can do fuzzy matching. ,[object Object]

Typical algorithms are maq, bwa, ssaha, blast. ,[object Object],[object Object]

Larger insertions/deletions/mutations. ,[object Object],[object Object],Reference: ...TTTGCTGAAACCCAAGTGACGCCATCCAGCGTGACCACTGCATTTTTCTCGGTCATCACCAGCATTCTC.... Query: CAAGTGACGCCATCCAGCGTGACCACTGCATTTTTCT A GGTCATCACCAGCA

Harder than it looks. ,[object Object],[object Object],[object Object]

Starting scaffold which can then be refined. ,[object Object]

Cancer Genomes ,[object Object]

Mutation Details ,[object Object]

Analysing Cancer Genomes ,[object Object]

Initial mutation disrupts the normal DNA repair/replication processes.

Corruption spreads through the rest of the genome. ,[object Object],[object Object],[object Object],[object Object]

Clinicians will use this information to tailor therapies.

International Cancer Genome Project ,[object Object]

Past Collaborations Data Sequencing Centre + DCC Sequencing centre Sequencing centre Sequencing centre Sequencing centre

Future Collaborations Collaborations are short term: 18 months-3 years. Sequencing centre Sequencing centre Sequencing centre Sequencing centre Federated access

Genomics Data Unstructured data (flat files) Data size per Genome Structured data (databases) Clinical Researchers, non-infomaticians Sequencing informatics specialists Intensities / raw data (2TB) Alignments (200 GB) Sequence + quality data (500 GB) Variation data (1GB) Individual features (3MB)

Where can grid technologies help us? ,[object Object]

Making our software resources available.

Bulk Data Structured data (databases) Unstructured data (flat files) Data size per Genome Sequencing informatics specialists Intensities / raw data (2TB) Alignments (200 GB) Sequence + quality data (500 GB) Variation data (1GB) Individual features (3MB)

Bulk Data Management ,[object Object]

Sequencing machines fed into an automated analysis pipeline.

All the data was tracked, analysed and archived appropriately. ,[object Object],[object Object]

Identifiable data -> private/controlled archives.

Some data held back until journal publication.

Compute farm analysis/QC pipeline Alignment/assembly suckers Data pull ... Final Repository (Oracle) 100TB / yr staging area 500 TB Seq 1 Seq 38

It turn out we were looking in the wrong place ,[object Object]

Investigators take the mass of finished sequence data out of the archives, onto our compute farms and “do stuff”. ,[object Object],[object Object]

... Data pull ... ? Compute farm analysis/QC pipeline assembly/alignment suckers Final Repository (Oracle) 100TB / yr staging area 500TB Seq 1 Seq 38 Compute Farm Compute farm disk Collaberators / 3 rd party sequencing Unmanged LIMS managed data

Accidents waiting to happen... From: <User A> (who left 12 months ago) I find the <project> directory is removed . The original directory is "/scratch/ <User B> (who left 6 months ago) " ..where is it ? If this problem cannot be solved ,I am afriaid that <project> cannot be released.

An idea whose time had come ,[object Object]

Problem exacerbated with student turnover (summer students, PhD students on rotation). ,[object Object],[object Object],[object Object],[object Object]

We want an institute wide, standardised system. ,[object Object]

Produced by DICE (Data Intensive Cyber Environments) groups at U. North Carolina, Chapel Hill.

iRODS ICAT Catalogue database Rule Engine Implements policies Irods Server Data on disk User interface WebDAV, icommands,fuse Irods Server Data in database

Basic Features ,[object Object]

Add query-able metadata to files. ,[object Object],[object Object],[object Object],[object Object]

Fast parallel data transfers across local and wide area network links.

Next generation genomics: Petascale data in the life sciences

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Next generation genomics: Petascale data in the life sciences

Similar to Next generation genomics: Petascale data in the life sciences (20)

Recently uploaded

Recently uploaded (20)

Next generation genomics: Petascale data in the life sciences