Produc'on	
  and	
  Research:	
  
Managing	
  Genomics	
  Data	
  at	
  the	
  Sanger	
  Ins'tute	
  
Dr	
  Tim	
  Cu;s	
 ...
Background	
  to	
  the	
  Sanger	
  
Ins'tute	
  

2
Po;ed	
  history	
  
2008	
  
2000	
  
Dra[	
  
Human	
  
genome	
  

1993	
  
Centre	
  
Opens	
  

1998	
  
Nematode	
  ...
Research	
  Programmes	
  
Bioinforma'cs	
  
Cellular	
  
Gene'cs	
  

Pathogen	
  
Gene'cs	
  
Mouse	
  and	
  
Zebrafish	...
Core	
  Facili'es	
  

DNA	
  
Pipelines	
  

IT	
  
Cellular	
  
Genera'on	
  
and	
  
Phenotyping	
  

Model	
  
Organis...
Idealised	
  data	
  flow	
  

6
Example:	
  Varia'on	
  associa'on	
  

7
Typical	
  data	
  flow	
  
Raw data from
sequencer

Stage data to Lustre

Staging storage

Lustre

QC and alignment

Resea...
Choosing	
  your	
  tech:	
  Pick	
  two…	
  
Price	
  

Capacity	
  

Performance	
  

9
Staging	
  storage	
  
Simple	
  scale-­‐out	
  architecture	
  
–  Server	
  with	
  ~50TB	
  direct	
  a;ached	
  
block...
iRODS	
  
Object	
  store	
  with	
  arbitrary	
  metadata	
  
Rules	
  to	
  automate	
  mirroring,	
  
and	
  other	
  t...
Downstream	
  analysis	
  
iRODS
(4PB)

Analysis clusters
(~14000 cores)

Aligned sequences

Lustre scratch space
(13 files...
Lustre	
  setup	
  
11	
  filesystems	
  
500TB	
  /1PB	
  each	
  
Large	
  projects	
  have	
  their	
  own	
  
	
  
Exas...
Future	
  challenges	
  and	
  direc'ons	
  
iRODS	
  
•  Object	
  storage	
  instead	
  of	
  filesystems	
  (WOS?)	
  
•...
Thank	
  you	
  
The	
  team	
  
	
  
–  Phil	
  Butcher,	
  IT	
  Director	
  
–  Tim	
  Cu;s,	
  Ac'ng	
  Head	
  of	
  ...
Upcoming SlideShare
Loading in...5
×

Managing Genomics Data at the Sanger Institute

1,740

Published on

In this presentation from the DDN User Meeting at SC13, Tim Cutts from The Sanger Insitute describes how the company wrangles genomics data with DDN storage.

Watch the video presentation: http://insidehpc.com/2013/11/13/ddn-user-meeting-coming-sc13-nov-18/

Published in: Technology, Health & Medicine

Managing Genomics Data at the Sanger Institute

  1. 1. Produc'on  and  Research:   Managing  Genomics  Data  at  the  Sanger  Ins'tute   Dr  Tim  Cu;s   Head  of  Scien'fic  Compu'ng   tjrc@sanger.ac.uk   1
  2. 2. Background  to  the  Sanger   Ins'tute   2
  3. 3. Po;ed  history   2008   2000   Dra[   Human   genome   1993   Centre   Opens   1998   Nematode   Genome   completed   • Next   genera'on   sequuencing   • 1000  genome   project   begins   2004   • MRSA   genome   2010   • UK10K   project   begins   2003   2005   2009   2013   • 2  billionth   base  pair   • Human   Genome   Project   completed   • Current   datacentre   opens   • Joins   interna'onal   Cancer   Genome   Consor'um   • UK10K   project  ends   •  Funded  by  the  Wellcome  Trust   •  Sequencing  projects  increase  in  scale  by  10x  every  two   years   •  ~17000  cores  of  total  compute   •  22PB  usable  storage  (~40PB  raw)     3
  4. 4. Research  Programmes   Bioinforma'cs   Cellular   Gene'cs   Pathogen   Gene'cs   Mouse  and   Zebrafish   Gene'cs   Human   Gene'cs   4
  5. 5. Core  Facili'es   DNA   Pipelines   IT   Cellular   Genera'on   and   Phenotyping   Model   Organisms   5
  6. 6. Idealised  data  flow   6
  7. 7. Example:  Varia'on  associa'on   7
  8. 8. Typical  data  flow   Raw data from sequencer Stage data to Lustre Staging storage Lustre QC and alignment Research analysis iRODS Archival storage Website 8
  9. 9. Choosing  your  tech:  Pick  two…   Price   Capacity   Performance   9
  10. 10. Staging  storage   Simple  scale-­‐out  architecture   –  Server  with  ~50TB  direct  a;ached   block  storage   –  One  per  sequencer   –  Running  SAMBA  for  upload  from   sequencer   Maximum  data  from  all  sequencers  is   currently  1.7  TB/day     1000  core  cluster  reads  data  from  staging   servers  over  NFS   –  Quality  checks   –  Alignment  to  reference  genome   –  Store  aligned  BAM  and/or  CRAM   files  in  iRODS   Next Gen Sequencer Sequence data over CIFS Production sequencing cluster QC and alignment (1000 cores) CIFS/NFS staging server NFS 50TB One of these for each of One of 27 sequencers of these for each One of 27 sequencers of these for each 27 sequencers Aligned BAM files iRODS (4PB) 10
  11. 11. iRODS   Object  store  with  arbitrary  metadata   Rules  to  automate  mirroring,   and  other  tasks  as  required     Vendor-­‐agnos'c    Mostly  DDN  SFA  10K    Some  other  vendors’  storage  also     Oracle  RAC  cluster  holds  metadata     Two  ac've-­‐ac've  iRES  resource  servers  in   different  rooms    8Gb  FC  to  storage    10Gb  IP     Series  of  43  TB  LVM  volumes  from  2x  SFA   10K  in  each  room   iCAT (Oracle RAC) iRODS Server Other vendors Other vendors SFA10K SFA10K 43TB 43TB 43TB 43TB iRES server 43TB 43TB iRES server 43TB 43TB 43TB 43TB 43TB 43TB 43TB 43TB 43TB 43TB 43TB 43TB SFA10K SFA10K 43TB 43TB 43TB 43TB 43TB 43TB 43TB 43TB 43TB 43TB 43TB 43TB 43TB 43TB 43TB 43TB 43TB 43TB 11
  12. 12. Downstream  analysis   iRODS (4PB) Analysis clusters (~14000 cores) Aligned sequences Lustre scratch space (13 filesystems) Research analysis NFS storage for completed work 12
  13. 13. Lustre  setup   11  filesystems   500TB  /1PB  each   Large  projects  have  their  own     Exascaler  hardware     …  but  our  own  Lustre  install     Aim  to  deliver  5MB/sec  per  core  of   compute     IB  connected  OSS-­‐OST     10G  ethernet  to  clients   EF3015 MGS MDS Clients MDT MDT 1/2U servers IB SFA10K/12K OSS OSS OST OSS 10G/40G Network OST OST OSS OST OSS OST OSS OST OSS OST OSS OST 13
  14. 14. Future  challenges  and  direc'ons   iRODS   •  Object  storage  instead  of  filesystems  (WOS?)   •  File  systems  take  a  long  'me  to  fsck   •  integra'on  with  WOS   Clinical  use  and  personalised  medicine   •  Security  implica'ons   •  How  can  we  do  this  in  a  small  laboratory  in  Africa  with  terrible  power  and  minimal  IT  skills?   Lustre   •  Upgrade  to  2.5  (HSM  features)   •  Exascaler  needs  to  be  more  current   Sequencing  technology   •  Nanopore  sequencing   •  Use  outside  the  datacentre   Vendor  support   •  Integrated  support  plaoorms  for  produc'on  systems   14
  15. 15. Thank  you   The  team     –  Phil  Butcher,  IT  Director   –  Tim  Cu;s,  Ac'ng  Head  of  Scien'fic  Compu'ng   –  Guy  Coates,  Informa'cs  Systems  Group  Team  Leader   –  Peter  Clapham   –  James  Beal   –  Helen  Brimmer   –  Jon  Nicholson,  Network  Team  Leader   –  Shanthi  Sivadasan,  DBA  Team  Leader   –  Numerous  bioinforma'cians   15
  1. Gostou de algum slide específico?

    Recortar slides é uma maneira fácil de colecionar informações para acessar mais tarde.

×