Managing Genomics Data at the Sanger Institute
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Managing Genomics Data at the Sanger Institute

  • 1,698 views
Uploaded on

In this presentation from the DDN User Meeting at SC13, Tim Cutts from The Sanger Insitute describes how the company wrangles genomics data with DDN storage. ...

In this presentation from the DDN User Meeting at SC13, Tim Cutts from The Sanger Insitute describes how the company wrangles genomics data with DDN storage.

Watch the video presentation: http://insidehpc.com/2013/11/13/ddn-user-meeting-coming-sc13-nov-18/

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,698
On Slideshare
1,698
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
58
Comments
0
Likes
7

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Produc'on  and  Research:   Managing  Genomics  Data  at  the  Sanger  Ins'tute   Dr  Tim  Cu;s   Head  of  Scien'fic  Compu'ng   tjrc@sanger.ac.uk   1
  • 2. Background  to  the  Sanger   Ins'tute   2
  • 3. Po;ed  history   2008   2000   Dra[   Human   genome   1993   Centre   Opens   1998   Nematode   Genome   completed   • Next   genera'on   sequuencing   • 1000  genome   project   begins   2004   • MRSA   genome   2010   • UK10K   project   begins   2003   2005   2009   2013   • 2  billionth   base  pair   • Human   Genome   Project   completed   • Current   datacentre   opens   • Joins   interna'onal   Cancer   Genome   Consor'um   • UK10K   project  ends   •  Funded  by  the  Wellcome  Trust   •  Sequencing  projects  increase  in  scale  by  10x  every  two   years   •  ~17000  cores  of  total  compute   •  22PB  usable  storage  (~40PB  raw)     3
  • 4. Research  Programmes   Bioinforma'cs   Cellular   Gene'cs   Pathogen   Gene'cs   Mouse  and   Zebrafish   Gene'cs   Human   Gene'cs   4
  • 5. Core  Facili'es   DNA   Pipelines   IT   Cellular   Genera'on   and   Phenotyping   Model   Organisms   5
  • 6. Idealised  data  flow   6
  • 7. Example:  Varia'on  associa'on   7
  • 8. Typical  data  flow   Raw data from sequencer Stage data to Lustre Staging storage Lustre QC and alignment Research analysis iRODS Archival storage Website 8
  • 9. Choosing  your  tech:  Pick  two…   Price   Capacity   Performance   9
  • 10. Staging  storage   Simple  scale-­‐out  architecture   –  Server  with  ~50TB  direct  a;ached   block  storage   –  One  per  sequencer   –  Running  SAMBA  for  upload  from   sequencer   Maximum  data  from  all  sequencers  is   currently  1.7  TB/day     1000  core  cluster  reads  data  from  staging   servers  over  NFS   –  Quality  checks   –  Alignment  to  reference  genome   –  Store  aligned  BAM  and/or  CRAM   files  in  iRODS   Next Gen Sequencer Sequence data over CIFS Production sequencing cluster QC and alignment (1000 cores) CIFS/NFS staging server NFS 50TB One of these for each of One of 27 sequencers of these for each One of 27 sequencers of these for each 27 sequencers Aligned BAM files iRODS (4PB) 10
  • 11. iRODS   Object  store  with  arbitrary  metadata   Rules  to  automate  mirroring,   and  other  tasks  as  required     Vendor-­‐agnos'c    Mostly  DDN  SFA  10K    Some  other  vendors’  storage  also     Oracle  RAC  cluster  holds  metadata     Two  ac've-­‐ac've  iRES  resource  servers  in   different  rooms    8Gb  FC  to  storage    10Gb  IP     Series  of  43  TB  LVM  volumes  from  2x  SFA   10K  in  each  room   iCAT (Oracle RAC) iRODS Server Other vendors Other vendors SFA10K SFA10K 43TB 43TB 43TB 43TB iRES server 43TB 43TB iRES server 43TB 43TB 43TB 43TB 43TB 43TB 43TB 43TB 43TB 43TB 43TB 43TB SFA10K SFA10K 43TB 43TB 43TB 43TB 43TB 43TB 43TB 43TB 43TB 43TB 43TB 43TB 43TB 43TB 43TB 43TB 43TB 43TB 11
  • 12. Downstream  analysis   iRODS (4PB) Analysis clusters (~14000 cores) Aligned sequences Lustre scratch space (13 filesystems) Research analysis NFS storage for completed work 12
  • 13. Lustre  setup   11  filesystems   500TB  /1PB  each   Large  projects  have  their  own     Exascaler  hardware     …  but  our  own  Lustre  install     Aim  to  deliver  5MB/sec  per  core  of   compute     IB  connected  OSS-­‐OST     10G  ethernet  to  clients   EF3015 MGS MDS Clients MDT MDT 1/2U servers IB SFA10K/12K OSS OSS OST OSS 10G/40G Network OST OST OSS OST OSS OST OSS OST OSS OST OSS OST 13
  • 14. Future  challenges  and  direc'ons   iRODS   •  Object  storage  instead  of  filesystems  (WOS?)   •  File  systems  take  a  long  'me  to  fsck   •  integra'on  with  WOS   Clinical  use  and  personalised  medicine   •  Security  implica'ons   •  How  can  we  do  this  in  a  small  laboratory  in  Africa  with  terrible  power  and  minimal  IT  skills?   Lustre   •  Upgrade  to  2.5  (HSM  features)   •  Exascaler  needs  to  be  more  current   Sequencing  technology   •  Nanopore  sequencing   •  Use  outside  the  datacentre   Vendor  support   •  Integrated  support  plaoorms  for  produc'on  systems   14
  • 15. Thank  you   The  team     –  Phil  Butcher,  IT  Director   –  Tim  Cu;s,  Ac'ng  Head  of  Scien'fic  Compu'ng   –  Guy  Coates,  Informa'cs  Systems  Group  Team  Leader   –  Peter  Clapham   –  James  Beal   –  Helen  Brimmer   –  Jon  Nicholson,  Network  Team  Leader   –  Shanthi  Sivadasan,  DBA  Team  Leader   –  Numerous  bioinforma'cians   15