Managing Genomics Data at the Sanger Institute

Produc'on
and
Research:

Managing
Genomics
Data
at
the
Sanger
Ins'tute

Dr
Tim
Cu;s

Head
of
Scien'ﬁc
Compu'ng

tjrc@sanger.ac.uk

1

Background
to
the
Sanger

Ins'tute

2

Po;ed
history

2008

2000

Dra[

Human

genome

1993

Centre

Opens

1998

Nematode

Genome

completed

• Next

genera'on

sequuencing

• 1000
genome

project

begins

2004

• MRSA

genome

2010

• UK10K

project

begins

2003

2005

2009

2013

• 2
billionth

base
pair

• Human

Genome

Project

completed

• Current

datacentre

opens

• Joins

interna'onal

Cancer

Genome

Consor'um

• UK10K

project
ends

•  Funded
by
the
Wellcome
Trust

•  Sequencing
projects
increase
in
scale
by
10x
every
two

years

•  ~17000
cores
of
total
compute

•  22PB
usable
storage
(~40PB
raw)

3

Research
Programmes

Bioinforma'cs

Cellular

Gene'cs

Pathogen

Gene'cs

Mouse
and

Zebraﬁsh

Gene'cs

Human

Gene'cs

4

Core
Facili'es

DNA

Pipelines

IT

Cellular

Genera'on

and

Phenotyping

Model

Organisms

5

Idealised
data
ﬂow

6

Example:
Varia'on
associa'on

7

Typical
data
ﬂow

Raw data from
sequencer

Stage data to Lustre

Staging storage

Lustre

QC and alignment

Research analysis

iRODS

Archival
storage

Website

8

Choosing
your
tech:
Pick
two…

Price

Capacity

Performance

9

Staging
storage

Simple
scale-‐out
architecture

–  Server
with
~50TB
direct
a;ached

block
storage

–  One
per
sequencer

–  Running
SAMBA
for
upload
from

sequencer

Maximum
data
from
all
sequencers
is

currently
1.7
TB/day

1000
core
cluster
reads
data
from
staging

servers
over
NFS

–  Quality
checks

–  Alignment
to
reference
genome

–  Store
aligned
BAM
and/or
CRAM

ﬁles
in
iRODS

Next Gen
Sequencer

Sequence
data over CIFS
Production
sequencing cluster
QC and alignment
(1000 cores)

CIFS/NFS
staging server
NFS
50TB

One of these for each of
One of 27 sequencers of
these for each
One of 27 sequencers of
these for each
27 sequencers
Aligned BAM ﬁles

iRODS
(4PB)

10

iRODS

Object
store
with
arbitrary
metadata

Rules
to
automate
mirroring,

and
other
tasks
as
required

Vendor-‐agnos'c

Mostly
DDN
SFA
10K

Some
other
vendors’
storage
also

Oracle
RAC
cluster
holds
metadata

Two
ac've-‐ac've
iRES
resource
servers
in

diﬀerent
rooms

8Gb
FC
to
storage

10Gb
IP

Series
of
43
TB
LVM
volumes
from
2x
SFA

10K
in
each
room

iCAT
(Oracle RAC)
iRODS Server

Other vendors

Other vendors

SFA10K

SFA10K

43TB

43TB

43TB

43TB

iRES server

43TB

43TB

iRES server

43TB

43TB

43TB

43TB

43TB

43TB

43TB

43TB

43TB

43TB

43TB

43TB

SFA10K

SFA10K

43TB

43TB

43TB

43TB

43TB

43TB

43TB

43TB

43TB

43TB

43TB

43TB

43TB

43TB

43TB

43TB

43TB

43TB

11

Downstream
analysis

iRODS
(4PB)

Analysis clusters
(~14000 cores)

Aligned sequences

Lustre scratch space
(13 ﬁlesystems)

Research
analysis

NFS storage for
completed work

12

Lustre
setup

11
ﬁlesystems

500TB
/1PB
each

Large
projects
have
their
own

Exascaler
hardware

…
but
our
own
Lustre
install

Aim
to
deliver
5MB/sec
per
core
of

compute

IB
connected
OSS-‐OST

10G
ethernet
to
clients

EF3015
MGS
MDS
Clients

MDT
MDT

1/2U servers
IB

SFA10K/12K

OSS
OSS

OST

OSS

10G/40G
Network

OST

OST

OSS

OST

OSS

OST

OSS

OST

OSS

OST

OSS

OST

13

Future
challenges
and
direc'ons

iRODS

•  Object
storage
instead
of
ﬁlesystems
(WOS?)

•  File
systems
take
a
long
'me
to
fsck

•  integra'on
with
WOS

Clinical
use
and
personalised
medicine

•  Security
implica'ons

•  How
can
we
do
this
in
a
small
laboratory
in
Africa
with
terrible
power
and
minimal
IT
skills?

Lustre

•  Upgrade
to
2.5
(HSM
features)

•  Exascaler
needs
to
be
more
current

Sequencing
technology

•  Nanopore
sequencing

•  Use
outside
the
datacentre

Vendor
support

•  Integrated
support
plaoorms
for
produc'on
systems

14

Thank
you

The
team

–  Phil
Butcher,
IT
Director

–  Tim
Cu;s,
Ac'ng
Head
of
Scien'ﬁc
Compu'ng

–  Guy
Coates,
Informa'cs
Systems
Group
Team
Leader

–  Peter
Clapham

–  James
Beal

–  Helen
Brimmer

–  Jon
Nicholson,
Network
Team
Leader

–  Shanthi
Sivadasan,
DBA
Team
Leader

–  Numerous
bioinforma'cians

15

Managing Genomics Data at the Sanger Institute

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Managing Genomics Data at the Sanger Institute

Similar to Managing Genomics Data at the Sanger Institute (20)

More from inside-BigData.com

More from inside-BigData.com (20)

Recently uploaded

Recently uploaded (20)

Managing Genomics Data at the Sanger Institute