Challenges and Opportunities of Big Data Genomics


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Challenges and Opportunities of Big Data Genomics

  1. 1. Challenges  and  Opportuni2es  of   Big  Data  Genomics   Yasin  Memari   Wellcome  Trust  Sanger  Ins2tute   January  2014  
  2. 2. Outline   •  •  •  •  •  •  •  •  •  Big  data  genomics:  hype  or  reality?   Limita2ons  of  big  data  analysis   Hardware  and  soKware  solu2ons   Bioinforma2cs  using  MapReduce   Hadoop  Distributed  File  System   Cloud  compu2ng  for  Genomics   Configuring  VRPipe  in  the  cloud   Lessons  from  cloud  compu2ng   A  unified  bioinforma2cs  plaRorm  
  3. 3. Big  Data  Genomics:  Hype  or  reality?   •  BoSleneck  in  sequencing  has  moved  from  data  genera2on   to  data  handling.   •  World’s  sequencing  capacity  at  15PB  in  2013,  and  expected   to  double  every  year.   •  10  petabytes  of  storage  required  for  100,000  human   genomes  (50X  ~  100GB  each).   •  $100  per  year  cost  of  storing  each  genome.   •  Data  deluge  is  inevitable  in  the  interim  as  sequencing   becomes  cheaper.   •  In  the  long  term,  DNA  itself  is  a  beSer  storage  medium!   •  Throughput  from  metagenomic  and  single-­‐cell  sequencing   will  rapidly  outpace  hard  gains  in  compression.  
  4. 4. Use  case  scenario:  Run  lobSTR  on  the  above  datasets  to  understand  varia2on   at  Short  Tandem  Repeats  on  a  genome-­‐wide  and  popula2on-­‐wide  scale  and   how  they  contribute  to  phenotypic  varia2on.    
  5. 5. “D  and  A”  Model   Download/transfer  and  Analyze:     •  I/O  intensive  jobs  can  overload  NAS   fileservers.   •  High  performance  file  systems  provide   fast  access  to  data  to  mul2ple  clients.   •  Network  performance  is  the  limi2ng   factor  for  big  data.   Filesystem  load   Sanger’s  farm  data  flow  
  6. 6. Compress  the  data!     High  coverage  equals  high  redundancy!       Images/TIFF  files  no  longer  in  use.   No  intermediate  fastq  files.   Bcl  and  locs/clocs  -­‐>  bam  (directly)   BAM  is  being  replaced  by  CRAM  (30%  reduc2on  in  size)   Discard  the  read  data  every  5-­‐10  years!   More  compression?  Smooth  out  sequencing  errors,  normalize  the   coverage,  down-­‐sample,  etc.    
  7. 7. How  can  we  improve  storage  performance?     •  Scale-­‐out  architectures  are  s2ll  costly  and  imprac2cal  e.g.  scale-­‐ out  NAS  ($1000/TB)  or  SAN  over  Fibre  Channel.   •  Solid-­‐state  drives  (SSDs)  are  being  used  to  enhance  cache   memory  and  IOPS  performance.   •  Hybrid  storage  systems    integrate  SSDs  into  tradi2onal  HDD-­‐ based  storage  arrays  as  a  1st  2er  of  storage.   •  Avere  FXT  and  Nexsan  NST  store  warm  data  in  SSDs  for  storage   accelera2on.  Migrate  cold  data  to  powered-­‐down  drives.   •  Fast  random  access  can  be  achieved  when  storing  metadata  in   Flash  SSDs.  Limited  gain  for  sequen2al  access!   •  Alterna2vely,  archive  the  data  in  cheap  object  stores  in  the   cloud,  but  invest  in  bandwidth!  
  8. 8. What  can  be  done  about  network  latency?     •  Use  high  performance  network  protocols  (e.g.  UDP-­‐based  UDT)   to  achieve  higher  speeds  that  can  be  achieved  with  TCP.     •  Aspera’s  fasp  accelerates  the  transfer  in  high-­‐latency  high-­‐loss   networks  where  the  transport  protocol  is  a  boSleneck.   •  Transmission  rates  can  be  enhanced  using  mul2ple  concurrent   transfers  (mul2-­‐part  downloads):   •  GeneTorrent  is  a  file  transfer  client  applica2on  based  on   BitTorrent  technology  (up  to  200MB/s  over  the  internet).   •  GridFTP  (implemented  in  Globus  toolkit)  enables  reliable  and   high-­‐speed  transmission  of  very  large  files  (up  to  ~800MB/s   when  scp  is  17MB/s).   •  High  speed  internet  connec2on  StarLight/internet2?  Firewall   and  network  security  problems.  
  9. 9. Alterna2ve  Models   What  types  of  analyses  do  we  run  in  genomics?     •  Embarrassingly  parallel  algorithms:      Most  sequence  analysis  soKware  have  distributed  solu2ons.  E.g.   Alignment,  imputa2on,  etc.  Use  genome  chunking  and  run  in  batches!   •  Tightly-­‐coupled  algorithms:      Some  require  message  passing  or  shared  memory,  e.g.  genome   assembly,  pathway  analysis.     Forms  of  parallelism:   •  Task  parallelism:  Distribute  the  execu2on  threads  across  different   nodes.   •  Data  parallelism:  Distribute  the  data  across  different  execu2on  nodes.  
  10. 10. Healthcare  data  need  to  be  stored  and  analyzed  centrally!  
  11. 11. Map-­‐Reduce  Framework   A  distributed  soluGon  to  a  data-­‐centric  problem:   •  Map:  Divide  up  the  problem  into  smaller  chunks  and  send  each   compute  task  to  where  the  data  resides.   •  Reduce:  Collect  the  answers  to  each  sub-­‐problem  and  combine  the   results.  
  12. 12. K-mer Counting Example:  K-­‐mer  Coun2ng   •! Application developers focus on 2 (+1 internal) functions –! Map: input ! key:value pairs Map, Shuffle & Reduce All Run in Parallel –! Shuffle: Group together pairs with same key –! Reduce: key, value-lists ! output ATGAACCTTA! (ATG:1)!(ACC:1)! (TGA:1)!(CCT:1)! (GAA:1)!(CTT:1)! (AAC:1)!(TTA:1)! ACA ATG CAA GCA TGA TTA -> -> -> -> -> -> 1! 1! 1,1! 1! 1! 1,1,1! ACA:1! ATG:1! CAA:2! GCA:1! TGA:1! TTA:3! GAACAACTTA! (GAA:1)!(AAC:1)! (AAC:1)!(ACT:1)! (ACA:1)!(CTT:1)! (CAA:1)!(TTA:1)! ACT AGG CCT GGC TTT -> -> -> -> -> 1! 1! 1! 1! 1! ACT:1! AGG:1! CCT:1! GGC:1! TTT:1! AAC ACC CTT GAA TAG -> -> -> -> -> 1,1,1,1! 1! 1,1! 1,1! 1! AAC:4! ACC:1! CTT:2! GAA:2! TAG:1! TTTAGGCAAC! (TTT:1)! (GGC:1)! (TTA:1)! (GCA:1)! (TAG:1)! (CAA:1)! (AGG:1)! (AAC:1)! map shuffle reduce Michael  Schatz  
  13. 13. Hadoop  Distributed  File  System  (HDFS)   •  •  •  Apache  Hadoop,  an  open-­‐source  implementa2on  of  Google’s  MapReduce  and   Google  File  System  (GFS).   A  highly  reliable  and  scalable  solu2on  to  storage  and  processing  of  massive  data   using  cheap  commodity  hardware.     Op2mised  for  high  throughput  access  to  data.  Data  is  replicated  for  fault  tolerance.    
  14. 14. HDFS  vs  Lustre   Client   DAS   DAS   DAS   CPU   CPU   CPU   Client   Client   Network   OSS   Network   Hadoop   •  Data  is  local:  data  nodes  act  as   compute  nodes.   •  I/O  is  not  very  relevant  here,   although  it  can  be  improved  by   concurrency.   •  Op2mised  for  batch  processing.   •  Single-­‐node  boSlenecks  or  name   node  failures.   OST   OSS   OST   OST   OST   Lustre   •  Data  is  shared:  compute  clients   talk  to  object  store  servers.   •  High  aggregate  I/O  can  be   achieved  with  striping.   •  Op2mised  for  HPC.  Used  in   Top500!   •  BoSleneck  is  gerng  the  data  on   lustre!  
  15. 15. Bioinforma2c  Tools  for  Hadoop   Suites  of  tools  ac2vely  under  development:     •  SeqPig:  A  library  which  u2lizes  Apache  Pig  to  translate  sequence   data  analysis  into  a  sequence  of  MapReduce  jobs.   •  Seal:  A  collec2on  of  distributed  applica2ons  for  alignment  and   manipula2on  of  short  read  sequence  data.   •  SeqWare:  A  toolkit  for  building  high-­‐throughput  sequencing  data   analysis  workflows  in  cloud-­‐based  environments.     •  And  many  algorithms  for  sequence  mapping  (CloudAligner)  and   SNP  calling  (Crossbow),  de  novo  assembly  (Contrail),  peak  calling   (PeakRanger)  and  RNA-­‐Seq  data  analysis  (Eoulsan,  FX  and  Myrna).  
  16. 16. Hardware  Virtualiza2on     Virtualiza2on  increases  u2liza2on  of  costly  hardware:   •  En2re  workflows  as  Virtual  Machines  residing  in  SAN.  VMs  are   sent  to  hypervisors  for  execu2on.   Hardware  (CPU,  Memory,  etc)   Hypervisor  (Xen,  Hyper-­‐V,  etc)   App   App   App   OS   OS   Management   Console   OS   VM   Storage-­‐Area  Network   (SAN)   VM   Fiber  Channel   VM   VM  
  17. 17. Cloud  Compu2ng   What  does  AWS  cloud  have  to  offer?     •  Networking:  Direct  connect,  Virtual  Private  Cloud  (VPC),   Route  53   •  Compute:  Elas2c  Compute  Cloud  (EC2),  Elas2c  MapReduce   •  Storage:  Simple  Storage  Service  (S3),  Glacier,  Storage   Gateway,  CloudFront   •  Database:  Rela2onal  Database  Service  (RDS),  DynamoDB,   Elas2Cache,  RedshiK   •  Management:  Iden2ty  and  Access  Management  (IAM),   CloudWatch,  CloudForma2on,  Elas2c  BeanStalk  
  18. 18. Network  Performance  within  Amazon   •  Bandwidth  speeds  within  AWS  are  way  too  low  for  moving  big   genome  data.   •  Experiments  achieve  maximum  70-­‐80MB/s  speeds  between   two  EC2  instances  and  10-­‐20MB/s  between  EC2  and  S3.   •  Download  from  S3  to  EC2  is  unreliable  and  constrained  given   data  inges2on  over  HTTP.   •  Gigabit  Ethernet  in  EC2  is  only  available  with  cluster  instances   •  Enhanced  networking  using  network  virtualiza2on  may   provide  higher  I/O  performance.   •  CloudFront,  Amazon’s  content  delivery  service  provides   streaming  at  HD  rates  only.   •  AWS  Data  Pipeline  is  not  up  to  the  task  of  big  data  workflows.  
  19. 19. VRPipe  in  the  Cloud   To  deploy  the  VRPipe  in  the  cloud  one  needs  to  sa2sfy  the   following  requirements  (Sendu  Bala):     •  Set  up  a  DataBase  Management  System  (DBMS)  for  VRPipe   in  the  AWS  RDS  (or  use  a  SQLite  or  locally  installed  MySQL   database)   •  Create  a  distributed  file  system  to  provide  shared  access  to   soKware  and  data  (adjust  for  speed  or  redundancy)   •  Configure  the  VRPipe  and  provide  the  required  permissions   and  security  creden2als   •  Install  and  configure  a  job  scheduling  system  supported  by   VRPipe,  e.g.  SGE  or  LSF   hSps://­‐pipe/wiki    
  20. 20. Tes2ng  VRPipe  in  AWS  Cloud   Alignment  and  calling  of  110  Phase  3  YRI  exomes  (~1.1TB):      (sequence.index)  -­‐>   2.  1000genomes_illumina_mapping_with_improvement   27.  bam_merge_lanes_and_fix_rgs   61.  snp_calling_mpileup   59.  snp_calling_gatk_unified_genotyper_and_annotate   89.  vcf_gatk_filter   90.  vcf_merge   93.  vcf_vep_annotate     •  Set  up  GlusterFS  volume  using  EBS  blocks  aSached  to  EC2  instances.   •  Enable  Elas2c  Load  Balancing  within  VPC  and  grant  r/w  privileges  to  DBMS.   •  Op2onally  use  SGE  job  scheduling  in  conjunc2on  with  EC2  load  balancing.  
  21. 21. Lessons  from  AWS  Cloud   •  The  bulk  of  the  cloud  is  made  of  general  purpose  hardware  suitable   for  enterprise  compu2ng.   •  Scien2fic  applica2ons  require  compute-­‐op2mised  HPC  plaRorms   and  high-­‐speed  I/O  and  storage.   •  On-­‐demand  services  are  expensive,  but  large  organiza2ons  may   benefit  from  the  economy  of  scale!?   •  As  a  self-­‐service  environment,  the  user  should  handle  sysadmin   tasks  including  provisioning  and  configura2on.   •  EC2  not  being  able  to  compute  against  S3  (high-­‐IO  tasks)  recalls  the   same  “D  and  A”  problem!   •  ElasFc  MapReduce  (EMR)  runs  on  EC2  instances,  with  ephemeral   disks  used  to  build  HDFS,  so  data  need  to  be  streamed  in/out  of  S3.   •  Virtualiza2on  imposes  performance  penal2es  as  the  available   physical  resources  are  shared  among  VMs.  
  22. 22. Bio-­‐cloud  Prototypes   •  •  •  •  •  •  •  The  EBI  has  developed  an  in-­‐house  cloud  for  public  sequence  repositories  such  as   the  European  Genome-­‐Phenome  Archive  (EGA).     The  NaGonal  Center  for  Biotechnology  InformaGon  is  working  on  cloud   implementa2ons  for  storing  genomic  data  such  as  dbGaP.   The  Beijing  Genomics  InsGtute  has  developed  five  bio-­‐cloud  compu2ng  centers  in   different  loca2ons  that  store  and  process  genomes.   The  US  NaGonal  Cancer  InsGtute  maintains  the  Cancer  Genome  Hub  (CGHub)   which  is  a  system  for  storing  large  genome  data.   The  Broad  InsGtute  has  instan2ated  its  analysis  pipeline  for  germline  and  cancer   soma2c  data  on  commercial  cloud  environments.   The  AMP  Lab  at  UC  Berkley  has  develop  and  is  deploying  its  genome  analysis   pipeline  on  commercial  cloud  environments.   Illumina  uploads  data  directly  to  the  cloud  where  they  have  created  a  plaRorm  for   sequence  analysis  called  Basespace.   Source:  Global  Alliance  White  Paper,  3  June  2013  
  23. 23. Data/Pipeline  Sharing   •  Grid  compu2ng  in  the  cloud   enables  sharing  data  and   resources  across  virtualized   servers.   •  Cloud  APIs  enable  applica2on   inter  operability  and  cross-­‐ plaRorm  compa2bility.   •  Applica2ons  are  able  to  launch   and  access  distributed  data   irrespec2ve  of  underlying  IT   infrastructures.   Sanger   Private   Cloud   BGI   Private   Cloud   Public   cloud   Broad   Private   Cloud   EBI   Private   Cloud   NCBI   Private   Cloud  
  24. 24. A  Unified  PlaRorm?   An  open  source  plaRorm  for  storing,  organizing,  processing,  and  sharing   very  large  genomic  and  biomedical  data  on  premise  or  in  the  cloud:   •  Data  Management:  Files  and  metadata  storage,  structured/ unstructured  data,  provenance  tracking,  security  and  access  control.   •  Content  addressable  Distributed  File  System:  Scalability  and  fault-­‐ tolerance,  block  storage  of  data,  High  performance  over  low  latency.     •  Computa2on  and  Pipeline  Processing:  Pipeline  crea2on  tools,  revision   control  system,  MapReduce  engine,  etc.   •  APIs  and  SDKs:  REST  and  na2ve  APIs,  w-­‐based  user  interface,   command  Line  Interface,  programming  languages  and  tools,  etc.   •  Cloud  OS  and  Virtualiza2on:  Networking,  self-­‐service  provisioning,   administra2on,  block  storage,  user  management,  etc.  
  25. 25. Discussion   •  Compute  is  much  cheaper.  Algorithms  run  faster  and  more   efficiently.     •  Transmission  of  big  data  will  be  a  boSleneck.  Network  latency  and   storage  I/O  are  the  limi2ng  factors.     •  Minimize  the  data  flow!     •  Distributed  file  systems  have  reduced  the  costs;  rou2ne  analy2cs  of   big  data  has  been  made  possible  using  cheap  commodity  hardware.   •  We  should  feel  lucky  that  sequence  analysis  is  mainly   embarrassingly  parallel!   •  MapReduce  engines  may  be  deployed  in  genome  data  centres?   •  Cloud  compu2ng  enables  data  and  applica2on  sharing  across   consolidated  IT  infrastructures.