Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Open Data in Bioinformatics and Required Infrastructure towards achieving the SDGs/Samar Kassim


Published on

Presented as part of the session on “How open data can contribute to achieving the UN SDGs”, during the BioVision2018 conference in Alexandria, Egypt.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Open Data in Bioinformatics and Required Infrastructure towards achieving the SDGs/Samar Kassim

  1. 1. Open  Data  in  Bioinforma/cs  and  Required   Infrastructure  towards  achieving  the  SDGs       9th  BioVisionAlexandria  Conference,     Alexandria,  Egypt   2018       Prof.  Samar  Kassim     9th  BioVisionAlexandria  Conference,  Egypt  
  2. 2. Introduc/on   •  Major  technological  advances  in  molecular  biology  is  the  sophis7ca7on,  diversity,   scale  and  decreasing  cost  of  the  data  being  generated  i.e.  by  high  throughput   pla;orms   •  First  human  genome  sequence:   –  Throughput  2.8  million  bases  per  24  hours  on                  AB3730xl  sequencers   –  13  years  to  sequence  3  billion  bases  at  x10                coverage   –  Cost  ~  500  million  USD  (lower  bound  es7mate)     •  Next  (now)  genera7on  sequencing:   –  Throughput  1  million  bases  per  second   –  ~10  hours  to  sequence  3  billion  bases  at  x10                coverage   –  Cost  ~  4,000  USD  per  genome       hTps://   hTp://   Author  =  Ben  Moore     9th  BioVisionAlexandria  Conference,  Egypt  
  3. 3. Data  driven  biological  science  -­‐   bioinforma/cs   •  Decreasing  data  genera7on  costs  shiZed  biological  sciences  to  a   data  driven  science  with  bioinforma7cs  playing  a  major   component     Stephens  ZD,  Lee  SY,  Faghri  F,  Campbell  RH,  Zhai  C,  et  al.  (2015)  Big  Data:  Astronomical  or  Genomical?.  PLOS  Biology  13(7):  e1002195.   hTps://  hTp://   9th  BioVisionAlexandria  Conference,  Egypt  
  4. 4. Genomics  and  Africa  -­‐  H3Africa   •  “The  Human  Heredity  and  Health  in  Africa  (H3Africa)  Ini/a/ve  aims  to   facilitate  a  contemporary  research  approach  to  the  study  of  genomics   and  environmental  determinants  of  common  diseases  with  the  goal  of   improving  the  health  of  African  popula7ons.”  (hTp://   •  “The  vision  of  H3Africa  is  to  create  and  support  a  pan-­‐con7nental   network  of  laboratories  that  will  be  equipped  to  apply  leading-­‐edge   research  to  the  study  of  the  complex  interplay  between  environmental   and  gene7c  factors  which  determines  disease  suscep7bility  and  drug   responses  in  African  popula7ons.”  (hTp://         9th  BioVisionAlexandria  Conference,  Egypt  
  5. 5. H3Africa  Phase  I  overview   •  25  research  projects  in  Africa   •  >  500  inves7gators   •  Covers  27  African  countries           •  Upto  75,000  research  par7cipants     •  >  USD  76  million  invested  in  phase  1   8  Collabora/ve   Centers   7  Research   Projects   3  Biorepositories   6  Ethics  Grants   The  H3Africa   Consor/um   Bioinforma/cs   Network   hTp://     9th  BioVisionAlexandria  Conference,  Egypt  
  6. 6. H3Africa  Bioinformatcs  Network  (H3ABioNet)   •  Pan  African  Bioinforma7cs  Network  to  develop  bioinforma7cs   capacity  in  Africa  and  support  the  H3Africa  research  projects   •  28  nodes  in  17  African  countries   •  PI:  Prof.  Nicky  Mulder,  CBIO-­‐UCT   •  Educa7on,  infrastructure,  research   •  Archive  African  genomics  data     9th  BioVisionAlexandria  Conference,  Egypt  
  7. 7. H3Africa  data  being  collected  (Phase  I)   •  Phenotype  data  (associated  with  genotype  data)   –  Demographic  informa7on   –  Anthropometric  data   –  Disease  and  health  related  phenotype  data   •  Gene7c  Varia7on  data  human  and  pathogen   –  Sequence  data  (whole  genome,  exome,  targeted)     •  Genotyping  chip  array  data   –  ~55,000  samples  to  be  run  on  an  H3Africa  African  custom  chip     •  Microbiome  sequence  data   –  Pa7ent/sample  phenotypes   –  Non-­‐human  16S  rRNA  sequence  data  for  microbiome   –  Non-­‐human  full  genome  sequence  data  for  microbiome   –  Possible  human  sequence  contamina7on   •  Biospecimens  to  be  deposited  at  the  H3Africa  biorepositories       Image  credits:  Na/onal  Human  Genome  Research  Ins/tute  (h]ps://   9th  BioVisionAlexandria  Conference,  Egypt  
  8. 8. Lack  of  repository  for  African  Genomics  data   •  1,759  datasets  with  the  query  “African”  –  none  in  Africa   hTps://     9th  BioVisionAlexandria  Conference,  Egypt  
  9. 9. 9th  BioVisionAlexandria  Conference,  Egypt   H3Africa  Data  Archive   •  Assist  H3Africa  projects  as  data  coordina7on  center:             Transfer  Validate   Store   Submit  to   EGA   Obtain  EGA  accessions   for  publica/ons   0.5  petabytes  storage  size  including  offsite   replica7on  
  10. 10. H3Africa  Catalogue   9th  BioVisionAlexandria  Conference,  Egypt   •  Online  catalogue  with  meta-­‐data  to  search  and  apply  for  datasets  and   biospecimens  (under  development)  
  11. 11. Human  gene/c  data  privacy   •  H3Africa  rich  source  of  meta-­‐data  (phenotypes)   (1)  Age  &  (2)  Sex   (3)  Country  of  birth   (4)  Current  residence   (5)  Native  language   (6)  Ethno-­‐linguistic/tribal  affiliation   (7)  Country  of  birth  of  father  and  mother   (8)  Na7ve  language  of  father  and  mother   (9) Ethno-­‐linguistic/tribal  affiliation  of   mother  and  father   (10)  Height   (11)  Weight   (12)  Current  medica7ons   (13)  Smoking  history   (14)  Alcohol  history   Image  credits:  Na/onal  Human  Genome  Research  Ins/tute  (h]ps://   •  Combina7on  of  phenotype  and  gene7c  data  makes  it  possible  to  iden7fy   different  popula7ons  and  individuals  –  restricted  access     9th  BioVisionAlexandria  Conference,  Egypt  
  12. 12. Sharing  of  research  data  and  outputs   •  Funders’  data  sharing  policies       “The  Wellcome  Trust  is  commiTed  to  ensuring  that  the  outputs  of  the  research   it  funds,  including  research  data,  are  managed  and  used  in  ways  that  maximise   public  benefit.  Making  research  data  widely  available  to  the  research   community  in  a  7mely  and  responsible  manner  ensures  that  these  data  can  be   verified,  built  upon  and  used  to  advance  knowledge  and  its  applica7on  to   generate  improvements  in  health.”   hTps://­‐grant/policy-­‐data-­‐management-­‐and-­‐sharing         “The  Na7onal  Ins7tutes  of  Health  (NIH)  Genomic  Data  Sharing  Policy  expects   that  genomic  research  data  from  NIH-­‐supported  studies  involving  human   specimens  as  well  as  non-­‐human  and  model  organisms  will  be  submiTed  to  an   NIH-­‐designated  data  repository.  The  list  below  provides  examples  of  relevant   databases.”   hTps://     9th  BioVisionAlexandria  Conference,  Egypt  
  13. 13. Limits  to  sharing  human  gene/c  data   •  Ethics:   –  Digital  data  (genomes)  can  be  stored  indefinitely,  biobank   specimens  can  be  stored  for  up  to  20  years  –  secondary  use   –  Rapid  innova7on  with  ‘omics  technologies   •  H3Africa:  “Seven  projects  used  broad  consent,  five   projects  used  7ered  consent  and  one  used  specific   consent”§     •  History  of  vulnerable  popula7ons,  low  educa7on   levels  and  exploita7on   •  Blood  sample  collec7on  and  visits  to  clinics  associated   with  disease  and  treatment  –  even  if  a  healthy  control   •  “All  but  one  of  the  consent  forms  that  we  reviewed   included  a  statement  about  data  sharing.”  §   §  Munung  NS,  Marshall  P,  Campbell  M,  et  al  Obtaining  informed  consent  for  genomics  research  in  Africa:  analysis  of  H3Africa  consent   documents.  Journal  of  Medical  Ethics  2016;42:132-­‐137)   Ethical   considera7ons   Informed   consent   Par7cipant   iden7fica7on   S7gma7sa7on   Benefit   sharing   9th  BioVisionAlexandria  Conference,  Egypt  
  14. 14. Limits  to  sharing  human  gene/c  data   •  Non-­‐harmonized  na7on  /  regional  laws  and  policies  for  ethics  and   genome  data  sharing  within  Africa         Image  credits:  hTps://     9th  BioVisionAlexandria  Conference,  Egypt  
  15. 15. H3Africa  data  sharing  and  access  policy   •  Balance  between  ensuring  that  adequate  safeguards  to  protect   par7cipants  while  not  being  a  barrier  for  scien7sts  to  advance   research:   -  Maximizing  the  availability  of  research  data,  in  a  7mely  and  responsible   manner.   -  Protec7ng  the  rights  and  privacy  of  human  subjects  who  par7cipated  in   research  studies.   -  Recognizing  the  scien7fic  contribu7on  of  researchers  who  generated  the   data.   -  Considering  the  nature  and  ethics  of  the  research  proposed  in  establishing   the  7mely  release  of  data,  and  mechanisms  of  data  sharing.     -  Promo7ng  deposi7on  of  genomic  data  in  exis7ng  community  data   repositories  whenever  possible     hTp:// %20Access%20%20Release%20Policy%20Aug%202014.pdf     9th  BioVisionAlexandria  Conference,  Egypt  
  16. 16. Challenges  in  sharing  data  –  metadata   standards   •  Meta-­‐data  (phenotype)  data  is  collected  via  case  report  forms  (CRFs)            Project  1  CRF                                            Project  2  CRF                                    Project  3  CRF   Female                                                              Woman                                                                1   Daily  units                                                    Weekly  units                                User  defined  7me  period   •  Same  ques7on  –  data  coded  in  different  ways   •  Similar  measure  –  collected  in  different  ways   9th  BioVisionAlexandria  Conference,  Egypt  
  17. 17. Use  established  standards  -­‐  Ontologies   •  “An  ontology  defines  a  common  vocabulary  for  researchers  who  need  to   share  informa7on  in  a  domain.  It  includes  machine-­‐interpretable   defini7ons  of  basic  concepts  in  the  domain  and  rela7ons  among  them.”*       *hTp://­‐noy-­‐ mcguinness.html       9th  BioVisionAlexandria  Conference,  Egypt  
  18. 18. Op/ons  to  aid  data  sharing   •  Make  data  Findable,  Accessible,  Interoperable  and  Reusable  (FAIR  compliant)                     •  Do  you  see  a  gene7c  variant  in  a  specific  posi7on  within  your  dataset  –  Yes  /   No  as  in  the  case  for  the  South  African  Human  Genome  Program  (SAHGP)   Global  Alliance  for  Genomics  and   Health:  hTp://     9th  BioVisionAlexandria  Conference,  Egypt  
  19. 19. H3Africa  genotyping  chip   •  Current  genotyping  technologies  are  designed  for  European   popula7ons   •  African  popula7ons  under  represented,  although  have  the  most   diversity   9th  BioVisionAlexandria  Conference,  Egypt   Image  credits:  Na/onal  Human  Genome  Research  Ins/tute  (h]ps://  
  20. 20. Designing  the  H3Africa  genotyping  chip   9th  BioVisionAlexandria  Conference,  Egypt   Image  credits:  Na/onal  Human  Genome  Research  Ins/tute  (h]ps://   •  Collabora7on  between  H3ABioNet  and  Na7onal  Center  for   Supercompu7ng  Applica7ons  (NCSA-­‐US  based)  via  US  partner  at   University  of  Illinois     •  U7lized  the  Bluewaters  supercomputer  facili7es  and  CHPC  facili7es          212,000  Node  compu7ng  hours  used  at  Bluewaters          600  TB  of  storage  needed     Chip  undergone  assessment  and  in  use  with  pos7ve                               results     h]ps://twi]    
  21. 21. Connec/vity  for  data  transfers   GO endpoints   Transfer speeds (Mbps)   (min, max)   Baylor <-> Blue Waters   340, 1900   Blue Waters -> UCT   204, 322   CHPC <-> Blue Waters   81, 243   UCT <-> CHPC   34, 406   Sanger <-> UCT   38, 76   GO  source  and   des/na/on   Files  to  transfer  and  size  per   sample   Total  size  of  transfer  for  350   samples   Min  transfer   speed   Time  to   transfer   Baylor  to  Blue  Waters   Baylor  FASTQ.gzs  /  100GB   75TB   340Mbps   21  days   Blue  Waters  to  UCT   Baylor  FASTQ.gzs  /  100GB   75TB   200Mbps   35  days   Blue  Waters  to  UCT   BW  BAMs  /  100GB   40TB   200Mbps   19  days   UCT  to  CHPC   BW  BAMs  /  100GB   40TB   34Mbps   109  days   CHPC  to  UCT   Union  set  /  VCFs   1TB   34Mbps   3  days   UCT  to  Sanger   Union  set  /  VCFs   1TB   34Mbps   3  days    Globus  Online  installed  at  Nodes   9th  BioVisionAlexandria  Conference,  Egypt  
  22. 22. Challenge  of  unequal  infrastuctures     •  Diverse  levels  of  exper7se  and  infrastructure  between  different   countries     www.project-­‐     SoZware  and  hardware  sanc7ons     exacerbate  exis7ng  inequali7es     e.g  Sudan  Node   hTp://­‐01-­‐14-­‐17-­‐ startling-­‐facts-­‐about-­‐the-­‐state-­‐of-­‐science-­‐and-­‐ research-­‐in-­‐africa     9th  BioVisionAlexandria  Conference,  Egypt  
  23. 23. Bioinforma/cs  educa/on   9th  BioVisionAlexandria  Conference,  Egypt   Aim:   •  Basic  bioinforma7cs  training  for  interested  H3Africa  members   (bioinforma7cs  users  –  Introduc7on  to  Bioinforma7cs  Training)   •  Web-­‐based  bioinforma7cs  tools  and  resources  and  how  to  use   them    Course  logis/cs:   •   3  months,  2  days  contact  7me  per  week  (3  hours  per  session)   •   Distance  learning  model  –  physical  classrooms  connected      to  virtual  classroom   •   Mconf  –  video  conferencing   •   Vula  –  course  management         virtual  classroom  
  24. 24. 9th  BioVisionAlexandria  Conference,  Egypt   IBT_2017  classroom  sites     27  in  total   (vs.  20  classrooms  in  2016)     Countries  that  have   joined  IBT  in  2017:   Ethiopia,  Burkina  Faso     Some  par7cipants  from   first  course  are  going  to  be   TAs     Over  580  enrolled     Par/cipants  and  over     130  volunteer  staff     IBT  2017  Classrooms   Paper  published  on   course  design   VIRTUAL CLASSROOM classroom site 2016 new classroom site 2017 classroom site 2016 and 2017
  25. 25. Conclusion   •  Bioinforma7cs  =  big  data  and  needs  computa7onal  power,  storage,   fast  read  and  write  for  processing   •  Well  defined  meta-­‐data  standards  are  vital  for  interoperability  and   sharing  of  data   •  Cyber  infrastructure  for  moving  and  sharing  large  datasets  is  needed   to  foster  open  data  and  open  science   •  Educa7on  and  skills  development  essen7al  for  African  ci7zens  to  take   advantage  of  the  data  revolu7on   •  Percep7ons  and  a{tudes  –  no  amount  of  infrastructure  will  drive   Open  data  and  Open  science  if  the  sen7ment  is  absent       9th  BioVisionAlexandria  Conference,  Egypt  
  26. 26. Acknowledgements   •  Prof  Nicky  Mulder  and  H3ABioNet  members   •  Ina  Smith  and  the  Academy  of  Science  of  South  Africa     •  BioVisionAlexandria  2018  organizers   H3ABioNet  Consor/um  Members  2017     9th  BioVisionAlexandria  Conference,  Egypt  
  27. 27. Conclusions   Provide  data  archiving  solu7on  for  H3Africa  projects  to   ensure  that  local  copy  of  the  data  remains  on  the  con7nent   9th  BioVisionAlexandria  Conference,  Egypt  
  28. 28. Communica/on  –  H3Africa     Image  credit:  hTps://                           •  H3Africa  working  groups  meet  every  fortnight   •  Regular  mee7ngs  are  challenging  due  to  diversity  of  7mezones  (most  funders   in  the  US)  and  daylight  saving  hours   9th  BioVisionAlexandria  Conference,  Egypt  
  29. 29. Communica/on  –  H3Africa       •  H3Africa  funders  and  project  members  meet  face  to  face  every  six   months  to  provide  reports  and  for  working  groups  to  also  wrap  up   deliverables   9th  BioVisionAlexandria  Conference,  Egypt  
  30. 30. Communica/on  –  H3ABioNet     •  Within  H3ABioNet  the  nodes  are  located  in  Africa  so  7me  differences  are  not   a  hindrance   •  Working  groups  meet  once  a  month  and  network  meets  annually  for  SAB   review  and  network  business   •  Only  some  countries  have  toll  free  access  to  a  booked  conference  call,  costly   •  Challenges:  communica7on  pla;orms         hTp://     9th  BioVisionAlexandria  Conference,  Egypt  
  31. 31. Biomedical  science  becoming  “data  rich”   OECD  –  WDS  Workshop,  Brussels  2017   hTps://    
  32. 32. Mapping  internet  conec/vity   OECD  –  WDS  Workshop,  Brussels  2017  
  33. 33. OECD  –  WDS  Workshop,  Brussels  2017   Bioinforma/cs  SOPs  -­‐  Reproducible   •  Developed  SOPs   and  prac7ce   datasets  for:   –  NGS  Variant  calling   –  Genome  Wide   Associa7on  Studies   (GWAS)   –  16S  rRNA  diversity   analysis     •  SOPs  and  prac7ce   datasets  under   development:   –  RNA  Seq   –  Variant   priori7za7on  and   annota7on   •  Guidelines  on   compute  and   storage  
  34. 34. OECD  –  WDS  Workshop,  Brussels  2017   Archive  dashboard  
  35. 35. OECD  –  WDS  Workshop,  Brussels  2017   Ontologies work 35   Adapting OMIABIS ontology to H3Africa data Mapping CRFs to ontologies, e.g. phenotype or disease ontology Mapping genomics data to Experimental Factor ontology Developing Sickle Cell Disease Ontology
  36. 36. OECD  –  WDS  Workshop,  Brussels  2017   Beacons in Africa hTps://beacon-­‐     •  First Beacon in Africa “lit” on October 2016 for the SAHGP