Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How we built a global search engine for genetic data

87 views

Published on

Slides for How we built a global search engine for genetic data talk at Velocity San Jose 2018.

Published in: Software
  • Be the first to comment

  • Be the first to like this

How we built a global search engine for genetic data

  1. 1. @mirocupak Miro Cupak VP Engineering, DNAstack 13/06/2018 How we built a global search engine for genetic data
  2. 2. @mirocupak What and why? !2 • Beacon Network • https://beacon-network.org/ • largest search and discovery engine of human genetic mutations • from the Global Alliance for Genomics & Health (GA4GH) • case study problem standard architecture technologies fun with stats
  3. 3. @mirocupak Background !3
  4. 4. @mirocupak !4 https://beacon-network.org
  5. 5. @mirocupak !5 https://beacon-network.org
  6. 6. @mirocupak Trends !6 https://www.nature.com/news/technology- the-1-000-genome-1.14901 sequencing cost decreasing exponentially (3M times since 2000)
  7. 7. @mirocupak Trends !7 http://journals.plos.org/plosbiology/ article?id=10.1371/journal.pbio. 1002195 genomic data volume increasing exponentially (1M times since 2000)
  8. 8. @mirocupak Trends !8 DataVolumesby2025(GB) 0E+00 1E+10 2E+10 3E+10 4E+10 Twitter Youtube Genomics Lower Bound Upper Bound http://journals.plos.org/plosbiology/article? id=10.1371/journal.pbio.1002195 up to 2 billion human genomes sequenced in the next 10 years (more data annually than uploaded to and )
  9. 9. @mirocupak • no single institution will have sufficient resources • still, institutions don’t have enough data • common diseases • rare diseases • challenge • discovering data • solution • traditional approach of data aggregation in a single centralized site not working • federated system capable of executing cross-dataset and cross-institution queries is needed Problem !9
  10. 10. @mirocupak • nonprofit standards organization • a coalition of over 500 leading institutions working in health care, research, disease advocacy, life science, and information technology • goal: enable responsible sharing of genomic and clinical data • established in 2013 GA4GH & Beacon Project !10 • experiment to test the willingness of international sites to share genetic data in the simplest of all technical contexts • initiative requiring collaboration of many different GA4GH groups • started in 2014 and quickly gained traction http://ga4gh.org/ https://www.broadinstitute.org/files/news/pdfs/ GAWhitePaperJune3.pdf https://beacon-project.io/
  11. 11. @mirocupak Beacon !11
  12. 12. @mirocupak • simple web service allowing users to query institution’s databases to determine whether they contain a genetic variant of interest • receives questions of the form Do you have information about this mutation? • responds with yes or no, optionally with additional information about the mutation • design principles • A beacon has to be technically simple. • A beacon has to minimize risks associated with genomic data sharing. • It has to be possible to make a beacon publicly available. Beacon !12
  13. 13. @mirocupak • no formal specification • receives questions of the form Do you have information about this mutation? • responds with yes or no • 4 public beacons, each API different Standard: Before Beacon Network !13 • request method • supported parameters • parameter names • chromosome identifiers • positional base • assembly notation • supported alleles • dataset support • response format • data included in the response
  14. 14. @mirocupak !14 Standard: Before Beacon Network
  15. 15. @mirocupak • 2014 • really simple (2 records) • true/false response • format: Avro • not enough traction • too vague • issues partially addressed by the Beacon Network Standard: 0.1 !15
  16. 16. @mirocupak • 2015 • true/false/overlap/null response • datasets • simple data use conditions • self description • format: Avro • complex (9 records) • not well adopted • not polished enough Standard: 0.2 !16
  17. 17. @mirocupak • 2016 • simplified 0.2 • based on real needs, successful • true/false/null response • data model improvements, extended metadata and response, improved support for datasets and cross-dataset queries, data versioning • modular and extensible • tooling • format: Avro → Proto3 Standard: 0.3 !17
  18. 18. @mirocupak • 2018 • stable and more flexible • support for more complex mutations • improved error handling • improved data use conditions • various minor improvements • developer experience • format: Proto3 → OpenAPI Standard: 0.4 !18
  19. 19. @mirocupak Beacon Network !19
  20. 20. @mirocupak Architecture !20
  21. 21. @mirocupak Data !21 • access data stored in a relational database
  22. 22. @mirocupak Service !22 • communication with other subsystems • query normalization • aggregators • participant resolution • query distribution • audit trail • L1 parallelization
  23. 23. @mirocupak Processor !23 • executing a query against a beacon and processing its response • management of a flexible, dynamic and easily extensible query execution pipeline • pipeline stages resolution (CDI and EJB) • L2 parallelization • cross-assembly query handling
  24. 24. @mirocupak Converter !24 • first stage in the query execution pipeline • translating query parameters
  25. 25. @mirocupak Requester !25 • second stage in the query execution pipeline • constructing beacon requests based on their URIs and parameters produced by the converters
  26. 26. @mirocupak Fetcher !26 • third stage in the query execution pipeline • unit actually talking to the API of beacons • submitting requests over the network and obtaining the raw response
  27. 27. @mirocupak Parser !27 • last stage in the pipeline • extracting information of interest from the raw response obtained by a fetcher • dealing with various formats • handling metadata, multiple responses, errors • response normalization • parallelized
  28. 28. @mirocupak Mapper !28 • translation between different representations of objects
  29. 29. @mirocupak REST !29 • handling client requests • data serialization
  30. 30. @mirocupak Search execution !30
  31. 31. @mirocupak Stats !31
  32. 32. @mirocupak • 100 installations • 40 institutions • 18 countries • 6 continents Size !32
  33. 33. @mirocupak Users !33 • 13k users • 136 countries
  34. 34. @mirocupak !34 Searches
  35. 35. @mirocupak Assemblies !35 Others 11% GRCh38 6% GRCh37 83%
  36. 36. @mirocupak Chromosomes !36 Others 39% Chr. 7 7% Chr. 13 11% Chr. 1 11% Chr. 17 14% Chr. 2 18%
  37. 37. @mirocupak Variants !37 Others 74% 2 : 212289100 C (ERBB4) 1% 2 : 29432776 C (ALK) 1% 14 : 23894969 A (MYH7) 1% 1 : 115258747 A (NRAS) 1% 1 : 43815163 C (MPL) 2% 7 : 140453136 C (BRAF) 2% 2 : 45895 G (FAM110C) 3% 22 : 46546565 A (PPARA) 3% 13 : 32936732 C (BRCA2) 6% 2 : 38938 C (FAM110C) 6% • 84k distinct mutations
  38. 38. @mirocupak Deleteriousness !38 Numberofvariants 1 1000 1000000 Score 0.00 0.07 0.14 0.21 0.28 0.35 0.42 0.49 0.56 0.63 0.70 0.77 0.84 0.91 0.98 Numberofvariants 1 1000 1000000 Score 0.00 0.07 0.14 0.21 0.28 0.35 0.42 0.49 0.56 0.63 0.70 0.77 0.84 0.91 0.98 SIFT (Sorting Intolerant From Tolerant) PolyPhen-2 HDIV (Polymorphism Phenotyping v2) 69% damaging, 31% tolerated 55% probably damaging, 22% possibly damaging, 23% benign
  39. 39. @mirocupak • 25% rare variants (1,000 Genomes Project) Rarity !39 Numberofvariants 1 100 10000 Allele frequency 0.00 0.03 0.06 0.090.12 0.15 0.18 0.21 0.240.27 0.30 0.33 0.36 0.39 0.420.45 0.48 0.51 0.54 0.57 0.600.63 0.66 0.69 0.72 0.75 0.780.81 0.84 0.87 0.90 0.93 0.960.99
  40. 40. @mirocupak Genes !40 Symbol Name 1 FAM110C Family With Sequence Similarity 110 Member C 2 BRCA1 BRCA1, DNA Repair Associated 3 BRCA2 BRCA2, DNA Repair Associated 4 PPARA Peroxisome Proliferator Activated Receptor Alpha 5 ERBB4 Erb-B2 Receptor Tyrosine Kinase 4 6 BRAF B-Raf Proto-Oncogene, Serine/Threonine Kinase 7 MPL MPL Proto-Oncogene, Thrombopoietin Receptor 8 MYH7 Myosin Heavy Chain 7 9 KIT KIT Proto-Oncogene Receptor Tyrosine Kinase 10 RET Ret Proto-Oncogene Others 53% RET 1% KIT 1% MYH7 2% MPL 2% BRAF 3% ERBB4 3% PPARA 4% BRCA2 9% BRCA1 10% FAM110C 11%
  41. 41. @mirocupak Disorders & clinical abnormalities !41 OMIM HPO 1 Pancreatic cancer, susceptibility to, 4 Autosomal dominant inheritance 2 Breast-ovarian cancer, familial, 1 Autosomal recessive inheritance 3 Fanconi anemia, complementation group D1 Scoliosis 4 Prostate cancer Short stature 5 Pancreatic cancer 2 Cognitive impairment 6 Medulloblastoma Constipation 7 Glioblastoma 3 Somatic mutation 8 Breast-ovarian cancer, familial, 2 Cafe-au-lait spot 9 Breast cancer, male, susceptibility to Failure to thrive 10 Wilms tumor Nausea and vomiting
  42. 42. @mirocupak Questions? !42 https://mirocupak.com

×