Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How we've made a global search engine for genetic data

48 views

Published on

Slides for the How we've made a global search engine for genetic data talk at Strange Loop 2018.

Published in: Software
  • Be the first to comment

  • Be the first to like this

How we've made a global search engine for genetic data

  1. 1. @mirocupak Miro Cupak VP Engineering, DNAstack 28/09/2018 How we've made a global search engine for genetic data
  2. 2. https://www.ga4gh.org/
  3. 3. https://beacon-network.org
  4. 4. https://beacon-network.org
  5. 5. Background https://www.nature.com/news/technology- the-1-000-genome-1.14901 sequencing cost decreasing exponentially (3M times since 2000)
  6. 6. Background http://journals.plos.org/plosbiology/ article?id=10.1371/journal.pbio. 1002195 genomic data volume increasing exponentially (1M times since 2000)
  7. 7. BackgroundDataVolumesby2025(GB) 0e+00 1e+10 2e+10 3e+10 4e+10 Twitter Youtube Genomics Lower Bound Upper Bound http://journals.plos.org/plosbiology/article? id=10.1371/journal.pbio.1002195 up to 2 billion human genomes sequenced in the next 10 years (more data annually than uploaded to and )
  8. 8. What does it mean? ❓ Problem ❗ Key obstacle 💡 Solution Too much data for any single institution. Not enough data to make new discoveries. Discovering data. Federated system capable of executing cross-dataset and cross-institution queries.
  9. 9. • initiative started in 2014 across many groups within GA4GH • experiment to test the willingness to share in the simplest of all technical contexts • simple web service • receives questions of the form Do you have information about this mutation? • responds with yes or no (optionally additional metadata) • design principles • A beacon has to be technically simple. • A beacon has to minimize risks associated with genomic data sharing. • It has to be possible to make a beacon publicly available. Beacon Project https://beacon-project.io/
  10. 10. • no formal specification • receives questions of the form Do you have information about this mutation? • responds with yes or no • 4 public beacons, each API different Standard: Before Beacon Network • request method • supported parameters • parameter names • chromosome identifiers • positional base • assembly notation • supported alleles • dataset support • response format • data included in the response
  11. 11. Standard: Before Beacon Network
  12. 12. • 2014 • really simple (2 records) • true/false response • format: Avro • too vague • not enough traction Standard: 0.1
  13. 13. • 2015 • true/false/overlap/null response • datasets • data use conditions • self description • complex (9 records) • format: Avro • not well adopted • not polished enough Standard: 0.2
  14. 14. • 2016 • simplified 0.2 • true/false/null response • data model improvements, extended metadata and response, improved support for datasets and cross-dataset queries, data versioning • modular and extensible • tooling • format: Avro → Proto3 • based on real needs, successful Standard: 0.3
  15. 15. • 2018 • stable and more flexible • support for more complex mutations • improved error handling • improved data use conditions • developer experience • format: Proto3 → OpenAPI Standard: 0.4
  16. 16. • promoted 0.4 • extended documentation and best practices Standard: 1.0 Finally ready! 🎉
  17. 17. Beacon Network
  18. 18. Data • access data stored in a relational database
  19. 19. Service • communication with other subsystems • query normalization • aggregators • participant resolution • query distribution • audit trail • L1 parallelization
  20. 20. Processor • executing a query against a beacon and processing its response • management of a flexible, dynamic and easily extensible query execution pipeline • pipeline stages resolution (CDI and EJB) • L2 parallelization • cross-assembly query handling
  21. 21. Converter • first stage in the query execution pipeline • translating query parameters
  22. 22. Requester • second stage in the query execution pipeline • constructing beacon requests based on their URIs and parameters produced by the converters
  23. 23. Fetcher • third stage in the query execution pipeline • unit actually talking to the API of beacons • submitting requests over the network and obtaining the raw response
  24. 24. Parser • last stage in the pipeline • extracting information of interest from the raw response obtained by a fetcher • dealing with various formats • handling metadata, multiple responses, errors • response normalization • parallelized
  25. 25. Mapper • translation between different representations of objects
  26. 26. REST • handling client requests • data serialization
  27. 27. Size 100+ installations 40+ institutions 18 countries 6 continents
  28. 28. Users 16k users 141 countries
  29. 29. Searches
  30. 30. Assemblies Others 11% GRCh38 6% GRCh37 83%
  31. 31. Chromosomes Others 39% Chr. 7 7% Chr. 13 11% Chr. 1 11% Chr. 17 14% Chr. 2 18%
  32. 32. Variants Others 74% 2 : 212289100 C (ERBB4) 1% 2 : 29432776 C (ALK) 1% 14 : 23894969 A (MYH7) 1% 1 : 115258747 A (NRAS) 1% 1 : 43815163 C (MPL) 2% 7 : 140453136 C (BRAF) 2% 2 : 45895 G (FAM110C) 3% 22 : 46546565 A (PPARA) 3% 13 : 32936732 C (BRCA2) 6% 2 : 38938 C (FAM110C) 6% 85k distinct mutations
  33. 33. DeleteriousnessNumberofvariants 1 1000 1000000 Score 0.00 0.07 0.14 0.21 0.28 0.35 0.42 0.49 0.56 0.63 0.70 0.77 0.84 0.91 0.98 Numberofvariants 1 1000 1000000 Score 0.00 0.07 0.14 0.21 0.28 0.35 0.42 0.49 0.56 0.63 0.70 0.77 0.84 0.91 0.98 SIFT (Sorting Intolerant From Tolerant) PolyPhen-2 HDIV (Polymorphism Phenotyping v2) 69% damaging, 31% tolerated 55% probably damaging, 22% possibly damaging, 23% benign
  34. 34. • 25% rare variants (1,000 Genomes Project) RarityNumberofvariants 1 100 10000 Allele frequency 0.00 0.03 0.06 0.090.12 0.15 0.18 0.21 0.240.27 0.30 0.33 0.36 0.39 0.420.45 0.48 0.51 0.54 0.57 0.600.63 0.66 0.69 0.72 0.75 0.780.81 0.84 0.87 0.90 0.93 0.960.99
  35. 35. Genes Symbol Name 1 FAM110C Family With Sequence Similarity 110 Member C 2 BRCA1 BRCA1, DNA Repair Associated 3 BRCA2 BRCA2, DNA Repair Associated 4 PPARA Peroxisome Proliferator Activated Receptor Alpha 5 ERBB4 Erb-B2 Receptor Tyrosine Kinase 4 6 BRAF B-Raf Proto-Oncogene, Serine/Threonine Kinase 7 MPL MPL Proto-Oncogene, Thrombopoietin Receptor 8 MYH7 Myosin Heavy Chain 7 9 KIT KIT Proto-Oncogene Receptor Tyrosine Kinase 10 RET Ret Proto-Oncogene Others 53% RET 1% KIT 1% MYH7 2% MPL 2% BRAF 3% ERBB4 3% PPARA 4% BRCA2 9% BRCA1 10% FAM110C 11%
  36. 36. Disorders & clinical abnormalities OMIM HPO 1 Pancreatic cancer, susceptibility to, 4 Autosomal dominant inheritance 2 Breast-ovarian cancer, familial, 1 Autosomal recessive inheritance 3 Fanconi anemia, complementation group D1 Scoliosis 4 Prostate cancer Short stature 5 Pancreatic cancer 2 Cognitive impairment 6 Medulloblastoma Constipation 7 Glioblastoma 3 Somatic mutation 8 Breast-ovarian cancer, familial, 2 Cafe-au-lait spot 9 Breast cancer, male, susceptibility to Failure to thrive 10 Wilms tumor Nausea and vomiting
  37. 37. Questions? @mirocupak

×