Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How we built a global search engine for genetic data

222 views

Published on

Slides from How we built a global search engine for genetic data talk at Index San Francisco 2018.

Published in: Software
  • Be the first to comment

  • Be the first to like this

How we built a global search engine for genetic data

  1. 1. Index San Francisco @mirocupak Miro Cupak Senior Software Engineer, DNAstack 22/02/2018 How we built a global search engine for genetic data
  2. 2. Index San Francisco @mirocupak What and why? 2 • Beacon Network (https://beacon-network.org/) • from the Global Alliance for Genomics and Health (GA4GH) • largest search and discovery engine of human genomic variation • case study talk • domain background • standard, architecture and technologies • fun with stats
  3. 3. Index San Francisco @mirocupak Background 3
  4. 4. Index San Francisco @mirocupak 4 https://beacon-network.org
  5. 5. Index San Francisco @mirocupak 5 https://beacon-network.org
  6. 6. Index San Francisco @mirocupak 6
  7. 7. Index San Francisco @mirocupak • sequencing cost decreasing exponentially (3M times since 2000) Trends 7 https://www.nature.com/news/technology-the-1-000-genome-1.14901
  8. 8. Index San Francisco @mirocupak • genomic data volumes increasing exponentially (1M times since 2000) Trends 8 http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
  9. 9. Index San Francisco @mirocupak • up to 2 billion human genomes sequenced in the next 10 years (more data annually than uploaded to Twitter and YouTube) Trends 9 Expected Data Volumes by 2025 DataVolumes(GB) 0E+00 1E+10 2E+10 3E+10 4E+10 Twitter Youtube Genomics Lower Bound Upper Bound http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
  10. 10. Index San Francisco @mirocupak • no single institution will have sufficient resources • still, institutions don’t have enough data • common diseases • rare diseases • challenge • discovering data • solution • traditional approach of data aggregation in a single centralized site not working • federated system capable of executing cross-dataset and cross- institution queries is needed Beacon Network Problem 10
  11. 11. Index San Francisco @mirocupak Global Alliance for Genomics & Health 11 • nonprofit standards alliance • a coalition of over 500 leading institutions working in health care, research, disease advocacy, life science, and information technology • goal: enable responsible sharing of genomic and clinical data • since 2013 http://ga4gh.org/ https://www.broadinstitute.org/files/news/pdfs/GAWhitePaperJune3.pdf
  12. 12. Index San Francisco @mirocupak Beacon Project 12 • experiment to test the willingness of international sites to share genetic data in the simplest of all technical contexts • named after the SETI project (Search for Extra-Terrestrial Intelligence, http:// history.nasa.gov/seti.html) • initiative requiring collaboration of many different GA4GH groups • started in March 2014, quickly gained traction
  13. 13. Index San Francisco @mirocupak Beacon 13
  14. 14. Index San Francisco @mirocupak Beacon 14 • simple web service allowing users to query institution’s databases to determine whether they contain a genetic variant of interest • receives questions of the form Do you have information about this mutation? • responds with yes or no, optionally with additional information about the mutation • design principles • A beacon has to be technically simple. • A beacon has to minimize risks associated with genomic data sharing. • It has to be possible to make a beacon publicly available.
  15. 15. Index San Francisco @mirocupak Standard: Before Beacon Network 15 • no formal specification • Receives questions of the form Do you have information about this mutation?. Responds with yes or no. • 4 public beacons, each API different • request method • supported parameters • parameter names • chromosome identifiers • positional base • assembly notation • supported alleles • dataset support • response format • data included in the response
  16. 16. Index San Francisco @mirocupak 16 Standard: Before Beacon Network
  17. 17. Index San Francisco @mirocupak Standard: 0.1 17 • 2014 • really simple (2 records) • true/false response • format: Avro • not enough traction • too vague • issues partially addressed by the Beacon Network
  18. 18. Index San Francisco @mirocupak Standard: 0.2 18 • 2015 • complex (9 records) • true/false/overlap/null response • datasets • simple data use conditions • self description • format: Avro • not well adopted • not polished enough
  19. 19. Index San Francisco @mirocupak Standard: 0.3 19 • 2016 • simplified 0.2 • based on real needs, successful • true/false/null response • improved support for datasets and cross-dataset queries • modular and extensible • data versioning • various improvements to the data model, more metadata, extended response • tooling • format: Avro to Proto3
  20. 20. Index San Francisco @mirocupak Standard: 0.4 20 • 2018 • stable and more flexible • support for complex variants • improved error handling • improved data use conditions • various minor improvements. • developer experience • format: Proto3 to OpenAPI
  21. 21. Index San Francisco @mirocupak Beacon Network 21
  22. 22. Index San Francisco @mirocupak Requirements 22 • federation of queries across beacons • integration of publicly available beacons • aggregation of data from multiple sources • online distribution of queries without the need to store genomic data • registry of public beacons • programmatically accessible • easily accessible • unified beacon API • push for standardization of the standard • performance • scalability • modularity and extensibility • logging and audit trail • beacon monitoring • lower barrier of entry for beacon developers • development under the umbrella of GA4GH
  23. 23. Index San Francisco @mirocupak Architecture 23
  24. 24. Index San Francisco @mirocupak Data 24 • access data stored in a relational database
  25. 25. Index San Francisco @mirocupak Service 25 • communication with other subsystems • query normalization • aggregators • participant resolution • query distribution • audit trail • L1 parallelization
  26. 26. Index San Francisco @mirocupak Processor 26 • executing a query against a beacon and processing its response • management of a flexible, dynamic and easily extensible query execution pipeline • pipeline stages resolution (CDI and EJB) • L2 parallelization • cross-assembly query handling
  27. 27. Index San Francisco @mirocupak Converter 27 • first stage in the query execution pipeline • translating query parameters
  28. 28. Index San Francisco @mirocupak Requester 28 • second stage in the query execution pipeline • constructing beacon requests based on their URIs and parameters produced by the converters
  29. 29. Index San Francisco @mirocupak Fetcher 29 • third stage in the query execution pipeline • unit actually talking to the API of beacons • submitting requests over the network and obtaining the raw response
  30. 30. Index San Francisco @mirocupak Parser 30 • last stage in the pipeline • extracting information of interest from the raw response obtained by a fetcher • dealing with various formats • handling metadata, multiple responses, errors • response normalization • parallelized
  31. 31. Index San Francisco @mirocupak Mapper 31 • translation between different representations of objects
  32. 32. Index San Francisco @mirocupak REST 32 • handling client requests • data serialization
  33. 33. Index San Francisco @mirocupak Search execution 33
  34. 34. Index San Francisco @mirocupak Stats 34
  35. 35. Index San Francisco @mirocupak Size 35 • ~100 installations, 40 institutions, 18 countries, 6 continents
  36. 36. Index San Francisco @mirocupak Users 36 • 13k users, 136 countries
  37. 37. Index San Francisco @mirocupak 37 Searches
  38. 38. Index San Francisco @mirocupak Assemblies 38 Others 11% GRCh38 6% GRCh37 83%
  39. 39. Index San Francisco @mirocupak Chromosomes 39 Others 39% Chr. 7 7% Chr. 13 11% Chr. 1 11% Chr. 17 14% Chr. 2 18%
  40. 40. Index San Francisco @mirocupak Variants 40 • 84k distinct variants Others 74% 2 : 212289100 C (ERBB4) 1% 2 : 29432776 C (ALK) 1% 14 : 23894969 A (MYH7) 1% 1 : 115258747 A (NRAS) 1% 1 : 43815163 C (MPL) 2% 7 : 140453136 C (BRAF) 2% 2 : 45895 G (FAM110C) 3% 22 : 46546565 A (PPARA) 3% 13 : 32936732 C (BRCA2) 6% 2 : 38938 C (FAM110C) 6%
  41. 41. Index San Francisco @mirocupak Deleteriousness 41 Numberofvariants 1 1000 1000000 Score 0.00 0.09 0.18 0.27 0.36 0.45 0.54 0.63 0.72 0.81 0.90 0.99 Numberofvariants 1 1000 1000000 Score 0.00 0.09 0.18 0.27 0.36 0.45 0.54 0.63 0.72 0.81 0.90 0.99 SIFT (Sorting Intolerant From Tolerant) PolyPhen-2 HDIV (Polymorphism Phenotyping v2) 69% damaging, 31% tolerated 55% probably damaging, 22% possibly damaging, 23% benign
  42. 42. Index San Francisco @mirocupak Rarity 42 • 25% rare variants (in in 1,000 Genomes Project, August 2015 release) Numberofvariants 1 100 10000 Allele frequency 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00
  43. 43. Index San Francisco @mirocupak Genes 43 Symbol Name 1 FAM110C Family With Sequence Similarity 110 Member C 2 BRCA1 BRCA1, DNA Repair Associated 3 BRCA2 BRCA2, DNA Repair Associated 4 PPARA Peroxisome Proliferator Activated Receptor Alpha 5 ERBB4 Erb-B2 Receptor Tyrosine Kinase 4 6 BRAF B-Raf Proto-Oncogene, Serine/ Threonine Kinase 7 MPL MPL Proto-Oncogene, Thrombopoietin Receptor 8 MYH7 Myosin Heavy Chain 7 9 KIT KIT Proto-Oncogene Receptor Tyrosine Kinase 10 RET Ret Proto-Oncogene Others 53% RET 1% KIT 1% MYH7 2% MPL 2% BRAF 3% ERBB4 3% PPARA 4% BRCA2 9% BRCA1 10% FAM110C 11%
  44. 44. Index San Francisco @mirocupak Disorders & clinical abnormalities 44 OMIM HPO 1 Pancreatic cancer, susceptibility to, 4 Autosomal dominant inheritance 2 Breast-ovarian cancer, familial, 1 Autosomal recessive inheritance 3 Fanconi anemia, complementation group D1 Scoliosis 4 Prostate cancer Short stature 5 Pancreatic cancer 2 Cognitive impairment 6 Medulloblastoma Constipation 7 Glioblastoma 3 Somatic mutation 8 Breast-ovarian cancer, familial, 2 Cafe-au-lait spot 9 Breast cancer, male, susceptibility to Failure to thrive 10 Wilms tumor Nausea and vomiting
  45. 45. Index San Francisco @mirocupak Getting involved 45 • Contribute on GitHub • https://github.com/ga4gh/beacon-team/ • Google Summer of Code • https://summerofcode.withgoogle.com/organizations/ 5727014175113216/ • DNAstack • https://dnastack.com/#/team/careers
  46. 46. Index San Francisco @mirocupak Questions? 46 https://mirocupak.com

×