Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Enabling Biobank-Scale Genomic Processing with Spark SQL

Download to read offline

With the size of genomic data doubling every seven months, existing tools in the genomic space designed for the gigabyte scale tip over when used to process the terabytes of data being made available by current biobank-scale efforts. To enable common genomic analyses at massive scale while being flexible to ad-hoc analysis, Databricks and Regeneron Genetics Center have partnered to launch an open-source project.

The project includes optimized DataFrame readers for loading genomics data formats, as well as Spark SQL functions to perform statistical tests and quality control analyses on genomic data. We discuss a variety of real-world use cases for processing genomic variant data, which represents how an individual’s genomic sequence differs from the average human genome. Two use cases we will discuss are: joint genotyping, in which multiple individuals’ genomes are analyzed as a group to improve the accuracy of identifying true variants; and variant effect annotation, which annotates variants with their predicted biological impact. Enabling such workflows on Spark follows a straightforward model: we ingest flat files into DataFrames, prepare the data for processing with common Spark SQL primitives, perform the processing on each partition or row with existing genomic analysis tools, and save the results to Delta or flat files.

Enabling Biobank-Scale Genomic Processing with Spark SQL

  1. 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. 2. Karen Feng, Databricks Enabling Biobank-Scale Genomic Processing with Spark SQL #UnifiedDataAnalytics #SparkAISummit
  3. 3. Agenda • Genomics overview – Big data problem – Real-world applications – Pain points at biobank scale • Glow – Datasources – Built-in functions – Extensibility 3
  4. 4. Agenda • Genomics overview – Big data problem – Real-world applications – Pain points at biobank scale • Glow – Datasources – Built-in functions – Extensibility 4
  5. 5. Genomics is a big data problem 5 40,000 Petabytes / year by 2025From $2.7B to <$1,000 https://www.genome.gov/27541954/dna-sequencing-costs-data/ https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
  6. 6. Agenda • Genomics overview – Big data problem – Real-world applications – Pain points at biobank scale • Glow – Datasources – Built-in functions – Extensibility 6
  7. 7. The power of big genomic data 7 Accelerate Target Discovery Goal: identify a biological target (eg. protein) that can be modulated with a drug Approach: large-scale regressions to correlate DNA variants and the trait Result: clinical trials with genomic evidence are 2x more likely to be approved by the FDA https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4547496/ Orthosteric inhibition
  8. 8. The power of big genomic data 8 Accelerate Target Discovery Goal: identify a biological target (eg. protein) that can be modulated with a drug Approach: large-scale regressions to correlate DNA variants and the trait Result: clinical trials with genomic evidence are 2x more likely to be approved by the FDA
  9. 9. The power of big genomic data 9 Accelerate Target Discovery Goal: identify a biological target (eg. protein) that can be modulated with a drug Approach: large-scale regressions to correlate DNA variants and the trait Result: clinical trials with genomic evidence are 2x more likely to be approved by the FDA
  10. 10. The power of big genomic data 10 Accelerate Target Discovery Reduce Costs via Precision Prevention Improve Survival with Optimized Treatment
  11. 11. Agenda • Genomics overview – Big data problem – Real-world applications – Pain points at biobank scale • Glow – Datasources – Built-in functions – Extensibility 11
  12. 12. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 12
  13. 13. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 13
  14. 14. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 14 for i in chr2L chr2R chr3L chr3R chr4 chrX;do GenomeAnalysisTK -R ${ref_seq} -T SelectVariants -V my_flies.vcf -L $i -o my_flies.${i}.vcf done;
  15. 15. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 15 for i in chr2L chr2R chr3L chr3R chr4 chrX;do GenomeAnalysisTK -R ${ref_seq} -T SelectVariants -V my_flies.vcf -L $i -o my_flies.${i}.vcf done; for i in {1..16}; do vcftools --vcf VCF_FILE --chr $i --recode --recode-INFO-all --out VCF_$i; done
  16. 16. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 16 for i in chr2L chr2R chr3L chr3R chr4 chrX;do GenomeAnalysisTK -R ${ref_seq} -T SelectVariants -V my_flies.vcf -L $i -o my_flies.${i}.vcf done; for i in {1..16}; do vcftools --vcf VCF_FILE --chr $i --recode --recode-INFO-all --out VCF_$i; done bgzip -c myvcf.vcf > myvcf.vcf.gz tabix -p vcf myvcf.vcf.gz tabix myvcf.vcf.gz chr1 > chr1.vcf
  17. 17. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 17 for i in chr2L chr2R chr3L chr3R chr4 chrX;do GenomeAnalysisTK -R ${ref_seq} -T SelectVariants -V my_flies.vcf -L $i -o my_flies.${i}.vcf done; for i in {1..16}; do vcftools --vcf VCF_FILE --chr $i --recode --recode-INFO-all --out VCF_$i; done bgzip -c myvcf.vcf > myvcf.vcf.gz tabix -p vcf myvcf.vcf.gz tabix myvcf.vcf.gz chr1 > chr1.vcf java -jar SnpSift.jar split file.vcf
  18. 18. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 18
  19. 19. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 19 838 results
  20. 20. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 20 Data management --make-bed --recode --output-chr --zero-cluster --split-x/--merge-x --set-me-missing --fill-missing-a2 --set-missing-var-ids --update-map... --update-ids... --flip --flip-scan --keep-allele-order... --indiv-sort --write-covar... --{,b}merge... Merge failures VCF reference merge --merge-list --write-snplist --list-duplicate-vars Basic statistics --freq{,x} --missing --test-mishap --hardy --mendel --het/--ibc --check-sex/--impute-sex --fst Linkage disequilibrium --indep... --r/--r2 --show-tags --blocks Distance matrices Identity-by-state/Hamming (--distance...) Relationship/covariance (--make-grm-bin...) --rel-cutoff Distance-pheno. analysis (--ibs-test...) Identity-by-descent --genome --homozyg... Population stratification --cluster --pca --mds-plot --neighbour Association analysis Basic case/control (--assoc, --model) Stratified case/control (--mh, --mh2, --homog) Quantitative trait (--assoc, --gxe) Regression w/ covariates (--linear, --logistic) --dosage --lasso --test-missing Monte Carlo permutation Set-based tests REML additive heritability Family-based association --tdt --dfam --qfam... --tucc Report postprocessing --annotate --clump --gene-report --meta-analysis Epistasis --fast-epistasis --epistasis --twolocus Allelic scoring (--score) R plugins (--R)
  21. 21. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 21
  22. 22. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 22
  23. 23. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 23
  24. 24. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 24
  25. 25. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 25
  26. 26. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 26 1. Converting one file format to another file format. 2. Converting one file format to another file format. 3. Converting one file format to another file format.
  27. 27. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 27 “Give a statistical geneticist an awk line, feed him for a day, teach a statistical geneticist how to awk, feed him for a lifetime...”
  28. 28. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 28 “Give a statistical geneticist an awk line, feed him for a day, teach a statistical geneticist how to awk, feed him for a lifetime...”
  29. 29. Agenda • Genomics overview – Big data problem – Real-world applications – Pain points at biobank scale • Glow – Datasources – Built-in functions – Extensibility 29
  30. 30. • Open-source toolkit for large-scale genomic analysis 30
  31. 31. • Open-source toolkit for large-scale genomic analysis • Built on Spark for biobank scale • Query and use built-in commands with familiar languages using Spark SQL • Compatible with existing genomic tools and formats, as well as big data and ML tools 31
  32. 32. Agenda • Genomics overview – Big data problem – Real-world applications – Pain points at biobank scale • Glow – Datasources – Built-in functions – Extensibility 32
  33. 33. Genomic variant data 33
  34. 34. Genomic variant data 34
  35. 35. Genomic variant data 35 Always present
  36. 36. Genomic variant data 36 Chromosome: StringType
  37. 37. Genomic variant data 37 Variant information: depends on metadata
  38. 38. Genomic variant data 38 MapType(StringType, StringType): {“DP” -> “14”, “AF” -> “0.5”} Genomic variant data
  39. 39. Genomic variant data 39 MapType(StringType, StringType) ##INFO=<ID=AF, Number=A, Type=Float, Description=”Allele Frequency”>
  40. 40. Genomic variant data 40 MapType(StringType, StringType) ##INFO=<ID=AF, Number=A, Type=Float, Description=”Allele Frequency”> ##INFO=<ID=AF, Number=?, Type=?, Description=?>
  41. 41. Genomic variant data 41 MapType(StringType, StringType): lose metadata and slow querying
  42. 42. Genomic variant data 42 Dynamic schema: preserve metadata and fast querying
  43. 43. Genomic variant data 43 StructField( name = “INFO_AF”, dataType = DoubleType, nullable = true, metadata = Map( “vcf_header_count” -> “A”, “vcf_header_description” -> “Allele Frequency”) ##INFO=<ID=AF, Number=A, Type=Float, Description=”Allele Frequency”>
  44. 44. Genomic variant data 44 StructField( name = “INFO_AF”, dataType = DoubleType, nullable = true, metadata = Map( “vcf_header_count” -> “A”, “vcf_header_description” -> “Allele Frequency”) ##INFO=<ID=AF, Number=A, Type=Float, Description=”Allele Frequency”>
  45. 45. Genomic variant data 45 Genotype information: depends on metadata
  46. 46. Genomic variant data 46 Genotype information: width depends on number of samples
  47. 47. Genomic variant data 47 Sample NA00001 Genotype 0|0 Genotype quality 48 Depth 1 Haplotype quality 51,51
  48. 48. Genomic variant data 48 Sample NA00001 NA0002 Genotype 0|0 0|0 Genotype quality 48 49 Depth 1 3 Haplotype quality 51,51 58,50
  49. 49. Genomic variant data 49 Sample NA00001 NA0002 Genotype 0|0 0|0 Genotype quality 48 49 Depth 1 3 Haplotype quality 51,51 58,50 ... UK Biobank has 500,000 participants!
  50. 50. Genomic variant data 50 Sample Genotype Genotype quality Depth Haplotype quality NA0001 0|0 48 1 51,51 NA0002 0|0 49 3 58,50 ...
  51. 51. Genomic variant data • Static fields – eg. Chromosome • Dynamic fields – Variant information – Genotype information • Preserves metadata • Fast querying • Limited number of columns 51
  52. 52. Genomic variant data 52 VCF VCF rows spark.read .format(“vcf”) .load(“genotypes.vcf”)
  53. 53. Genomic variant data 53 spark.write .format(“vcf”) .save(“genotypes.vcf”) VCF VCF rows
  54. 54. Genomic variant data 54 VCF rows spark.write .format(“delta”) .save(“genotypes.delta”)
  55. 55. Delta Lake 55 • Genomic data – VCF, BGEN, BED • Medical images • Electronic health records • Waveform data • Real world evidence • ...
  56. 56. Agenda • Genomics overview – Big data problem – Real-world applications – Pain points at biobank scale • Glow – Datasources – Built-in functions – Extensibility 56
  57. 57. Built-in functions • Convert genotype probabilities to hard calls • Normalize variants • Liftover between reference assemblies • Annotate variants • Genome-wide association studies • ... 57
  58. 58. Built-in functions • Convert genotype probabilities to hard calls • Normalize variants • Liftover between reference assemblies • Annotate variants • Genome-wide association studies • ... 58
  59. 59. GWAS • linear_regression_gwas • logistic_regression_gwas • Single-node bioinformatics tools 59
  60. 60. Single-node bioinformatics tools • SAIGE – R library – VCF → CSV 60 http://pheweb.sph.umich.edu/SAIGE-UKB/pheno/250
  61. 61. Single-node bioinformatics tools • Require flat file splicing and combination 61
  62. 62. Single-node bioinformatics tools 62 Command line tool Text Text Text Text Text Text Text Text Command line tool Command line tool ... ... ...
  63. 63. rdd.pipe() 63 Command line tool worker stdin stdout Text RDD Text RDD
  64. 64. rdd.pipe() • Input and output RDDs have single text column – Input: set header as pipe context – Output: mixed header and text data • Convert between genomic file formats – Changing specs 64
  65. 65. glow.transform(‘pipe’) 65 DataFrame (VCF, CSV, text) DataFrame (VCF, CSV, text) Command line tool (SAIGE) worker stdin stdout
  66. 66. glow.transform(‘pipe’) glow.transform( "pipe", input_df, cmd=cmd, input_formatter='vcf', in_vcf_header='infer', output_formatter='csv', out_header='true', out_delimiter=' ') 66
  67. 67. glow.transform(‘pipe’) glow.transform( "pipe", input_df, cmd=cmd, input_formatter='vcf', in_vcf_header='infer', output_formatter='csv', out_header='true', out_delimiter=' ') 67 DataFrame VCF
  68. 68. glow.transform(‘pipe’) • VCF input formatter – Set header based on schema – Convert Spark Rows to Java objects – Third-party library writes header and variant rows 68 StructField( name = “INFO_AF”, dataType = DoubleType, nullable = true, metadata = Map( “vcf_header_count” -> “A”, “vcf_header_description” -> “Allele Frequency”)) ##INFO=<ID=AF, Number=A, Type=Float, Description=”Allele Frequency”>
  69. 69. glow.transform(‘pipe’) glow.transform( "pipe", input_df, cmd=cmd, input_formatter='vcf', in_vcf_header='infer', output_formatter='csv', out_header='true', out_delimiter=' ') 69 Rscript step2_SPAtests.R
  70. 70. glow.transform(‘pipe’) • For each partition – Input formatter writes to the command’s stdin – Output formatter reads from the command’s stdout – If running the command triggers an exception, the error is propagated to the driver 70
  71. 71. glow.transform(‘pipe’) glow.transform( "pipe", input_df, cmd=cmd, input_formatter='vcf', in_vcf_header='infer', output_formatter='csv', out_header='true', out_delimiter=' ') 71 DataFrame CSV
  72. 72. glow.transform(‘pipe’) • CSV output formatter – Write schema to first element in iterator – Write remaining rows to iterator 72 CHR POS BETA SE p.value 22 35292447 1.206 3.285 0.714 22 35292456 1.358 2.534 0.592 StructType( Seq(“CHR”, “POS”, “BETA”, “SE”, “p.value”).map( StructField(_, StringType)) InternalRow(“22”, “35292447”, “1.206”, “3.285”, “0.714”) InternalRow(“22”, “35292456”, “1.358”, “2.534”, “0.592”)
  73. 73. glow.transform(‘pipe’) • Input and output DataFrames – Input: infer header from schema – Output: infer schema from header • Convert genomic data under the hood – Spark Row ↔ Java object ↔ text 73
  74. 74. Agenda • Genomics overview – Big data problem – Real-world applications – Pain points at biobank scale • Glow – Datasources – Built-in functions – Extensibility 74
  75. 75. GWAS • Load variants • Perform quality control • Control for ancestry • Run regression against trait • Log Manhattan plot 75
  76. 76. GWAS • Load variants • Perform quality control • Control for ancestry • Run regression against trait • Log Manhattan plot 76 spark.read.format("vcf") .load(“genotypes.vcf”)
  77. 77. GWAS • Load variants • Perform quality control • Control for ancestry • Run regression against trait • Log Manhattan plot 77 variant_df.selectExpr("*", "expand_struct(call_summary_stats(genotypes))", "expand_struct(hardy_weinberg(genotypes))") .where((col("alleleFrequencies").getItem(0) >= allele_freq_cutoff) & (col("alleleFrequencies").getItem(0) <= (1.0 - allele_freq_cutoff)) & (col("pValueHwe") >= hwe_cutoff))
  78. 78. GWAS • Load variants • Perform quality control • Control for ancestry • Run regression against trait • Log Manhattan plot 78 qc_df.write .format(“delta”) .save(delta_path)
  79. 79. GWAS • Load variants • Perform quality control • Control for ancestry • Run regression against trait • Log Manhattan plot 79 matrix.computeSVD(num_pcs)
  80. 80. GWAS • Load variants • Perform quality control • Control for ancestry • Run regression against trait • Log Manhattan plot 80 genotypes.crossJoin( phenotypeAndCovariates) .selectExpr( “expand_struct( ” “linear_regression_gwas( ” “genotype_states(genotypes), ” “phenotype_values, covariates))”)
  81. 81. GWAS • Load variants • Perform quality control • Control for ancestry • Run regression against trait • Log Manhattan plot 81 gwas_results_rdf <- as.data.frame(gwas_results) install.packages("qqman", `repos="http://cran.us.r-project.org") library(qqman) png('/databricks/driver/manhattan.png') manhattan(gwas_results_rdf)
  82. 82. GWAS • Load variants • Perform quality control • Control for ancestry • Run regression against trait • Log Manhattan plot 82 http://pheweb.sph.umich.edu/SAIGE-UKB/pheno/250
  83. 83. GWAS • Load variants • Perform quality control • Control for ancestry • Run regression against trait • Log Manhattan plot 83 mlflow.log_artifact( '/databricks/driver/manhattan.png')
  84. 84. GWAS pipeline 84 VCF DF QC’d DataFrame GWAS hits Phenotypes Ancestry
  85. 85. 85 projectglow.io
  86. 86. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT
  • tushar_kale

    Nov. 12, 2019

With the size of genomic data doubling every seven months, existing tools in the genomic space designed for the gigabyte scale tip over when used to process the terabytes of data being made available by current biobank-scale efforts. To enable common genomic analyses at massive scale while being flexible to ad-hoc analysis, Databricks and Regeneron Genetics Center have partnered to launch an open-source project. The project includes optimized DataFrame readers for loading genomics data formats, as well as Spark SQL functions to perform statistical tests and quality control analyses on genomic data. We discuss a variety of real-world use cases for processing genomic variant data, which represents how an individual’s genomic sequence differs from the average human genome. Two use cases we will discuss are: joint genotyping, in which multiple individuals’ genomes are analyzed as a group to improve the accuracy of identifying true variants; and variant effect annotation, which annotates variants with their predicted biological impact. Enabling such workflows on Spark follows a straightforward model: we ingest flat files into DataFrames, prepare the data for processing with common Spark SQL primitives, perform the processing on each partition or row with existing genomic analysis tools, and save the results to Delta or flat files.

Views

Total views

564

On Slideshare

0

From embeds

0

Number of embeds

2

Actions

Downloads

23

Shares

0

Comments

0

Likes

1

×