Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databricks) - A Real World Case Study

113 views

Published on

Molecular profiling provides precise and individualized cancer treatment options and decisions points. By assessing DNA, RNA, proteins, etc.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databricks) - A Real World Case Study

  1. 1. All In: Migrating a Genomics Pipeline from BASH/Hive to Spark and Azure Databricks—A Real World Case Study Victoria Morris Unicorn Health Bridge Consulting working for Atrium Health
  2. 2. Agenda Victoria Morris ▪ Overview Link ▪ Issues – why change? ▪ Next Moves ▪ Migration Starting Small Pharmacogenomics Pipeline ▪ Clinical Trials Matching Pipeline ▪ The Great Migration Hive-> Databricks ▪ Things we Learned ▪ Business Impact
  3. 3. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  4. 4. Overview LInk
  5. 5. Original Problem Statement(s) ▪ Genomic reports are hard to find in the Electronic Medical Record (EMR) ▪ The reports are difficult to read (++ pages) are different from each lab, may not have relevant recommendations and require manual efforts to summarize ▪ Presenting relevant Clinical Trails to providers when making treatment decisions will increase Clinical Trial participation ▪ As a Center of Excellence(COE) for the American Society of Clinical Oncology (ASCO)’s Targeted Agent and Profiling Utilization Registry (TAPUR) Clinical trial, clinical outcomes and treatment data must be reported back to the COE for patients enrolled in the studies ▪ Current process is complicated, time consuming and manual
  6. 6. Overview ▪ The objective of LInK (LCI Integrated Knowledgebase) is to provide interoperability of data between different LCI data sources ▪ Specifically to address the multiple data silo’s, that contain related data, which is a consistent challenge across the System ▪ Data meaning, must be transferred, not just values ▪ Apple: Fruit vs. Computer ▪ Originally we had 4 people, and we all had day jobs
  7. 7. Specialized External testing Testing Results PDF’s, results and Raw Sequence data in PDF, Clinical Decision Support Out (External –sftp/data factory) Clinical Trails Management Software (On-Premise- soon to be Cloud) EMR Clinical Data (Cerner reporting Database/EDW) EAPathways embedded in Cerner via SMART/FHIR Genomic results and PDF reports via Tier 1 SharePoint for molecular tumor board review Converting Raw Reads to Genotype-> Phenotype and generating report for Provider LCI Encounter Data (EDW) LInK Unstructured Notes (e.g. Cerner reporting Database) EAPathways Database (On-premise DB) Integration Office 365 (External- API) POC Clinical Decisio n Support Clinical Trials Matching Pharmacogenomics Specialized Internal testing Testing Results and Raw Sequence data in PDF out (internal)
  8. 8. Frd1Storage Netezza Cloud Azure On-Premise Databases EDW EaPathways Oncore External Labs Caris Inivata FMI Azure Storage • Cerner • EPIC • CRSTAR On-Premise Lab Genomics Lab LInK Data connections – High Level Clinical Trials Management Clinical Decision Supprt Enterprise Data Warehouse ARIA Genomic Pipelines- Auto-generate by WebApps Radiation Treatments CoPath Pathology MS Web Apps MS SharePoint Designer
  9. 9. Frd1Storage Netezza Cloud Azure On-Premise Databases EDW EaPathways Oncore External Labs Tempus Caris Inivata FMI External Vendor÷s Containers Azure Storage Azure Storage • Cerner • EPIC • CRSTAR On-Premise Lab Genomics Lab LInK Data connections – High Level Clinical Trials Management Clinical Decision Supprt Enterprise Data Warehouse ARIA Genomic Pipelines PharmacoGenomics Radiation Treatments CoPath Pathology
  10. 10. Issues
  11. 11. Issues ▪ We run 365 days a year ▪ The Data is used in real time by providers to make clinical decisions for patient treatment for Cancer any breakdown in the pipeline is a Priority 1 fix that needs to be fixed as soon as possible ▪ We were early adopters of HDI – this server has been up since 2016 – it is old technology and HDI was not built for servers to live this long.
  12. 12. Issues cont’d ▪ Randomly the cluster would freeze and go into SAFE mode – with no warning, this happened on a weekly basis often several days, in a row during the overnight batch. ▪ We were past the default allocated 10,000 tez counters and had to change the runs to constantly run with additional ones, back at around 3,000 lines of Hive code. ▪ Although we tried using Matrix manipulation in hive– at some point you just need a loop.
  13. 13. Issues cont’d ▪ The costs to have the HDI cluster up 24x365 was very expensive, we scaled it up and down to help reduced costs. ▪ The cluster was not stable, because we were scaling up and scaling down everyday, at one point there so many logs on the daily scaling it took the entire HDI cluster down.
  14. 14. Issues cont’d ▪ Twice the cluster went down so bad and so hard MS Support’s response was destroy it and start again, which we did the first time… ▪ The HDI server choice-dichotomy to HiveV2 had forced us into not allowing vectorized execution– we had to constantly set hive.vectorized.execution.enabled=false; through out the script because it would “forget” and which was slowing down processing.
  15. 15. Next moves
  16. 16. Search ▪ We wanted something that was cheaper ▪ We wanted to keep our old wasbi storage – not have to migrate the datalake ▪ We wanted flexibility in language options for on-going operations and continuity of care we did not want to get boxed into just one ▪ We wanted something less agnostic, more fully integrated into the Microsoft eco-system
  17. 17. Search cont’d ▪ We needed it to be HIPAA compliant because we were working with Patient data. ▪ We needed something that was self sufficient with the Cluster management so we could concentrate on the programming aspect instead of infrastructure. ▪ We really liked the Notebook concept – and had started experimenting with Jupiter notebooks inside HDI
  18. 18. Frd1Storage Netezza Cloud Azure On-Premise Databases EDW EaPathways Oncore External Labs Tempus Caris Inivata FMI External Vendor÷s Containers Azure Storage Azure Storage • Cerner • EPIC • CRSTAR On-Premise Lab Genomics Lab LInK Data connections – High Level Clinical Trials Management Clinical Decision Supprt Enterprise Data Warehouse ARIA Genomic Pipelines PharmacoGenomics Radiation Treatments CoPath Pathology
  19. 19. Migration
  20. 20. Migration – starting small ▪ There is a large steep learning curve to get into the databricks ▪ We had a new project the second pipeline that had to be built and it seemed easier to start with something smaller than the 8000 lines of Hive code that would be required if we started transitioning the original pipeline.
  21. 21. Pharmacogenomics In progress
  22. 22. Pharmacogenomics We receive raw Genomic test results from our internal lab
  23. 23. Pharmacogenomics Single Notebook
  24. 24. Overview Genomic Clinical Trials Pipeline
  25. 25. --------------------
  26. 26. Clinical Trial Match Criteria Age (today’s) Gender First line eligible(no previous anti- neoplastics ordered) Genomic Results (over 1290 genes) Diagnosis Tumor Site Secondary Gene results Must have/not have a specific protein change/mutation Previous Lab results Previous Medications
  27. 27. Opening Screen
  28. 28. Frd1Storage Netezza Cloud Azure On-Premise Databases EDW EaPathways Oncore External Labs Tempus Caris Inivata FMI External Vendor÷s Containers Azure Storage Azure Storage • Cerner • EPIC • CRSTAR On-Premise Lab Genomics Lab LInK Data connections – High Level Clinical Trials Management Clinical Decision Supprt Enterprise Data Warehouse ARIA Genomic Pipelines PharmacoGenomics Radiation Treatments CoPath Pathology
  29. 29. The Great Migration
  30. 30. Process Tempus files Process Caris files Process FMI files Process Inivata files Main Match Create Summary Preprocess each lab into similar data format Create Clinical Matches Create Genomic Summary, combine with matches an save to database 1 2 3
  31. 31. Hive Conversion
  32. 32. Initial Definitions ▪ Bulleted list ▪ Bulleted list ▪ Bulleted list ▪ Bulleted list DatabricksHive
  33. 33. Reading the file ▪ Not a separate step in Hive part of the next step ▪ Bulleted list ▪ Bulleted list DatabricksHive
  34. 34. Creating a clean view of the data ▪ Bulleted list ▪ Bulleted list ▪ Bulleted list ▪ Bulleted list DatabricksHive
  35. 35. Databricks by the numbers ▪ We work in a Premium Workspace, using our internal ip addresses inside a secured subnet inside the Atrium Health Azure Subscription ▪ Databricks is fully HIPPA compliant ▪ Clusters are created with predefined tags and costs associated to each tagged cluster’s run can be separated out ▪ Our data lake is ~110 terabytes ▪ We have 2.3+ million gene results x 240+ CTC to match against 10 criteria ▪ Yes even during COVID-19 we are still seeing an average of 1 new report a day – We still run 365 a year
  36. 36. Things we learned
  37. 37. Azure Key Vaults and Back-up ▪ Azure Key Vaults are tricky to implement and you only need to do the connection on a new workspace – so save those instructions! ▪ But these are a very secure way to save all your connection info without having it in plain text on the notebook itself. ▪ Do not forget to save a copy of everything periodically offline –if your workspace goes you lose all the notebooks and any manually uploaded data tables… ▪ Yes we have had to replace the workspace twice in this project
  38. 38. Working with complex nested Json and XML sucks ▪ It sounds so simple in the examples and works great in the simple 1 level examples – real world when something is nested and duplicated or missing entirely from that record several levels deep and usually in structs -it sucks ▪ Struct versus arrays- we ended-up having to convert structs to arrays all the time ▪ Use the cardinality function a lot to determine if there was anything in an array ▪ The concat_ws trick if you are not sure if ended up with an array or a string in a sql in your data
  39. 39. Tips and tricks? ▪ Databricks only reads a Blob Type of Block blob. Any other type means that databricks does not even see the directory – that took a fair bit to uncover when one of our vendors uploaded a new set of files in the wrong block type without realizing it. ▪ We ended up using data factory a lot less than we thought –odbc connections worked well except for Oracle we never could get that to work – it is the only thing still sqooped nightly
  40. 40. Code Snips I used all the time ▪ %python pythonDF.write.mode(“overwrite”).saveAsTable(“pythonTable”) ▪ %scala val ScalaDF= spark.read($“pythonTable”) ▪ If you need a table from a JDBC source to use in SQL: ▪ %scala val JDBCTableDF = spark.read.jdbc(jdbcUrl, "JDBCTableName", connectionProperties) ▪ JDBCTableDF.write.mode("overwrite").saveAsTable(" JDBCTableNameTbl") ▪ If you suddenly cannot write out a table: ▪ dbutils.fs.rm("dbfs:/user/hive/warehouse/JDBCTableNameTbl/", true) I am no expert – but I ended up using these all the time
  41. 41. Code Snips I used all the time ▪ Save tables between notebooks – use REFERSH table at the start of the new notebook to grab the latest version ▪ The null problem – using the cast function to save yourself from Parquet I am no expert – but I ended up using these all the time
  42. 42. Business Impact ▪ More stable infrastructure ▪ Lower costs ▪ Results come in faster ▪ Easier to add additional labs ▪ Easier to troubleshoot when there are issues ▪ Increase in volume handled easily ▪ Self-service for end-users means no IAS intervention
  43. 43. Thanks! Dr Derek Ragavan, Carol Farhangfar, Nury Steuerwald, Jai Patel Chris Danzi, Lance Richey, Scott Blevins Andrea Bouronich, Stephanie King, Melanie Bamberg, Stacy Harris Kelly Jones and his team All the data and system owners who let us access their data All the Microsoft support folks who helped us push to the edge And of course Databricks
  44. 44. Questions?
  45. 45. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

×