SlideShare a Scribd company logo
All In: Migrating a Genomics
Pipeline from BASH/Hive to Spark
and Azure Databricks—A Real
World Case Study
Victoria Morris
Unicorn Health Bridge Consulting working for Atrium Health
Agenda
Victoria Morris
▪ Overview Link
▪ Issues – why change?
▪ Next Moves
▪ Migration Starting Small Pharmacogenomics
Pipeline
▪ Clinical Trials Matching Pipeline
▪ The Great Migration Hive-> Databricks
▪ Things we Learned
▪ Business Impact
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Overview LInk
Original Problem Statement(s)
▪ Genomic reports are hard to find in the Electronic Medical Record (EMR)
▪ The reports are difficult to read (++ pages) are different from each lab, may not
have relevant recommendations and require manual efforts to summarize
▪ Presenting relevant Clinical Trails to providers when making treatment decisions
will increase Clinical Trial participation
▪ As a Center of Excellence(COE) for the American Society of Clinical Oncology
(ASCO)’s Targeted Agent and Profiling Utilization Registry (TAPUR) Clinical trial,
clinical outcomes and treatment data must be reported back to the COE for
patients enrolled in the studies
▪ Current process is complicated, time consuming and manual
Overview
▪ The objective of LInK (LCI Integrated Knowledgebase) is to provide
interoperability of data between different LCI data sources
▪ Specifically to address the multiple data silo’s, that contain related data, which is a
consistent challenge across the System
▪ Data meaning, must be transferred, not just values
▪ Apple: Fruit vs. Computer
▪ Originally we had 4 people, and we all had day jobs
Specialized External testing
Testing Results
PDF’s, results and
Raw Sequence data in
PDF, Clinical Decision Support Out
(External –sftp/data factory)
Clinical
Trails
Management
Software
(On-Premise-
soon to be
Cloud)
EMR
Clinical Data
(Cerner reporting
Database/EDW)
EAPathways embedded in
Cerner
via SMART/FHIR
Genomic results and
PDF reports
via Tier 1 SharePoint
for molecular tumor
board review
Converting Raw Reads to
Genotype-> Phenotype and
generating report for Provider
LCI
Encounter
Data
(EDW)
LInK
Unstructured Notes
(e.g. Cerner reporting
Database)
EAPathways
Database
(On-premise
DB)
Integration
Office
365
(External-
API)
POC
Clinical
Decisio
n
Support
Clinical
Trials
Matching
Pharmacogenomics
Specialized Internal testing
Testing Results and
Raw Sequence data in PDF
out
(internal)
Frd1Storage
Netezza Cloud
Azure
On-Premise Databases
EDW
EaPathways
Oncore
External Labs
Caris
Inivata
FMI
Azure Storage
• Cerner
• EPIC
• CRSTAR
On-Premise Lab
Genomics Lab
LInK Data connections – High Level
Clinical Trials
Management
Clinical Decision
Supprt
Enterprise Data
Warehouse
ARIA
Genomic Pipelines-
Auto-generate by WebApps
Radiation
Treatments
CoPath
Pathology
MS Web
Apps
MS
SharePoint
Designer
Frd1Storage
Netezza Cloud
Azure
On-Premise Databases
EDW
EaPathways
Oncore
External Labs
Tempus Caris
Inivata
FMI
External Vendor÷s
Containers
Azure Storage
Azure Storage
• Cerner
• EPIC
• CRSTAR
On-Premise Lab
Genomics Lab
LInK Data connections – High Level
Clinical Trials
Management
Clinical Decision
Supprt
Enterprise Data
Warehouse
ARIA
Genomic Pipelines
PharmacoGenomics
Radiation
Treatments
CoPath
Pathology
Issues
Issues
▪ We run 365 days a year
▪ The Data is used in real time by providers to make clinical decisions for
patient treatment for Cancer any breakdown in the pipeline is a
Priority 1 fix that needs to be fixed as soon as possible
▪ We were early adopters of HDI – this server has been up since 2016 – it
is old technology and HDI was not built for servers to live this long.
Issues cont’d
▪ Randomly the cluster would freeze and go into SAFE mode – with no
warning, this happened on a weekly basis often several days, in a row
during the overnight batch.
▪ We were past the default allocated 10,000 tez counters and had to
change the runs to constantly run with additional ones, back at
around 3,000 lines of Hive code.
▪ Although we tried using Matrix manipulation in hive– at some point you
just need a loop.
Issues cont’d
▪ The costs to have the HDI cluster up 24x365 was very expensive, we
scaled it up and down to help reduced costs.
▪ The cluster was not stable, because we were scaling up and scaling
down everyday, at one point there so many logs on the daily scaling it
took the entire HDI cluster down.
Issues cont’d
▪ Twice the cluster went down so bad and so hard MS Support’s
response was destroy it and start again, which we did the first time…
▪ The HDI server choice-dichotomy to HiveV2 had forced us into not
allowing vectorized execution– we had to constantly set
hive.vectorized.execution.enabled=false; through out the script
because it would “forget” and which was slowing down processing.
Next moves
Search
▪ We wanted something that was cheaper
▪ We wanted to keep our old wasbi storage – not have to migrate the
datalake
▪ We wanted flexibility in language options for on-going operations and
continuity of care we did not want to get boxed into just one
▪ We wanted something less agnostic, more fully integrated into the
Microsoft eco-system
Search cont’d
▪ We needed it to be HIPAA compliant because we were working with
Patient data.
▪ We needed something that was self sufficient with the Cluster
management so we could concentrate on the programming aspect
instead of infrastructure.
▪ We really liked the Notebook concept – and had started experimenting
with Jupiter notebooks inside HDI
Frd1Storage
Netezza Cloud
Azure
On-Premise Databases
EDW
EaPathways
Oncore
External Labs
Tempus Caris
Inivata
FMI
External Vendor÷s
Containers
Azure Storage
Azure Storage
• Cerner
• EPIC
• CRSTAR
On-Premise Lab
Genomics Lab
LInK Data connections – High Level
Clinical Trials
Management
Clinical Decision
Supprt
Enterprise Data
Warehouse
ARIA
Genomic Pipelines
PharmacoGenomics
Radiation
Treatments
CoPath
Pathology
Migration
Migration – starting small
▪ There is a large steep learning curve to get into the databricks
▪ We had a new project the second pipeline that had to be built and it
seemed easier to start with something smaller than the 8000 lines of
Hive code that would be required if we started transitioning the
original pipeline.
Pharmacogenomics In progress
Pharmacogenomics
We receive raw Genomic test
results from our internal lab
Pharmacogenomics
Single Notebook
Overview Genomic Clinical Trials Pipeline
--------------------
Clinical Trial Match Criteria
Age (today’s) Gender
First line eligible(no
previous anti-
neoplastics
ordered)
Genomic Results
(over 1290 genes)
Diagnosis Tumor Site
Secondary Gene
results
Must have/not have
a specific protein
change/mutation
Previous Lab
results
Previous
Medications
Opening Screen
Frd1Storage
Netezza Cloud
Azure
On-Premise Databases
EDW
EaPathways
Oncore
External Labs
Tempus Caris
Inivata
FMI
External Vendor÷s
Containers
Azure Storage
Azure Storage
• Cerner
• EPIC
• CRSTAR
On-Premise Lab
Genomics Lab
LInK Data connections – High Level
Clinical Trials
Management
Clinical Decision
Supprt
Enterprise Data
Warehouse
ARIA
Genomic Pipelines
PharmacoGenomics
Radiation
Treatments
CoPath
Pathology
The Great Migration
Process Tempus
files
Process Caris
files Process FMI files
Process Inivata
files
Main Match
Create Summary
Preprocess each
lab into similar
data format
Create Clinical Matches
Create Genomic Summary,
combine with matches an
save to database
1
2
3
Hive
Conversion
Initial Definitions
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
DatabricksHive
Reading the file
▪ Not a separate step in Hive part of the next step ▪ Bulleted list
▪ Bulleted list
DatabricksHive
Creating a clean view of the data
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
DatabricksHive
Databricks by the numbers
▪ We work in a Premium Workspace, using our internal ip addresses
inside a secured subnet inside the Atrium Health Azure Subscription
▪ Databricks is fully HIPPA compliant
▪ Clusters are created with predefined tags and costs associated to
each tagged cluster’s run can be separated out
▪ Our data lake is ~110 terabytes
▪ We have 2.3+ million gene results x 240+ CTC to match against 10
criteria
▪ Yes even during COVID-19 we are still seeing an average of 1 new
report a day –
We still run 365 a year
Things we learned
Azure Key Vaults and Back-up
▪ Azure Key Vaults are tricky to implement and you only need to do the
connection on a new workspace – so save those instructions!
▪ But these are a very secure way to save all your connection info
without having it in plain text on the notebook itself.
▪ Do not forget to save a copy of everything periodically offline –if your
workspace goes you lose all the notebooks and any manually uploaded
data tables…
▪ Yes we have had to replace the workspace twice in this project
Working with complex nested Json and XML sucks
▪ It sounds so simple in the examples and works great in the simple 1 level
examples – real world when something is nested and duplicated or
missing entirely from that record several levels deep and usually in
structs -it sucks
▪ Struct versus arrays- we ended-up having to convert structs to arrays all
the time
▪ Use the cardinality function a lot to determine if there was anything in an
array
▪ The concat_ws trick if you are not sure if ended up with an array or a
string in a sql in your data
Tips and tricks?
▪ Databricks only reads a Blob Type of Block blob. Any other type means
that databricks does not even see the directory – that took a fair bit to
uncover when one of our vendors uploaded a new set of files in the
wrong block type without realizing it.
▪ We ended up using data factory a lot less than we thought –odbc
connections worked well except for Oracle we never could get that to
work – it is the only thing still sqooped nightly
Code Snips I used all the time
▪ %python pythonDF.write.mode(“overwrite”).saveAsTable(“pythonTable”)
▪ %scala val ScalaDF= spark.read($“pythonTable”)
▪ If you need a table from a JDBC source to use in SQL:
▪ %scala val JDBCTableDF = spark.read.jdbc(jdbcUrl, "JDBCTableName", connectionProperties)
▪ JDBCTableDF.write.mode("overwrite").saveAsTable(" JDBCTableNameTbl")
▪ If you suddenly cannot write out a table:
▪ dbutils.fs.rm("dbfs:/user/hive/warehouse/JDBCTableNameTbl/", true)
I am no expert – but I ended up using these all the time
Code Snips I used all the time
▪ Save tables between notebooks – use REFERSH table at the start of
the new notebook to grab the latest version
▪ The null problem – using the cast function to save yourself from
Parquet
I am no expert – but I ended up using these all the time
Business Impact
▪ More stable infrastructure
▪ Lower costs
▪ Results come in faster
▪ Easier to add additional labs
▪ Easier to troubleshoot when there are issues
▪ Increase in volume handled easily
▪ Self-service for end-users means no IAS intervention
Thanks!
Dr Derek Ragavan,
Carol Farhangfar, Nury Steuerwald, Jai Patel
Chris Danzi, Lance Richey, Scott Blevins
Andrea Bouronich, Stephanie King, Melanie Bamberg,
Stacy Harris
Kelly Jones and his team
All the data and system owners who let us access their data
All the Microsoft support folks who helped us push to the edge
And of course Databricks
Questions?
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

More Related Content

What's hot

Big data pipelines
Big data pipelinesBig data pipelines
Big data pipelines
Vivek Aanand Ganesan
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
C4Media
 
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit AgarwalSuccinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Spark Summit
 
Testing the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big ProblemsTesting the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big Problems
TechWell
 
Big data rmoug
Big data rmougBig data rmoug
Big data rmoug
Gwen (Chen) Shapira
 
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXTDriving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
DataWorks Summit
 
Strata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma PresentationStrata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma Presentation
Zaloni
 
ETL using Big Data Talend
ETL using Big Data Talend  ETL using Big Data Talend
ETL using Big Data Talend
Edureka!
 
The Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedInThe Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedIn
OSCON Byrum
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
RTTS
 
ETL big data with apache hadoop
ETL big data with apache hadoopETL big data with apache hadoop
ETL big data with apache hadoop
Maulik Thaker
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
Stavros Kontopoulos
 
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
Rehgan Avon
 
Building Data Science into Organizations: Field Experience
Building Data Science into Organizations: Field ExperienceBuilding Data Science into Organizations: Field Experience
Building Data Science into Organizations: Field Experience
Databricks
 
Summary introduction to data engineering
Summary introduction to data engineeringSummary introduction to data engineering
Summary introduction to data engineering
Novita Sari
 
Flexible Design
Flexible DesignFlexible Design
Flexible Design
Gwen (Chen) Shapira
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing Architectures
Humza Naseer
 
Big Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyondBig Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyond
DataWorks Summit/Hadoop Summit
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 

What's hot (20)

Big data pipelines
Big data pipelinesBig data pipelines
Big data pipelines
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit AgarwalSuccinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
 
Testing the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big ProblemsTesting the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big Problems
 
Big data rmoug
Big data rmougBig data rmoug
Big data rmoug
 
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXTDriving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
 
Strata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma PresentationStrata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma Presentation
 
ETL using Big Data Talend
ETL using Big Data Talend  ETL using Big Data Talend
ETL using Big Data Talend
 
The Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedInThe Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedIn
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 
ETL big data with apache hadoop
ETL big data with apache hadoopETL big data with apache hadoop
ETL big data with apache hadoop
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
 
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
 
Building Data Science into Organizations: Field Experience
Building Data Science into Organizations: Field ExperienceBuilding Data Science into Organizations: Field Experience
Building Data Science into Organizations: Field Experience
 
Summary introduction to data engineering
Summary introduction to data engineeringSummary introduction to data engineering
Summary introduction to data engineering
 
Flexible Design
Flexible DesignFlexible Design
Flexible Design
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing Architectures
 
Big Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyondBig Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyond
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 

Similar to All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databricks) - A Real World Case Study

iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
Bonnie Hurwitz
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
AWS Chicago
 
Big Linked Data ETL Benchmark on Cloud Commodity Hardware
Big Linked Data ETL Benchmark on Cloud Commodity HardwareBig Linked Data ETL Benchmark on Cloud Commodity Hardware
Big Linked Data ETL Benchmark on Cloud Commodity Hardware
Laurens De Vocht
 
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
confluent
 
Ceph - High Performance Without High Costs
Ceph - High Performance Without High CostsCeph - High Performance Without High Costs
Ceph - High Performance Without High Costs
Jonathan Long
 
Medidata AMUG Meeting / Presentation 2013
Medidata AMUG Meeting / Presentation 2013Medidata AMUG Meeting / Presentation 2013
Medidata AMUG Meeting / Presentation 2013
Brock Heinz
 
Data Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical IndustryData Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical Industry
RTTS
 
Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...
Ola Spjuth
 
Leaving the Ivory Tower: Research in the Real World
Leaving the Ivory Tower: Research in the Real WorldLeaving the Ivory Tower: Research in the Real World
Leaving the Ivory Tower: Research in the Real World
ArmonDadgar
 
Production Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on ProductionProduction Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on Production
Chris Dwan
 
DockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability WorkshopDockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability Workshop
Kevin Crawley
 
BioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataBioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadata
Philip Cheung
 
Qiagram
QiagramQiagram
Qiagram
jwppz
 
Allotrope Foundation & OSTHUS at SmartLab Exchange 2015: Update on the Allotr...
Allotrope Foundation & OSTHUS at SmartLab Exchange 2015: Update on the Allotr...Allotrope Foundation & OSTHUS at SmartLab Exchange 2015: Update on the Allotr...
Allotrope Foundation & OSTHUS at SmartLab Exchange 2015: Update on the Allotr...
OSTHUS
 
The Architecture of Continuous Innovation - OSCON 2015
The Architecture of Continuous Innovation - OSCON 2015The Architecture of Continuous Innovation - OSCON 2015
The Architecture of Continuous Innovation - OSCON 2015
Chip Childers
 
From allotrope to reference master data management
From allotrope to reference master data management From allotrope to reference master data management
From allotrope to reference master data management
OSTHUS
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
DATAVERSITY
 
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Yahoo Developer Network
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?
Precisely
 
CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217
lyarmey
 

Similar to All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databricks) - A Real World Case Study (20)

iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
 
Big Linked Data ETL Benchmark on Cloud Commodity Hardware
Big Linked Data ETL Benchmark on Cloud Commodity HardwareBig Linked Data ETL Benchmark on Cloud Commodity Hardware
Big Linked Data ETL Benchmark on Cloud Commodity Hardware
 
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
 
Ceph - High Performance Without High Costs
Ceph - High Performance Without High CostsCeph - High Performance Without High Costs
Ceph - High Performance Without High Costs
 
Medidata AMUG Meeting / Presentation 2013
Medidata AMUG Meeting / Presentation 2013Medidata AMUG Meeting / Presentation 2013
Medidata AMUG Meeting / Presentation 2013
 
Data Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical IndustryData Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical Industry
 
Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...
 
Leaving the Ivory Tower: Research in the Real World
Leaving the Ivory Tower: Research in the Real WorldLeaving the Ivory Tower: Research in the Real World
Leaving the Ivory Tower: Research in the Real World
 
Production Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on ProductionProduction Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on Production
 
DockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability WorkshopDockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability Workshop
 
BioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataBioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadata
 
Qiagram
QiagramQiagram
Qiagram
 
Allotrope Foundation & OSTHUS at SmartLab Exchange 2015: Update on the Allotr...
Allotrope Foundation & OSTHUS at SmartLab Exchange 2015: Update on the Allotr...Allotrope Foundation & OSTHUS at SmartLab Exchange 2015: Update on the Allotr...
Allotrope Foundation & OSTHUS at SmartLab Exchange 2015: Update on the Allotr...
 
The Architecture of Continuous Innovation - OSCON 2015
The Architecture of Continuous Innovation - OSCON 2015The Architecture of Continuous Innovation - OSCON 2015
The Architecture of Continuous Innovation - OSCON 2015
 
From allotrope to reference master data management
From allotrope to reference master data management From allotrope to reference master data management
From allotrope to reference master data management
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
 
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?
 
CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Recently uploaded

一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理
ugydym
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
Vietnam Cotton & Spinning Association
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
agdhot
 
一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理
zsafxbf
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
dataschool1
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
ywqeos
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
yuvarajkumar334
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
Data Scientist Machine Learning Profiles .pdf
Data Scientist Machine Learning  Profiles .pdfData Scientist Machine Learning  Profiles .pdf
Data Scientist Machine Learning Profiles .pdf
Vineet
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
Vietnam Cotton & Spinning Association
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
hqfek
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
bmucuha
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
uevausa
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
9gr6pty
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
Drownings spike from May to August in children
Drownings spike from May to August in childrenDrownings spike from May to August in children
Drownings spike from May to August in children
Bisnar Chase Personal Injury Attorneys
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
TeukuEriSyahputra
 
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
yuvarajkumar334
 

Recently uploaded (20)

一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
 
一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
Data Scientist Machine Learning Profiles .pdf
Data Scientist Machine Learning  Profiles .pdfData Scientist Machine Learning  Profiles .pdf
Data Scientist Machine Learning Profiles .pdf
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
Drownings spike from May to August in children
Drownings spike from May to August in childrenDrownings spike from May to August in children
Drownings spike from May to August in children
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
 
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
 

All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databricks) - A Real World Case Study

  • 1.
  • 2. All In: Migrating a Genomics Pipeline from BASH/Hive to Spark and Azure Databricks—A Real World Case Study Victoria Morris Unicorn Health Bridge Consulting working for Atrium Health
  • 3. Agenda Victoria Morris ▪ Overview Link ▪ Issues – why change? ▪ Next Moves ▪ Migration Starting Small Pharmacogenomics Pipeline ▪ Clinical Trials Matching Pipeline ▪ The Great Migration Hive-> Databricks ▪ Things we Learned ▪ Business Impact
  • 4. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  • 6. Original Problem Statement(s) ▪ Genomic reports are hard to find in the Electronic Medical Record (EMR) ▪ The reports are difficult to read (++ pages) are different from each lab, may not have relevant recommendations and require manual efforts to summarize ▪ Presenting relevant Clinical Trails to providers when making treatment decisions will increase Clinical Trial participation ▪ As a Center of Excellence(COE) for the American Society of Clinical Oncology (ASCO)’s Targeted Agent and Profiling Utilization Registry (TAPUR) Clinical trial, clinical outcomes and treatment data must be reported back to the COE for patients enrolled in the studies ▪ Current process is complicated, time consuming and manual
  • 7. Overview ▪ The objective of LInK (LCI Integrated Knowledgebase) is to provide interoperability of data between different LCI data sources ▪ Specifically to address the multiple data silo’s, that contain related data, which is a consistent challenge across the System ▪ Data meaning, must be transferred, not just values ▪ Apple: Fruit vs. Computer ▪ Originally we had 4 people, and we all had day jobs
  • 8. Specialized External testing Testing Results PDF’s, results and Raw Sequence data in PDF, Clinical Decision Support Out (External –sftp/data factory) Clinical Trails Management Software (On-Premise- soon to be Cloud) EMR Clinical Data (Cerner reporting Database/EDW) EAPathways embedded in Cerner via SMART/FHIR Genomic results and PDF reports via Tier 1 SharePoint for molecular tumor board review Converting Raw Reads to Genotype-> Phenotype and generating report for Provider LCI Encounter Data (EDW) LInK Unstructured Notes (e.g. Cerner reporting Database) EAPathways Database (On-premise DB) Integration Office 365 (External- API) POC Clinical Decisio n Support Clinical Trials Matching Pharmacogenomics Specialized Internal testing Testing Results and Raw Sequence data in PDF out (internal)
  • 9. Frd1Storage Netezza Cloud Azure On-Premise Databases EDW EaPathways Oncore External Labs Caris Inivata FMI Azure Storage • Cerner • EPIC • CRSTAR On-Premise Lab Genomics Lab LInK Data connections – High Level Clinical Trials Management Clinical Decision Supprt Enterprise Data Warehouse ARIA Genomic Pipelines- Auto-generate by WebApps Radiation Treatments CoPath Pathology MS Web Apps MS SharePoint Designer
  • 10. Frd1Storage Netezza Cloud Azure On-Premise Databases EDW EaPathways Oncore External Labs Tempus Caris Inivata FMI External Vendor÷s Containers Azure Storage Azure Storage • Cerner • EPIC • CRSTAR On-Premise Lab Genomics Lab LInK Data connections – High Level Clinical Trials Management Clinical Decision Supprt Enterprise Data Warehouse ARIA Genomic Pipelines PharmacoGenomics Radiation Treatments CoPath Pathology
  • 12. Issues ▪ We run 365 days a year ▪ The Data is used in real time by providers to make clinical decisions for patient treatment for Cancer any breakdown in the pipeline is a Priority 1 fix that needs to be fixed as soon as possible ▪ We were early adopters of HDI – this server has been up since 2016 – it is old technology and HDI was not built for servers to live this long.
  • 13. Issues cont’d ▪ Randomly the cluster would freeze and go into SAFE mode – with no warning, this happened on a weekly basis often several days, in a row during the overnight batch. ▪ We were past the default allocated 10,000 tez counters and had to change the runs to constantly run with additional ones, back at around 3,000 lines of Hive code. ▪ Although we tried using Matrix manipulation in hive– at some point you just need a loop.
  • 14. Issues cont’d ▪ The costs to have the HDI cluster up 24x365 was very expensive, we scaled it up and down to help reduced costs. ▪ The cluster was not stable, because we were scaling up and scaling down everyday, at one point there so many logs on the daily scaling it took the entire HDI cluster down.
  • 15. Issues cont’d ▪ Twice the cluster went down so bad and so hard MS Support’s response was destroy it and start again, which we did the first time… ▪ The HDI server choice-dichotomy to HiveV2 had forced us into not allowing vectorized execution– we had to constantly set hive.vectorized.execution.enabled=false; through out the script because it would “forget” and which was slowing down processing.
  • 17. Search ▪ We wanted something that was cheaper ▪ We wanted to keep our old wasbi storage – not have to migrate the datalake ▪ We wanted flexibility in language options for on-going operations and continuity of care we did not want to get boxed into just one ▪ We wanted something less agnostic, more fully integrated into the Microsoft eco-system
  • 18. Search cont’d ▪ We needed it to be HIPAA compliant because we were working with Patient data. ▪ We needed something that was self sufficient with the Cluster management so we could concentrate on the programming aspect instead of infrastructure. ▪ We really liked the Notebook concept – and had started experimenting with Jupiter notebooks inside HDI
  • 19. Frd1Storage Netezza Cloud Azure On-Premise Databases EDW EaPathways Oncore External Labs Tempus Caris Inivata FMI External Vendor÷s Containers Azure Storage Azure Storage • Cerner • EPIC • CRSTAR On-Premise Lab Genomics Lab LInK Data connections – High Level Clinical Trials Management Clinical Decision Supprt Enterprise Data Warehouse ARIA Genomic Pipelines PharmacoGenomics Radiation Treatments CoPath Pathology
  • 21. Migration – starting small ▪ There is a large steep learning curve to get into the databricks ▪ We had a new project the second pipeline that had to be built and it seemed easier to start with something smaller than the 8000 lines of Hive code that would be required if we started transitioning the original pipeline.
  • 23. Pharmacogenomics We receive raw Genomic test results from our internal lab
  • 24.
  • 25.
  • 26.
  • 28. Overview Genomic Clinical Trials Pipeline
  • 30. Clinical Trial Match Criteria Age (today’s) Gender First line eligible(no previous anti- neoplastics ordered) Genomic Results (over 1290 genes) Diagnosis Tumor Site Secondary Gene results Must have/not have a specific protein change/mutation Previous Lab results Previous Medications
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37. Frd1Storage Netezza Cloud Azure On-Premise Databases EDW EaPathways Oncore External Labs Tempus Caris Inivata FMI External Vendor÷s Containers Azure Storage Azure Storage • Cerner • EPIC • CRSTAR On-Premise Lab Genomics Lab LInK Data connections – High Level Clinical Trials Management Clinical Decision Supprt Enterprise Data Warehouse ARIA Genomic Pipelines PharmacoGenomics Radiation Treatments CoPath Pathology
  • 39. Process Tempus files Process Caris files Process FMI files Process Inivata files Main Match Create Summary Preprocess each lab into similar data format Create Clinical Matches Create Genomic Summary, combine with matches an save to database 1 2 3
  • 41. Initial Definitions ▪ Bulleted list ▪ Bulleted list ▪ Bulleted list ▪ Bulleted list DatabricksHive
  • 42. Reading the file ▪ Not a separate step in Hive part of the next step ▪ Bulleted list ▪ Bulleted list DatabricksHive
  • 43. Creating a clean view of the data ▪ Bulleted list ▪ Bulleted list ▪ Bulleted list ▪ Bulleted list DatabricksHive
  • 44.
  • 45. Databricks by the numbers ▪ We work in a Premium Workspace, using our internal ip addresses inside a secured subnet inside the Atrium Health Azure Subscription ▪ Databricks is fully HIPPA compliant ▪ Clusters are created with predefined tags and costs associated to each tagged cluster’s run can be separated out ▪ Our data lake is ~110 terabytes ▪ We have 2.3+ million gene results x 240+ CTC to match against 10 criteria ▪ Yes even during COVID-19 we are still seeing an average of 1 new report a day – We still run 365 a year
  • 47. Azure Key Vaults and Back-up ▪ Azure Key Vaults are tricky to implement and you only need to do the connection on a new workspace – so save those instructions! ▪ But these are a very secure way to save all your connection info without having it in plain text on the notebook itself. ▪ Do not forget to save a copy of everything periodically offline –if your workspace goes you lose all the notebooks and any manually uploaded data tables… ▪ Yes we have had to replace the workspace twice in this project
  • 48. Working with complex nested Json and XML sucks ▪ It sounds so simple in the examples and works great in the simple 1 level examples – real world when something is nested and duplicated or missing entirely from that record several levels deep and usually in structs -it sucks ▪ Struct versus arrays- we ended-up having to convert structs to arrays all the time ▪ Use the cardinality function a lot to determine if there was anything in an array ▪ The concat_ws trick if you are not sure if ended up with an array or a string in a sql in your data
  • 49. Tips and tricks? ▪ Databricks only reads a Blob Type of Block blob. Any other type means that databricks does not even see the directory – that took a fair bit to uncover when one of our vendors uploaded a new set of files in the wrong block type without realizing it. ▪ We ended up using data factory a lot less than we thought –odbc connections worked well except for Oracle we never could get that to work – it is the only thing still sqooped nightly
  • 50. Code Snips I used all the time ▪ %python pythonDF.write.mode(“overwrite”).saveAsTable(“pythonTable”) ▪ %scala val ScalaDF= spark.read($“pythonTable”) ▪ If you need a table from a JDBC source to use in SQL: ▪ %scala val JDBCTableDF = spark.read.jdbc(jdbcUrl, "JDBCTableName", connectionProperties) ▪ JDBCTableDF.write.mode("overwrite").saveAsTable(" JDBCTableNameTbl") ▪ If you suddenly cannot write out a table: ▪ dbutils.fs.rm("dbfs:/user/hive/warehouse/JDBCTableNameTbl/", true) I am no expert – but I ended up using these all the time
  • 51. Code Snips I used all the time ▪ Save tables between notebooks – use REFERSH table at the start of the new notebook to grab the latest version ▪ The null problem – using the cast function to save yourself from Parquet I am no expert – but I ended up using these all the time
  • 52. Business Impact ▪ More stable infrastructure ▪ Lower costs ▪ Results come in faster ▪ Easier to add additional labs ▪ Easier to troubleshoot when there are issues ▪ Increase in volume handled easily ▪ Self-service for end-users means no IAS intervention
  • 53. Thanks! Dr Derek Ragavan, Carol Farhangfar, Nury Steuerwald, Jai Patel Chris Danzi, Lance Richey, Scott Blevins Andrea Bouronich, Stephanie King, Melanie Bamberg, Stacy Harris Kelly Jones and his team All the data and system owners who let us access their data All the Microsoft support folks who helped us push to the edge And of course Databricks
  • 55. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.