Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Empowering PDT Analytics through Databricks &
Spark Structured Streaming
05/05/2021
Arnav Chaudhary (he/him)
Digital Produ...
Databricks on Takeda’s Enterprise Data Backbone
The EDB (Enterprise Data Backbone) is Takeda’s integrated data platform re...
PDT Analytics Program
Drive improved
plasma yield
Increased access to a
greater volume of
plasma donors
What are we solvin...
PDT Donor Portal Application & Analytics Foundation
Going forward…
Previously…
• Existing 153 disparate
center systems
• R...
Daily Batch Jobs
Manual Report Generation
Limited Access to Data
Structured Typed SQL Data
API Returned JSON
Scheduled CSV...
Unified Data Schema
Configuration Driven Process
Lakehouse Model
Key Design Details
• Uniform ingestion platform
• Improved accessibility to data
• Delta Tables backing ea...
Real Time Data
Using foreachBatch to Fork and Serve Streaming CDC Data
Using the Delta Table merge construct within _serve
Writing the CD...
© 2019 Takeda Pharmaceutical Company Limited. All rights reserved
Thank you for attending!
We will do our best to answer a...
Empowering Real Time Patient Care Through Spark Streaming
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
What to Upload to SlideShare
Next
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Empowering Real Time Patient Care Through Spark Streaming

Download to read offline

Takeda’s Plasma Derived Therapies (PDT) business unit has recently embarked on a project to use Spark Streaming on Databricks to empower how they deliver value to their Plasma Donation centers. As patients come in and interface without clinics, we store and track all of the patient interactions in real time and deliver outputs and results based on said interactions. The current problem with our existing architecture is that it is very expensive to maintain and has an unsustainable number of failure points. Spark Streaming is essential for allowing this use case because it allows for a more robust ETL pipeline. With Spark Streaming, we are able to replace our existing ETL processes (that are based on Lamdbas, step functions, triggered jobs, etc) into a purely stream driven architecture.

Data is brought into our s3 raw layer as a large set of CSV files through AWS DMS and Informatica IICS as these services bring data from on-prem systems into our cloud layer. We have a stream currently running which takes these raw files up and merges them into Delta tables established in the bronze/stage layer. We are using AWS Glue as the metadata provider for all of these operations. From the stage layer, we have another set of streams using the stage Delta tables as their source, which transform and conduct stream to stream lookups before writing the enriched records into RDS (silver/prod layer). Once the data has been merged into RDS we have a DMS task which lifts the data back into S3 as CSV files. We have a small intermediary stream which merge these CSV files into corresponding delta tables, from which we have our gold/analytic streams. The on-prem systems are able to speak to the silver layer and allow for the near real-time latency that our patient care centers require.

  • Be the first to like this

Empowering Real Time Patient Care Through Spark Streaming

  1. 1. Empowering PDT Analytics through Databricks & Spark Structured Streaming 05/05/2021 Arnav Chaudhary (he/him) Digital Product Manager Takeda Jonathan E. Yee (he/him) Data and Analytics Executive EY Jeff Cubeta (they/them) Clinical Intelligence Executive Algernon Solutions
  2. 2. Databricks on Takeda’s Enterprise Data Backbone The EDB (Enterprise Data Backbone) is Takeda’s integrated data platform responsible for combining global data assets into a single source of truth and enabling tools to provide insights via analytics. • Data Ingestion • Data Processing • Advanced Analytics Domains • US • Europe • Japan Global Regions • Python • R Studio Applications • MIT Researches as an ongoing collaboration Specialized Deployment Databricks on AWS is used heavily by Takeda across the business 200,000 DBUs of Monthly compute 600+ Monthly Active Users 50+ validated schemas with 100s of tables 15 advanced analytics teams using Databricks
  3. 3. PDT Analytics Program Drive improved plasma yield Increased access to a greater volume of plasma donors What are we solving for? Expected Outcomes Gain access to a larger share of the donor market to reduce CPL Increase yield by improving retention and the conversion funnel Reduce manual processes and increase automation to improve operations efficiency Harvest the value of PDT’s data assets Reduce cost per liter Improved data, analytics, and process layers for PDT analytics
  4. 4. PDT Donor Portal Application & Analytics Foundation Going forward… Previously… • Existing 153 disparate center systems • Reliance on 3rd party for marketing insights • Manual report generation • Lack of real-time information for quick decision making • Reactive decision-making process • Consolidated data into one operational data store (ODS) Near Real time data transmission • Data lake to store years of information • Analytics platform allowing data scientists to perform data mining, create predictive model, and generate actionable insights • Reduction of manual reports PDT/BioLife Data Backbone • PDT is the pioneer using the newly developed Takeda Enterprise Data Backbone Platform in the CLOUD • Supporting Analytics, Operational Use, and other Products data needs (e.g. Donor Engagement, Fuji Innovation Engine)
  5. 5. Daily Batch Jobs Manual Report Generation Limited Access to Data Structured Typed SQL Data API Returned JSON Scheduled CSV Uploads 4 Enterprise Data Systems 151 collection centers 250 SQL Tables ~ 1 TB Historic Data ~ .5 GB/Hr Ongoing CDC We designed opportunities to drive value and address the core pain point themes for PDT • Spark Structured Streams • Low latency data processing • Standardized event streams to empower downstream apps Real Time Data • Single presence for Donors • Cross system relationships • Business process data entities Unified Data Schema • Uniform ingestion process • Configuration driven operations • S3 Delta Tables • Data served to SQL DB for low latency, high volume querying Lakehouse Model Data Isolation Latency of Analytics Narrow Audience Three Key Pain Points with PDT Data Analytics
  6. 6. Unified Data Schema
  7. 7. Configuration Driven Process
  8. 8. Lakehouse Model Key Design Details • Uniform ingestion platform • Improved accessibility to data • Delta Tables backing each layer • Structured Streams between layers • Support for big data analysis through serving Delta Tables • Support for high volume, low latency querying using SQL based tools • Extensible design to allow expansion
  9. 9. Real Time Data
  10. 10. Using foreachBatch to Fork and Serve Streaming CDC Data Using the Delta Table merge construct within _serve Writing the CDC stream Within the foreachBatch function, we target multiple sinks • Delta Table • SQL Database • Event Bridge
  11. 11. © 2019 Takeda Pharmaceutical Company Limited. All rights reserved Thank you for attending! We will do our best to answer any questions.

Takeda’s Plasma Derived Therapies (PDT) business unit has recently embarked on a project to use Spark Streaming on Databricks to empower how they deliver value to their Plasma Donation centers. As patients come in and interface without clinics, we store and track all of the patient interactions in real time and deliver outputs and results based on said interactions. The current problem with our existing architecture is that it is very expensive to maintain and has an unsustainable number of failure points. Spark Streaming is essential for allowing this use case because it allows for a more robust ETL pipeline. With Spark Streaming, we are able to replace our existing ETL processes (that are based on Lamdbas, step functions, triggered jobs, etc) into a purely stream driven architecture. Data is brought into our s3 raw layer as a large set of CSV files through AWS DMS and Informatica IICS as these services bring data from on-prem systems into our cloud layer. We have a stream currently running which takes these raw files up and merges them into Delta tables established in the bronze/stage layer. We are using AWS Glue as the metadata provider for all of these operations. From the stage layer, we have another set of streams using the stage Delta tables as their source, which transform and conduct stream to stream lookups before writing the enriched records into RDS (silver/prod layer). Once the data has been merged into RDS we have a DMS task which lifts the data back into S3 as CSV files. We have a small intermediary stream which merge these CSV files into corresponding delta tables, from which we have our gold/analytic streams. The on-prem systems are able to speak to the silver layer and allow for the near real-time latency that our patient care centers require.

Views

Total views

78

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

5

Shares

0

Comments

0

Likes

0

×