SlideShare a Scribd company logo
1 of 14
Download to read offline
Using Airflow to speed up
development of data intensive tools
Blaine Elliott
Data Engineer @ One Medical
Twitter: @blainee
Airflow Summit
July 10th, 2020
Purpose of this talk?
● To demonstrate how Airflow can help you build
new tools
● Inspire others to do the same
Who am I?
● Data Engineer @ One Medical
● Formerly @ LinkedIn, Chegg, MySpace
Intro...
Proprietary and ConfidentialOne Medical
● A tool to detect data anomalies
● The architecture of this tool
...also how the tool communicates with Airflow
● How Airflow decreased the cost to develop this tool
3
What are we going to cover in this talk?
Airflow Summit - Summer 2020
Proprietary and ConfidentialOne Medical
● At One Medical, we consume and create a lot of data
● We want to find bad data before it’s passed on to analysts
● We’re lazy engineers
4
Setting up the problem...
Airflow Summit - Summer 2020
Proprietary and ConfidentialOne Medical
● Needs to detect abnormal data
● Can scale to thousands of tables and columns
● Cost to develop the tool is minimized
5
Feature requirements for our Data Anomaly Detector(“DAD Tool”)
Airflow Summit - Summer 2020
Proprietary and ConfidentialOne Medical 6
● The ability to do statistical analysis
● Storage to persist data & test results
● UI/UX to manage the tool, create tests, & analyze results
● Database interoperability
(authentication, communication)
● The ability to run thousands of tests per day
● Must be secure
(must pass a security audit)
What is need to make this work?
Airflow Summit - Summer 2020
Proprietary and ConfidentialOne Medical
Airflow steps…
1. Create dynamic DAGs
2. Tell Airflow to run our DAGs
3. Process the DAGs
4. Send results to the DAD Tool
7
The Data Anomaly Detector(“DAD Tool”)
Airflow Summit - Summer 2020
Proprietary and ConfidentialOne Medical 8
Airflow Integration (4 steps)
Airflow Summit - Summer 2020
1. Send SQL to Airflow, as a text file on S3
2. Send request to Airflow to process DAGs
3. Process DAGs
4. Send test results to DAD Tool, as a pickled file on S3
Proprietary and ConfidentialOne Medical
1. User defines a test
Ex, all values in a time series must be within X σ’s of the mean.
2. User applies the test to a column
Ex, Using our new test, set threshold to 3-σ’s, use the table patients
w/the column systolic_blood_pressure for the most recent 90 days.
3. The DAD Tool + Airflow processes all the things
4. User analyzes results in the DAD Tool UI
9
Anatomy of a test
Airflow Summit - Summer 2020
Proprietary and ConfidentialOne Medical
● Needs to detect abnormal data
● Can scale to thousands of tables and columns
● Cost to develop the tool is minimized
10
Requirements Review
Airflow Summit - Summer 2020
Proprietary and ConfidentialOne Medical 11
● The complexity of Airflow is hidden from users
● Using Airflow for part of the backend processing of the DAD Tool
significantly decreased development time
● Because Airflow was already actively used at One Medical, desirable
features already available in Airflow could be made available to the
DAD Tool
● Time that would have been spent building features in Airflow were
repurposed to improve the DAD Tool
Conclusions
Airflow Summit - Summer 2020
Proprietary and ConfidentialOne Medical
● No need to manage database authentication
● Databases configured in Airflow are immediately available to the
DAD Tool
● Parallelism is managed by Airflow
● Throttling is managed by Airflow
● Since Airflow already passed our security audit, minimal effort was
needed to get approved to leverage Airflow in the DAD Tool
12
List of Airflow features that enable the DAD Tool
Airflow Summit - Summer 2020
Proprietary and ConfidentialOne Medical
Q. Why not use XCOM?
A. Using S3 (an any other object store) is stateful, fault tolerant and avoids
any limitations on how much data is being transferred.
Q. Is the DAD Tool open source?
A. Not currently but I am working towards that goal.
Answers to common questions
13
Airflow Summit - Summer 2020
Thank you
Blaine Elliott
Sr Data Engineer @ One Medical
Twitter: @blainee
Airflow Summit
July 10th, 2020

More Related Content

Similar to Using airflow for tools development

Multiple awr reports_parser
Multiple awr reports_parserMultiple awr reports_parser
Multiple awr reports_parserJacques Kostic
 
Universal test solutions customer testimonial 10192013-v2.2
Universal test solutions customer testimonial 10192013-v2.2Universal test solutions customer testimonial 10192013-v2.2
Universal test solutions customer testimonial 10192013-v2.2Universal Technology Solutions
 
Major Project Report on Designing an Android Application for Electrical Maint...
Major Project Report on Designing an Android Application for Electrical Maint...Major Project Report on Designing an Android Application for Electrical Maint...
Major Project Report on Designing an Android Application for Electrical Maint...Amit Kumar
 
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesPutting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesDATAVERSITY
 
Game Analytics at London Apache Druid Meetup
Game Analytics at London Apache Druid MeetupGame Analytics at London Apache Druid Meetup
Game Analytics at London Apache Druid MeetupJelena Zanko
 
Project Management Sample
Project Management SampleProject Management Sample
Project Management SampleRavi Nakulan
 
Fri benghiat gil-odsc-data-kitchen-data science to dataops
Fri benghiat gil-odsc-data-kitchen-data science to dataopsFri benghiat gil-odsc-data-kitchen-data science to dataops
Fri benghiat gil-odsc-data-kitchen-data science to dataopsDataKitchen
 
ODSC data science to DataOps
ODSC data science to DataOpsODSC data science to DataOps
ODSC data science to DataOpsChristopher Bergh
 
Self-Service Analytics with Guard Rails
Self-Service Analytics with Guard RailsSelf-Service Analytics with Guard Rails
Self-Service Analytics with Guard RailsDenodo
 
Monitoring MongoDB Atlas with Datadog
Monitoring MongoDB Atlas with DatadogMonitoring MongoDB Atlas with Datadog
Monitoring MongoDB Atlas with DatadogMongoDB
 
Technology Primer: Monitor Microservices, Containers, Cloud Foundry and Node ...
Technology Primer: Monitor Microservices, Containers, Cloud Foundry and Node ...Technology Primer: Monitor Microservices, Containers, Cloud Foundry and Node ...
Technology Primer: Monitor Microservices, Containers, Cloud Foundry and Node ...CA Technologies
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
 
Implement a Universal Data Distribution Architecture to Manage All Streaming ...
Implement a Universal Data Distribution Architecture to Manage All Streaming ...Implement a Universal Data Distribution Architecture to Manage All Streaming ...
Implement a Universal Data Distribution Architecture to Manage All Streaming ...Timothy Spann
 
Splunk bangalore user group 2020-06-01
Splunk bangalore user group   2020-06-01Splunk bangalore user group   2020-06-01
Splunk bangalore user group 2020-06-01NiketNilay
 
Technology Primer: Hey IT—Your Big Data Infrastructure Can’t Sit in a Silo An...
Technology Primer: Hey IT—Your Big Data Infrastructure Can’t Sit in a Silo An...Technology Primer: Hey IT—Your Big Data Infrastructure Can’t Sit in a Silo An...
Technology Primer: Hey IT—Your Big Data Infrastructure Can’t Sit in a Silo An...CA Technologies
 

Similar to Using airflow for tools development (20)

Industrial IoT bootcamp
Industrial IoT bootcampIndustrial IoT bootcamp
Industrial IoT bootcamp
 
Multiple awr reports_parser
Multiple awr reports_parserMultiple awr reports_parser
Multiple awr reports_parser
 
Resume (1)
Resume (1)Resume (1)
Resume (1)
 
Resume (1)
Resume (1)Resume (1)
Resume (1)
 
Universal test solutions customer testimonial 10192013-v2.2
Universal test solutions customer testimonial 10192013-v2.2Universal test solutions customer testimonial 10192013-v2.2
Universal test solutions customer testimonial 10192013-v2.2
 
Major Project Report on Designing an Android Application for Electrical Maint...
Major Project Report on Designing an Android Application for Electrical Maint...Major Project Report on Designing an Android Application for Electrical Maint...
Major Project Report on Designing an Android Application for Electrical Maint...
 
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesPutting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
 
Game Analytics at London Apache Druid Meetup
Game Analytics at London Apache Druid MeetupGame Analytics at London Apache Druid Meetup
Game Analytics at London Apache Druid Meetup
 
Project Management Sample
Project Management SampleProject Management Sample
Project Management Sample
 
Fri benghiat gil-odsc-data-kitchen-data science to dataops
Fri benghiat gil-odsc-data-kitchen-data science to dataopsFri benghiat gil-odsc-data-kitchen-data science to dataops
Fri benghiat gil-odsc-data-kitchen-data science to dataops
 
ODSC data science to DataOps
ODSC data science to DataOpsODSC data science to DataOps
ODSC data science to DataOps
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Self-Service Analytics with Guard Rails
Self-Service Analytics with Guard RailsSelf-Service Analytics with Guard Rails
Self-Service Analytics with Guard Rails
 
Monitoring MongoDB Atlas with Datadog
Monitoring MongoDB Atlas with DatadogMonitoring MongoDB Atlas with Datadog
Monitoring MongoDB Atlas with Datadog
 
Technology Primer: Monitor Microservices, Containers, Cloud Foundry and Node ...
Technology Primer: Monitor Microservices, Containers, Cloud Foundry and Node ...Technology Primer: Monitor Microservices, Containers, Cloud Foundry and Node ...
Technology Primer: Monitor Microservices, Containers, Cloud Foundry and Node ...
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
Implement a Universal Data Distribution Architecture to Manage All Streaming ...
Implement a Universal Data Distribution Architecture to Manage All Streaming ...Implement a Universal Data Distribution Architecture to Manage All Streaming ...
Implement a Universal Data Distribution Architecture to Manage All Streaming ...
 
Dagster @ R&S MNT
Dagster @ R&S MNTDagster @ R&S MNT
Dagster @ R&S MNT
 
Splunk bangalore user group 2020-06-01
Splunk bangalore user group   2020-06-01Splunk bangalore user group   2020-06-01
Splunk bangalore user group 2020-06-01
 
Technology Primer: Hey IT—Your Big Data Infrastructure Can’t Sit in a Silo An...
Technology Primer: Hey IT—Your Big Data Infrastructure Can’t Sit in a Silo An...Technology Primer: Hey IT—Your Big Data Infrastructure Can’t Sit in a Silo An...
Technology Primer: Hey IT—Your Big Data Infrastructure Can’t Sit in a Silo An...
 

Recently uploaded

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Using airflow for tools development

  • 1. Using Airflow to speed up development of data intensive tools Blaine Elliott Data Engineer @ One Medical Twitter: @blainee Airflow Summit July 10th, 2020
  • 2. Purpose of this talk? ● To demonstrate how Airflow can help you build new tools ● Inspire others to do the same Who am I? ● Data Engineer @ One Medical ● Formerly @ LinkedIn, Chegg, MySpace Intro...
  • 3. Proprietary and ConfidentialOne Medical ● A tool to detect data anomalies ● The architecture of this tool ...also how the tool communicates with Airflow ● How Airflow decreased the cost to develop this tool 3 What are we going to cover in this talk? Airflow Summit - Summer 2020
  • 4. Proprietary and ConfidentialOne Medical ● At One Medical, we consume and create a lot of data ● We want to find bad data before it’s passed on to analysts ● We’re lazy engineers 4 Setting up the problem... Airflow Summit - Summer 2020
  • 5. Proprietary and ConfidentialOne Medical ● Needs to detect abnormal data ● Can scale to thousands of tables and columns ● Cost to develop the tool is minimized 5 Feature requirements for our Data Anomaly Detector(“DAD Tool”) Airflow Summit - Summer 2020
  • 6. Proprietary and ConfidentialOne Medical 6 ● The ability to do statistical analysis ● Storage to persist data & test results ● UI/UX to manage the tool, create tests, & analyze results ● Database interoperability (authentication, communication) ● The ability to run thousands of tests per day ● Must be secure (must pass a security audit) What is need to make this work? Airflow Summit - Summer 2020
  • 7. Proprietary and ConfidentialOne Medical Airflow steps… 1. Create dynamic DAGs 2. Tell Airflow to run our DAGs 3. Process the DAGs 4. Send results to the DAD Tool 7 The Data Anomaly Detector(“DAD Tool”) Airflow Summit - Summer 2020
  • 8. Proprietary and ConfidentialOne Medical 8 Airflow Integration (4 steps) Airflow Summit - Summer 2020 1. Send SQL to Airflow, as a text file on S3 2. Send request to Airflow to process DAGs 3. Process DAGs 4. Send test results to DAD Tool, as a pickled file on S3
  • 9. Proprietary and ConfidentialOne Medical 1. User defines a test Ex, all values in a time series must be within X σ’s of the mean. 2. User applies the test to a column Ex, Using our new test, set threshold to 3-σ’s, use the table patients w/the column systolic_blood_pressure for the most recent 90 days. 3. The DAD Tool + Airflow processes all the things 4. User analyzes results in the DAD Tool UI 9 Anatomy of a test Airflow Summit - Summer 2020
  • 10. Proprietary and ConfidentialOne Medical ● Needs to detect abnormal data ● Can scale to thousands of tables and columns ● Cost to develop the tool is minimized 10 Requirements Review Airflow Summit - Summer 2020
  • 11. Proprietary and ConfidentialOne Medical 11 ● The complexity of Airflow is hidden from users ● Using Airflow for part of the backend processing of the DAD Tool significantly decreased development time ● Because Airflow was already actively used at One Medical, desirable features already available in Airflow could be made available to the DAD Tool ● Time that would have been spent building features in Airflow were repurposed to improve the DAD Tool Conclusions Airflow Summit - Summer 2020
  • 12. Proprietary and ConfidentialOne Medical ● No need to manage database authentication ● Databases configured in Airflow are immediately available to the DAD Tool ● Parallelism is managed by Airflow ● Throttling is managed by Airflow ● Since Airflow already passed our security audit, minimal effort was needed to get approved to leverage Airflow in the DAD Tool 12 List of Airflow features that enable the DAD Tool Airflow Summit - Summer 2020
  • 13. Proprietary and ConfidentialOne Medical Q. Why not use XCOM? A. Using S3 (an any other object store) is stateful, fault tolerant and avoids any limitations on how much data is being transferred. Q. Is the DAD Tool open source? A. Not currently but I am working towards that goal. Answers to common questions 13 Airflow Summit - Summer 2020
  • 14. Thank you Blaine Elliott Sr Data Engineer @ One Medical Twitter: @blainee Airflow Summit July 10th, 2020