Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AIRflow at Scale

329 views

Published on

Know all about 'AIRflow at Scale'. Gain insights from the webinar led by Sreenath Kamath & Sakshi Bansal Data Engineers, Qubole.

Published in: Data & Analytics
  • Be the first to comment

AIRflow at Scale

  1. 1. Building Scalable Robust Data Pipelines with Apache Airflow
  2. 2. Agenda ❖ A brief introduction to Qubole ❖ Apache Airflow ❖ Operational Challenges in managing an ETL ❖ Alerts and Monitoring ❖ Quality Assurance in ETL’s 3
  3. 3. About Qubole Data service ❖ A self-service platform for big data analytics. ❖ Delivers best-in-class Apache tools such as Hadoop, Hive, Spark, etc. integrated into an enterprise-feature rich platform optimized to run in the cloud. ❖ Enables users to focus on their data rather than the platform. 4
  4. 4. Data Team @ Qubole ❖ Data Warehouse for Qubole ❖ Provides Insights and Recommendations to users ❖ Just Another Qubole Account ❖ Enabling data driven features within QDS 5
  5. 5. Multi Tenant Nature Of Data Team 6 Qubole Distribution 2 (azure.qubole.com) Distribution 1 (api.qubole.com) Data Warehouse Data Warehouse
  6. 6. Apache Airflow For ETL ❖ Developer Friendly ❖ A rich collection of Operators, CLI utilities and UI to author and manage your Data Pipeline. ❖ Horizontally Scalable. ❖ Tight Integration With Qubole 7
  7. 7. DAG creation in Airflow 8
  8. 8. Operational Challenges In ETL World. 9 How to achieve continuous integration and deployment for ETL’s ? How to effectively manage configuration for ETL’s in a multi tenant environment ? How we do we make ETL’s aware of the Data Warehouse migrations ?
  9. 9. Configuration Management 10
  10. 10. Use Airflow Variables For Saving ETL configuration IDEA!
  11. 11. Airflow Variables for ETL Configuration ❖ Stores the information as a key value pair in airflow. ❖ Extensive support like CLI, UI and API to manage the variables ❖ Can be used from within the airflow script as variable.get(“variable_name”) 12
  12. 12. Warehouse Management. ❖ A leaf out of Ruby on Rails: Active Record Migrations. ❖ Each migration is tagged and committed as a single commit to version control along with ETL changes. 13
  13. 13. The PROCESS IS EASY 14 Checkout from version control the target tag. Update the migration number Run any new relevant migrations Fetch Current Migration Number from Airflow Variables.
  14. 14. ❖ Traditional deployment too messy when multiple users are handling airflow. ❖ Data Apps for ETL deployment. ❖ Provides cli option like <ETL_NAME> deploy -r <version_tag> -d <start_date> Deployment Checkout the airflow template file from version control. Copy the final script file to airflow directory. Read Config Values from Airflow and translate the config values
  15. 15. Alerts And Monitoring. 16
  16. 16. DAG in Qubole ❖ This graph has 90+ operators! ❖ 8 -9 different types. ❖ Clearly, error prone! 17
  17. 17. DATA QUALITY ISSUES Missing Data Data Corruption Data Duplication System Issues 18
  18. 18. IMPORTANCE OF DATA VALIDATION ❖ Application’s correctness depends on correctness of data. ❖ Increase confidence on data by quantifying data quality. ❖ Correcting existing data can be expensive - prevention better than cure! ❖ Stopping critical downstream tasks if the data is invalid. 19
  19. 19. TREND MONITORING ❖ Monitor dips, peaks, anomalies. ❖ Hard problem! ❖ Not real time. ❖ One size doesn’t fit all - Different ETLs manipulate data in different ways. ❖ Difficult to maintain. 20
  20. 20. Use assert queries for data validation! IDEA!
  21. 21. Using Apache Airflow Check operators: Approach: Extend open source airflow check operator for queries running on Qubole platform Run data validation queries Fail the operator if the validation fails 22
  22. 22. Creating QuboleCheck operator 23
  23. 23. Limitations and Enhancements to open source Apache Airflow Operator
  24. 24. Problem: Airflow Check operators required pass_value to be defined before the ETL starts. Use case: Validating data import logic Solution: Make pass_value an Airflow template field This way it can be configured at run-time. The pass value can be injected through multiple mechanisms once it’s an airflow template field. 1. Compare Data across engines 25
  25. 25. Pass value as an Airflow Template field 26
  26. 26. Problem: Currently, Apache airflow check operators consider single row for comparison. Use case: Run group queries, compare each of the values against the pass_value. Solution: Qubole_check_operator adds `results_parser_callable` parameter The function pointed to by `results_parser_callable` holds the logic to return a list of records on which the checks would be performed. 2. Validate multiline results 27
  27. 27. Parser function as parameter to Check operator 28
  28. 28. Integration of Apache Airflow Check Operators with Qubole ETLs
  29. 29. ETL # 1: Data Ingestion Imports data from RDS tables into Data Warehouse for analysis purposes. Historical Issues: Mismatch with source data 1. Data duplication 2. Data missing for certain duration Checks employed: - Count comparison across the two data stores - source and destination. How checks have helped us: - Verify and rectify upsert logic (which is not plain copy of RDS) PS: Runtime fetching of expected values! 30
  30. 30. ETL # 2: Data Transformation Repartitions a day’s worth of data into hourly partitions. Historical Issues: 1. Data ending up in single partition field (Default hive partition). 2. Wrong ordering of values in fields. Checks employed: 1. Number of partitions getting created are 24 (one for every hour). 2. Check the value of critical field, “source” . How checks have helped us: Verify and rectify repartitioning logic. 31
  31. 31. ETL # 3: Cost Computation Computes Qubole Compute Unit Hour (QCUH) Situation: We are narrowing down on the granularity of cost computation from daily to hourly. How Checks have helped? To monitor new data and alarm in case of mismatch in trends of old and new data. 32
  32. 32. ETL # 4: Data Transformation Parses customer queries and outputs table usage information. Historical Issues: 1. Data missing for a customer account. 2. Data loss due to different syntaxes across engines. 3. Data loss due to query syntax changes across different versions of data- engines. Checks employed: 1. Group by account ids, if any of them is 0, raise an alert. 2. Group by on engine type, account ids. If high error %, raise an alert. How checks have helped us: - Insights into amount of data loss. - Provides feedback, helped us make syntax checking more robust. 33
  33. 33. FEATURES ❖ Ability to plug-in different alerting mechanisms. ❖ Dependency management and Failure handling. ❖ Ability to parse the output of assert query in a user defined manner. ❖ Run time fetching of the pass_value against which the comparison is made. ❖ Ability to generate failure/success report. 34
  34. 34. LESSONS LEARNT One size doesn’t fit all- Estimation of data trends is a difficult problem Delegate the validation task to the ETL itself 35
  35. 35. Source code has been contributed to Apache Airflow AIRFLOW-2228: Enhancements in Check operator AIRFLOW-2213: Adding Qubole Check Operator 36
  36. 36. In data we trust! THANKS! Any questions? You can find us at: sakshib@qubole.com sreenathk@qubole.com

×