Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
ChakraView
A 360° approach to data quality
Shankar Manian
Keerthika Thiyagarajan
Background
● ~15 years in Big Data...
● ...as Data Janitors
● Can we do better ?
Data Quality - Missing Focus
● Afterthought
● Needle in a haystack
● Huge cost
Detection - Missing Dimensions
● Completeness
● Consistency
● Auditability
Cleansing - The Hidden Cost
● Trace the issue to source
● No SOP on how to fix
● Hard to Automate
Visibility - Or the lack of it
● Impact - Cost of bad data
● Breakdown and Prioritization
● Push quality upstream
State before
● Stakeholder driven
● Reactive process
● Business metrics
● Huge monetary impact
● Iterative Discovery
Validations Framework
● Granular Validations -> Business metrics
● Self serve onboarding
● Tigger on data refresh
● System...
TransactionI
d
OrderId Amount B.Amount InvoiceId L.Amount
TX1 OD1 100 100 I1 10
TX2 OD2 50 50 I2 50
TX3 OD3 75 75 I3 75
TX...
Salient features
● Abstract templates
○ Null check
○ Datatype compliance
○ Aggregated check
○ Range check
○ Cross comparis...
Validations UI
Sample Validation
{
"fact": [{
"fact_1": "payment_gateway",
"fact_2": "ledger",
"join_type":
"full_outer_join",
"join_colu...
Data Flow
Trigger from
Azkaban
Run spark job Publish validation
failures
Fact refresh
Dashboard Datastore
Template Library...
Until now we were blissfully ignorant, Now we spend multiple man hours
categorising the bad records
TransactionId OrderId B.Amount InvoiceId L.Amount Category
TX1 OD1 100 I1 10
Amount wrong in
Ledger entry
TX5 OD4 200
Upst...
Combinatorial explosion
● The cycle is longer for big data due to
● Complexity of the system
● Time consuming
● Error pron...
● Real-time systems has ELK kind of tools
● No tools available for Big data to RCA
How do we make this operation cheap?
Auto-RCA
● Enrich logs and data from main pipeline
Enrichments
{
"commerce_activity": {
"activityType": "create_ledger",
"activityId": "TX12345",
"payload": "{"event":"creat...
Auto-RCA
● Perform 5 Why RCA
● Hierarchical categorisation
● Leaf category -> Unique issues
Unclassified
Amount mismatch Missing entries
Missing entries in Bank
statement
Missing entries in ledger
Issue in invoice ...
Fixture
● Can we automate cleaning the data?
Fixture
Event processing failure
Event not arrived
Wrong value in file
File upload issue
Data not pushed to
analytical sto...
Fixture
{
"flowName": "debtor_flow",
"categoryName": "Event processing failure",
"recipeName": "reprocess_event"
}
Fixture
● Recipes - Library of functions that automate the cleansing
● Leaf Category -> Recipe
● Sample Recipes
○ Reverse
...
Architecture
● Man-days reduced to few hours.
● Reactive to proactive
● Dev-friendly
● People independent
● Complete visibility
Next Steps
● Open source
● Data observability
● Performance optimisation
Questions?
ChakraView – A 360° Approach to Data Quality
ChakraView – A 360° Approach to Data Quality
ChakraView – A 360° Approach to Data Quality
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
What to Upload to SlideShare
Next
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

ChakraView – A 360° Approach to Data Quality

Download to read offline

Availability of high-quality data is central to success of any organization in the current era. As every organization ramps up its collection and storage of data, its usefulness largely depends on the confidence of its quality. In the Financial Data Engineering team at Flipkart, where the bar for the data quality is 100% correctness and completeness, this problem takes on a wholly different dimension. Currently, countless number of data analysts and engineers try to find various issues in the financial data to keep it that way. We wanted to find a way that is less manual, more scalable and cost-effective.



As we evaluated various solutions available in the public domain, we found quite a few gaps.

Most frameworks are limited in the kind of issues they detect. While many detect the internal consistency issues at schema level and dataset level, there are none that detect consistency issues across datasets and check for completeness.
No common framework for Data cleaning and repairing once an issue has been found.
Fixing data quality issues require the right categorization of the issues to drive accountability with the producer systems. There are very few frameworks that support categorisation of issues and visibility to the producers.


In this presentation, we discuss how we developed a comprehensive data quality framework. Our framework has also been developed with the assumption that the people interested in and involved in fixing these issues are not necessarily data engineers. Our framework has been developed to be largely config driven with pluggable logic for categorisation and cleaning. We will then talk about how it helped achieve scale in fixing the data quality issues and helped reduce many of the repeated issues.

  • Be the first to like this

ChakraView – A 360° Approach to Data Quality

  1. 1. ChakraView A 360° approach to data quality Shankar Manian Keerthika Thiyagarajan
  2. 2. Background ● ~15 years in Big Data... ● ...as Data Janitors ● Can we do better ?
  3. 3. Data Quality - Missing Focus ● Afterthought ● Needle in a haystack ● Huge cost
  4. 4. Detection - Missing Dimensions ● Completeness ● Consistency ● Auditability
  5. 5. Cleansing - The Hidden Cost ● Trace the issue to source ● No SOP on how to fix ● Hard to Automate
  6. 6. Visibility - Or the lack of it ● Impact - Cost of bad data ● Breakdown and Prioritization ● Push quality upstream
  7. 7. State before ● Stakeholder driven ● Reactive process ● Business metrics ● Huge monetary impact ● Iterative Discovery
  8. 8. Validations Framework ● Granular Validations -> Business metrics ● Self serve onboarding ● Tigger on data refresh ● System health dashboard
  9. 9. TransactionI d OrderId Amount B.Amount InvoiceId L.Amount TX1 OD1 100 100 I1 10 TX2 OD2 50 50 I2 50 TX3 OD3 75 75 I3 75 TX4 OD4 200 200 TX5 OD5 50 I5 50 Bad Records PaymentGateway * BankStatement * Ledger Amount Mismatch Entry missing in Ledger Entry missing in Bank statement
  10. 10. Salient features ● Abstract templates ○ Null check ○ Datatype compliance ○ Aggregated check ○ Range check ○ Cross comparison check ● Filter and transformation support ○ Exclude few records ○ Case-insensitive conversion ● Construct target dataframe ● Row level results
  11. 11. Validations UI
  12. 12. Sample Validation { "fact": [{ "fact_1": "payment_gateway", "fact_2": "ledger", "join_type": "full_outer_join", "join_columns": [{ "fact_1_column": "transaction_id", "fact_2_column": "transaction_id", "operator": "equal" }] }], "group_by_columns": ["transaction_id"], "idempotency_columns": [ "transaction_id" ], "validation_configurations": [{ "name": "amount_recon", "operator": "equal", "expression_list": [{ "expression": { "operator": "amount", "terminal": "pg_amount" } }, { "expression": { "operator": "l.amount", "terminal": "ledger_amount" } } ] }] }
  13. 13. Data Flow Trigger from Azkaban Run spark job Publish validation failures Fact refresh Dashboard Datastore Template Library Validation Configuration
  14. 14. Until now we were blissfully ignorant, Now we spend multiple man hours categorising the bad records
  15. 15. TransactionId OrderId B.Amount InvoiceId L.Amount Category TX1 OD1 100 I1 10 Amount wrong in Ledger entry TX5 OD4 200 Upstream Failure- Payments TX6 OD6 I6 50 File upload issue Root Cause Analysis(RCA) Bank Statement * Ledger
  16. 16. Combinatorial explosion ● The cycle is longer for big data due to ● Complexity of the system ● Time consuming ● Error prone ● Humanly impossible
  17. 17. ● Real-time systems has ELK kind of tools ● No tools available for Big data to RCA How do we make this operation cheap?
  18. 18. Auto-RCA ● Enrich logs and data from main pipeline
  19. 19. Enrichments { "commerce_activity": { "activityType": "create_ledger", "activityId": "TX12345", "payload": "{"event":"create_ledger","entity_id":"TX12345"}", "eventStatus": "ERRORED", "retryCount": 0 }, "error_details": { "activityType": "create_ledger", "activityId": "TX12345", "errorCode": "503", "errorDescription": "Error: EnricherException{statusCode=503}", "sourceSystem": "IRN", "upstreamUriSignature": "/payment/<transaction>", "upstreamUrl": "/payment/TX12345", "upstreamHttpMethod": "GET", "upstreamHeader": null, "upstreamPayload": null, "errorStatus": "OPEN", "failureCount": null, } }
  20. 20. Auto-RCA ● Perform 5 Why RCA ● Hierarchical categorisation ● Leaf category -> Unique issues
  21. 21. Unclassified Amount mismatch Missing entries Missing entries in Bank statement Missing entries in ledger Issue in invoice creation Issue in Bank statement Event processing failure Event not arrived Wrong value in file File upload issue Data not pushed to analytical store Unclassified
  22. 22. Fixture ● Can we automate cleaning the data?
  23. 23. Fixture Event processing failure Event not arrived Wrong value in file File upload issue Data not pushed to analytical store reprocess_event replay_event reprocess_file republish_ledger_entry
  24. 24. Fixture { "flowName": "debtor_flow", "categoryName": "Event processing failure", "recipeName": "reprocess_event" }
  25. 25. Fixture ● Recipes - Library of functions that automate the cleansing ● Leaf Category -> Recipe ● Sample Recipes ○ Reverse ○ Retry ○ Restore
  26. 26. Architecture
  27. 27. ● Man-days reduced to few hours. ● Reactive to proactive ● Dev-friendly ● People independent ● Complete visibility
  28. 28. Next Steps ● Open source ● Data observability ● Performance optimisation
  29. 29. Questions?

Availability of high-quality data is central to success of any organization in the current era. As every organization ramps up its collection and storage of data, its usefulness largely depends on the confidence of its quality. In the Financial Data Engineering team at Flipkart, where the bar for the data quality is 100% correctness and completeness, this problem takes on a wholly different dimension. Currently, countless number of data analysts and engineers try to find various issues in the financial data to keep it that way. We wanted to find a way that is less manual, more scalable and cost-effective. As we evaluated various solutions available in the public domain, we found quite a few gaps. Most frameworks are limited in the kind of issues they detect. While many detect the internal consistency issues at schema level and dataset level, there are none that detect consistency issues across datasets and check for completeness. No common framework for Data cleaning and repairing once an issue has been found. Fixing data quality issues require the right categorization of the issues to drive accountability with the producer systems. There are very few frameworks that support categorisation of issues and visibility to the producers. In this presentation, we discuss how we developed a comprehensive data quality framework. Our framework has also been developed with the assumption that the people interested in and involved in fixing these issues are not necessarily data engineers. Our framework has been developed to be largely config driven with pluggable logic for categorisation and cleaning. We will then talk about how it helped achieve scale in fixing the data quality issues and helped reduce many of the repeated issues.

Views

Total views

69

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

2

Shares

0

Comments

0

Likes

0

×