SlideShare a Scribd company logo
Target Corporation




   BI Framework
   Error Processing
   Mohan.Kumar2
Table of Contents

1.     Exception Handling Overview (ref 2.5.2) ....................................................................................... 3
     1.1.     Data Reprocessing......................................................................................................................... 5
     1.2.     Infrastructure Exception Handling ................................................................................................ 7
     1.3.     Data Correction in DWH................................................................................................................ 9
2.     Error Processing – High Level .............................................................................................................. 11
     2.1.     Capturing..................................................................................................................................... 11
     2.2.     Error threshold ............................................................................................................................ 11
     2.3.     Purging ........................................................................................................................................ 12
       2.3.1.        Landing Area ....................................................................................................................... 12
       2.3.2.        Staging Area ........................................................................................................................ 12
       2.3.3.        EDW..................................................................................................................................... 12
       2.3.4.        Datamart ............................................................................................................................. 12
     2.4.     Purge threshold........................................................................................................................... 12
     2.5.     Appendix ..................................................................................................................................... 12
       2.5.1.        About Target ....................................................................................................................... 12
       2.5.2.        Reference ............................................................................................................................ 13
       2.5.3.        Other Contributors.............................................................................................................. 13




                                                                                                                                              Page 2 of 13
1. Exception Handling Overview (ref 2.5.2)




Exception Handling deals with any abnormal termination, unacceptable event or incorrect data that
can impact the data flow or accuracy of data in the warehouse/mart.

Exceptions in ETL could be classified as Data Related Exceptions and Infrastructure Related
Exceptions.


Please Note: In Infrastructure Related exception, Infrastructure glitches are not classified as exception
as they are temporary and are resolved by the time the job(s) is/are rerun. But, the logs are tracked
and maintained.

The process of recovering or gracefully exiting when an exception occurs is called exception handling.




                                                                                            Page 3 of 13
Data related exceptions are caused because of incorrect data format, incorrect value, incomplete
data from the source system. This leads to Data validation exceptions and Data Rejects. The process of
handling the Data Rejects is called Data Reprocessing.



                                                                                          Page 4 of 13
Infrastructure related exceptions are caused because of issues in the Network , the Database and the
Operating System. Common Infrastructure exceptions are FTP failure, Database connectivity failure,
File system full etc.

The data related exceptions are usually documented in the requirements, if not they must be because
if the data related exceptions are not handled they lead to inaccurate data in the warehouse/mart. We
also keep a threshold of maximum number of validation or reject failures allowed per load. Any value
above the threshold would mean the data would be too inaccurate because to too many rejections.

There is one more exception which is the presence of inaccurate or incorrect data in the warehouse.
This could happen due to

    1.   Incorrect requirement or missed, leading to incorrect ETL.
    2.   Incorrect interpretation of requirements leading to incorrect ETL.
    3.   Uncaught coding defects.
    4.   Incorrect data from source.

The process of Correction of the data already loaded in the warehouse involves fixing the data already
loaded and also preventing the inaccuracy to persist in the future.




    1.1.         Data Reprocessing

Reprocessing is is an exception handling process which involves the correction of the data that is could
                                                         not be loaded into the warehouse/mart.

                                                          There could be many reasons why source data
                                                          gets rejected from DWH. Most common of
                                                          them are

                                                                   Data Rejection - Source data not
                                                          matching critical business codes/attributes.
                                                          This is called Lookup Failure in ETL.
                                                                   Data Cleansing - Source data
                                                          containing junk values for business critical
                                                          fields hence getting rejected during data
                                                          validation.

                                                          There are 3 options to deal with the rejected
                                                          records. One, We could leave the rejected
                                                          data out of DWH or, two we could correct it
                                                          based on whether the rejected field is critical
                                                          to business and is worth reprocessing, and
                                                          then load it into DWH, and last option is to
                                                          The process of correcting the rejected data
                                                          and then loading into DWH is called Data
                                                          Reprocessing.




                                                                                             Page 5 of 13
As depicted in the figure above, we reject the data during the data validation process, data cleansing
process and data transformation process. The rejected data is collected in temporary files on the ETL
server while the ETL is running. Once the ETL is complete, the rejected data is moved into the Landing
Area.

The end user and the business analyst are provided interfaces to read the reject data in landing area.
They take this as the input, analyze the cause of rejection and correct the data at the source itself.
Once the data is corrected at the source, it is again extracted (depicted in Brown line in the figure).
The corrected data is not expected to get rejected again unless the correction provided was
insufficient.




                                                                                            Page 6 of 13
In some business critical data warehouses which have very very low tolerance towards inaccurate data,
we would need a sophisticated and a fast mechanism of handling rejected data in the landing area.
Here we consider a database to land the data. The database schema is the same as that of source
files/tables. We add two more columns to the schema, one to flag whether the record got rejected in
ETL, and the other to identify the date when the data was sent by the source system. Having a
database gives us an option of easily create applications to access and update the data in the landing
area.

Please note that adding a database in the landing area adds the infrastructure and maintenance costs.
Adding the database would also increase the number of processes in the extraction process, thereby
affecting the performance of ETL.




   1.2.        Infrastructure Exception Handling


Infrastructure related exceptions are caused because of issues in the Network connectivity, the
Database operations and the Operating System.

Common Infrastructure exceptions are




                                                                                          Page 7 of 13
Database Errors like db connection error, Referential integrity constraint failure, primary key
constraint failure, incorrect credentials, data type mismatch, Null in Not Null fields.
Network connection failure causing FTP failure.
Operating system issues on ETL server full causing aborts due to memory insufficiency, un-
mounted file systems, 100% CPU utilization, incorrect file/directory permissions.

                                                       The diagram below depicts the
                                                       exceptions and the process to handle
                                                       them.

                                                       The process of detecting the
                                                       abovementioned exception is generally
                                                       caught by the ETL scheduler which
                                                       checks whether there is a non zero value
                                                       returned by the ETL process.

                                                       If an exception occurs, we make a log
                                                       entry, send email or alerts to the users
                                                       to notify that the ETL process has
                                                       aborted and exit to the Operating
                                                       System with a Non Zero value.

                                                       The notification process alerts the IS
                                                       team to take appropriate action so that
                                                       the ETL process can be restarted once
                                                       the infrastructure issue is resolved.




                                                                                     Page 8 of 13
1.3.   Data Correction in DWH




                                The data in the DWH could be
                                incorrect or inaccurate due to a
                                variety of reasons, mainly

                                    1. Incorrect requirement or
                                missed, leading to incorrect ETL.
                                    2. Incorrect interpretation
                                of requirements leading to
                                incorrect ETL.
                                    3. Uncaught coding defects.
                                    4. Incorrect data from
                                source.




                                The reason 1, 2, and 3 would
                                require us to revisit the ETL code
                                with respect to the incorrect
                                requirements, missed
                                requirements and uncaught
                                defects.

                                The figure below depicts the
                                process to be followed to correct
                                the data already loaded in DWH.




                                Detection

                                Most important is the detection
                                of the inaccurate or incorrect
                                data in DWH. Incorrect data
                                loaded in DWH is usually
                                detected long after the it has
                                been loaded when some end-user
                                identifies it in his/her report.

                                Analysis

                                Once reported, we analyze the
                                report and its metadata. This
                                would require understanding the
                                report metadata, calculation and
                                the SQL generated by the report.

                                                     Page 9 of 13
If there is a no issue in the report definition, we analyze the data in DWH. Once we have pin pointed
the table, attributes and the data in DWH where the inaccuracy is, we perform the root cause of the
inaccuracy.

The root cause would require us to check the data with respect to the requirements, design and code.
The root cause helps us identify the next course of action.

Missing Requirements - If the root cause is massing requirements, then we go to the users and get the
complete requirements.

Misinterpretation of Requirements - Here too we go to the end user and clarify on the misinterpreted
requirement.

Defect in the code - There is a possibility of missing detecting bugs during the testing phase. If
undetected, the bug could cause inaccuracy in data.

Correction Process

In case of missing requirements,

    1.    Get the new requirements from the users.
    2.    Document the new requirements.
    3.    Design the new ETL.
    4.    Code the new ETL.
    5.    Test the new ETL.
    6.    Make the DWH offline.
    7.    Perform the History Load for the new Requirements. This could be possible only when we have
          added new tables or new attributes in the data model.
    8.    Check the report for new requirements.
    9.    If the reports are correct, then implement the new ETL into the regular ETL.
    10.   Perform the catch-up load for the duration the DWH was offline.
    11.   Bring the DWH online.

In case of misinterpreted requirements or undetected bugs,

    1.    Analyze the ETL and identify the changes in it.
    2.    Update the design.
    3.    Correct the code.
    4.    Test the code.
    5.    Create a patch to update the historical data (data already in DWH) to correct it.
    6.    Test the patch.
    7.    Bring the DWH offline.
    8.    Run the patch.
    9.    Check the report for correction.
    10.   If the reports are correct, then implement the corrected ETL.
    11.   Perform the catch-up load for the duration the DWH was offline.
    12.   Bring the DWH online.




                                                                                              Page 10 of 13
2. Error Processing – High Level




The error processing in Target is unique and flawless.

   2.1. Capturing
       All the various source system data is dumped into the landing area as is. All the records
       in the landing area are marked as valid in the first instance during the load.

       On a given schedule, the records are processed from landing area to the staging area
       and all the business validation are executed on these records. Once the staging load is
       finished, all the records which have not been loaded into the staging area are marked as
       invalid record in landing area.

       Information of all the rejected records which have failed will be stored into the error
       tables with error code. There is another table having all reference to the error code.

       Depending on the table(s), we would have multiple business validations for a each
       record. Hence could end up having multiple entries in the error table(s) for a given
       source record.

       The records which have been marked as invalid would be processed for every staging
       load until they are purged or if a corrected record is sent from the source.

   2.2. Error threshold
       If the no. of rejections reach a given threshold limit, mail is sent to EAM / Business data
       quality team informing the abnormal behavior and job is aborted.


                                                                                     Page 11 of 13
Based on the feedback the jobs are rerun/re-triggered manually.



2.3. Purging
   Purging is to delete the previous records which are no more required by a given
   business process.

   Following are the logic applied on various data.

   Purging logic is based on the following:-

   2.3.1. Landing Area
          1. Valid records – Valid records which have been loaded into the Staging area
             will retain only previous 7 days of data. Rest will be purged.

          2. Invalid records - Invalid records which have been errored out from Staging
             area will be retained for 30 days. Rest will be purged.

   2.3.2. Staging Area
          Truncate and load. An Area where we load and make sure data is good before
          we do any changes to warehouse table.

   2.3.3. EDW
          Depending on Business need, data is maintained in EDW.

   2.3.4. Datamart
          Depending on Business need, data is maintained in EDW.

2.4. Purge threshold
   During purging, the business can set a threshold limit to the number of records being
   purged. If while deleting the threshold limit is crossed. The Purge jobs are automatically
   aborted and a mail sent to the EAM / Business data quality team for confirmation.

   Once the business confirms, the aborted jobs are later triggered manually.



2.5. Appendix

   2.5.1. About Target
          TBU

                                                                                Page 12 of 13
2.5.2. Reference
     The Exception Handling Overview is an extract from www.dwhinfo.com written
     by Krishan.Vinayak@target.com

2.5.3. Other Contributors


           Krishan.Vinayak – Delivery Manager

           Devanathan.Rajagopalan – Senior Technical Architect

           Asis.Mohanty – BI Manager

           Joseph.Raj – Technical Architect




                                                                   Page 13 of 13

More Related Content

Similar to BI Error Processing Framework

WA CA 7 Edition r12 Database Conversion - CA Workload Automation Technology S...
WA CA 7 Edition r12 Database Conversion - CA Workload Automation Technology S...WA CA 7 Edition r12 Database Conversion - CA Workload Automation Technology S...
WA CA 7 Edition r12 Database Conversion - CA Workload Automation Technology S...
Extra Technology
 
Testing data warehouse applications by Kirti Bhushan
Testing data warehouse applications by Kirti BhushanTesting data warehouse applications by Kirti Bhushan
Testing data warehouse applications by Kirti Bhushan
Kirti Bhushan
 
Data warehousing change in a challenging environment
Data warehousing change in a challenging environmentData warehousing change in a challenging environment
Data warehousing change in a challenging environment
David Walker
 
Guide on Raid Data Recovery
Guide on Raid Data RecoveryGuide on Raid Data Recovery
Guide on Raid Data Recovery
Raid Data Recovery
 
Software architecture case study - why and why not sql server replication
Software architecture   case study - why and why not sql server replicationSoftware architecture   case study - why and why not sql server replication
Software architecture case study - why and why not sql server replication
Shahzad
 
Data Integration In Data Mining.pdf
Data Integration In Data Mining.pdfData Integration In Data Mining.pdf
Data Integration In Data Mining.pdf
Maria Mathe
 
Etl techniques
Etl techniquesEtl techniques
Etl techniques
mahezabeenIlkal
 
CA 7 r11.3 to r12 DB Conversion Presentation - CA Workload Automation Technol...
CA 7 r11.3 to r12 DB Conversion Presentation - CA Workload Automation Technol...CA 7 r11.3 to r12 DB Conversion Presentation - CA Workload Automation Technol...
CA 7 r11.3 to r12 DB Conversion Presentation - CA Workload Automation Technol...
Extra Technology
 
DATA BASE.docx
DATA BASE.docxDATA BASE.docx
DATA BASE.docx
write31
 
ETL Process
ETL ProcessETL Process
ETL Process
Rohin Rangnekar
 
Identify_Stability_Problems
Identify_Stability_ProblemsIdentify_Stability_Problems
Identify_Stability_Problems
Michael Materie
 
S18 das
S18 dasS18 das
Migration to Oracle 12c Made Easy Using Replication Technology
Migration to Oracle 12c Made Easy Using Replication TechnologyMigration to Oracle 12c Made Easy Using Replication Technology
Migration to Oracle 12c Made Easy Using Replication Technology
Donna Guazzaloca-Zehl
 
GROPSIKS.pptx
GROPSIKS.pptxGROPSIKS.pptx
GROPSIKS.pptx
avanceregine312
 
Discussion 1 The incorrect implementation of databases ou
Discussion 1 The incorrect implementation of databases ouDiscussion 1 The incorrect implementation of databases ou
Discussion 1 The incorrect implementation of databases ou
huttenangela
 
oracle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdforacle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdf
ssuserf8f9b2
 
Extract, Transform and Load.pptx
Extract, Transform and Load.pptxExtract, Transform and Load.pptx
Extract, Transform and Load.pptx
JesusaEspeleta
 
IRJET - The 3-Level Database Architectural Design for OLAP and OLTP Ops
IRJET - The 3-Level Database Architectural Design for OLAP and OLTP OpsIRJET - The 3-Level Database Architectural Design for OLAP and OLTP Ops
IRJET - The 3-Level Database Architectural Design for OLAP and OLTP Ops
IRJET Journal
 
Mocca International GmbH _Q500 analysis and Recommendations_Final
Mocca International GmbH _Q500 analysis and Recommendations_FinalMocca International GmbH _Q500 analysis and Recommendations_Final
Mocca International GmbH _Q500 analysis and Recommendations_Final
hjperry
 
Change data capture the journey to real time bi
Change data capture the journey to real time biChange data capture the journey to real time bi
Change data capture the journey to real time bi
Asis Mohanty
 

Similar to BI Error Processing Framework (20)

WA CA 7 Edition r12 Database Conversion - CA Workload Automation Technology S...
WA CA 7 Edition r12 Database Conversion - CA Workload Automation Technology S...WA CA 7 Edition r12 Database Conversion - CA Workload Automation Technology S...
WA CA 7 Edition r12 Database Conversion - CA Workload Automation Technology S...
 
Testing data warehouse applications by Kirti Bhushan
Testing data warehouse applications by Kirti BhushanTesting data warehouse applications by Kirti Bhushan
Testing data warehouse applications by Kirti Bhushan
 
Data warehousing change in a challenging environment
Data warehousing change in a challenging environmentData warehousing change in a challenging environment
Data warehousing change in a challenging environment
 
Guide on Raid Data Recovery
Guide on Raid Data RecoveryGuide on Raid Data Recovery
Guide on Raid Data Recovery
 
Software architecture case study - why and why not sql server replication
Software architecture   case study - why and why not sql server replicationSoftware architecture   case study - why and why not sql server replication
Software architecture case study - why and why not sql server replication
 
Data Integration In Data Mining.pdf
Data Integration In Data Mining.pdfData Integration In Data Mining.pdf
Data Integration In Data Mining.pdf
 
Etl techniques
Etl techniquesEtl techniques
Etl techniques
 
CA 7 r11.3 to r12 DB Conversion Presentation - CA Workload Automation Technol...
CA 7 r11.3 to r12 DB Conversion Presentation - CA Workload Automation Technol...CA 7 r11.3 to r12 DB Conversion Presentation - CA Workload Automation Technol...
CA 7 r11.3 to r12 DB Conversion Presentation - CA Workload Automation Technol...
 
DATA BASE.docx
DATA BASE.docxDATA BASE.docx
DATA BASE.docx
 
ETL Process
ETL ProcessETL Process
ETL Process
 
Identify_Stability_Problems
Identify_Stability_ProblemsIdentify_Stability_Problems
Identify_Stability_Problems
 
S18 das
S18 dasS18 das
S18 das
 
Migration to Oracle 12c Made Easy Using Replication Technology
Migration to Oracle 12c Made Easy Using Replication TechnologyMigration to Oracle 12c Made Easy Using Replication Technology
Migration to Oracle 12c Made Easy Using Replication Technology
 
GROPSIKS.pptx
GROPSIKS.pptxGROPSIKS.pptx
GROPSIKS.pptx
 
Discussion 1 The incorrect implementation of databases ou
Discussion 1 The incorrect implementation of databases ouDiscussion 1 The incorrect implementation of databases ou
Discussion 1 The incorrect implementation of databases ou
 
oracle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdforacle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdf
 
Extract, Transform and Load.pptx
Extract, Transform and Load.pptxExtract, Transform and Load.pptx
Extract, Transform and Load.pptx
 
IRJET - The 3-Level Database Architectural Design for OLAP and OLTP Ops
IRJET - The 3-Level Database Architectural Design for OLAP and OLTP OpsIRJET - The 3-Level Database Architectural Design for OLAP and OLTP Ops
IRJET - The 3-Level Database Architectural Design for OLAP and OLTP Ops
 
Mocca International GmbH _Q500 analysis and Recommendations_Final
Mocca International GmbH _Q500 analysis and Recommendations_FinalMocca International GmbH _Q500 analysis and Recommendations_Final
Mocca International GmbH _Q500 analysis and Recommendations_Final
 
Change data capture the journey to real time bi
Change data capture the journey to real time biChange data capture the journey to real time bi
Change data capture the journey to real time bi
 

More from Asis Mohanty

Cloud Data Warehouses
Cloud Data WarehousesCloud Data Warehouses
Cloud Data Warehouses
Asis Mohanty
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture Patterns
Asis Mohanty
 
Apache TAJO
Apache TAJOApache TAJO
Apache TAJO
Asis Mohanty
 
Cassandra basics 2.0
Cassandra basics 2.0Cassandra basics 2.0
Cassandra basics 2.0
Asis Mohanty
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
Asis Mohanty
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Asis Mohanty
 
Netezza vs Teradata vs Exadata
Netezza vs Teradata vs ExadataNetezza vs Teradata vs Exadata
Netezza vs Teradata vs Exadata
Asis Mohanty
 
ETL tool evaluation criteria
ETL tool evaluation criteriaETL tool evaluation criteria
ETL tool evaluation criteria
Asis Mohanty
 
COGNOS Vs OBIEE
COGNOS Vs OBIEECOGNOS Vs OBIEE
COGNOS Vs OBIEE
Asis Mohanty
 
Cognos vs Hyperion vs SSAS Comparison
Cognos vs Hyperion vs SSAS ComparisonCognos vs Hyperion vs SSAS Comparison
Cognos vs Hyperion vs SSAS Comparison
Asis Mohanty
 
Reporting/Dashboard Evaluations
Reporting/Dashboard EvaluationsReporting/Dashboard Evaluations
Reporting/Dashboard Evaluations
Asis Mohanty
 
Oracle to Netezza Migration Casestudy
Oracle to Netezza Migration CasestudyOracle to Netezza Migration Casestudy
Oracle to Netezza Migration Casestudy
Asis Mohanty
 
Netezza vs teradata
Netezza vs teradataNetezza vs teradata
Netezza vs teradata
Asis Mohanty
 

More from Asis Mohanty (13)

Cloud Data Warehouses
Cloud Data WarehousesCloud Data Warehouses
Cloud Data Warehouses
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture Patterns
 
Apache TAJO
Apache TAJOApache TAJO
Apache TAJO
 
Cassandra basics 2.0
Cassandra basics 2.0Cassandra basics 2.0
Cassandra basics 2.0
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouse
 
Netezza vs Teradata vs Exadata
Netezza vs Teradata vs ExadataNetezza vs Teradata vs Exadata
Netezza vs Teradata vs Exadata
 
ETL tool evaluation criteria
ETL tool evaluation criteriaETL tool evaluation criteria
ETL tool evaluation criteria
 
COGNOS Vs OBIEE
COGNOS Vs OBIEECOGNOS Vs OBIEE
COGNOS Vs OBIEE
 
Cognos vs Hyperion vs SSAS Comparison
Cognos vs Hyperion vs SSAS ComparisonCognos vs Hyperion vs SSAS Comparison
Cognos vs Hyperion vs SSAS Comparison
 
Reporting/Dashboard Evaluations
Reporting/Dashboard EvaluationsReporting/Dashboard Evaluations
Reporting/Dashboard Evaluations
 
Oracle to Netezza Migration Casestudy
Oracle to Netezza Migration CasestudyOracle to Netezza Migration Casestudy
Oracle to Netezza Migration Casestudy
 
Netezza vs teradata
Netezza vs teradataNetezza vs teradata
Netezza vs teradata
 

Recently uploaded

Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
UiPathCommunity
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
BibashShahi
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
Sease
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
Fwdays
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
christinelarrosa
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
Enterprise Knowledge
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Pitangent Analytics & Technology Solutions Pvt. Ltd
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving
 
What is an RPA CoE? Session 2 – CoE Roles
What is an RPA CoE?  Session 2 – CoE RolesWhat is an RPA CoE?  Session 2 – CoE Roles
What is an RPA CoE? Session 2 – CoE Roles
DianaGray10
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
Fwdays
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
christinelarrosa
 

Recently uploaded (20)

Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
 
What is an RPA CoE? Session 2 – CoE Roles
What is an RPA CoE?  Session 2 – CoE RolesWhat is an RPA CoE?  Session 2 – CoE Roles
What is an RPA CoE? Session 2 – CoE Roles
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
 

BI Error Processing Framework

  • 1. Target Corporation BI Framework Error Processing Mohan.Kumar2
  • 2. Table of Contents 1. Exception Handling Overview (ref 2.5.2) ....................................................................................... 3 1.1. Data Reprocessing......................................................................................................................... 5 1.2. Infrastructure Exception Handling ................................................................................................ 7 1.3. Data Correction in DWH................................................................................................................ 9 2. Error Processing – High Level .............................................................................................................. 11 2.1. Capturing..................................................................................................................................... 11 2.2. Error threshold ............................................................................................................................ 11 2.3. Purging ........................................................................................................................................ 12 2.3.1. Landing Area ....................................................................................................................... 12 2.3.2. Staging Area ........................................................................................................................ 12 2.3.3. EDW..................................................................................................................................... 12 2.3.4. Datamart ............................................................................................................................. 12 2.4. Purge threshold........................................................................................................................... 12 2.5. Appendix ..................................................................................................................................... 12 2.5.1. About Target ....................................................................................................................... 12 2.5.2. Reference ............................................................................................................................ 13 2.5.3. Other Contributors.............................................................................................................. 13 Page 2 of 13
  • 3. 1. Exception Handling Overview (ref 2.5.2) Exception Handling deals with any abnormal termination, unacceptable event or incorrect data that can impact the data flow or accuracy of data in the warehouse/mart. Exceptions in ETL could be classified as Data Related Exceptions and Infrastructure Related Exceptions. Please Note: In Infrastructure Related exception, Infrastructure glitches are not classified as exception as they are temporary and are resolved by the time the job(s) is/are rerun. But, the logs are tracked and maintained. The process of recovering or gracefully exiting when an exception occurs is called exception handling. Page 3 of 13
  • 4. Data related exceptions are caused because of incorrect data format, incorrect value, incomplete data from the source system. This leads to Data validation exceptions and Data Rejects. The process of handling the Data Rejects is called Data Reprocessing. Page 4 of 13
  • 5. Infrastructure related exceptions are caused because of issues in the Network , the Database and the Operating System. Common Infrastructure exceptions are FTP failure, Database connectivity failure, File system full etc. The data related exceptions are usually documented in the requirements, if not they must be because if the data related exceptions are not handled they lead to inaccurate data in the warehouse/mart. We also keep a threshold of maximum number of validation or reject failures allowed per load. Any value above the threshold would mean the data would be too inaccurate because to too many rejections. There is one more exception which is the presence of inaccurate or incorrect data in the warehouse. This could happen due to 1. Incorrect requirement or missed, leading to incorrect ETL. 2. Incorrect interpretation of requirements leading to incorrect ETL. 3. Uncaught coding defects. 4. Incorrect data from source. The process of Correction of the data already loaded in the warehouse involves fixing the data already loaded and also preventing the inaccuracy to persist in the future. 1.1. Data Reprocessing Reprocessing is is an exception handling process which involves the correction of the data that is could not be loaded into the warehouse/mart. There could be many reasons why source data gets rejected from DWH. Most common of them are Data Rejection - Source data not matching critical business codes/attributes. This is called Lookup Failure in ETL. Data Cleansing - Source data containing junk values for business critical fields hence getting rejected during data validation. There are 3 options to deal with the rejected records. One, We could leave the rejected data out of DWH or, two we could correct it based on whether the rejected field is critical to business and is worth reprocessing, and then load it into DWH, and last option is to The process of correcting the rejected data and then loading into DWH is called Data Reprocessing. Page 5 of 13
  • 6. As depicted in the figure above, we reject the data during the data validation process, data cleansing process and data transformation process. The rejected data is collected in temporary files on the ETL server while the ETL is running. Once the ETL is complete, the rejected data is moved into the Landing Area. The end user and the business analyst are provided interfaces to read the reject data in landing area. They take this as the input, analyze the cause of rejection and correct the data at the source itself. Once the data is corrected at the source, it is again extracted (depicted in Brown line in the figure). The corrected data is not expected to get rejected again unless the correction provided was insufficient. Page 6 of 13
  • 7. In some business critical data warehouses which have very very low tolerance towards inaccurate data, we would need a sophisticated and a fast mechanism of handling rejected data in the landing area. Here we consider a database to land the data. The database schema is the same as that of source files/tables. We add two more columns to the schema, one to flag whether the record got rejected in ETL, and the other to identify the date when the data was sent by the source system. Having a database gives us an option of easily create applications to access and update the data in the landing area. Please note that adding a database in the landing area adds the infrastructure and maintenance costs. Adding the database would also increase the number of processes in the extraction process, thereby affecting the performance of ETL. 1.2. Infrastructure Exception Handling Infrastructure related exceptions are caused because of issues in the Network connectivity, the Database operations and the Operating System. Common Infrastructure exceptions are Page 7 of 13
  • 8. Database Errors like db connection error, Referential integrity constraint failure, primary key constraint failure, incorrect credentials, data type mismatch, Null in Not Null fields. Network connection failure causing FTP failure. Operating system issues on ETL server full causing aborts due to memory insufficiency, un- mounted file systems, 100% CPU utilization, incorrect file/directory permissions. The diagram below depicts the exceptions and the process to handle them. The process of detecting the abovementioned exception is generally caught by the ETL scheduler which checks whether there is a non zero value returned by the ETL process. If an exception occurs, we make a log entry, send email or alerts to the users to notify that the ETL process has aborted and exit to the Operating System with a Non Zero value. The notification process alerts the IS team to take appropriate action so that the ETL process can be restarted once the infrastructure issue is resolved. Page 8 of 13
  • 9. 1.3. Data Correction in DWH The data in the DWH could be incorrect or inaccurate due to a variety of reasons, mainly 1. Incorrect requirement or missed, leading to incorrect ETL. 2. Incorrect interpretation of requirements leading to incorrect ETL. 3. Uncaught coding defects. 4. Incorrect data from source. The reason 1, 2, and 3 would require us to revisit the ETL code with respect to the incorrect requirements, missed requirements and uncaught defects. The figure below depicts the process to be followed to correct the data already loaded in DWH. Detection Most important is the detection of the inaccurate or incorrect data in DWH. Incorrect data loaded in DWH is usually detected long after the it has been loaded when some end-user identifies it in his/her report. Analysis Once reported, we analyze the report and its metadata. This would require understanding the report metadata, calculation and the SQL generated by the report. Page 9 of 13
  • 10. If there is a no issue in the report definition, we analyze the data in DWH. Once we have pin pointed the table, attributes and the data in DWH where the inaccuracy is, we perform the root cause of the inaccuracy. The root cause would require us to check the data with respect to the requirements, design and code. The root cause helps us identify the next course of action. Missing Requirements - If the root cause is massing requirements, then we go to the users and get the complete requirements. Misinterpretation of Requirements - Here too we go to the end user and clarify on the misinterpreted requirement. Defect in the code - There is a possibility of missing detecting bugs during the testing phase. If undetected, the bug could cause inaccuracy in data. Correction Process In case of missing requirements, 1. Get the new requirements from the users. 2. Document the new requirements. 3. Design the new ETL. 4. Code the new ETL. 5. Test the new ETL. 6. Make the DWH offline. 7. Perform the History Load for the new Requirements. This could be possible only when we have added new tables or new attributes in the data model. 8. Check the report for new requirements. 9. If the reports are correct, then implement the new ETL into the regular ETL. 10. Perform the catch-up load for the duration the DWH was offline. 11. Bring the DWH online. In case of misinterpreted requirements or undetected bugs, 1. Analyze the ETL and identify the changes in it. 2. Update the design. 3. Correct the code. 4. Test the code. 5. Create a patch to update the historical data (data already in DWH) to correct it. 6. Test the patch. 7. Bring the DWH offline. 8. Run the patch. 9. Check the report for correction. 10. If the reports are correct, then implement the corrected ETL. 11. Perform the catch-up load for the duration the DWH was offline. 12. Bring the DWH online. Page 10 of 13
  • 11. 2. Error Processing – High Level The error processing in Target is unique and flawless. 2.1. Capturing All the various source system data is dumped into the landing area as is. All the records in the landing area are marked as valid in the first instance during the load. On a given schedule, the records are processed from landing area to the staging area and all the business validation are executed on these records. Once the staging load is finished, all the records which have not been loaded into the staging area are marked as invalid record in landing area. Information of all the rejected records which have failed will be stored into the error tables with error code. There is another table having all reference to the error code. Depending on the table(s), we would have multiple business validations for a each record. Hence could end up having multiple entries in the error table(s) for a given source record. The records which have been marked as invalid would be processed for every staging load until they are purged or if a corrected record is sent from the source. 2.2. Error threshold If the no. of rejections reach a given threshold limit, mail is sent to EAM / Business data quality team informing the abnormal behavior and job is aborted. Page 11 of 13
  • 12. Based on the feedback the jobs are rerun/re-triggered manually. 2.3. Purging Purging is to delete the previous records which are no more required by a given business process. Following are the logic applied on various data. Purging logic is based on the following:- 2.3.1. Landing Area 1. Valid records – Valid records which have been loaded into the Staging area will retain only previous 7 days of data. Rest will be purged. 2. Invalid records - Invalid records which have been errored out from Staging area will be retained for 30 days. Rest will be purged. 2.3.2. Staging Area Truncate and load. An Area where we load and make sure data is good before we do any changes to warehouse table. 2.3.3. EDW Depending on Business need, data is maintained in EDW. 2.3.4. Datamart Depending on Business need, data is maintained in EDW. 2.4. Purge threshold During purging, the business can set a threshold limit to the number of records being purged. If while deleting the threshold limit is crossed. The Purge jobs are automatically aborted and a mail sent to the EAM / Business data quality team for confirmation. Once the business confirms, the aborted jobs are later triggered manually. 2.5. Appendix 2.5.1. About Target TBU Page 12 of 13
  • 13. 2.5.2. Reference The Exception Handling Overview is an extract from www.dwhinfo.com written by Krishan.Vinayak@target.com 2.5.3. Other Contributors Krishan.Vinayak – Delivery Manager Devanathan.Rajagopalan – Senior Technical Architect Asis.Mohanty – BI Manager Joseph.Raj – Technical Architect Page 13 of 13