Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Pragmatics Driven Issues in Data and Process Integrity in Enterprises


Published on

Keynote/Invited Talk

IFIP TC-11 First Working Conference on
Keynote/Invited Talk at the IFIP TC-11 First Working Conference on
Integrity and Internal Control in Information Systems
Zurich, Switzerland, December 4-5, 1997

Published in: Education
  • Be the first to comment

  • Be the first to like this

Pragmatics Driven Issues in Data and Process Integrity in Enterprises

  1. 1. Pragmatics Driven Issues inData and Process Integrity in Enterprises Keynote/Invited Talk IFIP TC-11 First Working Conference on Integrity and Internal Control in Information Systems Zurich, Switzerland December 4-5, 1997 Amit Sheth Large Scale Distributed Information System Lab University of Georgia
  2. 2. Three Real Challenges to Data IntegrityThree Real Challenges to Data IntegrityThree realities of IS environment• Dirty data• Interdependent Data• Process Coordination /Workflow Management but traditional data integrity and database transaction solutions come up short ...…...
  3. 3. OverviewPoor Quality Inconsistent Processof Data Related Data CoordinationData Cleanup/ Correct WorkflowPurification Inconsistencies Specifications Achieve Data Data Process Integrity Transaction Interdependent Manage Management Data Management Data Integrity
  4. 4. Dirty Data Dirty Data Managing Data Quality 46% Business Data Modeling 31% End-user Expectations 29% Legacy Data 25% Transformation 22% Business Rule Analysis 17%Management Expectations 16% Database Performance Source: DCI/Meta Group, Inc. Users cite their biggest data warehouse challenges;
  5. 5. Dirty Data Stories I have heard/seen• 30% fall-outs (“requests for manual assist”) due to mismatch between address in customer service request and loop inventory database in a Telco• PUC insisted that a Regional Bell Company do something about reducing 400 persons employed ($40 million+) to keep data consistent
  6. 6. Dirty Data Dirty Data: Real World Stories• Insurance company regional data: 80% of claims had “broken leg” as diagnosis*• 4% error rate, a $2 billion forfeits $80 million in revenue** Emily Kay, Dirty Data Challenges Warehouses, DW/Software Magazine, Oct. 97
  7. 7. Dirty Data Data Quality Dimensions• invalid or impaired data• incomplete or missing data• inconsistent dataHow to continue business operations• by discounting affect of poor data quality data• without worsening data quality
  8. 8. Dirty Data Improving Data Quality• Rule discovery, audit, scrubbing/cleansing/purifying, defect prevention• Commercial offerings give partial solutions to some aspects of identifying data quality problems and some aspects of cleanup (scrubbing)
  9. 9. Dirty Data NASD Data Quality Toolset Client-access tool Cognos, SAS, Applix Conversion tool ETI* Extract Metadata tool Platinum Tech’s Repository Auditing tool Prism Solution’s QDB Solutions QDB/ConnectProblem: No integrated solution!From L. Wilson, “NASD: Securing Data Quality, DW/Software Magazine, Oct. 97
  10. 10. Dirty Data More on Commercial Solutions• Commercial solution providers: Information Builders, Platinum Technologies, SAS Institute, Group 1 Software, Vality Technology, First Logic• Hundred of thousands of dollars: Why?
  11. 11. Dirty Data Issues reasonably addressed• Conceptual framework -- MIT’s work gives very good start• Most existing solutions apply to single data repository or database -- possible to use remote data access solutions for one database at a time
  12. 12. Dirty Data Challenges to be addressed• Most solutions deal with structured/relational data only -- increasingly data is in different media• Most solutions deal with creation of data warehouse; OK for decision support, but what about operational use?
  13. 13. Dirty Data Data Quality ChallengesHow to continue business operations• by discounting affect of poor data quality data• without worsening data quality “A Mediator for Approximate Consistency: Supporting “Good Enough” Materialized Views” Seligman-Kerschberg
  14. 14. Dirty Data A Research Project: Q-Data Define Invoke Validation Display Results Rules & Cleanup or Consult GUI Rules & Programs Declarative Rule and - Ref. Integrity Procedural Programs - Approx. Match LDL++ (LDL/Prolog/C++) - Consistency Database LegacyAccess Interface System Interface Databases Legacy Information Systems
  15. 15. Dirty Data Interested in More Information?• Industry/Practice: – – “Data Quality Maze”, DW, Software Magazing, Oct. 1997• MIS: Total Data Quality Research:• Computer Science Research: Sheth-Wood- Kashyap, Ami Motro,...
  16. 16. Interdependent Data Interdependent Data and Multidatabase ConsistencyFunction oriented independently created application systems to automate different parts of operation.Hence independently developed databases where:• information about a subject is distributed in multiple systems• a new application manages existing data independently
  17. 17. Interdependent Data Interdependent Data and Multidatabase Consistency Order Billing Planning & Processing System Engineering System SystemCustomer DataInventory DataAssignment DataReference Data
  18. 18. Interdependent Data War Stories• Data analysis: One data element was in 43 separate legacy system files, maintained by 43 separate programs.• Telco: Customer information is probably in over 100 information systems. Some information may be overlapping, and in different representational forms.
  19. 19. Interdependent Data Real Example:Provisioning Residential Line
  20. 20. Interdependent DataLack of understanding and maintenance of data independency lead to data inconsistency and require• manual intervention for completed failed operations• work-around/patches• manual reconciliationand result in• incorrect and wasted operations, poor quality of work• difficulty in interoperability, high costs• lost business opportunities
  21. 21. Interdependent Data A Framework for Specifying Interdependent Data data dependency descriptor dependency consistency restorationstructural control data state temporal coupled/ vital/ decoupled non-vital Sheth and Rusinkiewicz 1990
  22. 22. Interdependent Data A Case Study at Bellcore Planning Apps.Inventory/ Planning Source Reference Engineering Design Data Karabatis and Sheth 92
  23. 23. Interdependent Data An Example of Interdependent Data YEAR (…,demand, …) DMD_CAP(…,assigned,…) ENTITY_JOB (…,capacity,…)• Dependency: join and aggregation/sum over YEAR and ENTITY_JOB• Consistency requirement: C1: demand/capacity > 0.9 or C2: (capacity - demand) < 5000• Restoration procedure: • when C1 then regular_planning_update as non-coupled • when C2 then emergency_planning_update as coupled & vital
  24. 24. Interdependent Data Types of Dependency Specification• Redundant data – replication data, primary-secondary copies – vertical/horizontal partitions• Semantic integrity constraints – value existential constraints• Derived data
  25. 25. Interdependent Data Types of Consistency Requirements• Immediate consistency• eventual consistency• lagging consistency – Temporal criteria • at or before some time, within an interval, periodically – Data state criteria • number of operations or data items change, value of change, before or after an operation
  26. 26. Interdependent Data Some Relevant Work: Criteria• replica control: primary secondary copies, one- copy serializability• epsilon-serializability [Pu & Leff], N-ignorance [Krishnakumar & Bernstein], k-completeness [Sarin et al]• eventual and lagging consistency [Sheth et al]
  27. 27. Interdependent Data Some Relevant Work: Modeling• Identity Connections [Wiederhold & Qian]• Demarcation Protocol [Barbara and Garciia-Molina]• Data Dependency Descriptors [Rusinkiewicz/Sheth/Karabatis]• Existence/Value Dependency [Ceri & Widom], Interdependencies (existence, structural, behavioral, value) [Li and McLeod]• Computational Invariants, PATH structure [Etzion]• ECA Rules [Dayal]
  28. 28. Interdependent Data Enforcement Strategies• Application code• Middleware: Transaction Monitors, Replication Server [Notes]• Quasi-copies [Barbara et al]• Production Rules and Persistent Queues [Ceri and Widom]• Extended Distributed Transaction Management – Polytransactions [Sheth et al], Quasi-transactions [Arizio et al]
  29. 29. Interdependent Data Polytransactionsroot transaction (t1) IDS t1 coupled- coupled- t2b t3 non-vital vital t2a Interdependent Interdependent Interdependent t2a t2b Data Manager Data Manager Data Manager Non--coupled Local DBMS Local DBMS Local DBMS t3 How are related transactions determined? => S,U,P When is a related transaction created? => C, Policy What does a related transaction do? => A
  30. 30. Interdependent Data Enforcement Policycurrent consistent inconsistent eager restoration partial restoration late restoration or lazy restoration
  31. 31. Workflow Workflow Management• Workflow Management (WFM) is the automated coordination, control, and communication of work, both of people and computers, in the context of organizational processes, through the execution of software in a network of computers whose order of execution is controlled by a computerized representation of the business processes.
  32. 32. Workflow What is workflow about ?• Effective coordination, control and communications of work among human participants and system/information resources to orchestrate organizational processes• Need to improve human/organization productivity, efficiency, quality of work• New paradigm for “Programming in the large”
  33. 33. METEOR Workflow Model (very high level) taskstart task task end filter task interface interface interface aux. sys proc. proc. proc. entity entity entity
  34. 34. METEOR2 Task Models Initial Initial start Initial start Executing start Executing Executing done abort abortfail commit Done done prepared Prepared Aborted CommittedFailed Done abort commitNon-Transactional Transactional Aborted Committed Open 2PC transactional
  35. 35. A Complex Real-world ExampleGenerates:• alerts to identifypatient’s needs.• contraindications CLINICAL SUBSYSTEMto cautionproviders. Reminders to parentsHealth providers can obtain up-to-dateclinical and eligibility information C T Reports to stateHospitals and clinics updatecentral databases afterencounters Health agencies can use reports generated SDOH and to track CHREF population’s needs Hospitals and maintain case workers databases, State and HMO’s can reach can update out to the population HMOs can keep track support EDI of performance patient’s eligibility transactions data TRACKING SUBSYSTEM
  36. 36. Implementation Testbed Admit Clerk Triage Nurse Doctor/NP Maternity Ward Administrator Case Worker etc. CORBA (ORBeline)* Iris (Pentium/ Windows NT)Om (SunSparc 20 / Solaris) Illustra DBMS Oracle7 DBMS Web Server Web Server MPI MEI Immunization Db Optimus (SunSparc 2 / Solaris) Ra (SunSparc 20 / Solaris) Detailed Encounter DbCHREF Hospital Internet Clinic ICHREF/SDOH ED Admit Clerk Triage Nurse Doctor/NP em S yst File Web Server POMS rk Om (SunSparc 20 / Solaris) two Ne Illustra DBMS Db Files Insurance Eligibility Db Detailed Encounter Data
  37. 37. Workflow Data Integrity Challenges• Workflows express application level integrity needs – e.g., customer available to task 1 should be consistent with the related information available to task 2 even if both execute quite independently• In wake of -- inter-workflow requrements• Integrity of specification for adaptive workflows
  38. 38. Workflow Weaknesses of State-of-the-art WFMS• Lack of clear theoretical basis• Undefined correctness criteria• Limited support for: – Concurrency Control – Interoperability between workflow systems – Scalability – Availability – Recovery (no human assisted recovery)
  39. 39. Workflow Transactions to the rescue?• DB transactions and DP transactions address the correctness, consistency, recovery issues to different degrees, and have strong theoretical foundation ---• BUT can they apply to Workflow Management? Applications and environments differ significantly!
  40. 40. Workflow Transactions in WFMS• Task specific: – transactional tasks (e.g., database related) – distributed transaction processing• Domain specific: – EDI, HL7 – business contracts
  41. 41. Workflow Transactions in WFMS• Business-process specific: – workflow correctness and reliability from a business process point of view – roles, worklists, error handling• infrastructure specific: (each with their own notions) – CTM, DOM (CORBA), WWW, TP- monitors, Lotus Notes
  42. 42. Workflow An intuitive argument - why extended transactions don’t apply• ATMs were often motivated by a particular domain or a set of applications ... too narrow a scope in many case• Workflow is more horizontal in nature, many ATMs have been vertical in nature (Transaction concepts scale relatively well with hierarchical decompositions)• Significant human involvement, long running, autonomous systems,...
  43. 43. Workflow Characteristics of Large-Scale Real-World Workflow Applications• HAD computing environments• Multiple communication paradigms• Humans, legacy applications, and other non- transactional tasks• Organizational requirements (roles, authentication, security, etc.)• Heterogeneous multimedia data• Dynamic and virtual enterprises• Electronic commerce
  44. 44. Workflow Our view• In the context of workflows: – basis for modeling transactional tasks … YES – basis for modeling group of tasks as a transaction …MAY BE or YES – basis for ensuring reliable communication between workflow components … MAY BE or YES – basis for modeling workflows ?? …. NO!• Transactions --yes, ATMs --probably not
  45. 45. Workflow Our view• Notion of transactions in WFMS is more generalized than in TP-systems and DBMSs• Workflow systems should provide support for all forms of transactions• Strict transactional semantics not practical in workflow systems• Role of transactions in workflow systems: – for tasks within the workflow process – for implementing solutions to support fault-tolerance, concurrency control, correctness, recovery
  46. 46. Conclusions• Neither Systems Environment nor Data integrity requirements are as “simplistic”, “clean”, “well defined” as in research• Research has taken “black and white” approach -- we need to deal with “shades of gray”, how do you deal with the imperfect world?
  47. 47. • We have to address issues that span multiple heterogeneous systems – numerous, more challenging, more complex• Both data, application/process level issues
  48. 48. For more information: http://lsdis.cs.uga.eduFor publications: check corresponding areas at