Data-Ed: Unlock Business Value through Data Quality Engineering


Published on

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data-Ed: Unlock Business Value through Data Quality Engineering

  1. 1. Unlock Business Value throughData Quality EngineeringPresented by Peter Aiken, Ph.D.10124 W. Broad Street, Suite CGlen Allen, Virginia 23060804.521.4056
  2. 2. Copyright 2013 by Data Blueprint2Unlock Business Value through Data Quality EngineeringOrganizations must realize what it means to utilize dataquality management in support of business strategy. Thiswebinar focuses on obtaining business value from dataquality initiatives. I will illustrate how organizations withchronic business challenges often can trace the root of theproblem to poor data quality. Showing how data qualityshould be engineered provides a useful framework in whichto develop an effective approach. This in turn allowsorganizations to more quickly identify business problems aswell as data problems caused by structural issues versuspractice-oriented defects and prevent these from re-occurring.Date: June 11, 2013Time: 2:00 PM ET/11:00 AM PTPresenter: Peter Aiken, Ph.D.Time:• timeliness• currency• frequency• time periodForm:• clarity• detail• order• presentation• mediaContent:• accuracy• relevance• completeness• conciseness• scope• performanceTime:• timeliness• currency• frequency• time periodForm:• clarity• detail• order• presentation• mediaContent:• accuracy• relevance• completeness• conciseness• scope• performance
  3. 3. Copyright 2013 by Data BlueprintGet Social With Us!Live Twitter FeedJoin the conversation!Follow us:@datablueprint@paikenAsk questions and submit yourcomments: #dataed3Like Us on questions and commentsFind industry news, insightful contentand event updates.Join the GroupData Management & BusinessIntelligenceAsk questions, gain insights andcollaborate with fellow datamanagement professionals
  4. 4. Copyright 2013 by Data BlueprintMeet Your Presenter:Peter Aiken, Ph.D.• 25+ years of experience in data management• Multiple international awards &recognition• Founder, Data Blueprint (• Associate Professor of IS, VCU (• President, DAMA International (• 8 books and dozens of articles• Experienced w/ 500+ data managementpractices in 20 countries• Multi-year immersions with organizations asdiverse as the US DoD, Nokia,Deutsche Bank, Wells Fargo, and theCommonwealth of Virginia4
  5. 5. Copyright 2013 by Data Blueprint1. Data Management Overview2. DQE Definitions (w/ example)3. DQE Cycle & Contextual Complications4. DQ Causes and Dimensions5. Quality and the Data Life Cycle6. DDE Tools7. Takeaways and Q&AOutline5
  6. 6. Data ProgramCoordinationFeedbackDataDevelopmentCopyright 2013 by Data BlueprintStandardDataFive Integrated DM Practice AreasOrganizational StrategiesGoalsBusinessDataBusiness ValueApplicationModels &DesignsImplementationDirectionGuidance6OrganizationalData IntegrationDataStewardshipData SupportOperationsDataAsset UseIntegratedModelsLeverage data in organizational activitiesData managementprocesses andinfrastructureCombining multipleassets to produceextra valueOrganizational-entitysubject area dataintegrationProvide reliabledata accessAchieve sharing of datawithin a business area
  7. 7. Copyright 2013 by Data BlueprintFive Integrated DM Practice Areas7Manage data coherently.Share data across boundaries.Assign responsibilities for data.Engineer data delivery systems.Maintain data availability.Data ProgramCoordinationOrganizationalData IntegrationDataStewardshipDataDevelopmentData SupportOperations
  8. 8. Copyright 2013 by Data Blueprint• 5 Data ManagementPractices Areas / DataManagement Basics• Are necessary butinsufficientprerequisites toorganizational dataleveragingapplications(that is Self ActualizingData or AdvancedData Practices)Basic Data Management Practices– Data Program Management– Organizational Data Integration– Data Stewardship– Data Development– Data Support Operations• Cloud• MDM• Mining• Analytics• Warehousing• BigData Management Practices Hierarchy (after Maslow)
  9. 9. Copyright 2013 by Data BlueprintData ManagementBody ofKnowledge9DataManagementFunctions
  10. 10. • Published by DAMA International– The professional association forData Managers (40 chapters worldwide)– DMBoK organized around• Primary data management functions focusedaround data delivery to the organization (• Organized around several environmental elements• CDMP– Certified Data Management Professional– DAMA International and ICCP– Membership in a distinct group made up of yourfellow professionals– Recognition for your specialized knowledge in achoice of 17 specialty areas– Series of 3 exams– For more information, please visit:•• 2013 by Data BlueprintDAMA DM BoK & CDMP10
  11. 11. Copyright 2013 by Data BlueprintOverview: Data Quality Engineering11
  12. 12. Copyright 2013 by Data Blueprint1. Data Management Overview2. DQE Definitions (w/ example)3. DQE Cycle & Contextual Complications4. DQ Causes and Dimensions5. Quality and the Data Life Cycle6. DDE Tools7. Takeaways and Q&AOutline12
  13. 13. Copyright 2013 by Data BlueprintDataDataDataInformationFact MeaningRequestA Model Specifying Relationships Among Important Terms[Built on definition by Dan Appleton 1983]IntelligenceUse1. Each FACT combines with one or more MEANINGS.2. Each specific FACT and MEANING combination is referred to as a DATUM.3. An INFORMATION is one or more DATA that are returned in response to a specific REQUEST4. INFORMATION REUSE is enabled when one FACT is combined with more than oneMEANING.5. INTELLIGENCE is INFORMATION associated with its USES.Wisdom & knowledge areoften used synonymouslyDataDataData Data13
  14. 14. Copyright 2013 by Data BlueprintDefinitions• Quality Data– Fit for use meets the requirements of its authors, users,and administrators (adapted from Martin Eppler)– Synonymous with information quality, since poor data qualityresults in inaccurate information and poor business performance• Data Quality Management– Planning, implementation and control activities that apply qualitymanagement techniques to measure, assess, improve, andensure data quality– Entails the "establishment and deployment of roles, responsibilitiesconcerning the acquisition, maintenance, dissemination, anddisposition of data"✓ Critical supporting process from change management✓ Continuous process for defining acceptable levels of data quality to meet businessneeds and for ensuring that data quality meets these levels• Data Quality Engineering– Recognition that data quality solutions cannot not managed but must be engineered– Engineering is the application of scientific, economic, social, and practical knowledge inorder to design, build, and maintain solutions to data quality challenges– Engineering concepts are generally not known and understood within IT or business!14Spinach/Popeye story from
  15. 15. Copyright 2013 by Data BlueprintImproving Data Quality during System Migration15• Challenge– Millions of NSN/SKUsmaintained in a catalog– Key and other data stored inclear text/comment fields– Original suggestion was manualapproach to text extraction– Left the data structuring problem unsolved• Solution– Proprietary, improvable text extraction process– Converted non-tabular data into tabular data– Saved a minimum of $5 million– Literally person centuries of work
  16. 16. UnmatchedItemsIgnorableItemsItemsMatchedWeek # (% Total) (% Total) (% Total)1 31.47% 1.34% N/A2 21.22% 6.97% N/A3 20.66% 7.49% N/A4 32.48% 11.99% 55.53%… … … …14 9.02% 22.62% 68.36%15 9.06% 22.62% 68.33%16 9.53% 22.62% 67.85%17 9.50% 22.62% 67.88%18 7.46% 22.62% 69.92%Copyright 2013 by Data BlueprintDetermining Diminishing Returns16
  17. 17. Time needed to review all NSNs once over the life of the project:Time needed to review all NSNs once over the life of the project:NSNs 2,000,000Average time to review & cleanse (in minutes) 5Total Time (in minutes) 10,000,000Time available per resource over a one year period of time:Time available per resource over a one year period of time:Work weeks in a year 48Work days in a week 5Work hours in a day 7.5Work minutes in a day 450Total Work minutes/year 108,000Person years required to cleanse each NSN once prior to migration:Person years required to cleanse each NSN once prior to migration:Minutes needed 10,000,000Minutes available person/year 108,000Total Person-Years 92.6Resource Cost to cleanse NSNs prior to migration:Resource Cost to cleanse NSNs prior to migration:Avg Salary for SME year (not including overhead) $60,000.00Projected Years Required to Cleanse/Total DLA Person YearSaved93Total Cost to Cleanse/Total DLA Savings to Cleanse NSNs: $5.5 millionCopyright 2013 by Data Blueprint17Quantitative Benefits
  18. 18. Copyright 2013 by Data BlueprintSix misconceptions about dataquality1. You can fix the data2. Data quality is an IT problem3. The problem is in the data sources or data entry4. The data warehouse will provide a single version ofthe truth5. The new system will provide a single version of thetruth6. Standardization will eliminate the problem ofdifferent "truths" represented in the reports oranalysis18
  19. 19. The Blind Men andthe Elephant• It was six men of Indostan, To learning much inclined,Who went to see the Elephant(Though all of them were blind),That each by observationMight satisfy his mind.• The First approached the Elephant,And happening to fallAgainst his broad and sturdy side,At once began to bawl:"God bless me! but the ElephantIs very like a wall!"• The Second, feeling of the tuskCried, "Ho! what have we here,So very round and smooth and sharp? To me `tis mighty clearThis wonder of an ElephantIs very like a spear!"• The Third approached the animal,And happening to takeThe squirming trunk within his hands, Thus boldly up he spake:"I see," quoth he, "the ElephantIs very like a snake!"• The Fourth reached out an eager hand, And felt about the knee:"What most this wondrous beast is like Is mighty plain," quoth he;"Tis clear enough the ElephantIs very like a tree!"• The Fifth, who chanced to touch the ear, Said: "Eenthe blindest manCan tell what this resembles most;Deny the fact who can,This marvel of an ElephantIs very like a fan!"• The Sixth no sooner had begunAbout the beast to grope,Than, seizing on the swinging tailThat fell within his scope."I see," quoth he, "the ElephantIs very like a rope!"• And so these men of IndostanDisputed loud and long,Each in his own opinionExceeding stiff and strong,Though each was partly in the right,And all were in the wrong!(Source: John Godfrey Saxes ( 1816-1887) version of the famous Indian legend ) 19Copyright 2013 by Data Blueprint
  20. 20. Copyright 2013 by Data BlueprintNo universal conception of dataquality exists, instead many differingperspective compete.• Problem:–Most organizations approachdata quality problems in the same waythat the blind men approached the elephant - peopletend to see only the data that is in front of them–Little cooperation across boundaries, just as the blindmen were unable to convey their impressions about theelephant to recognize the entire entity.–Leads to confusion, disputes and narrow views• Solution:–Data quality engineering can help achieve a morecomplete picture and facilitate cross boundarycommunications20
  21. 21. Copyright 2013 by Data BlueprintStructured Data Quality Engineering1. Allow the form of theProblem to guide theform of the solution2. Provide a means ofdecomposing the problem3. Feature a variety of toolssimplifying system understanding4. Offer a set of strategies for evolving a design solution5. Provide criteria for evaluating the quality of thevarious solutions6. Facilitate development of a framework for developingorganizational knowledge.21
  22. 22. Copyright 2013 by Data BlueprintPolling Question #122• Does your organization address or plan to addressdata/information quality issues• Responses– A. We did last year (2012)– B. We are this year (2013)– C. We will next year (2014)– D. We hope to next year (2014)
  23. 23. Copyright 2013 by Data Blueprint1. Data Management Overview2. DQE Definitions (w/ example)3. DQE Cycle & Contextual Complications4. DQ Causes and Dimensions5. Quality and the Data Life Cycle6. DDE Tools7. Takeaways and Q&AOutline23Tweetingnow:#dataed
  24. 24. Copyright 2013 by Data BlueprintMizuho Securities• Wanted to sell 1 share for600,000 yen• Sold 600,000 shares for 1yen• $347 million loss• In-house system did nothave limit checking• Tokyo stock exchangesystem did not have limitchecking ...• … and doesnt allow ordercancellationsCLUMSY typing cost a Japanese bank atleast £128 million and staff their Christmasbonuses yesterday, after a tradermistakenly sold 600,000 more shares thanhe should have. The trader at MizuhoSecurities, who has not been named, fellfoul of what is known in financial circles as“fat finger syndrome” where a dealer typesincorrect details into his computer. Hewanted to sell one share in a new telecomscompany called J Com, for 600,000 yen(about £3,000).Infamous Data Quality Example24
  25. 25. Copyright 2013 by Data BlueprintFour ways to make your data sparkle!1.Prioritize the task– Cleaning data is costly and timeconsuming– Identify mission critical/non-missioncritical data2.Involve the data owners– Seek input of business units on what constitutes "dirty"data3.Keep future data clean– Incorporate processes and technologies that check everyzip code and area code4.Align your staff with business– Align IT staff with business units(Source: CIO JULY 1 2004)25
  26. 26. Copyright 2013 by Data Blueprint• Deming cycle• "Plan-do-study-act" or"plan-do-check-act"1. Identifying data issues that arecritical to the achievement ofbusiness objectives2. Defining businessrequirements for data quality3. Identifying key data qualitydimensions4. Defining business rules criticalto ensuring high quality data26The DQE Cycle
  27. 27. Copyright 2013 by Data BlueprintThe DQE Cycle: (1) Plan• Plan for the assessment ofthe current state andidentification of key metricsfor measuring quality• The data quality engineeringteam assesses the scope ofknown issues– Determining cost and impact– Evaluating alternatives foraddressing them27
  28. 28. Copyright 2013 by Data BlueprintThe DQE Cycle: (2) Deploy28• Deploy processes formeasuring and improvingthe quality of data:• Data profiling– Institute inspections andmonitors to identify dataissues when they occur– Fix flawed processes that arethe root cause of data errorsor correct errors downstream– When it is not possible tocorrect errors at their source,correct them at their earliestpoint in the data flow
  29. 29. Copyright 2013 by Data BlueprintThe DQE Cycle: (3) Monitor• Monitor the quality of dataas measured against thedefined business rules• If data quality meetsdefined thresholds foracceptability, theprocesses are in controland the level of dataquality meets thebusiness requirements• If data quality falls belowacceptability thresholds,notify data stewards sothey can take actionduring the next stage29
  30. 30. Copyright 2013 by Data BlueprintThe DQE Cycle: (4) Act• Act to resolve anyidentified issues toimprove data qualityand better meetbusinessexpectations• New cycles begin asnew data sets comeunder investigationor as new dataquality requirementsare identified forexisting data sets30
  31. 31. Copyright 2013 by Data BlueprintDQE Context & Engineering Concepts• Can rules be implemented stating that no data can becorrected unless the source of the error has beendiscovered and addressed?• All data mustbe 100%perfect?• Pareto– 80/20 rule– Not all datais of equalImportance• Scientific,economic,social, andpracticalknowledge31
  32. 32. Copyright 2013 by Data BlueprintData quality is now acknowledged as a major sourceof organizational risk by certified risk professionals!32
  33. 33. Copyright 2013 by Data Blueprint1. Data Management Overview2. DQE Definitions (w/ example)3. DQE Cycle & Contextual Complications4. DQ Causes and Dimensions5. Quality and the Data Life Cycle6. DDE Tools7. Takeaways and Q&AOutline33
  34. 34. Copyright 2013 by Data BlueprintTwo Distinct Activities Support Quality Data34• Data quality best practices depend on both– Practice-oriented activities– Structure-oriented activitiesPractice-orientedactivities focus on thecapture andmanipulation of dataStructure-orientedactivities focus on thedata implementationQualityData
  35. 35. Copyright 2013 by Data BlueprintPractice-Oriented Activities35• Stem from a failure to rigor when capturing/manipulating data such as:– Edit masking– Range checking of input data– CRC-checking of transmitted data• Affect the Data Value Quality and Data Representation Quality• Examples of improper practice-oriented activities:– Allowing imprecise or incorrect data to be collected when requirements specifyotherwise– Presenting data out of sequence• Typically diagnosed in bottom-up manner: find and fix the resultingproblem• Addressed by imposing more rigorous data-handling governanceQuality of DataRepresentationQuality of DataValuesPractice-oriented activities
  36. 36. Copyright 2013 by Data BlueprintStructure-Oriented Activities36• Occur because of data and metadata that has been arranged imperfectly. Forexample:– When the data is in the system but we just cant access it;– When a correct data value is provided as the wrong response to a query; or– When data is not provided because it is unavailable or inaccessible to the customer• Developer focus within system boundaries instead of within organization boundaries• Affect the Data Model Quality and Data Architecture Quality• Examples of improper structure-oriented activities:– Providing a correct response but incomplete data to a query because the user did notcomprehend the system data structure– Costly maintenance of inconsistent data used by redundant systems• Typically diagnosed in top-down manner: root cause fixes• Addressed through fundamental data structure governanceQuality ofData ArchitectureQuality ofData ModelsStructure-oriented activities
  37. 37. Copyright 2013 by Data BlueprintQuality Dimensions37
  38. 38. Copyright 2013 by Data BlueprintA congratulationsletter from anotherbankProblems• Bank did not know itmade an error• Tools alone could nothave prevented this error• Lost confidence in theability of the bank tomanage customer funds38
  39. 39. Copyright 2013 by Data Blueprint4 Dimensions of Data Quality39An organization’s overall data quality is a function of four distinctcomponents, each with its own attributes:• Data Value: the quality of data as stored & maintained in thesystem• Data Representation – the quality of representation for storedvalues; perfect data values stored in a system that areinappropriately represented can be harmful• Data Model – the quality of data logically representing userrequirements related to data entities, associated attributes, andtheir relationships; essential for effective communication amongdata suppliers and consumers• Data Architecture – the coordination of data managementactivities in cross-functional system development and operationsPractice-orientedStructure-oriented
  40. 40. Copyright 2013 by Data BlueprintEffective Data Quality Engineering40DataRepresentationQualityAs presented tothe userData ValueQualityAs maintained inthe systemData ModelQualityAs understood bydevelopersData ArchitectureQualityAs anorganizationalasset(closer to the architect)(closer to the user)• Data quality engineering has been focused onoperational problem correction– Directing attention to practice-oriented data imperfections• Data quality engineering is more effective when alsofocused on structure-oriented causes– Ensuring the quality of shared data across system boundaries
  41. 41. Copyright 2013 by Data BlueprintFull Set of Data Quality Attributes41
  42. 42. Copyright 2013 by Data BlueprintDifficult to obtain leverage at the bottom of the falls42
  43. 43. Copyright 2013 by Data BlueprintFrozen Falls43
  44. 44. Copyright 2013 by Data BlueprintNew York Turns to BigData to Solve Big TreeProblem• NYC– 2,500,000 trees• 11-months from 2009 to 2010– 4 people were killed or seriously injured by falling tree limbs inCentral Park alone• Belief– Arborists believe that pruning and otherwise maintaining treescan keep them healthier and make them more likely to withstanda storm, decreasing the likelihood of property damage, injuriesand deaths• Until recently– No research or data to back it up44
  45. 45. Copyright 2013 by Data BlueprintNYCs Big Tree Problem• Question– Does pruning trees in one year reduce thenumber of hazardous tree conditions in thefollowing year?• Lots of data but granularity challenges– Pruning data recorded block by block– Cleanup data recorded at the address level– Trees have no unique identifiers• After downloading, cleaning, merging, analyzing and intensivemodeling– Pruning trees for certain types of hazards caused a 22 percent reduction in thenumber of times the department had to send a crew for emergency cleanups• The best data analysis– Generates further questions• NYC cannot prune each block every year– Building block risk profiles: number of trees, types of trees, whether the blockis in a flood zone or storm zone45
  46. 46. Copyright 2013 by Data Blueprint1. Data Management Overview2. DQE Definitions (w/ example)3. DQE Cycle & Contextual Complications4. DQ Causes and Dimensions5. Quality and the Data Life Cycle6. DDE Tools7. Takeaways and Q&AOutline46
  47. 47. Copyright 2013 by Data BlueprintLetter from the Bank… so please continue to open yourmail from either Chase or Bank OneP.S. Please be on the lookout for anyupcoming communications fromeither Chase or Bank One regardingyour Bank One credit card and anyother Bank One product you mayhave.Problems• I initially discarded the letter!• I became upset after reading it• It proclaimed that Chase has dataquality challenges47
  48. 48. Copyright 2013 by Data BlueprintPolling Question #248• Does your organization utilize a structured or formalapproach to information quality?• A. Yes• B. They say they are but they arent• C. No
  49. 49. Copyright 2013 by Data Blueprint1. Data Management Overview2. DQE Definitions (w/ example)3. DQE Cycle & Contextual Complications4. DQ Causes and Dimensions5. Quality and the Data Life Cycle6. DDE Tools7. Takeaways and Q&AOutline49
  50. 50. Copyright 2013 by Data BlueprintData acquisition activities Data usage activitiesData storageTraditional Quality Life Cycle50
  51. 51. restored dataMetadataCreationMetadata RefinementMetadataStructuringData UtilizationCopyright 2013 by Data BlueprintData ManipulationData CreationData StorageDataAssessmentDataRefinement51dataarchitecture& modelspopulated datamodels andstorage locationsdata valuesdatavaluesdatavaluesvaluedefectsstructuredefectsarchitecturerefinementsmodelrefinementsData LifeCycleModelProductsdata
  52. 52. restored dataMetadata RefinementMetadataStructuringData UtilizationCopyright 2013 by Data BlueprintData ManipulationData CreationData StorageDataAssessmentDataRefinement52populated datamodels andstorage locationsdatavaluesData LifeCycleModel:QualityFocusdataarchitecture &model qualitymodel qualityvalue qualityvalue qualityvalue qualityrepresentationqualityMetadataCreationarchitecturequality
  53. 53. Copyright 2013 by Data BlueprintStartingpointfor newsystemdevelopmentdata performance metadatadata architecturedataarchitecture anddata modelsshared data updated datacorrecteddataarchitecturerefinementsfacts &meaningsMetadata &Data StorageStarting pointfor existingsystemsMetadata Refinement• Correct Structural Defects• Update ImplementationMetadata Creation• Define Data Architecture• Define Data Model StructuresMetadata Structuring• Implement Data Model Views• Populate Data Model ViewsData Refinement• Correct Data Value Defects• Re-store Data ValuesData Manipulation• Manipulate Data• Updata DataData Utilization• Inspect Data• Present DataData Creation• Create Data• Verify Data ValuesData Assessment• Assess Data Values• Assess MetadataExtended data life cycle model with metadata sources and uses53
  54. 54. Copyright 2013 by Data BlueprintPolling Question #354• Do you use metadata models, modeling tools, orprofiling to support your information quality efforts?• A. Yes• B. No
  55. 55. Copyright 2013 by Data Blueprint1. Data Management Overview2. DQE Definitions (w/ example)3. DQE Cycle & Contextual Complications4. DQ Causes and Dimensions5. Quality and the Data Life Cycle6. DDE Tools7. Takeaways and Q&AOutline55
  56. 56. Copyright 2013 by Data BlueprintProfile, Analyze and Assess DQ• Data assessment using 2 different approaches:– Bottom-up– Top-down• Bottom-up assessment:– Inspection and evaluation of the data sets– Highlight potential issues based on theresults of automated processes• Top-down assessment:– Engage business users to documenttheir business processes and thecorresponding critical data dependencies– Understand how their processesconsume data and which data elementsare critical to the success of the businessapplications56
  57. 57. Copyright 2013 by Data BlueprintDefine DQ Measures• Measures development occurs as part of the strategy/design/plan step• Process for defining data quality measures:1. Select one of the identified critical business impacts2. Evaluate the dependent data elements, create and updateprocesses associate with that business impact3. List any associated data requirements4. Specify the associated dimension of data quality and one ormore business rules to use to determine conformance of thedata to expectations5. Describe the process for measuring conformance6. Specify an acceptability threshold57
  58. 58. Copyright 2013 by Data BlueprintSet and Evaluate DQ Service Levels• Data quality inspection andmonitoring are used tomeasure and monitorcompliance with defineddata quality rules• Data quality SLAs specifythe organization’s expectations for response and remediation• Operational data quality control defined in data quality SLAsincludes:– Data elements covered by the agreement– Business impacts associated with data flaws– Data quality dimensions associated with each data element– Quality expectations for each data element of the identified dimensions ineach application for system in the value chain– Methods for measuring against those expectations– (…)58
  59. 59. Measure, Monitor & Manage DQCopyright 2013 by Data Blueprint• DQM procedures depend onavailable data quality measuringand monitoring services• 2 contexts for control/measurementof conformance to data qualitybusiness rules exist:– In-stream: collect in-stream measurements while creating data– In batch: perform batch activities on collections of datainstances assembled in a data set• Apply measurements at 3 levels of granularity:– Data element value– Data instance or record– Data set59
  60. 60. Copyright 2013 by Data BlueprintOverview: Data Quality Tools4 categories ofactivities:1) Analysis2) Cleansing3) Enhancement4) Monitoring60Principal tools:1) Data Profiling2) Parsing and Standardization3) Data Transformation4) Identity Resolution andMatching5) Enhancement6) Reporting
  61. 61. Copyright 2013 by Data BlueprintDQ Tool #1: Data Profiling• Data profiling is the assessment ofvalue distribution and clustering ofvalues into domains• Need to be able to distinguishbetween good and bad data beforemaking any improvements• Data profiling is a set of algorithmsfor 2 purposes:– Statistical analysis and assessment of the data quality values within adata set– Exploring relationships that exist between value collections within andacross data sets• At its most advanced, data profiling takes a series of prescribedrules from data quality engines. It then assesses the data,annotates and tracks violations to determine if they comprisenew or inferred data quality rules61
  62. 62. Copyright 2013 by Data BlueprintDQ Tool #1: Data Profiling, cont’d• Data profiling vs. data quality-business context andsemantic/logical layers– Data quality is concerned with proscriptive rules– Data profiling looks for patterns when rules are adhered to and whenrules are violated; able to provide input into the business context layer• Incumbent that data profiling services notify all concernedparties of whatever is discovered• Profiling can be used to…– …notify the help desk that validchanges in the data are about tocase an avalanche of “skepticaluser” calls– …notify business analysts ofprecisely where they should beworking today in terms of shiftsin the data62
  63. 63. Copyright 2013 by Data BlueprintCourtesy GlobalID.com63
  64. 64. Copyright 2013 by Data BlueprintDQ Tool #2: Parsing & Standardization• Data parsing tools enable the definitionof patterns that feed into a rules engineused to distinguish between validand invalid data values• Actions are triggered upon matchinga specific pattern• When an invalid pattern is recognized,the application may attempt totransform the invalid value into one that meets expectations• Data standardization is the process of conforming to a set ofbusiness rules and formats that are set up by data stewardsand administrators• Data standardization example:– Brining all the different formats of “street” into a single format, e.g.“STR”, “ST.”, “STRT”, “STREET”, etc.64
  65. 65. Copyright 2013 by Data BlueprintDQ Tool #3: Data Transformation• Upon identification of data errors, trigger data rules totransform the flawed data• Perform standardization and guide rule-basedtransformations by mapping data values in their originalformats and patterns into a target representation• Parsed components of a pattern are subjected torearrangement, corrections, or any changes as directedby the rules in the knowledge base65
  66. 66. Copyright 2013 by Data BlueprintDQ Tool #4: Identify Resolution & Matching• Data matching enables analysts to identify relationships between records forde-duplication or group-based processing• Matching is central to maintaining data consistency and integrity throughoutthe enterprise• The matching process should be used inthe initial data migration of data into asingle repository• 2 basic approaches to matching:• Deterministic– Relies on defined patterns/rules for assigningweights and scores to determine similarity– Predictable– Dependent on rules developers anticipations• Probabilistic– Relies on statistical techniques for assessing the probability that any pair of recordrepresents the same entity– Not reliant on rules– Probabilities can be refined based on experience -> matchers can improve precision asmore data is analyzed66
  67. 67. Copyright 2013 by Data BlueprintDQ Tool #5: Enhancement• Definition:– A method for adding value to information by accumulating additionalinformation about a base set of entities and then merging all the sets ofinformation to provide a focused view. Improves master data.• Benefits:– Enables use of third party data sources– Allows you to take advantage of the information and research carriedout by external data vendors to make data more meaningful and useful• Examples of data enhancements:– Time/date stamps– Auditing information– Contextual information– Geographic information– Demographic information– Psychographic information67
  68. 68. Copyright 2013 by Data BlueprintDQ Tool #6: Reporting• Good reporting supports:– Inspection and monitoring of conformance to data quality expectations– Monitoring performance of data stewards conforming to data qualitySLAs– Workflow processing for data quality incidents– Manual oversight of data cleansing and correction• Data quality tools provide dynamic reporting and monitoringcapabilities• Enables analyst and data stewards to support and drive themethodology for ongoing DQM and improvement with asingle, easy-to-use solution• Associate report results with:– Data quality measurement– Metrics– Activity68
  69. 69. Copyright 2013 by Data Blueprint1. Data Management Overview2. DQE Definitions (w/ example)3. DQE Cycle & Contextual Complications4. DQ Causes and Dimensions5. Quality and the Data Life Cycle6. DDE Tools7. Takeaways and Q&AOutline69
  70. 70. • Develop and promote data quality awareness• Define data quality requirements• Profile, analyze and asses data quality• Define data quality metrics• Define data quality businessrules• Test and validate data qualityrequirements• Set and evaluate data qualityservice levels• Measure and monitor data quality• Manage data quality issues• Clean and correct data quality defects• Design and implement operational DQM procedures• Monitor operational DQM procedures and performanceCopyright 2013 by Data BlueprintOverview: DQE Concepts and Activities70
  71. 71. Copyright 2013 by Data BlueprintConcepts and Activities• Data quality expectations provide the inputs necessaryto define the data quality framework:– Requirements– Inspection policies– Measures, and monitorsthat reflect changes in dataquality and performance• The data quality frameworkrequirements reflect 3 aspectsof business data expectations1. A manner to record the expectation in business rules2. A way to measure the quality of data within that dimension3. An acceptability threshold71from The DAMA Guide to the Data Management Body of Knowledge © 2009 by DAMA International
  72. 72. Copyright 2013 by Data BlueprintSummary: Data Quality Engineering721/26/2010 © Copyright this and previous years by Data Blueprint - all rights reserved!
  73. 73. Copyright 2013 by Data BlueprintQuestions?73+ =It’s your turn!Use the chat feature or Twitter (#dataed) to submityour questions to Peter now.
  74. 74. Data Systems Integration & BusinessValue Pt. 1: MetadataJuly 9, 2013 @ 2:00 PM ET/11:00 AM PTData Systems Integration & BusinessValue Pt. 2: CloudAugust 13, 2013 @ 2:00 PM ET/11:00 AM PTSign up www.dataversity.netCopyright 2013 by Data BlueprintUpcoming Events74
  75. 75. Copyright 2013 by Data BlueprintReferences & Recommended Reading75• The DAMA Guide to the Data Management Body of Knowledge © 2009 by DAMA International•
  76. 76. Copyright 2013 by Data BlueprintData Quality Dimensions76
  77. 77. Copyright 2013 by Data BlueprintData Value Quality77
  78. 78. Copyright 2013 by Data BlueprintData Representation Quality78
  79. 79. Copyright 2013 by Data BlueprintData Model Quality79
  80. 80. Copyright 2013 by Data BlueprintData Architecture Quality80
  81. 81. 10124 W. Broad Street, Suite CGlen Allen, Virginia 23060804.521.4056