Data quality - Using sas dataflux in the real world - Shane Gibson - OptimalBI

3,549 views

Published on

Overview of using Data Flux to resolve a number of data quality problems on a real world project.

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,549
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
85
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Data quality - Using sas dataflux in the real world - Shane Gibson - OptimalBI

  1. 1. Delivering Data Quality in the real world A case study using SAS Dataflux
  2. 2. What  I  Will  Cover  1.  What  is  Data  Quality?  2.  What  is  SAS  Dataflux?  3.  The  approach  we  took  and  why  4.  The  things  we  did  and  how  5.  Monitoring  the  results  
  3. 3. 1.  What  is  Data  Quality  •  Data  are  of  high  quality  “if  they  are  fit  for  their  intended  uses   in  opera6ons,  decision  making  and  planning"  (J.  M.  Juran).   Alterna6vely,  the  data  are  deemed  of  high  quality  if  they   correctly  represent  the  real-­‐world  construct  to  which  they   refer.”     •  Source  Wikipedia    hGp://en.wikipedia.org/wiki/Data_quality   •  Joseph  Moses  Juran  (December  24,  1904  –  February  28,  2008)  was  a  20th  century   management  consultant,  principally  remembered  as  an  evangelist  for  quality  and   quality  management,      
  4. 4. 2.  What  is  SAS  Dataflux  DataFlux  provides  organisaXons  with  the  ability  to  plan  and  complete  data  integraXon,  data  quality  and  master  data  management  (MDM)  projects  –  all  from  a  single  interface    It    makes  it  easier  to  do:   Its  delivered  as:  •  Profiling   •  Standalone  Desktop  Client  •  StandardizaXon   •  Component  of  SAS  Enterprise  •  Matching   Data  IntegraXon  Server    •  AugmentaXon   •  Full  Data  flux  soluXon  •  Business  Rules  Monitoring  
  5. 5. 2.  What  is  SAS  Dataflux  
  6. 6. 3.  The  approach  we  took  and  why   InformaXon  Governance  Hierarchy   Board   ExecuXve  Team   Data  Governance  CommiGee   Data  Council  Business  Data  Stewards   Technical  Data  Stewards  
  7. 7. Data  Governance:  From  theory  to  pracXce    Zeeman  van  der  Merwe  Manager:  InformaXon  Integrity  and  Analysis,  ACC    2010  SUNZ  Conference  16  February  2010    
  8. 8. Data  Quality  Maturity  Model   The   organisaXon   regularly   analyses   exisXng   data   management   processes   to   determine   OpXmised   where   changes   can   deliver   improved   efficiencies  and  implements  them.  Trust  in  InformaXon   Data   Quality   is   automaXcally   monitored   and   reported.       EffecXve   Reliability   and   predictability   of   result’s   is   monitored   via   Six  Sigma  or  equivalent  measurement  methodology.           The   use   of   the   data   management   processes   are   required   and   Managed   monitored.    All  projects  and  iniXaXves  include  data  management   as  a  core  part  of  their  objecXves  and  deliverables’.     The  organisaXon  uses  a  set  of  defined  data  management  processes,  which   Defined   are  published  for  recommended  use.   Data   management   experXse   exists   internally   and   there   is   some   ability   to   duplicate   Repeatable   good  pracXces.    Key  data  management  individuals  are  assigned  to  criXcal  projects  to   reduce  risks  and  improve  results.   Data   management   is   characterised   as   ad-­‐hoc   or   chaoXc.     The   organisaXon   depends   solely   on   Unaware   individuals  with  no  awareness  of  data  management  pracXces,  resulXng  in  variable  results  and  no   repeatability.   Maturity  of  Data  Governance  processes  
  9. 9. Data  Cleansing  Business  Process   IniXate     InformaXon   Governance  Group   PrioriXse   Data  Quality  Issues   Profile  Issues   Manually  or   Update  Source   Data  Quality   ProgrammaXcally   System   Issue   update  data   Monitoring   Scorecards  
  10. 10. 4.  The  things  we  did  and  how   We  used  dataflux  to    •  Profile  the  data   •  Profiled   •  Phone  numbers   •  Customer  AIributes   •  Gender   •  Date  of  Birth   •  Missing  Values   •  Addresses   •  Suppliers   •  Customers   •  Loca6ons    
  11. 11. 4.  The  things  we  did  and  how   Example    
  12. 12. 4.  The  things  we  did  and  how   Profile  Data     Pattern Count Percentage 9999999 12760 51% 9999999999 2979 12% Alpha String Count 99 9999999 Category 2634 Percentage Count 10%(NIGHTS 1 999999999 2210 9% AREA CODE MISSING 13476 51%-ROOM 1 999 9999 1453 6% INVALID MOBILE NUMBER 158 1%-X 3 99 999 9999 998 4% INVALID NUMBER 212 1%ACT 1 999 999 9999 605 2%COURSE 1 INVALID LANDLINE NUMBER, TOO 99*9999999 493 2% FEW DIGITS 723 3%EX 7 999 9999999 297 1% INVALID LANDLINE NUMBER, TOOEXT 4 99999999999 292 1% MANY DIGITS 366 1%FAX 1 99999999 101 0% MOBILE NUMBER 1744 7%N/A 2 9999999 9999 84 0% MOBILE NUMBER OBSOLETE 942 4%SCHOOL 1 999*9999 71 0%WK 1 NUMBER OK 999 999999 8324 53 30% 0%X 48 ZERO 9999999 999 511 49 2% 0%XT 7 999 999 999 47 0%XTN 3 999999 41 0% 999 9999 999 36 0%
  13. 13. 4.  The  things  we  did  and  how   We  used  dataflux  to    •  Standardise  Data   •  Use  Dataflux  Quality  Knowledge  Base  to:   •  Standardise  Person  Names   •  Robert,  Rob,  Bob   •  Standardise  Loca6on  Names   •  Wellington,  WLG,  Wgtn    
  14. 14. 4.  The  things  we  did  and  how   We  used  dataflux  to    •  Consolidate  Data   •  Merge  mul6ple  people  records   •  Mul6ple  matching  rules   •  Needed  to  be  reusable   •  Needed  to  have  logic  layers    
  15. 15. 4.  The  things  we  did  and  how   Logic  Layers  
  16. 16. 4.  The  things  we  did  and  how   We  used  dataflux  to    •  Programma6cally  Validate  and  Augment  the  Data   •  Validate  against  external  datasets   •  NZ  Post  PAF   •  LINZ  Data   •  Poten6ally   •  Birth’s,  Deaths  and  Marriages  data   •  External  Customer  Lists   •  Can’t  find  valid  Phone  number  dataset    
  17. 17. 5.  Monitoring  the  Results    •  Typical  aGributes  to  measure  data  quality   •  Accuracy   Are  targets  defined  to  measure  against?   •  Correctness   Requires  something  to  look  up   •  Data  Age   Data  Quality  degrades  over  Xme,  is  that  acceptable?   •  Completeness   What  are  the  business  rules  that  define  what  is  acceptable?   •  Relevance   Have  you  documented  how  it  is  used?      
  18. 18. 5.  Monitoring  the  Results    •  Give  the  business  owners  feedback  that  tells  them:   •  If  their  Data  Quality  is  ge]ng  beIer  or  worse   •  Who  is  the  business  owner  who  can  impact  the  data  quality   •  What  do  they  need  to  change  •  Encourage  the  business  owners  to  improve  the  quality  of  the  data     •  Ideally  programma6cally  update  the  data  for  them   •  Or  use  centers  of  excellence’s  to  update  data  (i.e  Call  Centers  for  Phone   numbers)   •  Or  provide  the  business  a  recommended  process  to  update  it  •  Make  people  accountable  for  bad  data  quality!   •     
  19. 19. 5.  Monitoring  the  Results    Customer  Records   Record  Type   Count   Percentage   Duplicates   1,037,964   56.85%   Master   787,673   43.15%  
  20. 20. Data  Quality  is  not  a  project,   it  is  a  never  ending  process  
  21. 21. The  shameless  plug!  •  www.opXmalBI.com   Delivering  AcXonable  Insight  •  www.saasInct.com   PreBuilt  SAS  Portlets  •  blog.saasInct.com   Ramblings  about  SAS    

×