Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A Machine learning based Data Quality Analysis Approach

1,033 views

Published on

A Machine learning based Data Quality Analysis Approach

Published in: Technology
  • Folks, if you find this document useful, I would appreciate a 1-line recommendation or simply Like on this page. Thank you
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

A Machine learning based Data Quality Analysis Approach

  1. 1.       Proposal  for   Data  Quality  Audit  Solutions   Prepared  for  IDEA     April  2013                   ©  2004  by  Third  Eye  Consulting  LLC     All  rights  reserved.  No  part  of  this  document  may  be  reproduced  or  transmitted  in  any  form  or  by  any   means,  electronic,  mechanical,  photocopying,  recording,  or  otherwise,  without  prior  written  permission   of  Third  Eye  Consulting  LLC.    
  2. 2.   Table  of  Contents     INTRODUCTION  ..................................................................................................................................................................................  3   SCOPE  ......................................................................................................................................................................................................  3   In  Scope  .............................................................................................................................................................................................  3   Out  of  Scope  .....................................................................................................................................................................................  3   ASSUMPTIONS  .....................................................................................................................................................................................  3   METHODOLOGY  ..................................................................................................................................................................................  3   ARCHITECTURAL  OVERVIEW  ......................................................................................................................................................  4   BENEFITS  ..............................................................................................................................................................................................  5   APPENDIX  A:  DQA  METHODOLOGY  FLOWCHART  .............................................................................................................  6   APPENDIX  B:  ARCHITECTURE  .....................................................................................................................................................  7      
  3. 3.   INTRODUCTION     Third  Eye  Consulting  LLC  (henceforth  referred  to  as  “TEC”  in  this  document.)  is  pleased  to  present  this   initial  draft  proposal  for  building  a  scalable  and  cost-­‐effective  Data  Quality  Audit  Solution  leveraging   state  of  the  art  Open  Source  Big  Data  Technology.     Third  Eye  Consulting  LLC  is  a  Big  Data  Consulting  firm  that  has  successfully  applied  Big  Data   technologies  to  various  applications  that  were  previously  deployed  using  traditional  licensed  tools,  and   helped  deliver  high  value  to  clients  with  realization  of  optimal  cost-­‐benefits.   SCOPE     This  initial  draft  proposal  is  based  on  few  assumptions  based  on  preliminary  conversations  around  the   strategic  need  for  Data  Quality  Audit  solutions  for  IDEA  (henceforth  referred  to  as  “DQA”  in  this   document.)       In  Scope   Per  the  conversation,  IDEA’s  Strategic  needs  are  broadly  interpreted  as:   • Capability  to  perform  Audit  on  Several  Million  Product  Codes,  and  associated  data  elements  in   the  data  flow.   • Score  carding  and  flagging  poor  quality  data  in  the  absence  of  data  governance  and  business   rules  defining  the  semantics  of  the  data.   Out  of  Scope   Data  Cleaning  or  Data  Correction  is  not    a  part  of  this  document.   ASSUMPTIONS     Standard  assumptions  made  in  this  initial  draft  are:   1. Data  set  is  made  available  on  IDEA’s  servers.   2. The  configuration  of  the  servers  (sand  box)  for  implementing  DQA  framework/capabilities  will   be  in  conformance  of  TEC  ‘s  recommendation.   3. TEC  team  will  have  remote  access  and  privileges  to  the  DQA  server  as  per  documented  requests   for  such  privileges  to  install  software,  execute  software  processes  etc.   4. In  context  of  the  strategic  needs  described  in  the  preceding  paragraph,  no  other  assumptions   regarding  the  data  e.g.  structure  etc.  or  otherwise  are  made  in  this  initial  draft.  And  it  is  not   required  to  do  so.   METHODOLOGY     TEC  ‘s  expertise  and  experiences  has  been  in  implementing  cost-­‐effective  solutions  to  deliver  scalable,   sustainable  and  high  value  to  its  customers.    TEC  will  leverage  open  source  and  big  data  solutions  to   implement  a  state-­‐of-­‐the-­‐art  Data  Quality  Audit  framework  that  leverages  statistical  algorithms  to   identify  data  outliers,  pattern  matching  etc.  in  addition  to  rudimentary  rules  like  “missing  data”.     The  TEC  DQA  methodology  will  setup  a  Repeatable  Agile  process  that  scales  not  just  to  handle  data   volumes,  but  also  data  formats  meeting  dynamically  changing  business  rules  and  supporting  
  4. 4.   infrastructure,  while  recognizing  the  challenges  of  lack  of  data  governance  or  the  dependence  on   external  data  and  lack  of  insight  thereof  and  progressively  keep  costs  flat  or  relatively  lower  to  other   alternatives.     The  flowchart  in  Appendix  A  illustrates  Agile  Methodology  for  implementing  a  repeatable  DQA  process.     The  box  “Extrapolate  DQA  Rules”  in  the  flowchart  is  the  step  where  TEC  team  will  attempt  to  identify   “occurrences”  of  data  leading  to,  potentially  what  can  be  inferred  as  “bad  data”  e.g.  special  characters  in   product  name  attributes  or  missing  data  or  skewed  data  in  Date  fields  (year  1000  for  e.g.)  etc.     Post  review  and  customer  acceptance,  these  rules  will  be  plugged  into  or  designed  and  coded  into  the   framework  that  will  leverage  technical  capabilities  of  big  data  to  process  large  amounts  of  data.     The  rules  will  be  generic  and  designed  to  scale  across  multiple  data  elements  as  and  where  applicable   and  possible.   ARCHITECTURAL  OVERVIEW     Figure  in  Appendix  B  depicts  a  bird’s  eye  view  representation  of  the  architecture.     Furthermore,  DQ  Auditing  falls  under  varying  degrees  of  complexity  Audit  process  will  inherently  be   progressive  starting  with  preliminary  assessment  on  a  case-­‐by-­‐case  basis  against  datasets  .     1. Simple  –  candidates  include  data  requiring  basic  checks  that  can  be  e.g.  missing  data,   implemented  with  SQL  capabilities.  Such  scenarios,  for  most  purpose  are  represented  by   standard  technical  or  sometimes  business  rules  as  in  master  data  matching  rules.     2. Medium  –  candidates  can  include  address  quality  check,  phone  number  check.  Most  of  the   programs  would  be  easily  available  in  a  license  tool  or  through  3rd  party  plug-­‐ins.  E.g.  Melissa   data.  However,  certain  non-­‐standard  data  elements  and  scenarios  are  seldom  offered  by  licensed   tools  and  require  innovative  implementation  techniques  to  be  incorporated  in  the  DQA   framework.  Examples  include:  Applying  Statistical  routines  to  identify  outlier  data,  applying   standard  deviation,  mean,  frequency  etc.    As  a  a  very  basic  example,  a  simple  spreadsheet  graph   is  presented  below.  The  product  code  “6000”  has  a  frequency  of  10  and  appears  skewed  in   relation  to  occurrences  of  all  other  product  codes.  In  the  absence  of  any  definitive  master  data   reference,  this  product  code  will  be  “flagged”  as  potential  bad  data.       100   100   120   145   122   10   0   50   100   150   200   1000'   2000'   3000'   4000'   5000'   6000'   Product  Code  Frequency   1000'   2000'   3000'   4000'   5000'   6000'  
  5. 5.     PS:  Rich  visualization  depicted  in  the  architecture  diagram  expands  to  web  technologies  like  HTML5  ,   SVG  as  also  to  spreadsheet  applications  like  Microsoft  Excel.     3. High  –  candidates  include  extrapolating  rules  across  multiple  datasets  one  such  can  be  e.g.   identifying  “bad”  product  code  by  comparing  with  multiple  variables  including  product  code   trending,  referential  associations,  machine  learning  algorithms  etc.   BENEFITS     TECs  Agile  methodology  coupled  with  open  source  big  data  capabilities  presents       1. Cost  –  As  TEC  will  use  Open  source  technologies,  the  CAPEX  is  largely  reduced  in  launching  a  robust   DQA  program.     2. While  most  licensed  tools  offer  out-­‐of-­‐box  functions  for  DQA,  they  often  fall  short  of  custom   capabilities  OR  have  high  costs  and  offer  less  transparency  into  scalability  and  implementation.  TEC   will  closely  partner  with  IDEA  bringing  clear  visibility  into  each  step  of  the  process  as  is  depicted  in   the  flowchart.  Of  course  some  out-­‐of-­‐the-­‐box  features  might  still  need  to  be  procured.  Example:   Flagging  “address”  as  bad  data  would  potentially  require  USPS  data  validation  routines.     3. Use  of  open  source  Big  Data  Analytics  and  Visualization  framework  can  be  scaled  across  other   applications,  infrastructure  and  DQ  capabilities,  while  maintaining  low  Total  Cost  of  Ownership.     4. TEC  methodology  will  result  in  Quick-­‐Wins  in  much  Shorter  Cycles  due  to  the  Agile  Engagement  as   opposed  to  going  through  a  full  program  life  cycle  to  derive  the  initial  results.                        
  6. 6.   APPENDIX  A:  DQA  METHODOLOGY  FLOWCHART        Receive   Data  Set   Load  into   Database   Preliminary   Analysis   Extrapolate   DQA  Rules   Publish   Rules     Customer     Accepted?   Apply  Rule   to  Data  Set   YES   Customer   engagement   Rules   Extrapolated   Create  DQA   Rules   NO   NO     Re-­‐Assess   Scenario?   YES   Generate  DQ   Metrics   Flag  Bad   Data   Load  into   Publish  Ready   DQA  Database   Stop   Load  into   DQA  Publish   Ready  Rules   Database   Analyze   Data  /  Rules     Customer     Validation   OK?   YES   NO  
  7. 7.   APPENDIX  B:  ARCHITECTURE                         Open  Source  -­‐Big  Data-­‐  DQA  Platform     Rules   Repository   • Using  capabilities  of  Mapreduce.   To  apply  Statistical  algorithms.     • Apply  basic  standard  rules  using   combination  of  SQL  and   Mapreduce  to  get  best  blend  of   performance  and  ease-­‐of-­‐ design/build  capabilities   DQA   Database   Rich   Visualization   Web  Data  Service   Structured  Databases   File  based  data   Capability  for     Multi-­‐Format  Data   Publishing  including   -­‐ Files   -­‐ Database   -­‐ JSON  docs   -­‐ XML   -­‐ Etc.  

×