SlideShare a Scribd company logo
1 of 7
Download to read offline
 
	
  
	
  
Proposal	
  for	
  
Data	
  Quality	
  Audit	
  Solutions	
  
Prepared	
  for	
  IDEA	
  
	
  
April	
  2013	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
©	
  2004	
  by	
  Third	
  Eye	
  Consulting	
  LLC	
  	
  
All	
  rights	
  reserved.	
  No	
  part	
  of	
  this	
  document	
  may	
  be	
  reproduced	
  or	
  transmitted	
  in	
  any	
  form	
  or	
  by	
  any	
  
means,	
  electronic,	
  mechanical,	
  photocopying,	
  recording,	
  or	
  otherwise,	
  without	
  prior	
  written	
  permission	
  
of	
  Third	
  Eye	
  Consulting	
  LLC.	
   	
  
 
Table	
  of	
  Contents	
  
	
  
INTRODUCTION	
  ..................................................................................................................................................................................	
  3	
  
SCOPE	
  ......................................................................................................................................................................................................	
  3	
  
In	
  Scope	
  .............................................................................................................................................................................................	
  3	
  
Out	
  of	
  Scope	
  .....................................................................................................................................................................................	
  3	
  
ASSUMPTIONS	
  .....................................................................................................................................................................................	
  3	
  
METHODOLOGY	
  ..................................................................................................................................................................................	
  3	
  
ARCHITECTURAL	
  OVERVIEW	
  ......................................................................................................................................................	
  4	
  
BENEFITS	
  ..............................................................................................................................................................................................	
  5	
  
APPENDIX	
  A:	
  DQA	
  METHODOLOGY	
  FLOWCHART	
  .............................................................................................................	
  6	
  
APPENDIX	
  B:	
  ARCHITECTURE	
  .....................................................................................................................................................	
  7	
  
	
   	
  
 
INTRODUCTION	
  
	
  
Third	
  Eye	
  Consulting	
  LLC	
  (henceforth	
  referred	
  to	
  as	
  “TEC”	
  in	
  this	
  document.)	
  is	
  pleased	
  to	
  present	
  this	
  
initial	
  draft	
  proposal	
  for	
  building	
  a	
  scalable	
  and	
  cost-­‐effective	
  Data	
  Quality	
  Audit	
  Solution	
  leveraging	
  
state	
  of	
  the	
  art	
  Open	
  Source	
  Big	
  Data	
  Technology.	
  
	
  
Third	
  Eye	
  Consulting	
  LLC	
  is	
  a	
  Big	
  Data	
  Consulting	
  firm	
  that	
  has	
  successfully	
  applied	
  Big	
  Data	
  
technologies	
  to	
  various	
  applications	
  that	
  were	
  previously	
  deployed	
  using	
  traditional	
  licensed	
  tools,	
  and	
  
helped	
  deliver	
  high	
  value	
  to	
  clients	
  with	
  realization	
  of	
  optimal	
  cost-­‐benefits.	
  
SCOPE	
  
	
  
This	
  initial	
  draft	
  proposal	
  is	
  based	
  on	
  few	
  assumptions	
  based	
  on	
  preliminary	
  conversations	
  around	
  the	
  
strategic	
  need	
  for	
  Data	
  Quality	
  Audit	
  solutions	
  for	
  IDEA	
  (henceforth	
  referred	
  to	
  as	
  “DQA”	
  in	
  this	
  
document.)	
  	
  
	
  
In	
  Scope	
  
Per	
  the	
  conversation,	
  IDEA’s	
  Strategic	
  needs	
  are	
  broadly	
  interpreted	
  as:	
  
• Capability	
  to	
  perform	
  Audit	
  on	
  Several	
  Million	
  Product	
  Codes,	
  and	
  associated	
  data	
  elements	
  in	
  
the	
  data	
  flow.	
  
• Score	
  carding	
  and	
  flagging	
  poor	
  quality	
  data	
  in	
  the	
  absence	
  of	
  data	
  governance	
  and	
  business	
  
rules	
  defining	
  the	
  semantics	
  of	
  the	
  data.	
  
Out	
  of	
  Scope	
  
Data	
  Cleaning	
  or	
  Data	
  Correction	
  is	
  not	
  	
  a	
  part	
  of	
  this	
  document.	
  
ASSUMPTIONS	
  
	
  
Standard	
  assumptions	
  made	
  in	
  this	
  initial	
  draft	
  are:	
  
1. Data	
  set	
  is	
  made	
  available	
  on	
  IDEA’s	
  servers.	
  
2. The	
  configuration	
  of	
  the	
  servers	
  (sand	
  box)	
  for	
  implementing	
  DQA	
  framework/capabilities	
  will	
  
be	
  in	
  conformance	
  of	
  TEC	
  ‘s	
  recommendation.	
  
3. TEC	
  team	
  will	
  have	
  remote	
  access	
  and	
  privileges	
  to	
  the	
  DQA	
  server	
  as	
  per	
  documented	
  requests	
  
for	
  such	
  privileges	
  to	
  install	
  software,	
  execute	
  software	
  processes	
  etc.	
  
4. In	
  context	
  of	
  the	
  strategic	
  needs	
  described	
  in	
  the	
  preceding	
  paragraph,	
  no	
  other	
  assumptions	
  
regarding	
  the	
  data	
  e.g.	
  structure	
  etc.	
  or	
  otherwise	
  are	
  made	
  in	
  this	
  initial	
  draft.	
  And	
  it	
  is	
  not	
  
required	
  to	
  do	
  so.	
  
METHODOLOGY	
  
	
  
TEC	
  ‘s	
  expertise	
  and	
  experiences	
  has	
  been	
  in	
  implementing	
  cost-­‐effective	
  solutions	
  to	
  deliver	
  scalable,	
  
sustainable	
  and	
  high	
  value	
  to	
  its	
  customers.	
  	
  TEC	
  will	
  leverage	
  open	
  source	
  and	
  big	
  data	
  solutions	
  to	
  
implement	
  a	
  state-­‐of-­‐the-­‐art	
  Data	
  Quality	
  Audit	
  framework	
  that	
  leverages	
  statistical	
  algorithms	
  to	
  
identify	
  data	
  outliers,	
  pattern	
  matching	
  etc.	
  in	
  addition	
  to	
  rudimentary	
  rules	
  like	
  “missing	
  data”.	
  
	
  
The	
  TEC	
  DQA	
  methodology	
  will	
  setup	
  a	
  Repeatable	
  Agile	
  process	
  that	
  scales	
  not	
  just	
  to	
  handle	
  data	
  
volumes,	
  but	
  also	
  data	
  formats	
  meeting	
  dynamically	
  changing	
  business	
  rules	
  and	
  supporting	
  
 
infrastructure,	
  while	
  recognizing	
  the	
  challenges	
  of	
  lack	
  of	
  data	
  governance	
  or	
  the	
  dependence	
  on	
  
external	
  data	
  and	
  lack	
  of	
  insight	
  thereof	
  and	
  progressively	
  keep	
  costs	
  flat	
  or	
  relatively	
  lower	
  to	
  other	
  
alternatives.	
  
	
  
The	
  flowchart	
  in	
  Appendix	
  A	
  illustrates	
  Agile	
  Methodology	
  for	
  implementing	
  a	
  repeatable	
  DQA	
  process.	
  
	
  
The	
  box	
  “Extrapolate	
  DQA	
  Rules”	
  in	
  the	
  flowchart	
  is	
  the	
  step	
  where	
  TEC	
  team	
  will	
  attempt	
  to	
  identify	
  
“occurrences”	
  of	
  data	
  leading	
  to,	
  potentially	
  what	
  can	
  be	
  inferred	
  as	
  “bad	
  data”	
  e.g.	
  special	
  characters	
  in	
  
product	
  name	
  attributes	
  or	
  missing	
  data	
  or	
  skewed	
  data	
  in	
  Date	
  fields	
  (year	
  1000	
  for	
  e.g.)	
  etc.	
  
	
  
Post	
  review	
  and	
  customer	
  acceptance,	
  these	
  rules	
  will	
  be	
  plugged	
  into	
  or	
  designed	
  and	
  coded	
  into	
  the	
  
framework	
  that	
  will	
  leverage	
  technical	
  capabilities	
  of	
  big	
  data	
  to	
  process	
  large	
  amounts	
  of	
  data.	
  
	
  
The	
  rules	
  will	
  be	
  generic	
  and	
  designed	
  to	
  scale	
  across	
  multiple	
  data	
  elements	
  as	
  and	
  where	
  applicable	
  
and	
  possible.	
  
ARCHITECTURAL	
  OVERVIEW	
  
	
  
Figure	
  in	
  Appendix	
  B	
  depicts	
  a	
  bird’s	
  eye	
  view	
  representation	
  of	
  the	
  architecture.	
  
	
  
Furthermore,	
  DQ	
  Auditing	
  falls	
  under	
  varying	
  degrees	
  of	
  complexity	
  Audit	
  process	
  will	
  inherently	
  be	
  
progressive	
  starting	
  with	
  preliminary	
  assessment	
  on	
  a	
  case-­‐by-­‐case	
  basis	
  against	
  datasets	
  .	
  
	
  
1. Simple	
  –	
  candidates	
  include	
  data	
  requiring	
  basic	
  checks	
  that	
  can	
  be	
  e.g.	
  missing	
  data,	
  
implemented	
  with	
  SQL	
  capabilities.	
  Such	
  scenarios,	
  for	
  most	
  purpose	
  are	
  represented	
  by	
  
standard	
  technical	
  or	
  sometimes	
  business	
  rules	
  as	
  in	
  master	
  data	
  matching	
  rules.	
  
	
  
2. Medium	
  –	
  candidates	
  can	
  include	
  address	
  quality	
  check,	
  phone	
  number	
  check.	
  Most	
  of	
  the	
  
programs	
  would	
  be	
  easily	
  available	
  in	
  a	
  license	
  tool	
  or	
  through	
  3rd	
  party	
  plug-­‐ins.	
  E.g.	
  Melissa	
  
data.	
  However,	
  certain	
  non-­‐standard	
  data	
  elements	
  and	
  scenarios	
  are	
  seldom	
  offered	
  by	
  licensed	
  
tools	
  and	
  require	
  innovative	
  implementation	
  techniques	
  to	
  be	
  incorporated	
  in	
  the	
  DQA	
  
framework.	
  Examples	
  include:	
  Applying	
  Statistical	
  routines	
  to	
  identify	
  outlier	
  data,	
  applying	
  
standard	
  deviation,	
  mean,	
  frequency	
  etc.	
  	
  As	
  a	
  a	
  very	
  basic	
  example,	
  a	
  simple	
  spreadsheet	
  graph	
  
is	
  presented	
  below.	
  The	
  product	
  code	
  “6000”	
  has	
  a	
  frequency	
  of	
  10	
  and	
  appears	
  skewed	
  in	
  
relation	
  to	
  occurrences	
  of	
  all	
  other	
  product	
  codes.	
  In	
  the	
  absence	
  of	
  any	
  definitive	
  master	
  data	
  
reference,	
  this	
  product	
  code	
  will	
  be	
  “flagged”	
  as	
  potential	
  bad	
  data.	
  	
  
	
  
100	
   100	
  
120	
  
145	
  
122	
  
10	
  
0	
  
50	
  
100	
  
150	
  
200	
  
1000'	
   2000'	
   3000'	
   4000'	
   5000'	
   6000'	
  
Product	
  Code	
  Frequency	
  
1000'	
  
2000'	
  
3000'	
  
4000'	
  
5000'	
  
6000'	
  
 
	
  
PS:	
  Rich	
  visualization	
  depicted	
  in	
  the	
  architecture	
  diagram	
  expands	
  to	
  web	
  technologies	
  like	
  HTML5	
  ,	
  
SVG	
  as	
  also	
  to	
  spreadsheet	
  applications	
  like	
  Microsoft	
  Excel.	
  
	
  
3. High	
  –	
  candidates	
  include	
  extrapolating	
  rules	
  across	
  multiple	
  datasets	
  one	
  such	
  can	
  be	
  e.g.	
  
identifying	
  “bad”	
  product	
  code	
  by	
  comparing	
  with	
  multiple	
  variables	
  including	
  product	
  code	
  
trending,	
  referential	
  associations,	
  machine	
  learning	
  algorithms	
  etc.	
  
BENEFITS	
  
	
  
TECs	
  Agile	
  methodology	
  coupled	
  with	
  open	
  source	
  big	
  data	
  capabilities	
  presents	
  	
  
	
  
1. Cost	
  –	
  As	
  TEC	
  will	
  use	
  Open	
  source	
  technologies,	
  the	
  CAPEX	
  is	
  largely	
  reduced	
  in	
  launching	
  a	
  robust	
  
DQA	
  program.	
  
	
  
2. While	
  most	
  licensed	
  tools	
  offer	
  out-­‐of-­‐box	
  functions	
  for	
  DQA,	
  they	
  often	
  fall	
  short	
  of	
  custom	
  
capabilities	
  OR	
  have	
  high	
  costs	
  and	
  offer	
  less	
  transparency	
  into	
  scalability	
  and	
  implementation.	
  TEC	
  
will	
  closely	
  partner	
  with	
  IDEA	
  bringing	
  clear	
  visibility	
  into	
  each	
  step	
  of	
  the	
  process	
  as	
  is	
  depicted	
  in	
  
the	
  flowchart.	
  Of	
  course	
  some	
  out-­‐of-­‐the-­‐box	
  features	
  might	
  still	
  need	
  to	
  be	
  procured.	
  Example:	
  
Flagging	
  “address”	
  as	
  bad	
  data	
  would	
  potentially	
  require	
  USPS	
  data	
  validation	
  routines.	
  
	
  
3. Use	
  of	
  open	
  source	
  Big	
  Data	
  Analytics	
  and	
  Visualization	
  framework	
  can	
  be	
  scaled	
  across	
  other	
  
applications,	
  infrastructure	
  and	
  DQ	
  capabilities,	
  while	
  maintaining	
  low	
  Total	
  Cost	
  of	
  Ownership.	
  
	
  
4. TEC	
  methodology	
  will	
  result	
  in	
  Quick-­‐Wins	
  in	
  much	
  Shorter	
  Cycles	
  due	
  to	
  the	
  Agile	
  Engagement	
  as	
  
opposed	
  to	
  going	
  through	
  a	
  full	
  program	
  life	
  cycle	
  to	
  derive	
  the	
  initial	
  results.	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
   	
  
 
APPENDIX	
  A:	
  DQA	
  METHODOLOGY	
  FLOWCHART	
  
	
  
	
   	
  Receive	
  
Data	
  Set	
  
Load	
  into	
  
Database	
  
Preliminary	
  
Analysis	
  
Extrapolate	
  
DQA	
  Rules	
  
Publish	
  
Rules	
  
	
  
Customer	
  	
  
Accepted?	
  
Apply	
  Rule	
  
to	
  Data	
  Set	
  
YES	
  
Customer	
  
engagement	
  
Rules	
  
Extrapolated	
  
Create	
  DQA	
  
Rules	
  
NO	
  
NO	
  
	
  
Re-­‐Assess	
  
Scenario?	
  
YES	
  
Generate	
  DQ	
  
Metrics	
  
Flag	
  Bad	
  
Data	
  
Load	
  into	
  
Publish	
  Ready	
  
DQA	
  Database	
  
Stop	
  
Load	
  into	
  
DQA	
  Publish	
  
Ready	
  Rules	
  
Database	
  
Analyze	
  
Data	
  /	
  Rules	
  
	
  
Customer	
  	
  
Validation	
  
OK?	
  
YES	
  
NO	
  
 
APPENDIX	
  B:	
  ARCHITECTURE	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
Open	
  Source	
  -­‐Big	
  Data-­‐	
  DQA	
  Platform	
  
	
  
Rules	
  
Repository	
  
• Using	
  capabilities	
  of	
  Mapreduce.	
  
To	
  apply	
  Statistical	
  algorithms.	
  
	
  
• Apply	
  basic	
  standard	
  rules	
  using	
  
combination	
  of	
  SQL	
  and	
  
Mapreduce	
  to	
  get	
  best	
  blend	
  of	
  
performance	
  and	
  ease-­‐of-­‐
design/build	
  capabilities	
  
DQA	
  
Database	
  
Rich	
  
Visualization
	
  
Web	
  Data	
  Service	
  
Structured	
  Databases	
  
File	
  based	
  data	
  
Capability	
  for	
  	
  
Multi-­‐Format	
  Data	
  
Publishing	
  including	
  
-­‐ Files	
  
-­‐ Database	
  
-­‐ JSON	
  docs	
  
-­‐ XML	
  
-­‐ Etc.	
  

More Related Content

What's hot

Enterprise risk management
Enterprise risk managementEnterprise risk management
Enterprise risk managementMetricStream Inc
 
Value of Exalytics for Oracle full stack Customers
Value of Exalytics for Oracle full stack CustomersValue of Exalytics for Oracle full stack Customers
Value of Exalytics for Oracle full stack CustomersMiguel Garcia
 
HL7 Releases FHIR 4 - Highlights, Impact and More
HL7 Releases FHIR 4 - Highlights, Impact and MoreHL7 Releases FHIR 4 - Highlights, Impact and More
HL7 Releases FHIR 4 - Highlights, Impact and MoreCitiusTech
 
Make compliance fulfillment count double
Make compliance fulfillment count doubleMake compliance fulfillment count double
Make compliance fulfillment count doubleDirk Ortloff
 
Laurie Maxwell Resume 07/2016
Laurie Maxwell Resume 07/2016Laurie Maxwell Resume 07/2016
Laurie Maxwell Resume 07/2016Laurie Maxwell
 
Project Documentation
Project DocumentationProject Documentation
Project DocumentationRohan Reddy
 
Tdwi austin simplifying big data delivery to drive new insights final
Tdwi austin   simplifying big data delivery to drive new insights finalTdwi austin   simplifying big data delivery to drive new insights final
Tdwi austin simplifying big data delivery to drive new insights finalSal Marcus
 
Software design
Software designSoftware design
Software designambitlick
 

What's hot (13)

120faug
120faug120faug
120faug
 
Enterprise risk management
Enterprise risk managementEnterprise risk management
Enterprise risk management
 
Value of Exalytics for Oracle full stack Customers
Value of Exalytics for Oracle full stack CustomersValue of Exalytics for Oracle full stack Customers
Value of Exalytics for Oracle full stack Customers
 
IEEE 2 5 Beta Bethod Unraveled - A Technical Paper Prepared for SCTE/ISBE
IEEE 2 5 Beta Bethod Unraveled - A Technical Paper Prepared for SCTE/ISBEIEEE 2 5 Beta Bethod Unraveled - A Technical Paper Prepared for SCTE/ISBE
IEEE 2 5 Beta Bethod Unraveled - A Technical Paper Prepared for SCTE/ISBE
 
Software Design Document
Software Design DocumentSoftware Design Document
Software Design Document
 
HL7 Releases FHIR 4 - Highlights, Impact and More
HL7 Releases FHIR 4 - Highlights, Impact and MoreHL7 Releases FHIR 4 - Highlights, Impact and More
HL7 Releases FHIR 4 - Highlights, Impact and More
 
Make compliance fulfillment count double
Make compliance fulfillment count doubleMake compliance fulfillment count double
Make compliance fulfillment count double
 
Systems analysis and design (abe)
Systems analysis and design (abe)Systems analysis and design (abe)
Systems analysis and design (abe)
 
Laurie Maxwell Resume 07/2016
Laurie Maxwell Resume 07/2016Laurie Maxwell Resume 07/2016
Laurie Maxwell Resume 07/2016
 
Data masking a developer's guide
Data masking a developer's guideData masking a developer's guide
Data masking a developer's guide
 
Project Documentation
Project DocumentationProject Documentation
Project Documentation
 
Tdwi austin simplifying big data delivery to drive new insights final
Tdwi austin   simplifying big data delivery to drive new insights finalTdwi austin   simplifying big data delivery to drive new insights final
Tdwi austin simplifying big data delivery to drive new insights final
 
Software design
Software designSoftware design
Software design
 

Viewers also liked

Business value realization of cloud content collaboration
Business value realization of cloud content collaborationBusiness value realization of cloud content collaboration
Business value realization of cloud content collaborationNandeep Nagarkar
 
Apache Any23 - Anything to Triples
Apache Any23 - Anything to TriplesApache Any23 - Anything to Triples
Apache Any23 - Anything to TriplesMichele Mostarda
 
Migration approachquestionnaire checklist
Migration approachquestionnaire checklistMigration approachquestionnaire checklist
Migration approachquestionnaire checklistNandeep Nagarkar
 
Case competitive benchmarking
Case  competitive benchmarkingCase  competitive benchmarking
Case competitive benchmarkingNandeep Nagarkar
 
Application support requirements & processes
Application support requirements & processesApplication support requirements & processes
Application support requirements & processesNandeep Nagarkar
 
Bookings Quality Score Model
Bookings  Quality Score ModelBookings  Quality Score Model
Bookings Quality Score ModelNandeep Nagarkar
 
Predicting Retail KPIs using Magento & Machine Learning
Predicting Retail KPIs using Magento & Machine LearningPredicting Retail KPIs using Magento & Machine Learning
Predicting Retail KPIs using Magento & Machine LearningRud Boruah
 
Data profiling-best-practices
Data profiling-best-practicesData profiling-best-practices
Data profiling-best-practicesBlaise Cheuteu
 
Edital TJPE 2011 - Completo
Edital TJPE 2011 - CompletoEdital TJPE 2011 - Completo
Edital TJPE 2011 - CompletoEstrategiaConc
 
Building a Data Quality Program from Scratch
Building a Data Quality Program from ScratchBuilding a Data Quality Program from Scratch
Building a Data Quality Program from Scratchdmurph4
 

Viewers also liked (10)

Business value realization of cloud content collaboration
Business value realization of cloud content collaborationBusiness value realization of cloud content collaboration
Business value realization of cloud content collaboration
 
Apache Any23 - Anything to Triples
Apache Any23 - Anything to TriplesApache Any23 - Anything to Triples
Apache Any23 - Anything to Triples
 
Migration approachquestionnaire checklist
Migration approachquestionnaire checklistMigration approachquestionnaire checklist
Migration approachquestionnaire checklist
 
Case competitive benchmarking
Case  competitive benchmarkingCase  competitive benchmarking
Case competitive benchmarking
 
Application support requirements & processes
Application support requirements & processesApplication support requirements & processes
Application support requirements & processes
 
Bookings Quality Score Model
Bookings  Quality Score ModelBookings  Quality Score Model
Bookings Quality Score Model
 
Predicting Retail KPIs using Magento & Machine Learning
Predicting Retail KPIs using Magento & Machine LearningPredicting Retail KPIs using Magento & Machine Learning
Predicting Retail KPIs using Magento & Machine Learning
 
Data profiling-best-practices
Data profiling-best-practicesData profiling-best-practices
Data profiling-best-practices
 
Edital TJPE 2011 - Completo
Edital TJPE 2011 - CompletoEdital TJPE 2011 - Completo
Edital TJPE 2011 - Completo
 
Building a Data Quality Program from Scratch
Building a Data Quality Program from ScratchBuilding a Data Quality Program from Scratch
Building a Data Quality Program from Scratch
 

Similar to A Machine learning based Data Quality Analysis Approach

Cartts ACA Prototype Proposal to Ptiney Bowes 2
Cartts ACA Prototype Proposal to Ptiney Bowes 2Cartts ACA Prototype Proposal to Ptiney Bowes 2
Cartts ACA Prototype Proposal to Ptiney Bowes 2H. Donald Betts, Jr.
 
Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...
Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...
Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...IRJET Journal
 
Governance and Architecture in Data Integration
Governance and Architecture in Data IntegrationGovernance and Architecture in Data Integration
Governance and Architecture in Data IntegrationAnalytiX DS
 
White Paper-1-AnalytiX Mapping Manager-Governance And Architecture In Data In...
White Paper-1-AnalytiX Mapping Manager-Governance And Architecture In Data In...White Paper-1-AnalytiX Mapping Manager-Governance And Architecture In Data In...
White Paper-1-AnalytiX Mapping Manager-Governance And Architecture In Data In...AnalytixDataServices
 
An Overview of Data Lake
An Overview of Data LakeAn Overview of Data Lake
An Overview of Data LakeIRJET Journal
 
Qlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipeline
Qlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipelineQlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipeline
Qlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipelineSrikanth Sharma Boddupalli
 
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...Denodo
 
How to Monitor Digital Dependencies Across Your Modern IT Stack
How to Monitor Digital Dependencies Across Your Modern IT StackHow to Monitor Digital Dependencies Across Your Modern IT Stack
How to Monitor Digital Dependencies Across Your Modern IT StackThousandEyes
 
How to Monitor Digital Dependencies Across Your Modern IT Stack
How to Monitor Digital Dependencies Across Your Modern IT StackHow to Monitor Digital Dependencies Across Your Modern IT Stack
How to Monitor Digital Dependencies Across Your Modern IT StackThousandEyes
 
Next generation Data Governance
Next generation Data GovernanceNext generation Data Governance
Next generation Data GovernanceVladimiro Borsi
 
How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)Denodo
 
593 Managing Enterprise Data Quality Using SAP Information Steward
593 Managing Enterprise Data Quality Using SAP Information Steward593 Managing Enterprise Data Quality Using SAP Information Steward
593 Managing Enterprise Data Quality Using SAP Information StewardVinny (Gurvinder) Ahuja
 
3 D's of test data management managing effectively the underlying challenges...
3 D's of test data management  managing effectively the underlying challenges...3 D's of test data management  managing effectively the underlying challenges...
3 D's of test data management managing effectively the underlying challenges...Ajeet Singh, PMP, CSM
 
Sample_Data_and_Data_Modules
Sample_Data_and_Data_ModulesSample_Data_and_Data_Modules
Sample_Data_and_Data_ModulesMichael Cook
 
Pysyvästi laadukasta masterdataa SmartMDM:n avulla
Pysyvästi laadukasta masterdataa SmartMDM:n avullaPysyvästi laadukasta masterdataa SmartMDM:n avulla
Pysyvästi laadukasta masterdataa SmartMDM:n avullaBilot
 
Week10 Analysing Client Requirements
Week10 Analysing Client RequirementsWeek10 Analysing Client Requirements
Week10 Analysing Client Requirementshapy
 
The Evolution of Digital Control Towers in Supply Chain
The Evolution of Digital Control Towers in Supply ChainThe Evolution of Digital Control Towers in Supply Chain
The Evolution of Digital Control Towers in Supply ChainTredence Inc
 
IRJET- Data Analytics & Visualization using Qlik
IRJET- Data Analytics & Visualization using QlikIRJET- Data Analytics & Visualization using Qlik
IRJET- Data Analytics & Visualization using QlikIRJET Journal
 
Determining Requirements Complexity - White Paper
Determining Requirements Complexity - White PaperDetermining Requirements Complexity - White Paper
Determining Requirements Complexity - White PaperSaurabh Goel
 
Three Cool Things You Can Do with Standards
Three Cool Things You Can Do with StandardsThree Cool Things You Can Do with Standards
Three Cool Things You Can Do with StandardsMatt Turner
 

Similar to A Machine learning based Data Quality Analysis Approach (20)

Cartts ACA Prototype Proposal to Ptiney Bowes 2
Cartts ACA Prototype Proposal to Ptiney Bowes 2Cartts ACA Prototype Proposal to Ptiney Bowes 2
Cartts ACA Prototype Proposal to Ptiney Bowes 2
 
Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...
Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...
Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...
 
Governance and Architecture in Data Integration
Governance and Architecture in Data IntegrationGovernance and Architecture in Data Integration
Governance and Architecture in Data Integration
 
White Paper-1-AnalytiX Mapping Manager-Governance And Architecture In Data In...
White Paper-1-AnalytiX Mapping Manager-Governance And Architecture In Data In...White Paper-1-AnalytiX Mapping Manager-Governance And Architecture In Data In...
White Paper-1-AnalytiX Mapping Manager-Governance And Architecture In Data In...
 
An Overview of Data Lake
An Overview of Data LakeAn Overview of Data Lake
An Overview of Data Lake
 
Qlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipeline
Qlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipelineQlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipeline
Qlik wp 2021_q3_data_governance_in_the_modern_data_analytics_pipeline
 
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
 
How to Monitor Digital Dependencies Across Your Modern IT Stack
How to Monitor Digital Dependencies Across Your Modern IT StackHow to Monitor Digital Dependencies Across Your Modern IT Stack
How to Monitor Digital Dependencies Across Your Modern IT Stack
 
How to Monitor Digital Dependencies Across Your Modern IT Stack
How to Monitor Digital Dependencies Across Your Modern IT StackHow to Monitor Digital Dependencies Across Your Modern IT Stack
How to Monitor Digital Dependencies Across Your Modern IT Stack
 
Next generation Data Governance
Next generation Data GovernanceNext generation Data Governance
Next generation Data Governance
 
How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)
 
593 Managing Enterprise Data Quality Using SAP Information Steward
593 Managing Enterprise Data Quality Using SAP Information Steward593 Managing Enterprise Data Quality Using SAP Information Steward
593 Managing Enterprise Data Quality Using SAP Information Steward
 
3 D's of test data management managing effectively the underlying challenges...
3 D's of test data management  managing effectively the underlying challenges...3 D's of test data management  managing effectively the underlying challenges...
3 D's of test data management managing effectively the underlying challenges...
 
Sample_Data_and_Data_Modules
Sample_Data_and_Data_ModulesSample_Data_and_Data_Modules
Sample_Data_and_Data_Modules
 
Pysyvästi laadukasta masterdataa SmartMDM:n avulla
Pysyvästi laadukasta masterdataa SmartMDM:n avullaPysyvästi laadukasta masterdataa SmartMDM:n avulla
Pysyvästi laadukasta masterdataa SmartMDM:n avulla
 
Week10 Analysing Client Requirements
Week10 Analysing Client RequirementsWeek10 Analysing Client Requirements
Week10 Analysing Client Requirements
 
The Evolution of Digital Control Towers in Supply Chain
The Evolution of Digital Control Towers in Supply ChainThe Evolution of Digital Control Towers in Supply Chain
The Evolution of Digital Control Towers in Supply Chain
 
IRJET- Data Analytics & Visualization using Qlik
IRJET- Data Analytics & Visualization using QlikIRJET- Data Analytics & Visualization using Qlik
IRJET- Data Analytics & Visualization using Qlik
 
Determining Requirements Complexity - White Paper
Determining Requirements Complexity - White PaperDetermining Requirements Complexity - White Paper
Determining Requirements Complexity - White Paper
 
Three Cool Things You Can Do with Standards
Three Cool Things You Can Do with StandardsThree Cool Things You Can Do with Standards
Three Cool Things You Can Do with Standards
 

More from Nandeep Nagarkar

Agile work estimation template
Agile work estimation templateAgile work estimation template
Agile work estimation templateNandeep Nagarkar
 
Organizational value model
Organizational value modelOrganizational value model
Organizational value modelNandeep Nagarkar
 
Customer Data Integration Architecture Gudelinbes
Customer Data Integration Architecture GudelinbesCustomer Data Integration Architecture Gudelinbes
Customer Data Integration Architecture GudelinbesNandeep Nagarkar
 
Roadmap for Application Process Interactions
Roadmap for  Application Process InteractionsRoadmap for  Application Process Interactions
Roadmap for Application Process InteractionsNandeep Nagarkar
 
Information Integration Data Quality
Information Integration Data QualityInformation Integration Data Quality
Information Integration Data QualityNandeep Nagarkar
 

More from Nandeep Nagarkar (7)

Agile work estimation template
Agile work estimation templateAgile work estimation template
Agile work estimation template
 
Organizational value model
Organizational value modelOrganizational value model
Organizational value model
 
Customer Data Integration Architecture Gudelinbes
Customer Data Integration Architecture GudelinbesCustomer Data Integration Architecture Gudelinbes
Customer Data Integration Architecture Gudelinbes
 
Address module
Address moduleAddress module
Address module
 
Data Management Strategy
Data Management StrategyData Management Strategy
Data Management Strategy
 
Roadmap for Application Process Interactions
Roadmap for  Application Process InteractionsRoadmap for  Application Process Interactions
Roadmap for Application Process Interactions
 
Information Integration Data Quality
Information Integration Data QualityInformation Integration Data Quality
Information Integration Data Quality
 

Recently uploaded

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 

Recently uploaded (20)

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 

A Machine learning based Data Quality Analysis Approach

  • 1.       Proposal  for   Data  Quality  Audit  Solutions   Prepared  for  IDEA     April  2013                   ©  2004  by  Third  Eye  Consulting  LLC     All  rights  reserved.  No  part  of  this  document  may  be  reproduced  or  transmitted  in  any  form  or  by  any   means,  electronic,  mechanical,  photocopying,  recording,  or  otherwise,  without  prior  written  permission   of  Third  Eye  Consulting  LLC.    
  • 2.   Table  of  Contents     INTRODUCTION  ..................................................................................................................................................................................  3   SCOPE  ......................................................................................................................................................................................................  3   In  Scope  .............................................................................................................................................................................................  3   Out  of  Scope  .....................................................................................................................................................................................  3   ASSUMPTIONS  .....................................................................................................................................................................................  3   METHODOLOGY  ..................................................................................................................................................................................  3   ARCHITECTURAL  OVERVIEW  ......................................................................................................................................................  4   BENEFITS  ..............................................................................................................................................................................................  5   APPENDIX  A:  DQA  METHODOLOGY  FLOWCHART  .............................................................................................................  6   APPENDIX  B:  ARCHITECTURE  .....................................................................................................................................................  7      
  • 3.   INTRODUCTION     Third  Eye  Consulting  LLC  (henceforth  referred  to  as  “TEC”  in  this  document.)  is  pleased  to  present  this   initial  draft  proposal  for  building  a  scalable  and  cost-­‐effective  Data  Quality  Audit  Solution  leveraging   state  of  the  art  Open  Source  Big  Data  Technology.     Third  Eye  Consulting  LLC  is  a  Big  Data  Consulting  firm  that  has  successfully  applied  Big  Data   technologies  to  various  applications  that  were  previously  deployed  using  traditional  licensed  tools,  and   helped  deliver  high  value  to  clients  with  realization  of  optimal  cost-­‐benefits.   SCOPE     This  initial  draft  proposal  is  based  on  few  assumptions  based  on  preliminary  conversations  around  the   strategic  need  for  Data  Quality  Audit  solutions  for  IDEA  (henceforth  referred  to  as  “DQA”  in  this   document.)       In  Scope   Per  the  conversation,  IDEA’s  Strategic  needs  are  broadly  interpreted  as:   • Capability  to  perform  Audit  on  Several  Million  Product  Codes,  and  associated  data  elements  in   the  data  flow.   • Score  carding  and  flagging  poor  quality  data  in  the  absence  of  data  governance  and  business   rules  defining  the  semantics  of  the  data.   Out  of  Scope   Data  Cleaning  or  Data  Correction  is  not    a  part  of  this  document.   ASSUMPTIONS     Standard  assumptions  made  in  this  initial  draft  are:   1. Data  set  is  made  available  on  IDEA’s  servers.   2. The  configuration  of  the  servers  (sand  box)  for  implementing  DQA  framework/capabilities  will   be  in  conformance  of  TEC  ‘s  recommendation.   3. TEC  team  will  have  remote  access  and  privileges  to  the  DQA  server  as  per  documented  requests   for  such  privileges  to  install  software,  execute  software  processes  etc.   4. In  context  of  the  strategic  needs  described  in  the  preceding  paragraph,  no  other  assumptions   regarding  the  data  e.g.  structure  etc.  or  otherwise  are  made  in  this  initial  draft.  And  it  is  not   required  to  do  so.   METHODOLOGY     TEC  ‘s  expertise  and  experiences  has  been  in  implementing  cost-­‐effective  solutions  to  deliver  scalable,   sustainable  and  high  value  to  its  customers.    TEC  will  leverage  open  source  and  big  data  solutions  to   implement  a  state-­‐of-­‐the-­‐art  Data  Quality  Audit  framework  that  leverages  statistical  algorithms  to   identify  data  outliers,  pattern  matching  etc.  in  addition  to  rudimentary  rules  like  “missing  data”.     The  TEC  DQA  methodology  will  setup  a  Repeatable  Agile  process  that  scales  not  just  to  handle  data   volumes,  but  also  data  formats  meeting  dynamically  changing  business  rules  and  supporting  
  • 4.   infrastructure,  while  recognizing  the  challenges  of  lack  of  data  governance  or  the  dependence  on   external  data  and  lack  of  insight  thereof  and  progressively  keep  costs  flat  or  relatively  lower  to  other   alternatives.     The  flowchart  in  Appendix  A  illustrates  Agile  Methodology  for  implementing  a  repeatable  DQA  process.     The  box  “Extrapolate  DQA  Rules”  in  the  flowchart  is  the  step  where  TEC  team  will  attempt  to  identify   “occurrences”  of  data  leading  to,  potentially  what  can  be  inferred  as  “bad  data”  e.g.  special  characters  in   product  name  attributes  or  missing  data  or  skewed  data  in  Date  fields  (year  1000  for  e.g.)  etc.     Post  review  and  customer  acceptance,  these  rules  will  be  plugged  into  or  designed  and  coded  into  the   framework  that  will  leverage  technical  capabilities  of  big  data  to  process  large  amounts  of  data.     The  rules  will  be  generic  and  designed  to  scale  across  multiple  data  elements  as  and  where  applicable   and  possible.   ARCHITECTURAL  OVERVIEW     Figure  in  Appendix  B  depicts  a  bird’s  eye  view  representation  of  the  architecture.     Furthermore,  DQ  Auditing  falls  under  varying  degrees  of  complexity  Audit  process  will  inherently  be   progressive  starting  with  preliminary  assessment  on  a  case-­‐by-­‐case  basis  against  datasets  .     1. Simple  –  candidates  include  data  requiring  basic  checks  that  can  be  e.g.  missing  data,   implemented  with  SQL  capabilities.  Such  scenarios,  for  most  purpose  are  represented  by   standard  technical  or  sometimes  business  rules  as  in  master  data  matching  rules.     2. Medium  –  candidates  can  include  address  quality  check,  phone  number  check.  Most  of  the   programs  would  be  easily  available  in  a  license  tool  or  through  3rd  party  plug-­‐ins.  E.g.  Melissa   data.  However,  certain  non-­‐standard  data  elements  and  scenarios  are  seldom  offered  by  licensed   tools  and  require  innovative  implementation  techniques  to  be  incorporated  in  the  DQA   framework.  Examples  include:  Applying  Statistical  routines  to  identify  outlier  data,  applying   standard  deviation,  mean,  frequency  etc.    As  a  a  very  basic  example,  a  simple  spreadsheet  graph   is  presented  below.  The  product  code  “6000”  has  a  frequency  of  10  and  appears  skewed  in   relation  to  occurrences  of  all  other  product  codes.  In  the  absence  of  any  definitive  master  data   reference,  this  product  code  will  be  “flagged”  as  potential  bad  data.       100   100   120   145   122   10   0   50   100   150   200   1000'   2000'   3000'   4000'   5000'   6000'   Product  Code  Frequency   1000'   2000'   3000'   4000'   5000'   6000'  
  • 5.     PS:  Rich  visualization  depicted  in  the  architecture  diagram  expands  to  web  technologies  like  HTML5  ,   SVG  as  also  to  spreadsheet  applications  like  Microsoft  Excel.     3. High  –  candidates  include  extrapolating  rules  across  multiple  datasets  one  such  can  be  e.g.   identifying  “bad”  product  code  by  comparing  with  multiple  variables  including  product  code   trending,  referential  associations,  machine  learning  algorithms  etc.   BENEFITS     TECs  Agile  methodology  coupled  with  open  source  big  data  capabilities  presents       1. Cost  –  As  TEC  will  use  Open  source  technologies,  the  CAPEX  is  largely  reduced  in  launching  a  robust   DQA  program.     2. While  most  licensed  tools  offer  out-­‐of-­‐box  functions  for  DQA,  they  often  fall  short  of  custom   capabilities  OR  have  high  costs  and  offer  less  transparency  into  scalability  and  implementation.  TEC   will  closely  partner  with  IDEA  bringing  clear  visibility  into  each  step  of  the  process  as  is  depicted  in   the  flowchart.  Of  course  some  out-­‐of-­‐the-­‐box  features  might  still  need  to  be  procured.  Example:   Flagging  “address”  as  bad  data  would  potentially  require  USPS  data  validation  routines.     3. Use  of  open  source  Big  Data  Analytics  and  Visualization  framework  can  be  scaled  across  other   applications,  infrastructure  and  DQ  capabilities,  while  maintaining  low  Total  Cost  of  Ownership.     4. TEC  methodology  will  result  in  Quick-­‐Wins  in  much  Shorter  Cycles  due  to  the  Agile  Engagement  as   opposed  to  going  through  a  full  program  life  cycle  to  derive  the  initial  results.                        
  • 6.   APPENDIX  A:  DQA  METHODOLOGY  FLOWCHART        Receive   Data  Set   Load  into   Database   Preliminary   Analysis   Extrapolate   DQA  Rules   Publish   Rules     Customer     Accepted?   Apply  Rule   to  Data  Set   YES   Customer   engagement   Rules   Extrapolated   Create  DQA   Rules   NO   NO     Re-­‐Assess   Scenario?   YES   Generate  DQ   Metrics   Flag  Bad   Data   Load  into   Publish  Ready   DQA  Database   Stop   Load  into   DQA  Publish   Ready  Rules   Database   Analyze   Data  /  Rules     Customer     Validation   OK?   YES   NO  
  • 7.   APPENDIX  B:  ARCHITECTURE                         Open  Source  -­‐Big  Data-­‐  DQA  Platform     Rules   Repository   • Using  capabilities  of  Mapreduce.   To  apply  Statistical  algorithms.     • Apply  basic  standard  rules  using   combination  of  SQL  and   Mapreduce  to  get  best  blend  of   performance  and  ease-­‐of-­‐ design/build  capabilities   DQA   Database   Rich   Visualization   Web  Data  Service   Structured  Databases   File  based  data   Capability  for     Multi-­‐Format  Data   Publishing  including   -­‐ Files   -­‐ Database   -­‐ JSON  docs   -­‐ XML   -­‐ Etc.