A Machine learning based Data Quality Analysis Approach

Proposal
for

Data
Quality
Audit
Solutions

Prepared
for
IDEA

April
2013

©
2004
by
Third
Eye
Consulting
LLC

All
rights
reserved.
No
part
of
this
document
may
be
reproduced
or
transmitted
in
any
form
or
by
any

means,
electronic,
mechanical,
photocopying,
recording,
or
otherwise,
without
prior
written
permission

of
Third
Eye
Consulting
LLC.

Table
of
Contents

INTRODUCTION
..................................................................................................................................................................................
3

SCOPE
......................................................................................................................................................................................................
3

In
Scope
.............................................................................................................................................................................................
3

Out
of
Scope
.....................................................................................................................................................................................
3

ASSUMPTIONS
.....................................................................................................................................................................................
3

METHODOLOGY
..................................................................................................................................................................................
3

ARCHITECTURAL
OVERVIEW
......................................................................................................................................................
4

BENEFITS
..............................................................................................................................................................................................
5

APPENDIX
A:
DQA
METHODOLOGY
FLOWCHART
.............................................................................................................
6

APPENDIX
B:
ARCHITECTURE
.....................................................................................................................................................
7

INTRODUCTION

Third
Eye
Consulting
LLC
(henceforth
referred
to
as
“TEC”
in
this
document.)
is
pleased
to
present
this

initial
draft
proposal
for
building
a
scalable
and
cost-‐effective
Data
Quality
Audit
Solution
leveraging

state
of
the
art
Open
Source
Big
Data
Technology.

Third
Eye
Consulting
LLC
is
a
Big
Data
Consulting
firm
that
has
successfully
applied
Big
Data

technologies
to
various
applications
that
were
previously
deployed
using
traditional
licensed
tools,
and

helped
deliver
high
value
to
clients
with
realization
of
optimal
cost-‐benefits.

SCOPE

This
initial
draft
proposal
is
based
on
few
assumptions
based
on
preliminary
conversations
around
the

strategic
need
for
Data
Quality
Audit
solutions
for
IDEA
(henceforth
referred
to
as
“DQA”
in
this

document.)

In
Scope

Per
the
conversation,
IDEA’s
Strategic
needs
are
broadly
interpreted
as:

• Capability
to
perform
Audit
on
Several
Million
Product
Codes,
and
associated
data
elements
in

the
data
flow.

• Score
carding
and
flagging
poor
quality
data
in
the
absence
of
data
governance
and
business

rules
defining
the
semantics
of
the
data.

Out
of
Scope

Data
Cleaning
or
Data
Correction
is
not

a
part
of
this
document.

ASSUMPTIONS

Standard
assumptions
made
in
this
initial
draft
are:

1. Data
set
is
made
available
on
IDEA’s
servers.

2. The
configuration
of
the
servers
(sand
box)
for
implementing
DQA
framework/capabilities
will

be
in
conformance
of
TEC
‘s
recommendation.

3. TEC
team
will
have
remote
access
and
privileges
to
the
DQA
server
as
per
documented
requests

for
such
privileges
to
install
software,
execute
software
processes
etc.

4. In
context
of
the
strategic
needs
described
in
the
preceding
paragraph,
no
other
assumptions

regarding
the
data
e.g.
structure
etc.
or
otherwise
are
made
in
this
initial
draft.
And
it
is
not

required
to
do
so.

METHODOLOGY

TEC
‘s
expertise
and
experiences
has
been
in
implementing
cost-‐effective
solutions
to
deliver
scalable,

sustainable
and
high
value
to
its
customers.

TEC
will
leverage
open
source
and
big
data
solutions
to

implement
a
state-‐of-‐the-‐art
Data
Quality
Audit
framework
that
leverages
statistical
algorithms
to

identify
data
outliers,
pattern
matching
etc.
in
addition
to
rudimentary
rules
like
“missing
data”.

The
TEC
DQA
methodology
will
setup
a
Repeatable
Agile
process
that
scales
not
just
to
handle
data

volumes,
but
also
data
formats
meeting
dynamically
changing
business
rules
and
supporting

infrastructure,
while
recognizing
the
challenges
of
lack
of
data
governance
or
the
dependence
on

external
data
and
lack
of
insight
thereof
and
progressively
keep
costs
flat
or
relatively
lower
to
other

alternatives.

The
flowchart
in
Appendix
A
illustrates
Agile
Methodology
for
implementing
a
repeatable
DQA
process.

The
box
“Extrapolate
DQA
Rules”
in
the
flowchart
is
the
step
where
TEC
team
will
attempt
to
identify

“occurrences”
of
data
leading
to,
potentially
what
can
be
inferred
as
“bad
data”
e.g.
special
characters
in

product
name
attributes
or
missing
data
or
skewed
data
in
Date
fields
(year
1000
for
e.g.)
etc.

Post
review
and
customer
acceptance,
these
rules
will
be
plugged
into
or
designed
and
coded
into
the

framework
that
will
leverage
technical
capabilities
of
big
data
to
process
large
amounts
of
data.

The
rules
will
be
generic
and
designed
to
scale
across
multiple
data
elements
as
and
where
applicable

and
possible.

ARCHITECTURAL
OVERVIEW

Figure
in
Appendix
B
depicts
a
bird’s
eye
view
representation
of
the
architecture.

Furthermore,
DQ
Auditing
falls
under
varying
degrees
of
complexity
Audit
process
will
inherently
be

progressive
starting
with
preliminary
assessment
on
a
case-‐by-‐case
basis
against
datasets
.

1. Simple
–
candidates
include
data
requiring
basic
checks
that
can
be
e.g.
missing
data,

implemented
with
SQL
capabilities.
Such
scenarios,
for
most
purpose
are
represented
by

standard
technical
or
sometimes
business
rules
as
in
master
data
matching
rules.

2. Medium
–
candidates
can
include
address
quality
check,
phone
number
check.
Most
of
the

programs
would
be
easily
available
in
a
license
tool
or
through
3rd
party
plug-‐ins.
E.g.
Melissa

data.
However,
certain
non-‐standard
data
elements
and
scenarios
are
seldom
offered
by
licensed

tools
and
require
innovative
implementation
techniques
to
be
incorporated
in
the
DQA

framework.
Examples
include:
Applying
Statistical
routines
to
identify
outlier
data,
applying

standard
deviation,
mean,
frequency
etc.

As
a
a
very
basic
example,
a
simple
spreadsheet
graph

is
presented
below.
The
product
code
“6000”
has
a
frequency
of
10
and
appears
skewed
in

relation
to
occurrences
of
all
other
product
codes.
In
the
absence
of
any
definitive
master
data

reference,
this
product
code
will
be
“flagged”
as
potential
bad
data.

100
100

120

145

122

10

0

50

100

150

200

1000'
2000'
3000'
4000'
5000'
6000'

Product
Code
Frequency

1000'

2000'

3000'

4000'

5000'

6000'

PS:
Rich
visualization
depicted
in
the
architecture
diagram
expands
to
web
technologies
like
HTML5
,

SVG
as
also
to
spreadsheet
applications
like
Microsoft
Excel.

3. High
–
candidates
include
extrapolating
rules
across
multiple
datasets
one
such
can
be
e.g.

identifying
“bad”
product
code
by
comparing
with
multiple
variables
including
product
code

trending,
referential
associations,
machine
learning
algorithms
etc.

BENEFITS

TECs
Agile
methodology
coupled
with
open
source
big
data
capabilities
presents

1. Cost
–
As
TEC
will
use
Open
source
technologies,
the
CAPEX
is
largely
reduced
in
launching
a
robust

DQA
program.

2. While
most
licensed
tools
offer
out-‐of-‐box
functions
for
DQA,
they
often
fall
short
of
custom

capabilities
OR
have
high
costs
and
offer
less
transparency
into
scalability
and
implementation.
TEC

will
closely
partner
with
IDEA
bringing
clear
visibility
into
each
step
of
the
process
as
is
depicted
in

the
flowchart.
Of
course
some
out-‐of-‐the-‐box
features
might
still
need
to
be
procured.
Example:

Flagging
“address”
as
bad
data
would
potentially
require
USPS
data
validation
routines.

3. Use
of
open
source
Big
Data
Analytics
and
Visualization
framework
can
be
scaled
across
other

applications,
infrastructure
and
DQ
capabilities,
while
maintaining
low
Total
Cost
of
Ownership.

4. TEC
methodology
will
result
in
Quick-‐Wins
in
much
Shorter
Cycles
due
to
the
Agile
Engagement
as

opposed
to
going
through
a
full
program
life
cycle
to
derive
the
initial
results.

APPENDIX
A:
DQA
METHODOLOGY
FLOWCHART

Receive

Data
Set

Load
into

Database

Preliminary

Analysis

Extrapolate

DQA
Rules

Publish

Rules

Customer

Accepted?

Apply
Rule

to
Data
Set

YES

Customer

engagement

Rules

Extrapolated

Create
DQA

Rules

NO

NO

Re-‐Assess

Scenario?

YES

Generate
DQ

Metrics

Flag
Bad

Data

Load
into

Publish
Ready

DQA
Database

Stop

Load
into

DQA
Publish

Ready
Rules

Database

Analyze

Data
/
Rules

Customer

Validation

OK?

YES

NO

APPENDIX
B:
ARCHITECTURE

Open
Source
-‐Big
Data-‐
DQA
Platform

Rules

Repository

• Using
capabilities
of
Mapreduce.

To
apply
Statistical
algorithms.

• Apply
basic
standard
rules
using

combination
of
SQL
and

Mapreduce
to
get
best
blend
of

performance
and
ease-‐of-‐
design/build
capabilities

DQA

Database

Rich

Visualization

Web
Data
Service

Structured
Databases

File
based
data

Capability
for

Multi-‐Format
Data

Publishing
including

-‐ Files

-‐ Database

-‐ JSON
docs

-‐ XML

-‐ Etc.

A Machine learning based Data Quality Analysis Approach

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Viewers also liked

Viewers also liked (10)

Similar to A Machine learning based Data Quality Analysis Approach

Similar to A Machine learning based Data Quality Analysis Approach (20)

More from Nandeep Nagarkar

More from Nandeep Nagarkar (7)

Recently uploaded

Recently uploaded (20)

A Machine learning based Data Quality Analysis Approach