3.
INTRODUCTION
Third
Eye
Consulting
LLC
(henceforth
referred
to
as
“TEC”
in
this
document.)
is
pleased
to
present
this
initial
draft
proposal
for
building
a
scalable
and
cost-‐effective
Data
Quality
Audit
Solution
leveraging
state
of
the
art
Open
Source
Big
Data
Technology.
Third
Eye
Consulting
LLC
is
a
Big
Data
Consulting
firm
that
has
successfully
applied
Big
Data
technologies
to
various
applications
that
were
previously
deployed
using
traditional
licensed
tools,
and
helped
deliver
high
value
to
clients
with
realization
of
optimal
cost-‐benefits.
SCOPE
This
initial
draft
proposal
is
based
on
few
assumptions
based
on
preliminary
conversations
around
the
strategic
need
for
Data
Quality
Audit
solutions
for
IDEA
(henceforth
referred
to
as
“DQA”
in
this
document.)
In
Scope
Per
the
conversation,
IDEA’s
Strategic
needs
are
broadly
interpreted
as:
• Capability
to
perform
Audit
on
Several
Million
Product
Codes,
and
associated
data
elements
in
the
data
flow.
• Score
carding
and
flagging
poor
quality
data
in
the
absence
of
data
governance
and
business
rules
defining
the
semantics
of
the
data.
Out
of
Scope
Data
Cleaning
or
Data
Correction
is
not
a
part
of
this
document.
ASSUMPTIONS
Standard
assumptions
made
in
this
initial
draft
are:
1. Data
set
is
made
available
on
IDEA’s
servers.
2. The
configuration
of
the
servers
(sand
box)
for
implementing
DQA
framework/capabilities
will
be
in
conformance
of
TEC
‘s
recommendation.
3. TEC
team
will
have
remote
access
and
privileges
to
the
DQA
server
as
per
documented
requests
for
such
privileges
to
install
software,
execute
software
processes
etc.
4. In
context
of
the
strategic
needs
described
in
the
preceding
paragraph,
no
other
assumptions
regarding
the
data
e.g.
structure
etc.
or
otherwise
are
made
in
this
initial
draft.
And
it
is
not
required
to
do
so.
METHODOLOGY
TEC
‘s
expertise
and
experiences
has
been
in
implementing
cost-‐effective
solutions
to
deliver
scalable,
sustainable
and
high
value
to
its
customers.
TEC
will
leverage
open
source
and
big
data
solutions
to
implement
a
state-‐of-‐the-‐art
Data
Quality
Audit
framework
that
leverages
statistical
algorithms
to
identify
data
outliers,
pattern
matching
etc.
in
addition
to
rudimentary
rules
like
“missing
data”.
The
TEC
DQA
methodology
will
setup
a
Repeatable
Agile
process
that
scales
not
just
to
handle
data
volumes,
but
also
data
formats
meeting
dynamically
changing
business
rules
and
supporting
4.
infrastructure,
while
recognizing
the
challenges
of
lack
of
data
governance
or
the
dependence
on
external
data
and
lack
of
insight
thereof
and
progressively
keep
costs
flat
or
relatively
lower
to
other
alternatives.
The
flowchart
in
Appendix
A
illustrates
Agile
Methodology
for
implementing
a
repeatable
DQA
process.
The
box
“Extrapolate
DQA
Rules”
in
the
flowchart
is
the
step
where
TEC
team
will
attempt
to
identify
“occurrences”
of
data
leading
to,
potentially
what
can
be
inferred
as
“bad
data”
e.g.
special
characters
in
product
name
attributes
or
missing
data
or
skewed
data
in
Date
fields
(year
1000
for
e.g.)
etc.
Post
review
and
customer
acceptance,
these
rules
will
be
plugged
into
or
designed
and
coded
into
the
framework
that
will
leverage
technical
capabilities
of
big
data
to
process
large
amounts
of
data.
The
rules
will
be
generic
and
designed
to
scale
across
multiple
data
elements
as
and
where
applicable
and
possible.
ARCHITECTURAL
OVERVIEW
Figure
in
Appendix
B
depicts
a
bird’s
eye
view
representation
of
the
architecture.
Furthermore,
DQ
Auditing
falls
under
varying
degrees
of
complexity
Audit
process
will
inherently
be
progressive
starting
with
preliminary
assessment
on
a
case-‐by-‐case
basis
against
datasets
.
1. Simple
–
candidates
include
data
requiring
basic
checks
that
can
be
e.g.
missing
data,
implemented
with
SQL
capabilities.
Such
scenarios,
for
most
purpose
are
represented
by
standard
technical
or
sometimes
business
rules
as
in
master
data
matching
rules.
2. Medium
–
candidates
can
include
address
quality
check,
phone
number
check.
Most
of
the
programs
would
be
easily
available
in
a
license
tool
or
through
3rd
party
plug-‐ins.
E.g.
Melissa
data.
However,
certain
non-‐standard
data
elements
and
scenarios
are
seldom
offered
by
licensed
tools
and
require
innovative
implementation
techniques
to
be
incorporated
in
the
DQA
framework.
Examples
include:
Applying
Statistical
routines
to
identify
outlier
data,
applying
standard
deviation,
mean,
frequency
etc.
As
a
a
very
basic
example,
a
simple
spreadsheet
graph
is
presented
below.
The
product
code
“6000”
has
a
frequency
of
10
and
appears
skewed
in
relation
to
occurrences
of
all
other
product
codes.
In
the
absence
of
any
definitive
master
data
reference,
this
product
code
will
be
“flagged”
as
potential
bad
data.
100
100
120
145
122
10
0
50
100
150
200
1000'
2000'
3000'
4000'
5000'
6000'
Product
Code
Frequency
1000'
2000'
3000'
4000'
5000'
6000'
5.
PS:
Rich
visualization
depicted
in
the
architecture
diagram
expands
to
web
technologies
like
HTML5
,
SVG
as
also
to
spreadsheet
applications
like
Microsoft
Excel.
3. High
–
candidates
include
extrapolating
rules
across
multiple
datasets
one
such
can
be
e.g.
identifying
“bad”
product
code
by
comparing
with
multiple
variables
including
product
code
trending,
referential
associations,
machine
learning
algorithms
etc.
BENEFITS
TECs
Agile
methodology
coupled
with
open
source
big
data
capabilities
presents
1. Cost
–
As
TEC
will
use
Open
source
technologies,
the
CAPEX
is
largely
reduced
in
launching
a
robust
DQA
program.
2. While
most
licensed
tools
offer
out-‐of-‐box
functions
for
DQA,
they
often
fall
short
of
custom
capabilities
OR
have
high
costs
and
offer
less
transparency
into
scalability
and
implementation.
TEC
will
closely
partner
with
IDEA
bringing
clear
visibility
into
each
step
of
the
process
as
is
depicted
in
the
flowchart.
Of
course
some
out-‐of-‐the-‐box
features
might
still
need
to
be
procured.
Example:
Flagging
“address”
as
bad
data
would
potentially
require
USPS
data
validation
routines.
3. Use
of
open
source
Big
Data
Analytics
and
Visualization
framework
can
be
scaled
across
other
applications,
infrastructure
and
DQ
capabilities,
while
maintaining
low
Total
Cost
of
Ownership.
4. TEC
methodology
will
result
in
Quick-‐Wins
in
much
Shorter
Cycles
due
to
the
Agile
Engagement
as
opposed
to
going
through
a
full
program
life
cycle
to
derive
the
initial
results.
6.
APPENDIX
A:
DQA
METHODOLOGY
FLOWCHART
Receive
Data
Set
Load
into
Database
Preliminary
Analysis
Extrapolate
DQA
Rules
Publish
Rules
Customer
Accepted?
Apply
Rule
to
Data
Set
YES
Customer
engagement
Rules
Extrapolated
Create
DQA
Rules
NO
NO
Re-‐Assess
Scenario?
YES
Generate
DQ
Metrics
Flag
Bad
Data
Load
into
Publish
Ready
DQA
Database
Stop
Load
into
DQA
Publish
Ready
Rules
Database
Analyze
Data
/
Rules
Customer
Validation
OK?
YES
NO
7.
APPENDIX
B:
ARCHITECTURE
Open
Source
-‐Big
Data-‐
DQA
Platform
Rules
Repository
• Using
capabilities
of
Mapreduce.
To
apply
Statistical
algorithms.
• Apply
basic
standard
rules
using
combination
of
SQL
and
Mapreduce
to
get
best
blend
of
performance
and
ease-‐of-‐
design/build
capabilities
DQA
Database
Rich
Visualization
Web
Data
Service
Structured
Databases
File
based
data
Capability
for
Multi-‐Format
Data
Publishing
including
-‐ Files
-‐ Database
-‐ JSON
docs
-‐ XML
-‐ Etc.