The document discusses a webcast on data normalization hosted by the Tech Lab on September 24, 2014. It provides information on what the Tech Lab is, defines data normalization, and explains why normalization is necessary when dealing with disparate data sources. Examples of normalizing Federal Procurement data and using Hadoop to enable complex queries on normalized data are also presented.
1. Grab some
coffee and
enjoy the
pre-show
banter
before the
top of the
hour!
2. Episode
2:
Back
to
Normal
Tech
Lab
Webcast
|
September
24,
2014
Sponsored
by
3. What
Is
the
Tech
Lab?
u Real-‐world
proving
ground
for
enterprise
soCware
u Designed
to
showcase
the
process
of
creaEng
soluEons
u Completely
independent
of
sponsor
influence
u Run
by
Master
ScienEst,
Dr.
Geoffrey
Malafsky
u Projects
span
3-‐6
months
4. What
Is
Data
NormalizaEon?
u Data
NormalizaEon
is
a
process
by
which
disparate
data
sets,
terms,
models
and
ontologies
can
be
reconciled
for
the
purpose
of
providing
cerEfiably
accurate
enterprise
data.
5. Why
Is
NormalizaEon
Necessary?
u Disparate
Data
Systems
u Disparate
File
Structures
u Disparate
Data
Models
u Variable
Business
Logic
u ConflicEng
Data
Values
u Serious
SemanEc
Issues
6. How
Hadoop
Can
Help
u Robust
plaYorm
for
data
persistence
u RelaEvely
easy
to
connect
to
enterprise
apps
u Enables
‘future-‐proofing’
by
avoiding
lock-‐in
u Growing
array
of
parallel
processing
funcEons
u New
standard
for
data
management
u No
need
to
delete
data,
enabling
roll-‐back
10. u Normalizing
data
is
more
sophisEcated
than
what
is
commonly
done
in
integraEon
u It
combines
subject
maaer
knowledge,
governance,
business
rules,
and
raw
data.
u Small
Data
is
“corporate
structured
data
that
is
the
fuel
of
its
main
ac2vi2es,
and
whose
problems
with
accuracy
and
trustworthiness
are
past
the
stage
of
being
alleged.
This
includes
financial,
customer,
company,
inventory,
medical,
risk,
supply
chain,
and
other
primary
data
used
for
decision
making,
applica2ons,
reports,
and
Business
Intelligence.”
11. The
State
of
Corporate
Data
multiple
instances of
source data
multiple
definitions
for reporting
multiple
copies of data
variable
structures
hidden
conflicts
in
data
definiEons
different
data
values
which
source
to
use
different
model
types
&
standards
more
storage
,
esp.
when
mulEplied
by
envinroments
more
data
flows
to
develop
and
maintain
more
than
100
DW
or
data
marts
downstream
different
methods
for
ETL
complex
dependencies,
difficult
for
impact
assessment
conflicEng
business
logic
&
views
global
analyses
&
aggregaEons
restricted
by
inconsistencies
Copyright
PSIKORS
InsEtute
2013
11
13. Data
NormalizaEon
Showcase
u FPDS
is
an
open
source
of
Federal
Procurement
data
that
has
poor
quality
and
consistency.
– Approx
10M+
records
each
with
306
columns
=
25GB
raw
text
– Structured
data
except
for
some
free
text
fields
u We
are
normalizing
it
for
analysis
of
IT
expenditures
for
a
real
client
u Queries
are
used
by
analysts
supported
by
Hadoop
environment
via
Data
NormalizaEon
plaYorm
14. NormalizaEon
Begins
with
Understanding
Data
u Databases
are
supposed
to
have
official
informaEon
on
formal
acquisiEon
of
IT
assets.
– Contracts
DB
not
aligned
with
Procurement
DB
• Example,
FA330012Dxxx
in
one
but
not
other
u Differing
data
sets
and
values
– FA330012F0005:
Same
in
both
– FA330012P0020:
Contracts
DB:
10
items;
FPDS:
1
item;
Same
descripEon,
same
total
dollars
– HQ042312*:
Contracts
6
=
$278.4K,
FPDS
1
=
$48K
• $48K
is
one
of
6
records
in
Contracts
Copyright
PSIKORS
InsEtute
2014
14
15.
16. ConverEng
supposedly
same
primary
keys
into
normalized
values
that
can
be
compared:
contract
number
u If
(DELIVERY_ORDER=NULL)
v_piid
=
CONTRACT
else
v_piid
=
DELIVERY_ORDER
u If
(
x1='0')
v_modificaEon_number
=
'0‘
else
v_modificaEon_number
=
x2
– where
x1:
if
(ACO_MOD=NULL)
x1
=
x3
else
x1
=
ACO_MOD
– where
x3:
if
(PCO_MOD=NULL)
x3='0‘
else
x3=PCO_MOD
– where
x2:
if
(x4=NULL)
x2='0‘
else
x2=x4
– where
x4:
x4=
LTRIM(x5)
– where
x5:
x5=x1
– essenEally
this
first
tries
to
use
ACO_MOD,
and
if
this
is
NULL
then
it
tries
to
use
PCO_MOD
and
sets
=
'0'
if
these
are
NULL
u If
(DELIVERY_ORDER=NULL)
v_idv_piid
=
y1
else
v_idv_piid
=
CONTRACT
– where
y1:
y1
=
REF_PROC_INSTRUMENT
with
all
'-‐'
characters
removed
Copyright
PSIKORS
InsEtute
2014
16
key
business
logic
as
buried
in
a
database
stored
procedure
(condensed)
21. Storing
Term
Rules
in
Master
Codes
Note
wildcard
character
(*)
in
middle
as
well
as
front
and
back
22. Complicated
Queries
are
OCen
Needed
Looking
for
a
combinaEon
of
keywords
with
wildcards
along
with
structured
values
SELECT
recordid,contracEngagencyid,contracEngagencyname,orgcode,orgid,modificaEonnumber,piid,piidagencyid,solicitaEonid,effecEvedate,fiscalyear,fundingagencyid,fundingagencyname,typeofcon
tract,consolidatedcontractdesc,descofreq,naicscode,naicsdesc,productorservicecode,productorservicedesc,globaldunsnumber,dunsnumber,globalvendorname,vendorname,datesigned,referenc
edidvpiid,referencedidvagencyid,referencedidvmodnumber,contracEngdepartmenEd,contracEngdepartmentname,contracEngofficeid,contracEngofficename,contracEngofficeregion,funcdimen
ddate,funcdimstartdate,funcEon1,funcEon1value,funcEon2,funcEon2value,funcEon3,funcEon3value,majorcommandcode,majorcommandid,majorcommandname,parentmacomcode,primarydi
mensionid,primarydimensionvalueid,secondarydimensionid,secondarydimensionvalueid,subcommand1code,subcommand1id,subcommand1name,subcommand2code,subcommand2id,subcom
mand2name,subcommand3code,subcommand3id,subcommand3name,subcommand4code,subcommand4id,subcommand4name,terEarydimensionid,terEarydimensionvalueid,transacEonnum
ber,lastdatetoorder,compleEondate,estulEmatecompleEondate,signeddate,fundingofficeid,fundingofficename,isfundedforeignenEtycode,isfundedforeignenEtydesc,reasoninteragencycontracE
ng,feeforuseofservice,fixed,lowervalue,maximumorderlimit,orderingprocedure,uppervalue,websiteurl,whocanuse,feepaidforuseofidv,programacronym,typeofidc,a76acEoncode,a76acEondesc,
conEngencyhumanitarianpeaceop,contracYinancing,costacctstdclausecode,costacctstdclausedesc,costorpricingdata,emailaddress,gfegfpcode,gfegfpdesc,inherentlygovernmentaldesc,inherentl
ygovernmentalfuncEon,leaercontractundefacEoncode,leaercontractundefacEondesc,majorprogram,mulEpleorsingleawardidv,mulEyearcontractcode,mulEyearcontractdesc,naEonalinterestac
Eon,naEonalinterestdesc,numberofacEons,performancebasedserviceacqcode,performancebasedserviceacqdesc,purchasecardpaymethodcode,purchasecardpaymethoddesc,seatransportaEon,s
ubcontractplan,treasuryacctsymbolagencyid,treasuryacctsymboliniEaEve,treasuryacctsymbolmaincode,treasuryacctsymbolsubcode,clingercohenactcode,clingercohenactdesc,davisbaconactcod
e,davisbaconactdesc,economyact,interagencycontracEngauthcode,interagencycontracEngauthdesc,otherstatutoryauthdesc,servicecontractactdesc,servicecontractactcode,walshhealeyactcode,
walshhealeyactdesc,bundledreqs,claimantprogramcode,consolidatedcontractcode,domesEcorforeignenEtycode,domesEcorforeignenEtydesc,infotechcommercialitemcategory,recoveredmateri
alssustain,recoveredmaterialssustaindesc,systemequipmentcode,useofepadesignatedproducts,congrdistrictplaceofperf,placeofperfzipcode,princplaceofperfcityname,princplaceofperfcountryco
de,princplaceofperfcountryname,princplaceofperfcountycode,princplaceofperfcountyname,princplaceofperflocaEoncode,princplaceofperfstatecode,countryprodserviceorigincode,placeofmanu
facture,placeofmanufacturedesc,alternaEveadverEsing,commercialitemacqperoccode,commercialitemacqperocdesc,commercialitemtestprogram,commercialitemtestprogramdesc,evaluatedpr
eference,extentcompeted,fairopportunitylimitedsources,fedbizoppscode,fedbizoppsdesc,localareasetasidecode,localareasetasidedesc,numberofoffersreceived,otherthanfullopencompeEEon,pr
eawardyosynopsis,priceevaluaEonpercentdiff,sbaorofppsynopsiswaiverpilot,sbirsar,smallbuscompdemoprog,solicitaEonperoc,typeofsetaside,awardoridvtype,createdvia,lastmodifiedby,lastmo
difieddate,part8orpart13,preparedby,prepareddate,reasonformodificaEoncode,reasonformodificaEondesc,congrdistrictcontractor,contractorname,doingbusasname,samexcepEon,street,street
2,vendorcity,vendorcountry,vendorphonenumber,vendorstate,zip,is1862landgrantcollege,is1890landgrantcollege,is1994landgrantcollege,isairportauth,isalaskannaEvecorpownedfirm,isalaskann
aEveservicinginst,isamericanindianowned,isasianpacificamericanowned,isblackamericanowned,isbothcontractsandgrants,iscity,iscommdevelopedcorpownedfirm,iscommdevelopmentcorp,isco
ntracts,iscorporateenEtynoaaxexempt,iscorporateenEtytaxexempt,iscouncilofgovernments,iscountryofincorporaEon,iscounty,isdomesEcshelter,isdotcertdisbusent,iseducaEonalinst,isemerging
smallbus,isfederalagency,isfedfundedresanddevcorp,isforprofitorg,isforeigngovernment,isforeignownedandlocated,isfoundaEon,isgrants,ishispanicamericanowned,ishispanicservicinginst,isvend
orhbcu,ishospital,ishousingauthpublictribal,isindiantribe,isintermunicipal,isinternaEonalorg,isinterstateenEty,islaborsurplusareafirm,islimitedliabilitycorp,islocalgovernmentowned,ismanufactur
erofgoods,isminorityinsts,isminorityownedbus,ismunicipality,isnaEveamericanowned,isnaEvehawaiianorgownedfirm,isnaEvehawaiianservicinginst,isnonprofitorg,isotherminorityowned,isother
noYorprofitorg,ispartnershipllp,isplanningcommission,isportauth,isprivateuniversityorcollege,issbacert8ajointventure,issbacert8aprogparEcipant,issbacerthubzonefirm,issbacertsmalldisbus,issc
hooldistrict,isschoolofforestry,isselfcerEfedsmalldisbus,isservicedisabledvetownedbus,issmallagriculturalcooperaEve,issoleproprietorship,isstatecontrinsthigherlearn,isstateofincorporaEon,issub
chapterscorp,issubcontasianindianamerowned,istheabilityoneprog,istownship,istransitauth,istribalcollege,istriballyowned,isusfederalgovernment,isusgovernmentenEty,isuslocalgovernment,isu
sstategovernment,isveteranownedbus,isveterinarycollege,isveterinaryhospital,iswomanownedbus,istypeecondiswosb,istypejventecondiswosb,istypejventwosb,istypewosb,contracEngo{ussize
selecEon,reasonnotawardedtosmallbus,reasonnotawardedtosmalldisbus,idvbundledreqs,idvcontracEngagencyid,idvcontracEngagencyname,idvcontracEngo{ussizesel,idvdepartmenEd,idvdep
artmentname,idvmajorprogcode,idvmulEpleorsingleawardidv,idvnaicscode,idvnaicsdesc,idvpart8orpart13,idvprogacronym,idvreferencedidvagencycode,idvreferencedidvpiid,idvsubcontractpla
n,idvsubcontractplandesc,idvtypeofcontractpricing,idvtypeofcontractpricingdesc,idvtypeofidc,idvtypeofidcdesc,idvwhocanuse,idvwhocanusedesc,missing301,currentcontractvalue,acEonobligaE
on,ulEmatecontractvalue
FROM
fpdsrawrecords.records
WHERE
(
(
(
LOWER(fundingagencyid)
=
'97as'
)
)
AND
(
(
LOWER(fiscalyear)
=
'2013'
)
)
AND
(
(
LOWER(productorservicecode)
LIKE
'70%'
OR
LOWER(productorservicecode)
LIKE
'd3%'
)
)
)
LIMIT
1000
23. Query
Timing
u Looking
for
combinaEons
of
text
tokens
(with
wildcards)
to
known
field
values
u Queries
are
done
both
in
Data
NormalizaEon
plaYorm
and
by
command
line
interface
on
Hadoop
server
for
Impala
and
Hive.
Time
differences
are
negligible
but
all
Emes
reported
here
are
by
CLI
– Tables
made
for:
text,
Parquet,
Parquet
parEEoned
by
‘fiscalyear’
(6
values)
and
‘fundingagencyid’
(approx.
25
values)
24. 400
350
300
250
200
150
100
50
0
FPDS
Hadoop
Query
Times
Text
Field
(secs)
Hive
Impala
SQLServer
Text
Parquet
Parquet
ParEEoned
EvaluaEng
query
performance
in
Hadoop
relaEve
to
format
and
comparing
to
RDBMS
25. 250
200
150
100
50
0
FPDS
TEXT
QUERIES
PER
LIMIT
(SECS)
Hive
Text
Impala
Text
Hive
Parquet
Impala
Parquet
Hive
Parquet
Part
Impala
Parquet
Part
100
LIMIT
1000
LIMIT
NO
LIMIT
38. Special
Programming
for
Hadoop
u Which
Hadoop
libraries?
Intertwined
so
reference
all.
u Otherwise:
not
much
– HDFS
filesystem
– YARN
containers
39.
40.
41. Parallel
Jobs
u Three
ways
to
run
parallel
jobs
– Launch
mulEple
Java
sessions
from
command
line
• Same
as
in
Windows,
Linux
– Use
Cloudera
Hue
Job
Designer
• Easy
and
has
management
web
pages
– Data
NormalizaEon
system
• Coordinates
governance,
architecture,
data
models,
codes,
business
rules
• Define,
submit
YARN
containers
specifying
Java
jar,
dicEonaries,
source
files
42. Key
Code
Analysis
– Invoice
data
sets
extracted
with
correlaEon
• CAGE:
984274,
DUNS:
973437
– FPDS
DUNS
and
Names
extracted
&
correlated
• 158181
unique
DUNS
codes
– Will
be
included
in
normalized
composite
IT
Asset
records
– Composite
records
for
lookup
added
to
Hadoop
• By
DUNS
or
Global
DUNS:
get
all
related
DUNS,
CAGE,
names
• By
CAGE:
get
all
related
DUNS,
names
• By
name:
get
all
related
DUNS,
CAGE,
names
43. Number
CAGE
Per
DUNS
Code
1000000
100000
10000
1000
100
10
1
0.1
Number
DUNS
Codes
With
X
CAGE
Codes
One
DUNS
code
has
119
CAGE
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
23
24
27
35
40
43
44
46
54
71
78
90
119
44. 1.4
1.2
1
0.8
0.6
0.4
0.2
0
ToWAWF
Millions
CAGE
Codes
from
LookUp
File
Found
NotFound
45. FPDS
Number
DUNS
with
N
Global
DUNS
1000000
100000
10000
1000
100
10
1
0.1
0
1
2
3
4
5
100000
10000
1000
100
10
1
0.1
FPDS:
Number
DUNS
with
N
Names
6849
instances
for
code
=
12345678
7
1
3
5
7
9
11
13
15
17
19
21
24
27
35
112
46. 10000
1000
100
10
1
0.1
FPDS:
Number
Global
DUNS
with
N
DUNS
0
50
100
150
200
250
Number
Global
DUNS
Number
DUNS
1000
100
10
1
0.1
FPDS:
Global
DUNS
with
MulEple
Names
0
200
400
600
800
1000
1200
1400
Number
Global
DUNS
Number
Names
48. FPDS
DUNS
With
Most
Names
DUNS
NGlobalDUNS
Nnames
123456787
0
6849
136666505
0
112
790238851
0
96
103933453
1
35
103385519
1
33
005149120
1
27
067641597
1
25
005103494
0
24
332619535
0
24
020751082
1
22
054781240
1
22
621599893
1
21
790238638
0
21
834476079
1
21
123456787
miscellaneous
foreign
contractors
123456787
eEsalat
c/o
us
consulate
general
dubai
123456787
boswedden
house
123456787
turner
engine
controls
b.
v.
123456787
swissport
hellas
cargo
s
a
123456787
orbit
couriers
sa
123456787
goldair
aviaEon
handling
s.a.
123456787
federal
egov
iae
iniEaEve
generic
duns
123456787
federal
egov
iae
iniEaEve
-‐
generic
duns
123456787
miscellaneous
foreign
contractorsan
123456787
prc-‐desoto
123456787
inversiones
sochagota
e.u.
123456787
comcel
123456787
transporte
y
servicio
lucio
123456787
jesse
james
members
only
maxi
taxi
svc
123456787
club
naval
de
oficiales
123456787
inchcape
shipping
services
123456787
dr.
thalia
abatzi
123456787
central
asia
development
group
123456787
bennea-‐fouch
and
associates
123456787
noor
al-‐sabah
company
123456787
ait/arc
infrasture
soluEons
123456787
not
available
123456787
77
construcEon
company
136666505
adese
genc
petrol
136666505
amy
lily
chung
136666505
anderson
erin
ruth
136666505
andrew
william
knef
136666505
anduaga-‐arias
laura
136666505
angelica
m.
de
la
cruz
136666505
anthony
o'brien,
330531-‐5100194
136666505
batac
belle
136666505
boaesini
beth
ms.
136666505
bouck
shannon
136666505
bunn
amy
b.
136666505
carlene
clark
136666505
cho,
boong
haeng
136666505
choe,
sun
young
136666505
chrisEna
michajlyszyn
136666505
christopher
cannon
136666505
christopher
l.
booth
136666505
chun,
kil
mo
136666505
conflict
+
transiEon
consultancies
136666505
cozzone
elaine
136666505
deborah
p.
carney
136666505
denihan
patricia
joann
136666505
dong
sook
mcgeorge,
690525-‐2716816
136666505
dorene
d.lukewalton,pharm
d.
136666505
dr.
terry
a.
klein
50. Prompted
CollaboraEon
and
New
Business
InformaEon
u Showing
these
results
prompted
discussions
leading
to:
– There
are
generic
DUNS
heavily
used
but
these
are
being
removed
from
use
via
policy
changes
– System
validaEon
rules
are
not
current
with
all
policy
– AddiEonal
“rules”
of
how
to
track,
audit,
align,
merge
spread
by
email
• All
put
back
into
Data
NormalizaEon
system
and
then
into
modified
Java
u New
results
available
over
all
data
sets
<1day