SlideShare a Scribd company logo
1 of 57
Download to read offline
Grab some 
coffee and 
enjoy the 
pre-show 
banter 
before the 
top of the 
hour!
Episode 
2: 
Back 
to 
Normal 
Tech 
Lab 
Webcast 
| 
September 
24, 
2014 
Sponsored 
by
What 
Is 
the 
Tech 
Lab? 
u Real-­‐world 
proving 
ground 
for 
enterprise 
soCware 
u Designed 
to 
showcase 
the 
process 
of 
creaEng 
soluEons 
u Completely 
independent 
of 
sponsor 
influence 
u Run 
by 
Master 
ScienEst, 
Dr. 
Geoffrey 
Malafsky 
u Projects 
span 
3-­‐6 
months
What 
Is 
Data 
NormalizaEon? 
u Data 
NormalizaEon 
is 
a 
process 
by 
which 
disparate 
data 
sets, 
terms, 
models 
and 
ontologies 
can 
be 
reconciled 
for 
the 
purpose 
of 
providing 
cerEfiably 
accurate 
enterprise 
data.
Why 
Is 
NormalizaEon 
Necessary? 
u Disparate 
Data 
Systems 
u Disparate 
File 
Structures 
u Disparate 
Data 
Models 
u Variable 
Business 
Logic 
u ConflicEng 
Data 
Values 
u Serious 
SemanEc 
Issues
How 
Hadoop 
Can 
Help 
u Robust 
plaYorm 
for 
data 
persistence 
u RelaEvely 
easy 
to 
connect 
to 
enterprise 
apps 
u Enables 
‘future-­‐proofing’ 
by 
avoiding 
lock-­‐in 
u Growing 
array 
of 
parallel 
processing 
funcEons 
u New 
standard 
for 
data 
management 
u No 
need 
to 
delete 
data, 
enabling 
roll-­‐back
QuesEons?
Thank 
you! 
FIND 
THE 
ARCHIVE 
AT 
InsideAnalysis.com
DATA 
SCIENCE 
AND 
HADOOP 
TO 
NORMALIZE 
CORPORATE 
DATA
u Normalizing 
data 
is 
more 
sophisEcated 
than 
what 
is 
commonly 
done 
in 
integraEon 
u It 
combines 
subject 
maaer 
knowledge, 
governance, 
business 
rules, 
and 
raw 
data. 
u Small 
Data 
is 
“corporate 
structured 
data 
that 
is 
the 
fuel 
of 
its 
main 
ac2vi2es, 
and 
whose 
problems 
with 
accuracy 
and 
trustworthiness 
are 
past 
the 
stage 
of 
being 
alleged. 
This 
includes 
financial, 
customer, 
company, 
inventory, 
medical, 
risk, 
supply 
chain, 
and 
other 
primary 
data 
used 
for 
decision 
making, 
applica2ons, 
reports, 
and 
Business 
Intelligence.”
The 
State 
of 
Corporate 
Data 
multiple 
instances of 
source data 
multiple 
definitions 
for reporting 
multiple 
copies of data 
variable 
structures 
hidden 
conflicts 
in 
data 
definiEons 
different 
data 
values 
which 
source 
to 
use 
different 
model 
types 
& 
standards 
more 
storage 
, 
esp. 
when 
mulEplied 
by 
envinroments 
more 
data 
flows 
to 
develop 
and 
maintain 
more 
than 
100 
DW 
or 
data 
marts 
downstream 
different 
methods 
for 
ETL 
complex 
dependencies, 
difficult 
for 
impact 
assessment 
conflicEng 
business 
logic 
& 
views 
global 
analyses 
& 
aggregaEons 
restricted 
by 
inconsistencies 
Copyright 
PSIKORS 
InsEtute 
2013 
11
Copyright 
PSIKORS 
InsEtute 
2014 
12
Data 
NormalizaEon 
Showcase 
u FPDS 
is 
an 
open 
source 
of 
Federal 
Procurement 
data 
that 
has 
poor 
quality 
and 
consistency. 
– Approx 
10M+ 
records 
each 
with 
306 
columns 
= 
25GB 
raw 
text 
– Structured 
data 
except 
for 
some 
free 
text 
fields 
u We 
are 
normalizing 
it 
for 
analysis 
of 
IT 
expenditures 
for 
a 
real 
client 
u Queries 
are 
used 
by 
analysts 
supported 
by 
Hadoop 
environment 
via 
Data 
NormalizaEon 
plaYorm
NormalizaEon 
Begins 
with 
Understanding 
Data 
u Databases 
are 
supposed 
to 
have 
official 
informaEon 
on 
formal 
acquisiEon 
of 
IT 
assets. 
– Contracts 
DB 
not 
aligned 
with 
Procurement 
DB 
• Example, 
FA330012Dxxx 
in 
one 
but 
not 
other 
u Differing 
data 
sets 
and 
values 
– FA330012F0005: 
Same 
in 
both 
– FA330012P0020: 
Contracts 
DB: 
10 
items; 
FPDS: 
1 
item; 
Same 
descripEon, 
same 
total 
dollars 
– HQ042312*: 
Contracts 
6 
= 
$278.4K, 
FPDS 
1 
= 
$48K 
• $48K 
is 
one 
of 
6 
records 
in 
Contracts 
Copyright 
PSIKORS 
InsEtute 
2014 
14
ConverEng 
supposedly 
same 
primary 
keys 
into 
normalized 
values 
that 
can 
be 
compared: 
contract 
number 
u If 
(DELIVERY_ORDER=NULL) 
v_piid 
= 
CONTRACT 
else 
v_piid 
= 
DELIVERY_ORDER 
u If 
( 
x1='0') 
v_modificaEon_number 
= 
'0‘ 
else 
v_modificaEon_number 
= 
x2 
– where 
x1: 
if 
(ACO_MOD=NULL) 
x1 
= 
x3 
else 
x1 
= 
ACO_MOD 
– where 
x3: 
if 
(PCO_MOD=NULL) 
x3='0‘ 
else 
x3=PCO_MOD 
– where 
x2: 
if 
(x4=NULL) 
x2='0‘ 
else 
x2=x4 
– where 
x4: 
x4= 
LTRIM(x5) 
– where 
x5: 
x5=x1 
– essenEally 
this 
first 
tries 
to 
use 
ACO_MOD, 
and 
if 
this 
is 
NULL 
then 
it 
tries 
to 
use 
PCO_MOD 
and 
sets 
= 
'0' 
if 
these 
are 
NULL 
u If 
(DELIVERY_ORDER=NULL) 
v_idv_piid 
= 
y1 
else 
v_idv_piid 
= 
CONTRACT 
– where 
y1: 
y1 
= 
REF_PROC_INSTRUMENT 
with 
all 
'-­‐' 
characters 
removed 
Copyright 
PSIKORS 
InsEtute 
2014 
16 
key 
business 
logic 
as 
buried 
in 
a 
database 
stored 
procedure 
(condensed)
SQL 
Queries 
via 
Hue: 
Impala
SQL 
Queries 
via 
Hue: 
Hive
Querying 
Impala 
From 
Data 
NormalizaEon 
System
Simplifying 
Queries 
and 
Tying 
to 
AuthoritaEve 
Management
Storing 
Term 
Rules 
in 
Master 
Codes 
Note 
wildcard 
character 
(*) 
in 
middle 
as 
well 
as 
front 
and 
back
Complicated 
Queries 
are 
OCen 
Needed 
Looking 
for 
a 
combinaEon 
of 
keywords 
with 
wildcards 
along 
with 
structured 
values 
SELECT 
recordid,contracEngagencyid,contracEngagencyname,orgcode,orgid,modificaEonnumber,piid,piidagencyid,solicitaEonid,effecEvedate,fiscalyear,fundingagencyid,fundingagencyname,typeofcon 
tract,consolidatedcontractdesc,descofreq,naicscode,naicsdesc,productorservicecode,productorservicedesc,globaldunsnumber,dunsnumber,globalvendorname,vendorname,datesigned,referenc 
edidvpiid,referencedidvagencyid,referencedidvmodnumber,contracEngdepartmenEd,contracEngdepartmentname,contracEngofficeid,contracEngofficename,contracEngofficeregion,funcdimen 
ddate,funcdimstartdate,funcEon1,funcEon1value,funcEon2,funcEon2value,funcEon3,funcEon3value,majorcommandcode,majorcommandid,majorcommandname,parentmacomcode,primarydi 
mensionid,primarydimensionvalueid,secondarydimensionid,secondarydimensionvalueid,subcommand1code,subcommand1id,subcommand1name,subcommand2code,subcommand2id,subcom 
mand2name,subcommand3code,subcommand3id,subcommand3name,subcommand4code,subcommand4id,subcommand4name,terEarydimensionid,terEarydimensionvalueid,transacEonnum 
ber,lastdatetoorder,compleEondate,estulEmatecompleEondate,signeddate,fundingofficeid,fundingofficename,isfundedforeignenEtycode,isfundedforeignenEtydesc,reasoninteragencycontracE 
ng,feeforuseofservice,fixed,lowervalue,maximumorderlimit,orderingprocedure,uppervalue,websiteurl,whocanuse,feepaidforuseofidv,programacronym,typeofidc,a76acEoncode,a76acEondesc, 
conEngencyhumanitarianpeaceop,contracYinancing,costacctstdclausecode,costacctstdclausedesc,costorpricingdata,emailaddress,gfegfpcode,gfegfpdesc,inherentlygovernmentaldesc,inherentl 
ygovernmentalfuncEon,leaercontractundefacEoncode,leaercontractundefacEondesc,majorprogram,mulEpleorsingleawardidv,mulEyearcontractcode,mulEyearcontractdesc,naEonalinterestac 
Eon,naEonalinterestdesc,numberofacEons,performancebasedserviceacqcode,performancebasedserviceacqdesc,purchasecardpaymethodcode,purchasecardpaymethoddesc,seatransportaEon,s 
ubcontractplan,treasuryacctsymbolagencyid,treasuryacctsymboliniEaEve,treasuryacctsymbolmaincode,treasuryacctsymbolsubcode,clingercohenactcode,clingercohenactdesc,davisbaconactcod 
e,davisbaconactdesc,economyact,interagencycontracEngauthcode,interagencycontracEngauthdesc,otherstatutoryauthdesc,servicecontractactdesc,servicecontractactcode,walshhealeyactcode, 
walshhealeyactdesc,bundledreqs,claimantprogramcode,consolidatedcontractcode,domesEcorforeignenEtycode,domesEcorforeignenEtydesc,infotechcommercialitemcategory,recoveredmateri 
alssustain,recoveredmaterialssustaindesc,systemequipmentcode,useofepadesignatedproducts,congrdistrictplaceofperf,placeofperfzipcode,princplaceofperfcityname,princplaceofperfcountryco 
de,princplaceofperfcountryname,princplaceofperfcountycode,princplaceofperfcountyname,princplaceofperflocaEoncode,princplaceofperfstatecode,countryprodserviceorigincode,placeofmanu 
facture,placeofmanufacturedesc,alternaEveadverEsing,commercialitemacqperoccode,commercialitemacqperocdesc,commercialitemtestprogram,commercialitemtestprogramdesc,evaluatedpr 
eference,extentcompeted,fairopportunitylimitedsources,fedbizoppscode,fedbizoppsdesc,localareasetasidecode,localareasetasidedesc,numberofoffersreceived,otherthanfullopencompeEEon,pr 
eawardyosynopsis,priceevaluaEonpercentdiff,sbaorofppsynopsiswaiverpilot,sbirsar,smallbuscompdemoprog,solicitaEonperoc,typeofsetaside,awardoridvtype,createdvia,lastmodifiedby,lastmo 
difieddate,part8orpart13,preparedby,prepareddate,reasonformodificaEoncode,reasonformodificaEondesc,congrdistrictcontractor,contractorname,doingbusasname,samexcepEon,street,street 
2,vendorcity,vendorcountry,vendorphonenumber,vendorstate,zip,is1862landgrantcollege,is1890landgrantcollege,is1994landgrantcollege,isairportauth,isalaskannaEvecorpownedfirm,isalaskann 
aEveservicinginst,isamericanindianowned,isasianpacificamericanowned,isblackamericanowned,isbothcontractsandgrants,iscity,iscommdevelopedcorpownedfirm,iscommdevelopmentcorp,isco 
ntracts,iscorporateenEtynoaaxexempt,iscorporateenEtytaxexempt,iscouncilofgovernments,iscountryofincorporaEon,iscounty,isdomesEcshelter,isdotcertdisbusent,iseducaEonalinst,isemerging 
smallbus,isfederalagency,isfedfundedresanddevcorp,isforprofitorg,isforeigngovernment,isforeignownedandlocated,isfoundaEon,isgrants,ishispanicamericanowned,ishispanicservicinginst,isvend 
orhbcu,ishospital,ishousingauthpublictribal,isindiantribe,isintermunicipal,isinternaEonalorg,isinterstateenEty,islaborsurplusareafirm,islimitedliabilitycorp,islocalgovernmentowned,ismanufactur 
erofgoods,isminorityinsts,isminorityownedbus,ismunicipality,isnaEveamericanowned,isnaEvehawaiianorgownedfirm,isnaEvehawaiianservicinginst,isnonprofitorg,isotherminorityowned,isother 
noYorprofitorg,ispartnershipllp,isplanningcommission,isportauth,isprivateuniversityorcollege,issbacert8ajointventure,issbacert8aprogparEcipant,issbacerthubzonefirm,issbacertsmalldisbus,issc 
hooldistrict,isschoolofforestry,isselfcerEfedsmalldisbus,isservicedisabledvetownedbus,issmallagriculturalcooperaEve,issoleproprietorship,isstatecontrinsthigherlearn,isstateofincorporaEon,issub 
chapterscorp,issubcontasianindianamerowned,istheabilityoneprog,istownship,istransitauth,istribalcollege,istriballyowned,isusfederalgovernment,isusgovernmentenEty,isuslocalgovernment,isu 
sstategovernment,isveteranownedbus,isveterinarycollege,isveterinaryhospital,iswomanownedbus,istypeecondiswosb,istypejventecondiswosb,istypejventwosb,istypewosb,contracEngo{ussize 
selecEon,reasonnotawardedtosmallbus,reasonnotawardedtosmalldisbus,idvbundledreqs,idvcontracEngagencyid,idvcontracEngagencyname,idvcontracEngo{ussizesel,idvdepartmenEd,idvdep 
artmentname,idvmajorprogcode,idvmulEpleorsingleawardidv,idvnaicscode,idvnaicsdesc,idvpart8orpart13,idvprogacronym,idvreferencedidvagencycode,idvreferencedidvpiid,idvsubcontractpla 
n,idvsubcontractplandesc,idvtypeofcontractpricing,idvtypeofcontractpricingdesc,idvtypeofidc,idvtypeofidcdesc,idvwhocanuse,idvwhocanusedesc,missing301,currentcontractvalue,acEonobligaE 
on,ulEmatecontractvalue 
FROM 
fpdsrawrecords.records 
WHERE 
( 
( 
( 
LOWER(fundingagencyid) 
= 
'97as' 
) 
) 
AND 
( 
( 
LOWER(fiscalyear) 
= 
'2013' 
) 
) 
AND 
( 
( 
LOWER(productorservicecode) 
LIKE 
'70%' 
OR 
LOWER(productorservicecode) 
LIKE 
'd3%' 
) 
) 
) 
LIMIT 
1000
Query 
Timing 
u Looking 
for 
combinaEons 
of 
text 
tokens 
(with 
wildcards) 
to 
known 
field 
values 
u Queries 
are 
done 
both 
in 
Data 
NormalizaEon 
plaYorm 
and 
by 
command 
line 
interface 
on 
Hadoop 
server 
for 
Impala 
and 
Hive. 
Time 
differences 
are 
negligible 
but 
all 
Emes 
reported 
here 
are 
by 
CLI 
– Tables 
made 
for: 
text, 
Parquet, 
Parquet 
parEEoned 
by 
‘fiscalyear’ 
(6 
values) 
and 
‘fundingagencyid’ 
(approx. 
25 
values)
400 
350 
300 
250 
200 
150 
100 
50 
0 
FPDS 
Hadoop 
Query 
Times 
Text 
Field 
(secs) 
Hive 
Impala 
SQLServer 
Text 
Parquet 
Parquet 
ParEEoned 
EvaluaEng 
query 
performance 
in 
Hadoop 
relaEve 
to 
format 
and 
comparing 
to 
RDBMS
250 
200 
150 
100 
50 
0 
FPDS 
TEXT 
QUERIES 
PER 
LIMIT 
(SECS) 
Hive 
Text 
Impala 
Text 
Hive 
Parquet 
Impala 
Parquet 
Hive 
Parquet 
Part 
Impala 
Parquet 
Part 
100 
LIMIT 
1000 
LIMIT 
NO 
LIMIT
JusEn 
Erickson 
| 
Director, 
Product 
Management, 
Cloudera 
QUERY 
PERFORMANCE 
IMPROVEMENT 
WITH 
IMPALA
Impala’s 
Benefits 
u Unlocks 
BI/analyEcs 
on 
Hadoop 
– InteracEve 
SQL 
in 
seconds 
– Highly 
concurrent 
to 
handle 
100s 
of 
users 
u NaEve 
Hadoop 
flexibility 
– No 
data 
migraEon, 
conversion, 
or 
duplicaEon 
required 
– Query 
exisEng 
Hadoop 
data 
– Run 
mulEple 
frameworks 
on 
the 
same 
data 
at 
the 
same 
Eme 
– Supports 
Parquet 
for 
best-­‐of-­‐breed 
columnar 
performance 
u NaEve 
MPP 
query 
engine 
designed 
into 
Hadoop: 
– Unified 
Hadoop 
storage 
– Unified 
Hadoop 
metadata 
(uses 
Hive 
and 
HCatalog) 
– Unified 
Hadoop 
security 
– Fine-­‐grained 
role-­‐based 
access 
controls 
with 
Sentry 
u Apache-­‐licensed 
open 
source 
u Deployed 
across 
customers 
today 
©2014 
Cloudera, 
Inc. 
All 
Rights 
Reserved. 
27
Impala 
Architecture 
u MPP 
query 
engine 
built 
naEvely 
into 
Hadoop 
©2014 
Cloudera, 
Inc. 
All 
Rights 
Reserved. 
28 
Query 
Planner 
Query 
Coordinator 
Query 
Executor 
HDFS 
DN 
HBase 
SQL 
App 
ODBC 
Hive 
Metastore 
HDFS 
NN 
Statestore 
Query 
Planner 
Query 
Coordinator 
Query 
Executor 
HDFS 
DN 
HBase 
Query 
Planner 
Query 
Coordinator 
Query 
Executor 
HDFS 
DN 
HBase 
SQL 
request
Impala’s 
MulE-­‐User 
over 
9.5x 
Faster 
©2014 
Cloudera, 
Inc. 
All 
Rights 
Reserved. 
29
MulE-­‐user 
hardware 
uElizaEon 
©2014 
Cloudera, 
Inc. 
All 
Rights 
Reserved. 
30
Performance 
Takeaways 
u Impala’s 
advantage 
expands 
with 
just 
10 
users 
to 
>9.5x 
nearest 
compeEtor 
– Predominantly 
aaributable 
to 
CPU 
efficiency 
u Does 
not 
parEcularly 
maaer 
which 
DAG 
is 
run 
for 
Hive 
– Shark 
(with 
Spark) 
and 
Tez 
produce 
very 
similar 
results 
– Both 
incrementally 
faster 
batch 
processing 
but 
not 
comparable 
to 
MPP 
databases 
– Difference 
is 
Spark 
is 
already 
proven 
with 
broad 
community 
and 
vendor 
adopEon 
u Mid-­‐term 
trends 
will 
further 
favor 
Impala’s 
design 
approach 
– More 
data 
sets 
move 
to 
memory 
(HDFS 
caching, 
in-­‐memory 
joins, 
Intel 
joint 
roadmap) 
– CPU 
efficiency 
will 
increase 
in 
importance 
– NaEve 
code 
enables 
easy 
opEmizaEons 
for 
CPU 
instrucEon 
sets 
(e.g. 
floaEng 
point 
operaEons, 
math 
operaEons, 
encrypt/decrypt) 
– The 
Intel 
joint 
roadmap 
helps 
support 
these 
opportuniEes 
u Upcoming 
benchmark 
on 
latest 
releases 
demonstrate 
Impala’s 
this 
gap 
widening 
©2014 
Cloudera, 
Inc. 
All 
Rights 
Reserved. 
31
NORMALIZING 
THE 
DATA
Capture 
Business 
Rules 
and 
Make 
Visible, 
Changeable, 
and 
Useful
Custom 
MulE-­‐Use 
NormalizaE 
on 
Methods 
Ready 
for 
Hadoop 
Parallel 
ExecuEon
Data 
NormalizaEon 
Library 
Enables 
Rapid 
Build, 
Deploy, 
Change 
Cycles
Special 
Programming 
for 
Hadoop 
u Which 
Hadoop 
libraries? 
Intertwined 
so 
reference 
all. 
u Otherwise: 
not 
much 
– HDFS 
filesystem 
– YARN 
containers
Parallel 
Jobs 
u Three 
ways 
to 
run 
parallel 
jobs 
– Launch 
mulEple 
Java 
sessions 
from 
command 
line 
• Same 
as 
in 
Windows, 
Linux 
– Use 
Cloudera 
Hue 
Job 
Designer 
• Easy 
and 
has 
management 
web 
pages 
– Data 
NormalizaEon 
system 
• Coordinates 
governance, 
architecture, 
data 
models, 
codes, 
business 
rules 
• Define, 
submit 
YARN 
containers 
specifying 
Java 
jar, 
dicEonaries, 
source 
files
Key 
Code 
Analysis 
– Invoice 
data 
sets 
extracted 
with 
correlaEon 
• CAGE: 
984274, 
DUNS: 
973437 
– FPDS 
DUNS 
and 
Names 
extracted 
& 
correlated 
• 158181 
unique 
DUNS 
codes 
– Will 
be 
included 
in 
normalized 
composite 
IT 
Asset 
records 
– Composite 
records 
for 
lookup 
added 
to 
Hadoop 
• By 
DUNS 
or 
Global 
DUNS: 
get 
all 
related 
DUNS, 
CAGE, 
names 
• By 
CAGE: 
get 
all 
related 
DUNS, 
names 
• By 
name: 
get 
all 
related 
DUNS, 
CAGE, 
names
Number 
CAGE 
Per 
DUNS 
Code 
1000000 
100000 
10000 
1000 
100 
10 
1 
0.1 
Number 
DUNS 
Codes 
With 
X 
CAGE 
Codes 
One 
DUNS 
code 
has 
119 
CAGE 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
23 
24 
27 
35 
40 
43 
44 
46 
54 
71 
78 
90 
119
1.4 
1.2 
1 
0.8 
0.6 
0.4 
0.2 
0 
ToWAWF 
Millions 
CAGE 
Codes 
from 
LookUp 
File 
Found 
NotFound
FPDS 
Number 
DUNS 
with 
N 
Global 
DUNS 
1000000 
100000 
10000 
1000 
100 
10 
1 
0.1 
0 
1 
2 
3 
4 
5 
100000 
10000 
1000 
100 
10 
1 
0.1 
FPDS: 
Number 
DUNS 
with 
N 
Names 
6849 
instances 
for 
code 
= 
12345678 
7 
1 
3 
5 
7 
9 
11 
13 
15 
17 
19 
21 
24 
27 
35 
112
10000 
1000 
100 
10 
1 
0.1 
FPDS: 
Number 
Global 
DUNS 
with 
N 
DUNS 
0 
50 
100 
150 
200 
250 
Number 
Global 
DUNS 
Number 
DUNS 
1000 
100 
10 
1 
0.1 
FPDS: 
Global 
DUNS 
with 
MulEple 
Names 
0 
200 
400 
600 
800 
1000 
1200 
1400 
Number 
Global 
DUNS 
Number 
Names
FPDS 
DUNS 
Code 
Matches 
to 
WAWF 
Codes 
140827 
13302 
17363 
942 
180000 
160000 
140000 
120000 
100000 
80000 
60000 
40000 
20000 
0 
DUNS 
GlobalDUNS 
Found 
NotFound
FPDS 
DUNS 
With 
Most 
Names 
DUNS 
NGlobalDUNS 
Nnames 
123456787 
0 
6849 
136666505 
0 
112 
790238851 
0 
96 
103933453 
1 
35 
103385519 
1 
33 
005149120 
1 
27 
067641597 
1 
25 
005103494 
0 
24 
332619535 
0 
24 
020751082 
1 
22 
054781240 
1 
22 
621599893 
1 
21 
790238638 
0 
21 
834476079 
1 
21 
123456787 
miscellaneous 
foreign 
contractors 
123456787 
eEsalat 
c/o 
us 
consulate 
general 
dubai 
123456787 
boswedden 
house 
123456787 
turner 
engine 
controls 
b. 
v. 
123456787 
swissport 
hellas 
cargo 
s 
a 
123456787 
orbit 
couriers 
sa 
123456787 
goldair 
aviaEon 
handling 
s.a. 
123456787 
federal 
egov 
iae 
iniEaEve 
generic 
duns 
123456787 
federal 
egov 
iae 
iniEaEve 
-­‐ 
generic 
duns 
123456787 
miscellaneous 
foreign 
contractorsan 
123456787 
prc-­‐desoto 
123456787 
inversiones 
sochagota 
e.u. 
123456787 
comcel 
123456787 
transporte 
y 
servicio 
lucio 
123456787 
jesse 
james 
members 
only 
maxi 
taxi 
svc 
123456787 
club 
naval 
de 
oficiales 
123456787 
inchcape 
shipping 
services 
123456787 
dr. 
thalia 
abatzi 
123456787 
central 
asia 
development 
group 
123456787 
bennea-­‐fouch 
and 
associates 
123456787 
noor 
al-­‐sabah 
company 
123456787 
ait/arc 
infrasture 
soluEons 
123456787 
not 
available 
123456787 
77 
construcEon 
company 
136666505 
adese 
genc 
petrol 
136666505 
amy 
lily 
chung 
136666505 
anderson 
erin 
ruth 
136666505 
andrew 
william 
knef 
136666505 
anduaga-­‐arias 
laura 
136666505 
angelica 
m. 
de 
la 
cruz 
136666505 
anthony 
o'brien, 
330531-­‐5100194 
136666505 
batac 
belle 
136666505 
boaesini 
beth 
ms. 
136666505 
bouck 
shannon 
136666505 
bunn 
amy 
b. 
136666505 
carlene 
clark 
136666505 
cho, 
boong 
haeng 
136666505 
choe, 
sun 
young 
136666505 
chrisEna 
michajlyszyn 
136666505 
christopher 
cannon 
136666505 
christopher 
l. 
booth 
136666505 
chun, 
kil 
mo 
136666505 
conflict 
+ 
transiEon 
consultancies 
136666505 
cozzone 
elaine 
136666505 
deborah 
p. 
carney 
136666505 
denihan 
patricia 
joann 
136666505 
dong 
sook 
mcgeorge, 
690525-­‐2716816 
136666505 
dorene 
d.lukewalton,pharm 
d. 
136666505 
dr. 
terry 
a. 
klein
FPDS 
Global 
DUNS 
with 
Most 
Names 
& 
DUNS 
GlobalDUNS 
NDUNS 
Nnames 
877936518 
12 
27299 
624770475 
212 
21866 
148095086 
80 
21754 
027079776 
2 
17128 
103933453 
86 
17075 
026157235 
4 
15694 
963737366 
106 
15200 
134303192 
19 
14481 
067641597 
108 
13998 
064680213 
102 
13809 
077652761 
93 
12914 
002204600 
15 
12570 
039860122 
44 
12382 
805258373 
130 
11995 
GlobalDUNS 
NDUNS 
Nnames 
624770475 
212 
21866 
805258373 
130 
11995 
012003349 
128 
9748 
877987347 
127 
8253 
057272486 
124 
6935 
007250079 
123 
9076 
071767334 
123 
9474 
158140041 
117 
6671 
019710586 
116 
8163 
091441089 
116 
7813 
616924770 
116 
7217 
067641597 
108 
13998
Prompted 
CollaboraEon 
and 
New 
Business 
InformaEon 
u Showing 
these 
results 
prompted 
discussions 
leading 
to: 
– There 
are 
generic 
DUNS 
heavily 
used 
but 
these 
are 
being 
removed 
from 
use 
via 
policy 
changes 
– System 
validaEon 
rules 
are 
not 
current 
with 
all 
policy 
– AddiEonal 
“rules” 
of 
how 
to 
track, 
audit, 
align, 
merge 
spread 
by 
email 
• All 
put 
back 
into 
Data 
NormalizaEon 
system 
and 
then 
into 
modified 
Java 
u New 
results 
available 
over 
all 
data 
sets 
<1day
ADDITIONAL 
INFORMATION
Impala 
JusEn 
Erickson 
| 
Director, 
Product 
Management 
September 
2014 
©2014 
Cloudera, 
Inc. 
All 
Rights 
Reserved. 
52
Impala 
Architecture: 
Query 
ExecuEon 
u Request 
arrives 
via 
ODBC/JDBC/Hue 
GUI/Shell 
Query 
Planner 
Query 
Coordinator 
Query 
Executor 
HDFS 
DN 
HBase 
SQL 
App 
ODBC 
Hive 
Metastore 
HDFS 
NN 
Statestore 
Query 
Planner 
Query 
Coordinator 
Query 
Executor 
HDFS 
DN 
HBase 
Query 
Planner 
Query 
Coordinator 
Query 
Executor 
HDFS 
DN 
HBase 
SQL 
request 
©2014 
Cloudera, 
Inc. 
All 
Rights 
Reserved. 
53
Impala 
Architecture: 
Query 
ExecuEon 
u Planner 
turns 
request 
into 
collecEons 
of 
plan 
fragments 
u Coordinator 
iniEates 
execuEon 
on 
impalad's 
local 
to 
data 
Query 
Planner 
Query 
Coordinator 
Query 
Executor 
HDFS 
DN 
HBase 
SQL 
App 
ODBC 
Query 
Planner 
Query 
Coordinator 
Query 
Executor 
HDFS 
DN 
HBase 
Query 
Planner 
Query 
Coordinator 
Query 
Executor 
HDFS 
DN 
HBase 
Hive 
Metastore 
HDFS 
NN 
Statestore 
©2014 
Cloudera, 
Inc. 
All 
Rights 
Reserved. 
54
Impala 
Architecture: 
Query 
ExecuEon 
u Intermediate 
results 
are 
streamed 
between 
impalad’s 
u Query 
results 
are 
streamed 
back 
to 
client 
Query 
Planner 
Query 
Coordinator 
Query 
Executor 
HDFS 
DN 
HBase 
SQL 
App 
ODBC 
Hive 
Metastore 
HDFS 
NN 
Statestore 
Query 
Planner 
Query 
Coordinator 
Query 
Executor 
HDFS 
DN 
HBase 
Query 
Planner 
Query 
Coordinator 
Query 
Executor 
HDFS 
DN 
HBase 
query 
results 
©2014 
Cloudera, 
Inc. 
All 
Rights 
Reserved. 
55
Try 
It 
Out! 
u 100% 
Apache-­‐licensed 
open 
source 
u Downloads 
on 
hap://impala.io/: 
– Live 
online 
– VM 
– InstallaEon 
u QuesEons/comments? 
– Community: 
hap://impala.io/community 
– Email: 
impala-­‐user@cloudera.org 
©2014 
Cloudera, 
Inc. 
All 
Rights 
Reserved. 
56
©2014 
Cloudera, 
Inc. 
All 
Rights 
Reserved. 
57

More Related Content

Similar to Tech Lab Series - Episode II - Back to Normal

Phasic Systems - Dr. Geoffrey Malafsky
Phasic Systems - Dr. Geoffrey MalafskyPhasic Systems - Dr. Geoffrey Malafsky
Phasic Systems - Dr. Geoffrey MalafskyInside Analysis
 
The Dirty Work -- Why Data Must Be Reconciled
The Dirty Work -- Why Data Must Be ReconciledThe Dirty Work -- Why Data Must Be Reconciled
The Dirty Work -- Why Data Must Be ReconciledInside Analysis
 
Automatic Data Reconciliation, Data Quality, and Data Observability.pdf
Automatic Data Reconciliation, Data Quality, and Data Observability.pdfAutomatic Data Reconciliation, Data Quality, and Data Observability.pdf
Automatic Data Reconciliation, Data Quality, and Data Observability.pdf4dalert
 
Data Science Operationalization: The Journey of Enterprise AI
Data Science Operationalization: The Journey of Enterprise AIData Science Operationalization: The Journey of Enterprise AI
Data Science Operationalization: The Journey of Enterprise AIDenodo
 
OLAP on the Cloud with Azure Databricks and Azure Synapse
OLAP on the Cloud with Azure Databricks and Azure SynapseOLAP on the Cloud with Azure Databricks and Azure Synapse
OLAP on the Cloud with Azure Databricks and Azure SynapseAtScale
 
Henninger_MakingReferenceDataMoreMeaningful-Final
Henninger_MakingReferenceDataMoreMeaningful-FinalHenninger_MakingReferenceDataMoreMeaningful-Final
Henninger_MakingReferenceDataMoreMeaningful-FinalScott Henninger
 
Product Keynote: Denodo 8.0 - A Logical Data Fabric for the Intelligent Enter...
Product Keynote: Denodo 8.0 - A Logical Data Fabric for the Intelligent Enter...Product Keynote: Denodo 8.0 - A Logical Data Fabric for the Intelligent Enter...
Product Keynote: Denodo 8.0 - A Logical Data Fabric for the Intelligent Enter...Denodo
 
Data Virtualization, a Strategic IT Investment to Build Modern Enterprise Dat...
Data Virtualization, a Strategic IT Investment to Build Modern Enterprise Dat...Data Virtualization, a Strategic IT Investment to Build Modern Enterprise Dat...
Data Virtualization, a Strategic IT Investment to Build Modern Enterprise Dat...Denodo
 
Extending BI with Big Data Analytics
Extending BI with Big Data AnalyticsExtending BI with Big Data Analytics
Extending BI with Big Data AnalyticsDatameer
 
Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...
Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...
Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...Denodo
 
Why Big Data is Really about Small Data
Why Big Data is Really about Small DataWhy Big Data is Really about Small Data
Why Big Data is Really about Small DataHurwitz & Associates
 
Data Intelligence Overview
Data Intelligence OverviewData Intelligence Overview
Data Intelligence OverviewGDPR SMEs
 
Data summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data opsData summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data opsRyan Gross
 
Analyst View of Data Virtualization: Conversations with Boulder Business Inte...
Analyst View of Data Virtualization: Conversations with Boulder Business Inte...Analyst View of Data Virtualization: Conversations with Boulder Business Inte...
Analyst View of Data Virtualization: Conversations with Boulder Business Inte...Denodo
 
Big Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseBig Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseCaserta
 
Open Source Ecosystem Future of Enterprise IT
Open Source Ecosystem Future of Enterprise ITOpen Source Ecosystem Future of Enterprise IT
Open Source Ecosystem Future of Enterprise ITandreas kuncoro
 

Similar to Tech Lab Series - Episode II - Back to Normal (20)

Phasic Systems - Dr. Geoffrey Malafsky
Phasic Systems - Dr. Geoffrey MalafskyPhasic Systems - Dr. Geoffrey Malafsky
Phasic Systems - Dr. Geoffrey Malafsky
 
The Dirty Work -- Why Data Must Be Reconciled
The Dirty Work -- Why Data Must Be ReconciledThe Dirty Work -- Why Data Must Be Reconciled
The Dirty Work -- Why Data Must Be Reconciled
 
Systems analysis and design (abe)
Systems analysis and design (abe)Systems analysis and design (abe)
Systems analysis and design (abe)
 
Automatic Data Reconciliation, Data Quality, and Data Observability.pdf
Automatic Data Reconciliation, Data Quality, and Data Observability.pdfAutomatic Data Reconciliation, Data Quality, and Data Observability.pdf
Automatic Data Reconciliation, Data Quality, and Data Observability.pdf
 
IT Ready - DW: 1st Day
IT Ready - DW: 1st Day IT Ready - DW: 1st Day
IT Ready - DW: 1st Day
 
DATA WAREHOUSE.pptx
DATA WAREHOUSE.pptxDATA WAREHOUSE.pptx
DATA WAREHOUSE.pptx
 
Data Science Operationalization: The Journey of Enterprise AI
Data Science Operationalization: The Journey of Enterprise AIData Science Operationalization: The Journey of Enterprise AI
Data Science Operationalization: The Journey of Enterprise AI
 
OLAP on the Cloud with Azure Databricks and Azure Synapse
OLAP on the Cloud with Azure Databricks and Azure SynapseOLAP on the Cloud with Azure Databricks and Azure Synapse
OLAP on the Cloud with Azure Databricks and Azure Synapse
 
Msbi by quontra us
Msbi by quontra usMsbi by quontra us
Msbi by quontra us
 
Henninger_MakingReferenceDataMoreMeaningful-Final
Henninger_MakingReferenceDataMoreMeaningful-FinalHenninger_MakingReferenceDataMoreMeaningful-Final
Henninger_MakingReferenceDataMoreMeaningful-Final
 
Product Keynote: Denodo 8.0 - A Logical Data Fabric for the Intelligent Enter...
Product Keynote: Denodo 8.0 - A Logical Data Fabric for the Intelligent Enter...Product Keynote: Denodo 8.0 - A Logical Data Fabric for the Intelligent Enter...
Product Keynote: Denodo 8.0 - A Logical Data Fabric for the Intelligent Enter...
 
Data Virtualization, a Strategic IT Investment to Build Modern Enterprise Dat...
Data Virtualization, a Strategic IT Investment to Build Modern Enterprise Dat...Data Virtualization, a Strategic IT Investment to Build Modern Enterprise Dat...
Data Virtualization, a Strategic IT Investment to Build Modern Enterprise Dat...
 
Extending BI with Big Data Analytics
Extending BI with Big Data AnalyticsExtending BI with Big Data Analytics
Extending BI with Big Data Analytics
 
Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...
Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...
Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...
 
Why Big Data is Really about Small Data
Why Big Data is Really about Small DataWhy Big Data is Really about Small Data
Why Big Data is Really about Small Data
 
Data Intelligence Overview
Data Intelligence OverviewData Intelligence Overview
Data Intelligence Overview
 
Data summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data opsData summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data ops
 
Analyst View of Data Virtualization: Conversations with Boulder Business Inte...
Analyst View of Data Virtualization: Conversations with Boulder Business Inte...Analyst View of Data Virtualization: Conversations with Boulder Business Inte...
Analyst View of Data Virtualization: Conversations with Boulder Business Inte...
 
Big Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseBig Data's Impact on the Enterprise
Big Data's Impact on the Enterprise
 
Open Source Ecosystem Future of Enterprise IT
Open Source Ecosystem Future of Enterprise ITOpen Source Ecosystem Future of Enterprise IT
Open Source Ecosystem Future of Enterprise IT
 

More from Inside Analysis

An Ounce of Prevention: Forging Healthy BI
An Ounce of Prevention: Forging Healthy BIAn Ounce of Prevention: Forging Healthy BI
An Ounce of Prevention: Forging Healthy BIInside Analysis
 
Agile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for SuccessAgile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for SuccessInside Analysis
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationInside Analysis
 
Fit For Purpose: Preventing a Big Data Letdown
Fit For Purpose: Preventing a Big Data LetdownFit For Purpose: Preventing a Big Data Letdown
Fit For Purpose: Preventing a Big Data LetdownInside Analysis
 
To Serve and Protect: Making Sense of Hadoop Security
To Serve and Protect: Making Sense of Hadoop Security To Serve and Protect: Making Sense of Hadoop Security
To Serve and Protect: Making Sense of Hadoop Security Inside Analysis
 
The Hadoop Guarantee: Keeping Analytics Running On Time
The Hadoop Guarantee: Keeping Analytics Running On TimeThe Hadoop Guarantee: Keeping Analytics Running On Time
The Hadoop Guarantee: Keeping Analytics Running On TimeInside Analysis
 
Introducing: A Complete Algebra of Data
Introducing: A Complete Algebra of DataIntroducing: A Complete Algebra of Data
Introducing: A Complete Algebra of DataInside Analysis
 
The Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionThe Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionInside Analysis
 
Ahead of the Stream: How to Future-Proof Real-Time Analytics
Ahead of the Stream: How to Future-Proof Real-Time AnalyticsAhead of the Stream: How to Future-Proof Real-Time Analytics
Ahead of the Stream: How to Future-Proof Real-Time AnalyticsInside Analysis
 
All Together Now: Connected Analytics for the Internet of Everything
All Together Now: Connected Analytics for the Internet of EverythingAll Together Now: Connected Analytics for the Internet of Everything
All Together Now: Connected Analytics for the Internet of EverythingInside Analysis
 
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETLGoodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETLInside Analysis
 
The Biggest Picture: Situational Awareness on a Global Level
The Biggest Picture: Situational Awareness on a Global LevelThe Biggest Picture: Situational Awareness on a Global Level
The Biggest Picture: Situational Awareness on a Global LevelInside Analysis
 
Structurally Sound: How to Tame Your Architecture
Structurally Sound: How to Tame Your ArchitectureStructurally Sound: How to Tame Your Architecture
Structurally Sound: How to Tame Your ArchitectureInside Analysis
 
SQL In Hadoop: Big Data Innovation Without the Risk
SQL In Hadoop: Big Data Innovation Without the RiskSQL In Hadoop: Big Data Innovation Without the Risk
SQL In Hadoop: Big Data Innovation Without the RiskInside Analysis
 
The Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big DataThe Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big DataInside Analysis
 
A Revolutionary Approach to Modernizing the Data Warehouse
A Revolutionary Approach to Modernizing the Data WarehouseA Revolutionary Approach to Modernizing the Data Warehouse
A Revolutionary Approach to Modernizing the Data WarehouseInside Analysis
 
The Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of HadoopThe Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of HadoopInside Analysis
 
Rethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile WorldRethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile WorldInside Analysis
 
DisrupTech - Dave Duggal
DisrupTech - Dave DuggalDisrupTech - Dave Duggal
DisrupTech - Dave DuggalInside Analysis
 

More from Inside Analysis (20)

An Ounce of Prevention: Forging Healthy BI
An Ounce of Prevention: Forging Healthy BIAn Ounce of Prevention: Forging Healthy BI
An Ounce of Prevention: Forging Healthy BI
 
Agile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for SuccessAgile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for Success
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter Integration
 
Fit For Purpose: Preventing a Big Data Letdown
Fit For Purpose: Preventing a Big Data LetdownFit For Purpose: Preventing a Big Data Letdown
Fit For Purpose: Preventing a Big Data Letdown
 
To Serve and Protect: Making Sense of Hadoop Security
To Serve and Protect: Making Sense of Hadoop Security To Serve and Protect: Making Sense of Hadoop Security
To Serve and Protect: Making Sense of Hadoop Security
 
The Hadoop Guarantee: Keeping Analytics Running On Time
The Hadoop Guarantee: Keeping Analytics Running On TimeThe Hadoop Guarantee: Keeping Analytics Running On Time
The Hadoop Guarantee: Keeping Analytics Running On Time
 
Introducing: A Complete Algebra of Data
Introducing: A Complete Algebra of DataIntroducing: A Complete Algebra of Data
Introducing: A Complete Algebra of Data
 
The Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionThe Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop Adoption
 
Ahead of the Stream: How to Future-Proof Real-Time Analytics
Ahead of the Stream: How to Future-Proof Real-Time AnalyticsAhead of the Stream: How to Future-Proof Real-Time Analytics
Ahead of the Stream: How to Future-Proof Real-Time Analytics
 
All Together Now: Connected Analytics for the Internet of Everything
All Together Now: Connected Analytics for the Internet of EverythingAll Together Now: Connected Analytics for the Internet of Everything
All Together Now: Connected Analytics for the Internet of Everything
 
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETLGoodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
 
The Biggest Picture: Situational Awareness on a Global Level
The Biggest Picture: Situational Awareness on a Global LevelThe Biggest Picture: Situational Awareness on a Global Level
The Biggest Picture: Situational Awareness on a Global Level
 
Structurally Sound: How to Tame Your Architecture
Structurally Sound: How to Tame Your ArchitectureStructurally Sound: How to Tame Your Architecture
Structurally Sound: How to Tame Your Architecture
 
SQL In Hadoop: Big Data Innovation Without the Risk
SQL In Hadoop: Big Data Innovation Without the RiskSQL In Hadoop: Big Data Innovation Without the Risk
SQL In Hadoop: Big Data Innovation Without the Risk
 
The Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big DataThe Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big Data
 
A Revolutionary Approach to Modernizing the Data Warehouse
A Revolutionary Approach to Modernizing the Data WarehouseA Revolutionary Approach to Modernizing the Data Warehouse
A Revolutionary Approach to Modernizing the Data Warehouse
 
The Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of HadoopThe Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of Hadoop
 
Rethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile WorldRethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile World
 
DisrupTech - Dave Duggal
DisrupTech - Dave DuggalDisrupTech - Dave Duggal
DisrupTech - Dave Duggal
 
Modus Operandi
Modus OperandiModus Operandi
Modus Operandi
 

Tech Lab Series - Episode II - Back to Normal

  • 1. Grab some coffee and enjoy the pre-show banter before the top of the hour!
  • 2. Episode 2: Back to Normal Tech Lab Webcast | September 24, 2014 Sponsored by
  • 3. What Is the Tech Lab? u Real-­‐world proving ground for enterprise soCware u Designed to showcase the process of creaEng soluEons u Completely independent of sponsor influence u Run by Master ScienEst, Dr. Geoffrey Malafsky u Projects span 3-­‐6 months
  • 4. What Is Data NormalizaEon? u Data NormalizaEon is a process by which disparate data sets, terms, models and ontologies can be reconciled for the purpose of providing cerEfiably accurate enterprise data.
  • 5. Why Is NormalizaEon Necessary? u Disparate Data Systems u Disparate File Structures u Disparate Data Models u Variable Business Logic u ConflicEng Data Values u Serious SemanEc Issues
  • 6. How Hadoop Can Help u Robust plaYorm for data persistence u RelaEvely easy to connect to enterprise apps u Enables ‘future-­‐proofing’ by avoiding lock-­‐in u Growing array of parallel processing funcEons u New standard for data management u No need to delete data, enabling roll-­‐back
  • 8. Thank you! FIND THE ARCHIVE AT InsideAnalysis.com
  • 9. DATA SCIENCE AND HADOOP TO NORMALIZE CORPORATE DATA
  • 10. u Normalizing data is more sophisEcated than what is commonly done in integraEon u It combines subject maaer knowledge, governance, business rules, and raw data. u Small Data is “corporate structured data that is the fuel of its main ac2vi2es, and whose problems with accuracy and trustworthiness are past the stage of being alleged. This includes financial, customer, company, inventory, medical, risk, supply chain, and other primary data used for decision making, applica2ons, reports, and Business Intelligence.”
  • 11. The State of Corporate Data multiple instances of source data multiple definitions for reporting multiple copies of data variable structures hidden conflicts in data definiEons different data values which source to use different model types & standards more storage , esp. when mulEplied by envinroments more data flows to develop and maintain more than 100 DW or data marts downstream different methods for ETL complex dependencies, difficult for impact assessment conflicEng business logic & views global analyses & aggregaEons restricted by inconsistencies Copyright PSIKORS InsEtute 2013 11
  • 13. Data NormalizaEon Showcase u FPDS is an open source of Federal Procurement data that has poor quality and consistency. – Approx 10M+ records each with 306 columns = 25GB raw text – Structured data except for some free text fields u We are normalizing it for analysis of IT expenditures for a real client u Queries are used by analysts supported by Hadoop environment via Data NormalizaEon plaYorm
  • 14. NormalizaEon Begins with Understanding Data u Databases are supposed to have official informaEon on formal acquisiEon of IT assets. – Contracts DB not aligned with Procurement DB • Example, FA330012Dxxx in one but not other u Differing data sets and values – FA330012F0005: Same in both – FA330012P0020: Contracts DB: 10 items; FPDS: 1 item; Same descripEon, same total dollars – HQ042312*: Contracts 6 = $278.4K, FPDS 1 = $48K • $48K is one of 6 records in Contracts Copyright PSIKORS InsEtute 2014 14
  • 15.
  • 16. ConverEng supposedly same primary keys into normalized values that can be compared: contract number u If (DELIVERY_ORDER=NULL) v_piid = CONTRACT else v_piid = DELIVERY_ORDER u If ( x1='0') v_modificaEon_number = '0‘ else v_modificaEon_number = x2 – where x1: if (ACO_MOD=NULL) x1 = x3 else x1 = ACO_MOD – where x3: if (PCO_MOD=NULL) x3='0‘ else x3=PCO_MOD – where x2: if (x4=NULL) x2='0‘ else x2=x4 – where x4: x4= LTRIM(x5) – where x5: x5=x1 – essenEally this first tries to use ACO_MOD, and if this is NULL then it tries to use PCO_MOD and sets = '0' if these are NULL u If (DELIVERY_ORDER=NULL) v_idv_piid = y1 else v_idv_piid = CONTRACT – where y1: y1 = REF_PROC_INSTRUMENT with all '-­‐' characters removed Copyright PSIKORS InsEtute 2014 16 key business logic as buried in a database stored procedure (condensed)
  • 17. SQL Queries via Hue: Impala
  • 18. SQL Queries via Hue: Hive
  • 19. Querying Impala From Data NormalizaEon System
  • 20. Simplifying Queries and Tying to AuthoritaEve Management
  • 21. Storing Term Rules in Master Codes Note wildcard character (*) in middle as well as front and back
  • 22. Complicated Queries are OCen Needed Looking for a combinaEon of keywords with wildcards along with structured values SELECT recordid,contracEngagencyid,contracEngagencyname,orgcode,orgid,modificaEonnumber,piid,piidagencyid,solicitaEonid,effecEvedate,fiscalyear,fundingagencyid,fundingagencyname,typeofcon tract,consolidatedcontractdesc,descofreq,naicscode,naicsdesc,productorservicecode,productorservicedesc,globaldunsnumber,dunsnumber,globalvendorname,vendorname,datesigned,referenc edidvpiid,referencedidvagencyid,referencedidvmodnumber,contracEngdepartmenEd,contracEngdepartmentname,contracEngofficeid,contracEngofficename,contracEngofficeregion,funcdimen ddate,funcdimstartdate,funcEon1,funcEon1value,funcEon2,funcEon2value,funcEon3,funcEon3value,majorcommandcode,majorcommandid,majorcommandname,parentmacomcode,primarydi mensionid,primarydimensionvalueid,secondarydimensionid,secondarydimensionvalueid,subcommand1code,subcommand1id,subcommand1name,subcommand2code,subcommand2id,subcom mand2name,subcommand3code,subcommand3id,subcommand3name,subcommand4code,subcommand4id,subcommand4name,terEarydimensionid,terEarydimensionvalueid,transacEonnum ber,lastdatetoorder,compleEondate,estulEmatecompleEondate,signeddate,fundingofficeid,fundingofficename,isfundedforeignenEtycode,isfundedforeignenEtydesc,reasoninteragencycontracE ng,feeforuseofservice,fixed,lowervalue,maximumorderlimit,orderingprocedure,uppervalue,websiteurl,whocanuse,feepaidforuseofidv,programacronym,typeofidc,a76acEoncode,a76acEondesc, conEngencyhumanitarianpeaceop,contracYinancing,costacctstdclausecode,costacctstdclausedesc,costorpricingdata,emailaddress,gfegfpcode,gfegfpdesc,inherentlygovernmentaldesc,inherentl ygovernmentalfuncEon,leaercontractundefacEoncode,leaercontractundefacEondesc,majorprogram,mulEpleorsingleawardidv,mulEyearcontractcode,mulEyearcontractdesc,naEonalinterestac Eon,naEonalinterestdesc,numberofacEons,performancebasedserviceacqcode,performancebasedserviceacqdesc,purchasecardpaymethodcode,purchasecardpaymethoddesc,seatransportaEon,s ubcontractplan,treasuryacctsymbolagencyid,treasuryacctsymboliniEaEve,treasuryacctsymbolmaincode,treasuryacctsymbolsubcode,clingercohenactcode,clingercohenactdesc,davisbaconactcod e,davisbaconactdesc,economyact,interagencycontracEngauthcode,interagencycontracEngauthdesc,otherstatutoryauthdesc,servicecontractactdesc,servicecontractactcode,walshhealeyactcode, walshhealeyactdesc,bundledreqs,claimantprogramcode,consolidatedcontractcode,domesEcorforeignenEtycode,domesEcorforeignenEtydesc,infotechcommercialitemcategory,recoveredmateri alssustain,recoveredmaterialssustaindesc,systemequipmentcode,useofepadesignatedproducts,congrdistrictplaceofperf,placeofperfzipcode,princplaceofperfcityname,princplaceofperfcountryco de,princplaceofperfcountryname,princplaceofperfcountycode,princplaceofperfcountyname,princplaceofperflocaEoncode,princplaceofperfstatecode,countryprodserviceorigincode,placeofmanu facture,placeofmanufacturedesc,alternaEveadverEsing,commercialitemacqperoccode,commercialitemacqperocdesc,commercialitemtestprogram,commercialitemtestprogramdesc,evaluatedpr eference,extentcompeted,fairopportunitylimitedsources,fedbizoppscode,fedbizoppsdesc,localareasetasidecode,localareasetasidedesc,numberofoffersreceived,otherthanfullopencompeEEon,pr eawardyosynopsis,priceevaluaEonpercentdiff,sbaorofppsynopsiswaiverpilot,sbirsar,smallbuscompdemoprog,solicitaEonperoc,typeofsetaside,awardoridvtype,createdvia,lastmodifiedby,lastmo difieddate,part8orpart13,preparedby,prepareddate,reasonformodificaEoncode,reasonformodificaEondesc,congrdistrictcontractor,contractorname,doingbusasname,samexcepEon,street,street 2,vendorcity,vendorcountry,vendorphonenumber,vendorstate,zip,is1862landgrantcollege,is1890landgrantcollege,is1994landgrantcollege,isairportauth,isalaskannaEvecorpownedfirm,isalaskann aEveservicinginst,isamericanindianowned,isasianpacificamericanowned,isblackamericanowned,isbothcontractsandgrants,iscity,iscommdevelopedcorpownedfirm,iscommdevelopmentcorp,isco ntracts,iscorporateenEtynoaaxexempt,iscorporateenEtytaxexempt,iscouncilofgovernments,iscountryofincorporaEon,iscounty,isdomesEcshelter,isdotcertdisbusent,iseducaEonalinst,isemerging smallbus,isfederalagency,isfedfundedresanddevcorp,isforprofitorg,isforeigngovernment,isforeignownedandlocated,isfoundaEon,isgrants,ishispanicamericanowned,ishispanicservicinginst,isvend orhbcu,ishospital,ishousingauthpublictribal,isindiantribe,isintermunicipal,isinternaEonalorg,isinterstateenEty,islaborsurplusareafirm,islimitedliabilitycorp,islocalgovernmentowned,ismanufactur erofgoods,isminorityinsts,isminorityownedbus,ismunicipality,isnaEveamericanowned,isnaEvehawaiianorgownedfirm,isnaEvehawaiianservicinginst,isnonprofitorg,isotherminorityowned,isother noYorprofitorg,ispartnershipllp,isplanningcommission,isportauth,isprivateuniversityorcollege,issbacert8ajointventure,issbacert8aprogparEcipant,issbacerthubzonefirm,issbacertsmalldisbus,issc hooldistrict,isschoolofforestry,isselfcerEfedsmalldisbus,isservicedisabledvetownedbus,issmallagriculturalcooperaEve,issoleproprietorship,isstatecontrinsthigherlearn,isstateofincorporaEon,issub chapterscorp,issubcontasianindianamerowned,istheabilityoneprog,istownship,istransitauth,istribalcollege,istriballyowned,isusfederalgovernment,isusgovernmentenEty,isuslocalgovernment,isu sstategovernment,isveteranownedbus,isveterinarycollege,isveterinaryhospital,iswomanownedbus,istypeecondiswosb,istypejventecondiswosb,istypejventwosb,istypewosb,contracEngo{ussize selecEon,reasonnotawardedtosmallbus,reasonnotawardedtosmalldisbus,idvbundledreqs,idvcontracEngagencyid,idvcontracEngagencyname,idvcontracEngo{ussizesel,idvdepartmenEd,idvdep artmentname,idvmajorprogcode,idvmulEpleorsingleawardidv,idvnaicscode,idvnaicsdesc,idvpart8orpart13,idvprogacronym,idvreferencedidvagencycode,idvreferencedidvpiid,idvsubcontractpla n,idvsubcontractplandesc,idvtypeofcontractpricing,idvtypeofcontractpricingdesc,idvtypeofidc,idvtypeofidcdesc,idvwhocanuse,idvwhocanusedesc,missing301,currentcontractvalue,acEonobligaE on,ulEmatecontractvalue FROM fpdsrawrecords.records WHERE ( ( ( LOWER(fundingagencyid) = '97as' ) ) AND ( ( LOWER(fiscalyear) = '2013' ) ) AND ( ( LOWER(productorservicecode) LIKE '70%' OR LOWER(productorservicecode) LIKE 'd3%' ) ) ) LIMIT 1000
  • 23. Query Timing u Looking for combinaEons of text tokens (with wildcards) to known field values u Queries are done both in Data NormalizaEon plaYorm and by command line interface on Hadoop server for Impala and Hive. Time differences are negligible but all Emes reported here are by CLI – Tables made for: text, Parquet, Parquet parEEoned by ‘fiscalyear’ (6 values) and ‘fundingagencyid’ (approx. 25 values)
  • 24. 400 350 300 250 200 150 100 50 0 FPDS Hadoop Query Times Text Field (secs) Hive Impala SQLServer Text Parquet Parquet ParEEoned EvaluaEng query performance in Hadoop relaEve to format and comparing to RDBMS
  • 25. 250 200 150 100 50 0 FPDS TEXT QUERIES PER LIMIT (SECS) Hive Text Impala Text Hive Parquet Impala Parquet Hive Parquet Part Impala Parquet Part 100 LIMIT 1000 LIMIT NO LIMIT
  • 26. JusEn Erickson | Director, Product Management, Cloudera QUERY PERFORMANCE IMPROVEMENT WITH IMPALA
  • 27. Impala’s Benefits u Unlocks BI/analyEcs on Hadoop – InteracEve SQL in seconds – Highly concurrent to handle 100s of users u NaEve Hadoop flexibility – No data migraEon, conversion, or duplicaEon required – Query exisEng Hadoop data – Run mulEple frameworks on the same data at the same Eme – Supports Parquet for best-­‐of-­‐breed columnar performance u NaEve MPP query engine designed into Hadoop: – Unified Hadoop storage – Unified Hadoop metadata (uses Hive and HCatalog) – Unified Hadoop security – Fine-­‐grained role-­‐based access controls with Sentry u Apache-­‐licensed open source u Deployed across customers today ©2014 Cloudera, Inc. All Rights Reserved. 27
  • 28. Impala Architecture u MPP query engine built naEvely into Hadoop ©2014 Cloudera, Inc. All Rights Reserved. 28 Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase SQL request
  • 29. Impala’s MulE-­‐User over 9.5x Faster ©2014 Cloudera, Inc. All Rights Reserved. 29
  • 30. MulE-­‐user hardware uElizaEon ©2014 Cloudera, Inc. All Rights Reserved. 30
  • 31. Performance Takeaways u Impala’s advantage expands with just 10 users to >9.5x nearest compeEtor – Predominantly aaributable to CPU efficiency u Does not parEcularly maaer which DAG is run for Hive – Shark (with Spark) and Tez produce very similar results – Both incrementally faster batch processing but not comparable to MPP databases – Difference is Spark is already proven with broad community and vendor adopEon u Mid-­‐term trends will further favor Impala’s design approach – More data sets move to memory (HDFS caching, in-­‐memory joins, Intel joint roadmap) – CPU efficiency will increase in importance – NaEve code enables easy opEmizaEons for CPU instrucEon sets (e.g. floaEng point operaEons, math operaEons, encrypt/decrypt) – The Intel joint roadmap helps support these opportuniEes u Upcoming benchmark on latest releases demonstrate Impala’s this gap widening ©2014 Cloudera, Inc. All Rights Reserved. 31
  • 33.
  • 34. Capture Business Rules and Make Visible, Changeable, and Useful
  • 35.
  • 36. Custom MulE-­‐Use NormalizaE on Methods Ready for Hadoop Parallel ExecuEon
  • 37. Data NormalizaEon Library Enables Rapid Build, Deploy, Change Cycles
  • 38. Special Programming for Hadoop u Which Hadoop libraries? Intertwined so reference all. u Otherwise: not much – HDFS filesystem – YARN containers
  • 39.
  • 40.
  • 41. Parallel Jobs u Three ways to run parallel jobs – Launch mulEple Java sessions from command line • Same as in Windows, Linux – Use Cloudera Hue Job Designer • Easy and has management web pages – Data NormalizaEon system • Coordinates governance, architecture, data models, codes, business rules • Define, submit YARN containers specifying Java jar, dicEonaries, source files
  • 42. Key Code Analysis – Invoice data sets extracted with correlaEon • CAGE: 984274, DUNS: 973437 – FPDS DUNS and Names extracted & correlated • 158181 unique DUNS codes – Will be included in normalized composite IT Asset records – Composite records for lookup added to Hadoop • By DUNS or Global DUNS: get all related DUNS, CAGE, names • By CAGE: get all related DUNS, names • By name: get all related DUNS, CAGE, names
  • 43. Number CAGE Per DUNS Code 1000000 100000 10000 1000 100 10 1 0.1 Number DUNS Codes With X CAGE Codes One DUNS code has 119 CAGE 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 23 24 27 35 40 43 44 46 54 71 78 90 119
  • 44. 1.4 1.2 1 0.8 0.6 0.4 0.2 0 ToWAWF Millions CAGE Codes from LookUp File Found NotFound
  • 45. FPDS Number DUNS with N Global DUNS 1000000 100000 10000 1000 100 10 1 0.1 0 1 2 3 4 5 100000 10000 1000 100 10 1 0.1 FPDS: Number DUNS with N Names 6849 instances for code = 12345678 7 1 3 5 7 9 11 13 15 17 19 21 24 27 35 112
  • 46. 10000 1000 100 10 1 0.1 FPDS: Number Global DUNS with N DUNS 0 50 100 150 200 250 Number Global DUNS Number DUNS 1000 100 10 1 0.1 FPDS: Global DUNS with MulEple Names 0 200 400 600 800 1000 1200 1400 Number Global DUNS Number Names
  • 47. FPDS DUNS Code Matches to WAWF Codes 140827 13302 17363 942 180000 160000 140000 120000 100000 80000 60000 40000 20000 0 DUNS GlobalDUNS Found NotFound
  • 48. FPDS DUNS With Most Names DUNS NGlobalDUNS Nnames 123456787 0 6849 136666505 0 112 790238851 0 96 103933453 1 35 103385519 1 33 005149120 1 27 067641597 1 25 005103494 0 24 332619535 0 24 020751082 1 22 054781240 1 22 621599893 1 21 790238638 0 21 834476079 1 21 123456787 miscellaneous foreign contractors 123456787 eEsalat c/o us consulate general dubai 123456787 boswedden house 123456787 turner engine controls b. v. 123456787 swissport hellas cargo s a 123456787 orbit couriers sa 123456787 goldair aviaEon handling s.a. 123456787 federal egov iae iniEaEve generic duns 123456787 federal egov iae iniEaEve -­‐ generic duns 123456787 miscellaneous foreign contractorsan 123456787 prc-­‐desoto 123456787 inversiones sochagota e.u. 123456787 comcel 123456787 transporte y servicio lucio 123456787 jesse james members only maxi taxi svc 123456787 club naval de oficiales 123456787 inchcape shipping services 123456787 dr. thalia abatzi 123456787 central asia development group 123456787 bennea-­‐fouch and associates 123456787 noor al-­‐sabah company 123456787 ait/arc infrasture soluEons 123456787 not available 123456787 77 construcEon company 136666505 adese genc petrol 136666505 amy lily chung 136666505 anderson erin ruth 136666505 andrew william knef 136666505 anduaga-­‐arias laura 136666505 angelica m. de la cruz 136666505 anthony o'brien, 330531-­‐5100194 136666505 batac belle 136666505 boaesini beth ms. 136666505 bouck shannon 136666505 bunn amy b. 136666505 carlene clark 136666505 cho, boong haeng 136666505 choe, sun young 136666505 chrisEna michajlyszyn 136666505 christopher cannon 136666505 christopher l. booth 136666505 chun, kil mo 136666505 conflict + transiEon consultancies 136666505 cozzone elaine 136666505 deborah p. carney 136666505 denihan patricia joann 136666505 dong sook mcgeorge, 690525-­‐2716816 136666505 dorene d.lukewalton,pharm d. 136666505 dr. terry a. klein
  • 49. FPDS Global DUNS with Most Names & DUNS GlobalDUNS NDUNS Nnames 877936518 12 27299 624770475 212 21866 148095086 80 21754 027079776 2 17128 103933453 86 17075 026157235 4 15694 963737366 106 15200 134303192 19 14481 067641597 108 13998 064680213 102 13809 077652761 93 12914 002204600 15 12570 039860122 44 12382 805258373 130 11995 GlobalDUNS NDUNS Nnames 624770475 212 21866 805258373 130 11995 012003349 128 9748 877987347 127 8253 057272486 124 6935 007250079 123 9076 071767334 123 9474 158140041 117 6671 019710586 116 8163 091441089 116 7813 616924770 116 7217 067641597 108 13998
  • 50. Prompted CollaboraEon and New Business InformaEon u Showing these results prompted discussions leading to: – There are generic DUNS heavily used but these are being removed from use via policy changes – System validaEon rules are not current with all policy – AddiEonal “rules” of how to track, audit, align, merge spread by email • All put back into Data NormalizaEon system and then into modified Java u New results available over all data sets <1day
  • 52. Impala JusEn Erickson | Director, Product Management September 2014 ©2014 Cloudera, Inc. All Rights Reserved. 52
  • 53. Impala Architecture: Query ExecuEon u Request arrives via ODBC/JDBC/Hue GUI/Shell Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase SQL request ©2014 Cloudera, Inc. All Rights Reserved. 53
  • 54. Impala Architecture: Query ExecuEon u Planner turns request into collecEons of plan fragments u Coordinator iniEates execuEon on impalad's local to data Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase Hive Metastore HDFS NN Statestore ©2014 Cloudera, Inc. All Rights Reserved. 54
  • 55. Impala Architecture: Query ExecuEon u Intermediate results are streamed between impalad’s u Query results are streamed back to client Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase query results ©2014 Cloudera, Inc. All Rights Reserved. 55
  • 56. Try It Out! u 100% Apache-­‐licensed open source u Downloads on hap://impala.io/: – Live online – VM – InstallaEon u QuesEons/comments? – Community: hap://impala.io/community – Email: impala-­‐user@cloudera.org ©2014 Cloudera, Inc. All Rights Reserved. 56
  • 57. ©2014 Cloudera, Inc. All Rights Reserved. 57