Hadoop in Validated Environment
Data Governance Initiative
Martin Ryzl
Director, Analytics Platform
Ivo Lasek
Architect, Analytics Platform
Research
Manufacturing
Marketing
Search
Data
Integration
Data
Analytics
Open
Data
90 Days
Laboratory
Information
Management
SAP
Enterprise
Asset
Management
Manufacturing
Execution
Systems
Data
Analytics
Data
Integration
Who is the
dataset owner?
How can I
get access?
What does the
data mean?
How can
I reproduce
the results?
Where is the
data I need?
6 Months
Where Is the Data I Need?
Data Lake
Data Lake
Merge
Clean
Data Lake
Merge
Clean
Security and Data Governance
Data Catalog
Source: http://www.data.gov/
Who Is the Dataset Owner?
Entitlements
Dataset
Owner
Dataset
User
Entitlements
Dataset
Owner
Dataset
User
Entitlements
Dataset
Owner
Dataset
User
How Can I Get Access?
Entitlements
Dataset
Owner
Entitlement
Steward
Entitlements
Dataset
Owner
Entitlement
Steward
Dataset
User
Entitlements
Dataset
Owner
Entitlement
Steward
Dataset
User
Entitlements
Dataset
Owner
Entitlement
Steward
Dataset
User
What Does the Data Mean?
Semantic meaning – Metastore
id name ssn birth_n
o
phone id
personal_number
employee_number
first_name
division
Metastore
Dataset
Owner
Metastore
Dataset
Owner
Data
Steward
Metastore
Dataset
Owner
Data
Steward
Metastore
Dataset
Owner
Data
Steward
Entitlement
Steward
Dataset
User
Metastore
Dataset
Owner
Data
Steward
Entitlement
Steward
Dataset
User
Metastore
Dataset
Owner
Data
Steward
Entitlement
Steward
Dataset
User
How Can I Reproduce the Data?
Reporting Data
Reproducibility
delta1
delta2
delta3
Raw Data
Aggregate
v1.0
delta1..delta3
Reporting Data
Reproducibility
delta1
delta2
delta3
Raw Data
delta1..delta3
Reporting Data
Aggregate
v1.0
delta4
delta5
delta6 delta1..delta7
Reporting Data
Aggregate
v1.0
delta7
Traceability
delta1
delta2
delta3
Raw Data
delta1..delta3
Reporting Data
Aggregate
v1.0
delta1..delta3
Cleaned Data
Clean
v1.0
delta4
delta5
delta6 delta1..delta7
Reporting Data
Aggregate
v1.0
delta1..delta7
Cleaned Data
Clean
v1.1
delta7
Data Lineage
Access Logs
Process Logs
2015-06-04 12:53:31,601 INFO [main] parse.SemanticAnalyzer (SemanticAnalyzer.java:getMetaData(1702)) - Get metadata for subqueries
17865102 [main] INFO org.apache.hadoop.hive.ql.parse.SemanticAnalyzer - Get metadata for destination tables
2015-06-04 12:53:31,601 INFO [main] parse.SemanticAnalyzer (SemanticAnalyzer.java:getMetaData(1726)) - Get metadata for destination
tables
17865345 [main] INFO org.apache.hadoop.hive.ql.parse.SemanticAnalyzer - Completed getting MetaData in Semantic Analysis
2015-06-04 12:53:31,844 INFO [main] parse.SemanticAnalyzer (SemanticAnalyzer.java:analyzeInternal(10004)) - Completed getting MetaData
in Semantic Analysis
17865347 [main] INFO org.apache.hadoop.hive.ql.parse.SemanticAnalyzer - Not invoking CBO because the statement has too few joins
2015-06-04 12:53:31,846 INFO [main] parse.SemanticAnalyzer (SemanticAnalyzer.java:canHandleAstForCbo(10258)) - Not invoking CBO
because the statement has too few joins
Heart beat
17866695 [main] ERROR org.apache.hadoop.hive.ql.Driver - FAILED: SemanticException [Error 10044]: Line 2:18 Cannot insert into target
table because column number/types are different ''2015-06-04-07-50'': Table insclause-0 has 165 columns, but query has 166 columns.
org.apache.hadoop.hive.ql.parse.SemanticException: Line 2:18 Cannot insert into target table because column number/types are different
''2015-06-04-07-50'': Table insclause-0 has 165 columns, but query has 166 columns.
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genConversionSelectOperator(SemanticAnalyzer.java:6535)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genFileSinkPlan(SemanticAnalyzer.java:6336)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPostGroupByBodyPlan(SemanticAnalyzer.java:8977)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:8868)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:9713)
Who is the
dataset owner?
How can I
get access?
What does the
data mean?
How can
I reproduce
the results?
Where is the
data I need?
HDFS/Hive Metastore Ranger
Metastore Falcon
Contacts
• Martin Ryzl (martin.ryzl@merck.com)
• Ivo Lasek (ivo.lasek@merck.com)
• http://www.merck.com/
• http://www.msdit.cz/

Hadoop in Validated Environment - Data Governance Initiative