Page1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP: Data Governance
We Do Hadoop
Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Enterprise Data Governance Goals
GOAL: Provide a common approach to
data governance across all systems
and data within the organization
• Transparent
Governance standards & protocols must be
clearly defined and available to all
• Reproducible
Recreate the relevant data landscape at a
point in time
• Auditable
All relevant events and assets but be
traceable with appropriate historical lineage
• Consistent
Compliance practices must be consistent
ETL/DQ
BPM
Business
Analytics
Visualization
& Dashboards
ERP
CRM
SCM
MDM
ARCHIVE
Governance
Framework
Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Governance Challenges WITHIN Hadoop
• No comprehensive governance within
the Hadoop stack
• Mostly disjoint as each project defines its own
future and there is no common framework
• Disparate tools, such as HCatalog, Ranger and
Falcon provide pieces of the overall solution
• No integration with external governance
frameworks
• Difficult to get right because each
project is autonomous and you need
insight into traditional IT
ApachePig
ApacheHive
ApacheHBase
ApacheAccumulo
ApacheSolr
ApacheSpark
ApacheStorm
Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Governance Initiative for Hadoop
ETL/DQ
BPM
Business
Analytics
Visualization
& Dashboards
ERP
CRM
SCM
MDM
ARCHIVE
Data Governance Initiative
Common
Governance
Framework
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° °
° °
°
°
ApachePig
ApacheHive
ApacheHBase
ApacheAccumulo
ApacheSolr
ApacheSpark
ApacheStorm
TWO Requirements
1. Hadoop must snap in to
the existing frameworks
and be a good citizen
2. Hadoop must also provide
governance within its own
stack of technologies
A group of companies dedicated to meeting
these requirements in the open
Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas Overview
We Do Hadoop
Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
New Project Proposal: Apache Atlas
Apache Atlas
Proposed open source project
aimed at solving the Hadoop
data governance challenge in
the open.
Key Capabilities
• Data Classification
• Metadata Exchange
• Centralized Auditing
• Search & Lineage (Browse)
• Security & Policy Engine
Apache Atlas
Knowledge Store
Audit Store
ModelsType-System
Policy RulesTaxonomies
Tag Based
Policies
Data Lifecycle
Management
Real Time Tag Based Access Control
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA
HL7
Financial
SOX
Dodd-Frank
Energy
PPDM
Retail
PCI
PII
Other
CWM
Essential Timeline
Phase-3
• Collaboration Features
• Self Service
• Steward Delegation
• Profiling & Pattern Analysis
• Visualization
Phase-2
• Advance audit reporting
• Advanced Policy Engine
• Row / Column Masking
• 3rd party Metadata exchange
1H 2015 GA
• Rest API
• Centralized Taxonomy
• Import / export metadata
• Basic Policy Rules Engine
• Real-time access control
• Column Level Tagging
Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas Capabilities: Overview
Data Classification
• Import or define taxonomy business-oriented annotations for data
• Define, annotate, and automate capture of relationships between data sets and underlying
elements including source, target, and derivation processes
• Export metadata to third-party systems
Centralized Auditing
• Capture security access information for every application, process, and interaction with data
• Capture the operational information for execution, steps, and activities
Search & Lineage (Browse)
• Pre-defined navigation paths to explore the data classification and audit information
• Text-based search features locates relevant data and audit event across Data Lake quickly
and accurately
• Browse visualization of data set lineage allowing users to drill-down into operational, security,
and provenance related information
Security & Policy Engine
• Rationalize compliance policy at runtime based on data classification schemes
• Advanced definition of policies for preventing data derivation based on classification (i.e. re-
identification)
Apache Atlas
Knowledge Store
Audit Store
ModelsType-System
Policy RulesTaxonomies
Tag Based
Policies
Data Lifecycle
Management
Real Time Tag Based Access Control
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA
HL7
Financial
SOX
Dodd-Frank
Energy
PPDM
Retail
PCI
PII
Other
CWM
Page8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Governance Ready Certification Program
Curated group of vendor partners to provide
rich & complete features
Customers choose features that they want to
deploy – a la carte.
Low switching costs !
HDP at core to provide stability and
interoperability
Discovery
Tagging
Prep /
Cleanse
ETL
Governance
BPM
Self
Service
Visual-
ization
Page9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Architecture
Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
High Level Architecture
Type System
Repository
Search DSL
Bridge
Hive Storm OthersSqoop
REST API
Titan / HBase
Solr/Elastic
Page11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Type System – Overview of Types
• Class
• Struct
• Trait
• Primitives
• Collections
• Map
• Array
• Instances (Entity)
• Referenceable
Page12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Type System – Data Types
Page13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Type System - Examples
HierarchicalTypeDefinition<ClassType> tblClsDef =
TypesUtil.createClassTypeDef(TABLE_TYPE, null,
attrDef("name", DataTypes.STRING_TYPE),
attrDef("description", DataTypes.STRING_TYPE),
new AttributeDefinition("db", DATABASE_TYPE,
Multiplicity.REQUIRED, false, null),
new AttributeDefinition("sd", STORAGE_DESC_TYPE,
Multiplicity.REQUIRED, true, null),
attrDef("owner", DataTypes.STRING_TYPE),
attrDef("createTime", DataTypes.INT_TYPE),
attrDef("lastAccessTime", DataTypes.INT_TYPE),
attrDef("retention", DataTypes.INT_TYPE),
attrDef("viewOriginalText", DataTypes.STRING_TYPE),
attrDef("viewExpandedText", DataTypes.STRING_TYPE),
attrDef("tableType", DataTypes.STRING_TYPE),
attrDef("temporary", DataTypes.BOOLEAN_TYPE),
new AttributeDefinition("columns",
DataTypes.arrayTypeName(COLUMN_TYPE),
Multiplicity.COLLECTION, true, null)
);
Page14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Repository
• Graph Database
• Titan with storage backed by HBase
• Types and instances are mapped to the Graph DB
• Classes, Structs and Traits map to a vertex
• Relationships are mapped as edges
• Search - plugin enabled
• Indexing based on type annotations
• Solr
• Elastic search
Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Search
• DSL with SQL Like Syntax
• from $type is $trait where $clause select|has $attributes loop $loopExpression withPath, repeat
• Examples
• from DB
• DB where name="Reporting" select name, owner
• DB has name
• DB is JdbcAccess
• Column where Column isa PII
• Table where name="sales_fact", columns
• Table where name="sales_fact", columns as column select column.name, column.dataType,
column.comment
• Full-text search
Page16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Lineage
• Uses Search DSL Loop expression
• Everything results in search
• Named Queries
• inputs
• Table where (name = "sales_fact_monthly_mv") as src loop (LoadProcess->outputTable
inputTables) as dest select src.name as src_name, dest.name as dest_name withPath
• outputs
• Table where (name = "sales_fact") as src loop (LoadProcess->inputTables outputTables) as dest
select src.name as src_name, dest.name as dest_name withPath
• schema
• Table where name="sales_fact", columns
Page17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hive Integration
Apache Atlas
Hive Bridge
(Client)
Hive Hook
(Post-execution)
REST API
Page18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Prototype Demonstration
Page19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Demo – Class Diagram
Page20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Demo – Instance Graph
Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Demo
• http://162.212.133.230:21000/dashboard/test1/Search.html
Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Questions and Answers

Data Governance Initiative

  • 1.
    Page1 © HortonworksInc. 2011 – 2014. All Rights Reserved HDP: Data Governance We Do Hadoop
  • 2.
    Page2 © HortonworksInc. 2011 – 2014. All Rights Reserved Enterprise Data Governance Goals GOAL: Provide a common approach to data governance across all systems and data within the organization • Transparent Governance standards & protocols must be clearly defined and available to all • Reproducible Recreate the relevant data landscape at a point in time • Auditable All relevant events and assets but be traceable with appropriate historical lineage • Consistent Compliance practices must be consistent ETL/DQ BPM Business Analytics Visualization & Dashboards ERP CRM SCM MDM ARCHIVE Governance Framework
  • 3.
    Page3 © HortonworksInc. 2011 – 2014. All Rights Reserved Data Governance Challenges WITHIN Hadoop • No comprehensive governance within the Hadoop stack • Mostly disjoint as each project defines its own future and there is no common framework • Disparate tools, such as HCatalog, Ranger and Falcon provide pieces of the overall solution • No integration with external governance frameworks • Difficult to get right because each project is autonomous and you need insight into traditional IT ApachePig ApacheHive ApacheHBase ApacheAccumulo ApacheSolr ApacheSpark ApacheStorm
  • 4.
    Page4 © HortonworksInc. 2011 – 2014. All Rights Reserved Data Governance Initiative for Hadoop ETL/DQ BPM Business Analytics Visualization & Dashboards ERP CRM SCM MDM ARCHIVE Data Governance Initiative Common Governance Framework 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ApachePig ApacheHive ApacheHBase ApacheAccumulo ApacheSolr ApacheSpark ApacheStorm TWO Requirements 1. Hadoop must snap in to the existing frameworks and be a good citizen 2. Hadoop must also provide governance within its own stack of technologies A group of companies dedicated to meeting these requirements in the open
  • 5.
    Page5 © HortonworksInc. 2011 – 2014. All Rights Reserved Apache Atlas Overview We Do Hadoop
  • 6.
    Page6 © HortonworksInc. 2011 – 2014. All Rights Reserved New Project Proposal: Apache Atlas Apache Atlas Proposed open source project aimed at solving the Hadoop data governance challenge in the open. Key Capabilities • Data Classification • Metadata Exchange • Centralized Auditing • Search & Lineage (Browse) • Security & Policy Engine Apache Atlas Knowledge Store Audit Store ModelsType-System Policy RulesTaxonomies Tag Based Policies Data Lifecycle Management Real Time Tag Based Access Control REST API Services Search Lineage Exchange Healthcare HIPAA HL7 Financial SOX Dodd-Frank Energy PPDM Retail PCI PII Other CWM Essential Timeline Phase-3 • Collaboration Features • Self Service • Steward Delegation • Profiling & Pattern Analysis • Visualization Phase-2 • Advance audit reporting • Advanced Policy Engine • Row / Column Masking • 3rd party Metadata exchange 1H 2015 GA • Rest API • Centralized Taxonomy • Import / export metadata • Basic Policy Rules Engine • Real-time access control • Column Level Tagging
  • 7.
    Page7 © HortonworksInc. 2011 – 2014. All Rights Reserved Apache Atlas Capabilities: Overview Data Classification • Import or define taxonomy business-oriented annotations for data • Define, annotate, and automate capture of relationships between data sets and underlying elements including source, target, and derivation processes • Export metadata to third-party systems Centralized Auditing • Capture security access information for every application, process, and interaction with data • Capture the operational information for execution, steps, and activities Search & Lineage (Browse) • Pre-defined navigation paths to explore the data classification and audit information • Text-based search features locates relevant data and audit event across Data Lake quickly and accurately • Browse visualization of data set lineage allowing users to drill-down into operational, security, and provenance related information Security & Policy Engine • Rationalize compliance policy at runtime based on data classification schemes • Advanced definition of policies for preventing data derivation based on classification (i.e. re- identification) Apache Atlas Knowledge Store Audit Store ModelsType-System Policy RulesTaxonomies Tag Based Policies Data Lifecycle Management Real Time Tag Based Access Control REST API Services Search Lineage Exchange Healthcare HIPAA HL7 Financial SOX Dodd-Frank Energy PPDM Retail PCI PII Other CWM
  • 8.
    Page8 © HortonworksInc. 2011 – 2014. All Rights Reserved Governance Ready Certification Program Curated group of vendor partners to provide rich & complete features Customers choose features that they want to deploy – a la carte. Low switching costs ! HDP at core to provide stability and interoperability Discovery Tagging Prep / Cleanse ETL Governance BPM Self Service Visual- ization
  • 9.
    Page9 © HortonworksInc. 2011 – 2014. All Rights Reserved Architecture
  • 10.
    Page10 © HortonworksInc. 2011 – 2014. All Rights Reserved High Level Architecture Type System Repository Search DSL Bridge Hive Storm OthersSqoop REST API Titan / HBase Solr/Elastic
  • 11.
    Page11 © HortonworksInc. 2011 – 2014. All Rights Reserved Type System – Overview of Types • Class • Struct • Trait • Primitives • Collections • Map • Array • Instances (Entity) • Referenceable
  • 12.
    Page12 © HortonworksInc. 2011 – 2014. All Rights Reserved Type System – Data Types
  • 13.
    Page13 © HortonworksInc. 2011 – 2014. All Rights Reserved Type System - Examples HierarchicalTypeDefinition<ClassType> tblClsDef = TypesUtil.createClassTypeDef(TABLE_TYPE, null, attrDef("name", DataTypes.STRING_TYPE), attrDef("description", DataTypes.STRING_TYPE), new AttributeDefinition("db", DATABASE_TYPE, Multiplicity.REQUIRED, false, null), new AttributeDefinition("sd", STORAGE_DESC_TYPE, Multiplicity.REQUIRED, true, null), attrDef("owner", DataTypes.STRING_TYPE), attrDef("createTime", DataTypes.INT_TYPE), attrDef("lastAccessTime", DataTypes.INT_TYPE), attrDef("retention", DataTypes.INT_TYPE), attrDef("viewOriginalText", DataTypes.STRING_TYPE), attrDef("viewExpandedText", DataTypes.STRING_TYPE), attrDef("tableType", DataTypes.STRING_TYPE), attrDef("temporary", DataTypes.BOOLEAN_TYPE), new AttributeDefinition("columns", DataTypes.arrayTypeName(COLUMN_TYPE), Multiplicity.COLLECTION, true, null) );
  • 14.
    Page14 © HortonworksInc. 2011 – 2014. All Rights Reserved Repository • Graph Database • Titan with storage backed by HBase • Types and instances are mapped to the Graph DB • Classes, Structs and Traits map to a vertex • Relationships are mapped as edges • Search - plugin enabled • Indexing based on type annotations • Solr • Elastic search
  • 15.
    Page15 © HortonworksInc. 2011 – 2014. All Rights Reserved Search • DSL with SQL Like Syntax • from $type is $trait where $clause select|has $attributes loop $loopExpression withPath, repeat • Examples • from DB • DB where name="Reporting" select name, owner • DB has name • DB is JdbcAccess • Column where Column isa PII • Table where name="sales_fact", columns • Table where name="sales_fact", columns as column select column.name, column.dataType, column.comment • Full-text search
  • 16.
    Page16 © HortonworksInc. 2011 – 2014. All Rights Reserved Lineage • Uses Search DSL Loop expression • Everything results in search • Named Queries • inputs • Table where (name = "sales_fact_monthly_mv") as src loop (LoadProcess->outputTable inputTables) as dest select src.name as src_name, dest.name as dest_name withPath • outputs • Table where (name = "sales_fact") as src loop (LoadProcess->inputTables outputTables) as dest select src.name as src_name, dest.name as dest_name withPath • schema • Table where name="sales_fact", columns
  • 17.
    Page17 © HortonworksInc. 2011 – 2014. All Rights Reserved Hive Integration Apache Atlas Hive Bridge (Client) Hive Hook (Post-execution) REST API
  • 18.
    Page18 © HortonworksInc. 2011 – 2014. All Rights Reserved Prototype Demonstration
  • 19.
    Page19 © HortonworksInc. 2011 – 2014. All Rights Reserved Demo – Class Diagram
  • 20.
    Page20 © HortonworksInc. 2011 – 2014. All Rights Reserved Demo – Instance Graph
  • 21.
    Page21 © HortonworksInc. 2011 – 2014. All Rights Reserved Demo • http://162.212.133.230:21000/dashboard/test1/Search.html
  • 22.
    Page22 © HortonworksInc. 2011 – 2014. All Rights Reserved Questions and Answers

Editor's Notes

  • #3 A data governance framework in any organization comprises a combination of people, process and technology that are in place to establish decision rights and accountabilities for information. A governance policy defines who can take what actions with what information, and when, under what circumstances, using what methods.   The technology goals for a data governance framework are to provide a platform for a common approach across all systems and data within the organization, Explicitly they need to be: - Transparent: Governance standards & protocols must be clearly defined and available to all - Reproducible: Recreate the relevant data landscape at a point in time - Auditable: All relevant events and assets but be traceable with appropriate historical lineage - Consistent: Compliance practices must be consistent    
  • #4 One of the most important and challenging goals is consistency and when we inspect implementation of data governance for a Hadoop implementation this consistency seems almost impossible.   While there are constructs within the various Hadoop related projects to aid governance, there is no single, consistent approach to implementation/architecture of this important framework. Disparate tools, such as HCatalog, Ranger and Falcon provide pieces of the overall solution, but are not holistic across the stack in approach.   Further, with Hadoop gaining traction in the enterprise, it has only recently been asked to integrate with existing frameworks such as operations, security and now governance. It is now a foundational concern.    
  • #5 Consistent with our approach, Hortonworks recently led the development Data Governance Initiative which brings together industry experts from some of the largest enterprise, traditional vendors and Hadoop experts to address governance within Hadoop.   The members have already defined a comprehensive solution for data governance in Hadoop that will provide the base functions of metadata services, deep audit store and an advanced policy rules engine, but also and more importantly, the solution will interoperate with and extend existing third-party data governance and management tools by shedding light on the access of data within Hadoop.   This approach allows organizations to apply comprehensive governance policy across their data.   In order to provide these functions for Hadoop, much work will need to be accomplished throughout all of the Hadoop projects. The initiative is unique in that it is not a spot solution, but will build on top of existing projects such as Apache Falcon for data lifecycle management and Apache Ranger for global security policies. The founding members of this initiative include, Aetna, Target, Merck, SAS and Hortonworks and many more are lined up to contribute. This is a broad community based initiative set to deliver the right features for the right requirements and is perfectly representative of the Hortonworks open community approach.  
  • #7 Apache Atlas is the only open source project created to solve the governance challenge in the open. The founding members of the project include all the members of the data governance intiative and others from the Hadoop community. The core functionality defined by the project includes the following: Data Classification – create an understanding of the data within Hadoop and provide a classification of this data to external and internal sources Centralized Auditing – provide a framework to capture and report on access to and modifications of data within Hadoop Search & Lineage – allow pre-defined and ad hoc exploration of data and metadata while maintaining a history of how a data source or explicit data was constructed. Security and Policy Engine – implement engines to protect and rationalize data access and according to compliance policy.
  • #8 Roll tax into data disco