Implementing the Business Catalog
in the Modern Enterprise:
Bridging Traditional EDW and
Hadoop with Apache Atlas
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Disclaimer
This document may contain product features and technology directions that are under development,
may be under development in the future or may ultimately not be developed.
Project capabilities are based on information that is publicly available within the Apache Software
Foundation project websites ("Apache"). Progress of the project capabilities can be tracked from
inception to release through Apache, however, technical feasibility, market demand, user feedback and
the overarching Apache Software Foundation community development process can all effect timing
and final delivery.
This document’s description of these features and technology directions does not represent a
contractual commitment, promise or obligation from Hortonworks to deliver these features in any
generally available product.
Product features and technology directions are subject to change, and must not be included in
contracts, purchase orders, or sales agreements of any kind.
Since this document contains an outline of general product development plans, customers should not
rely upon it when making purchasing decisions.
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Speakers
Andrew Ahn
Governance Director
Product Management
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
• Atlas Overview
• Near term roadmap
• Business Catalog
• Questions
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Atlas Overview
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
STRUCTURED
UNSTRUCTURED
Vision - Enterprise Data Governance Across Platfroms
TRADITIONAL
RDBMS
METADATA
MPP
APPLIANCES
Project 1
Project 5
Project 4
Project 3
Metadata
Project 6
DATA
LAKE
GOAL: Provide a common approach to data
governance across all systems and data within the
enterprise
Transparent
Governance standards and protocols must be clearly
defined and available to all
Reproducible
Recreate the relevant data landscape at a point in time
Auditable
All relevant events and assets but be traceable with
appropriate historical lineage
Consistent
Compliance practices must be consistent
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ready for Trusted Governance
OPERATIONS SECURITY
GOVERNANCE
STORAGE
STORAGE
Machine
Learning
Batch
StreamingInteractive
Search
GOVERNANCE
YA R N
D A T A O P E R A T I N G S Y S T E M
Data Management
along the entire data lifecycle with integrated
provenance and lineage capability
Modeling with Metadata
enables comprehensive data lineage through
a hybrid approach with enhanced tagging
and attribute capabilities
Interoperable Solutions
across the Hadoop ecosystem, through a
common metadata store
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DGI* Community becomes Apache Atlas
May
2015
Proto-type
Built
Apache
Atlas
Incubation
DGI group
Kickoff
Feb
2015
Dec
2014
July
2015
HDP 2.3
Foundation
GA Release
First kickoff to GA in 7 months
Global Financial
Company
* DGI: Data Governance Initiative
Faster & Safer
Co-Development driven
by customer use cases
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Atlas: Metadata Services
• Cross- component dataset
lineage. Centralized location for
all metadata inside HDP
• Single Interface point for
Metadata Exchange with
platforms outside of HDP
• Business Taxonomy based
classification. Conceptual,
Logical And Technical
Apache Atlas
Hive
Ranger
Falcon
Sqoop
Storm
Kafka
Spark
NiFi
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Big Data Management Through Metadata
Management Scalability
Many traditional tools and patterns do not scale when applied to multi-tenant data lakes.
Many enterprise have silo’d data and metadata stores that collide in the data lake. This is
compounded by the ability to have very large windows (years). Can traditional EDW tools
manage 100 million entities effectively with room to grow ?
Metadata Tools
Scalable, decoupled, de-centralized manage driven through metadata is the only via solution.
This allows quick integration with automation and other metamodels
Tags for Management, Discovery and Security
Proper metadata is the foundation for business taxonomy, stewardship, attribute based
security and self-service.
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Atlas High Level Architecture
Type System
Repository
Search DSL
Bridge
Hive Storm
Falcon Others
REST API
Graph DB
Search
Kafka
Sqoop
Connectors
MessagingFramework
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Technical and Logical Metadata Exchange
Knowledge
Store
Atlas
REST API
Structured
Unstructured
Files:
XML / JSON
3rd Party
Vendors
Custom
Reporter
Non-Hadoop
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Near Term Roadmap:
Summer 2016
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sqoop
Teradata
Connector
Apache
Kafka
Expanded Native Connector: Dataset Lineage
Custom
Activity
Reporter
Metadata
Repository
RDBMS
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dynamic Access Policy Driven by metadata
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Business Taxonomy UX Prototype
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
We conduct open-ended user interviews so that we can learn more
about who are users are and what their needs are. This helps us
validate whether or not we’re solving the right problem.
User Interviews
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
We test our prototype in InVision - a click through prototyping tool
that allows users to interact with static mockups.
Usability Testing
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
After conducting interviews and usability testing we spend sometime
analyzing our findings and pulling out themes + insights.
Synthesis + Analysis
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Usability Findings
• Understood the hierarchy and how to search for data
• Would generally search by file name or specific keyword
• Would use tags for the purpose of searching
• Would want to preview a subset of the data before analyzing the
whole data set
• Interested in the size of the data set
• Concerned with how current and updated the information is
• Would like the ability to contact a steward for more information
regarding the data set
• Would use an advanced boolean search if it were available
• Viewing the popularity and access frequency would provide
confidence
• Would like to provide and view fellow user’s input
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Persona Findings
• Data Scientists typically have backgrounds in Mathematics, Computer
Science and Statistics
• Responsible for analyzing and transforming data into more useful
structures
• Responsible for correcting missing values, typos and parsing issues
• Typically fluent with SQL, Python and Hadoop tools
• Require time upfront to understand and discover new data sets
• Spend a significant amount of time reaching out to others with questions
about data sets
• Interact with Subject Matter Experts and Solution Architects
• Noted that compliance is a big interest for enterprises and government
• Felt Hadoop doesn’t support security and compliance very well
• Find it difficult to see who is doing what in Hadoop
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Principle Roles
• Data Steward – Curator, responsible for catalog verasity
• Data Scientist – Analyst, primary consumer of Business Catalog
• Administrator – Role management only
• Data Engineer – Data ingress and egress, semantic data quality
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
UX proto-type: Taxonomy Navigation
Breadcrumbs for
taxonomy context path
Contents at
taxonomy context
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Taxonomy Creation
In place taxonomy
management
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Taxonomy Classification of Assets
Create new object
on the fly
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Object Details
Annotation for
policies and rules
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Object Lineage
Dataset Lineage
across components
Assign Tags
to assets
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
User Comments
User comments for
collaboration
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Classify and Tag Assets
Keyword, DSL, and
Faceted search
Define authoritive tags
for the whole
taxonomy
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
• Hierarchical Taxonomy Creation
• Agile modeling: Model Conceptual, Logical, Physical assets
• Authorization: Steward / Analytic Roles
• Tag management: Definition and assignment
• DQ tab for profiling and sampling
• User Comments
Business Taxonomy UX Prototype
What other
information would you
like to see included?
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Availability:
- Tech Preview VMs: May 2016
- GA Release: Summer 2016
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Questions ?
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Reference
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Online Resources
VM: https://s3.amazonaws.com/demo-drops.hortonworks.com/HDP-
Atlas-Ranger-TP.ova —> Download Public Preview VM
Tutorial: https://github.com/hortonworks/tutorials/tree/atlas-ranger-
tp/tutorials/hortonworks/atlas-ranger-preview
Blog: http://hwxjojo.wpengine.com/blog/the-next-generation-of-
hadoop-based-security-data-governance/ (this is giving an error, right
now)
Learn More: http://hortonworks.com/solutions/atlas-ranger-
integration/

Implementing the Business Catalog in the Modern Enterprise: Bridging Traditional EDW and Hadoop with Apache Atlas

  • 1.
    Implementing the BusinessCatalog in the Modern Enterprise: Bridging Traditional EDW and Hadoop with Apache Atlas
  • 2.
    2 © HortonworksInc. 2011 – 2016. All Rights Reserved Disclaimer This document may contain product features and technology directions that are under development, may be under development in the future or may ultimately not be developed. Project capabilities are based on information that is publicly available within the Apache Software Foundation project websites ("Apache"). Progress of the project capabilities can be tracked from inception to release through Apache, however, technical feasibility, market demand, user feedback and the overarching Apache Software Foundation community development process can all effect timing and final delivery. This document’s description of these features and technology directions does not represent a contractual commitment, promise or obligation from Hortonworks to deliver these features in any generally available product. Product features and technology directions are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind. Since this document contains an outline of general product development plans, customers should not rely upon it when making purchasing decisions.
  • 3.
    3 © HortonworksInc. 2011 – 2016. All Rights Reserved Speakers Andrew Ahn Governance Director Product Management
  • 4.
    4 © HortonworksInc. 2011 – 2016. All Rights Reserved Agenda • Atlas Overview • Near term roadmap • Business Catalog • Questions
  • 5.
    5 © HortonworksInc. 2011 – 2016. All Rights Reserved Apache Atlas Overview
  • 6.
    6 © HortonworksInc. 2011 – 2016. All Rights Reserved STRUCTURED UNSTRUCTURED Vision - Enterprise Data Governance Across Platfroms TRADITIONAL RDBMS METADATA MPP APPLIANCES Project 1 Project 5 Project 4 Project 3 Metadata Project 6 DATA LAKE GOAL: Provide a common approach to data governance across all systems and data within the enterprise Transparent Governance standards and protocols must be clearly defined and available to all Reproducible Recreate the relevant data landscape at a point in time Auditable All relevant events and assets but be traceable with appropriate historical lineage Consistent Compliance practices must be consistent
  • 7.
    7 © HortonworksInc. 2011 – 2016. All Rights Reserved Ready for Trusted Governance OPERATIONS SECURITY GOVERNANCE STORAGE STORAGE Machine Learning Batch StreamingInteractive Search GOVERNANCE YA R N D A T A O P E R A T I N G S Y S T E M Data Management along the entire data lifecycle with integrated provenance and lineage capability Modeling with Metadata enables comprehensive data lineage through a hybrid approach with enhanced tagging and attribute capabilities Interoperable Solutions across the Hadoop ecosystem, through a common metadata store
  • 8.
    8 © HortonworksInc. 2011 – 2016. All Rights Reserved DGI* Community becomes Apache Atlas May 2015 Proto-type Built Apache Atlas Incubation DGI group Kickoff Feb 2015 Dec 2014 July 2015 HDP 2.3 Foundation GA Release First kickoff to GA in 7 months Global Financial Company * DGI: Data Governance Initiative Faster & Safer Co-Development driven by customer use cases
  • 9.
    9 © HortonworksInc. 2011 – 2016. All Rights Reserved Apache Atlas: Metadata Services • Cross- component dataset lineage. Centralized location for all metadata inside HDP • Single Interface point for Metadata Exchange with platforms outside of HDP • Business Taxonomy based classification. Conceptual, Logical And Technical Apache Atlas Hive Ranger Falcon Sqoop Storm Kafka Spark NiFi
  • 10.
    10 © HortonworksInc. 2011 – 2016. All Rights Reserved Big Data Management Through Metadata Management Scalability Many traditional tools and patterns do not scale when applied to multi-tenant data lakes. Many enterprise have silo’d data and metadata stores that collide in the data lake. This is compounded by the ability to have very large windows (years). Can traditional EDW tools manage 100 million entities effectively with room to grow ? Metadata Tools Scalable, decoupled, de-centralized manage driven through metadata is the only via solution. This allows quick integration with automation and other metamodels Tags for Management, Discovery and Security Proper metadata is the foundation for business taxonomy, stewardship, attribute based security and self-service.
  • 11.
    11 © HortonworksInc. 2011 – 2016. All Rights Reserved Apache Atlas High Level Architecture Type System Repository Search DSL Bridge Hive Storm Falcon Others REST API Graph DB Search Kafka Sqoop Connectors MessagingFramework
  • 12.
    12 © HortonworksInc. 2011 – 2016. All Rights Reserved Technical and Logical Metadata Exchange Knowledge Store Atlas REST API Structured Unstructured Files: XML / JSON 3rd Party Vendors Custom Reporter Non-Hadoop
  • 13.
    13 © HortonworksInc. 2011 – 2016. All Rights Reserved Near Term Roadmap: Summer 2016
  • 14.
    14 © HortonworksInc. 2011 – 2016. All Rights Reserved Sqoop Teradata Connector Apache Kafka Expanded Native Connector: Dataset Lineage Custom Activity Reporter Metadata Repository RDBMS
  • 15.
    15 © HortonworksInc. 2011 – 2016. All Rights Reserved Dynamic Access Policy Driven by metadata
  • 16.
    16 © HortonworksInc. 2011 – 2016. All Rights Reserved Business Taxonomy UX Prototype
  • 17.
    17 © HortonworksInc. 2011 – 2016. All Rights Reserved We conduct open-ended user interviews so that we can learn more about who are users are and what their needs are. This helps us validate whether or not we’re solving the right problem. User Interviews
  • 18.
    18 © HortonworksInc. 2011 – 2016. All Rights Reserved We test our prototype in InVision - a click through prototyping tool that allows users to interact with static mockups. Usability Testing
  • 19.
    19 © HortonworksInc. 2011 – 2016. All Rights Reserved After conducting interviews and usability testing we spend sometime analyzing our findings and pulling out themes + insights. Synthesis + Analysis
  • 20.
    20 © HortonworksInc. 2011 – 2016. All Rights Reserved Usability Findings • Understood the hierarchy and how to search for data • Would generally search by file name or specific keyword • Would use tags for the purpose of searching • Would want to preview a subset of the data before analyzing the whole data set • Interested in the size of the data set • Concerned with how current and updated the information is • Would like the ability to contact a steward for more information regarding the data set • Would use an advanced boolean search if it were available • Viewing the popularity and access frequency would provide confidence • Would like to provide and view fellow user’s input
  • 21.
    21 © HortonworksInc. 2011 – 2016. All Rights Reserved Persona Findings • Data Scientists typically have backgrounds in Mathematics, Computer Science and Statistics • Responsible for analyzing and transforming data into more useful structures • Responsible for correcting missing values, typos and parsing issues • Typically fluent with SQL, Python and Hadoop tools • Require time upfront to understand and discover new data sets • Spend a significant amount of time reaching out to others with questions about data sets • Interact with Subject Matter Experts and Solution Architects • Noted that compliance is a big interest for enterprises and government • Felt Hadoop doesn’t support security and compliance very well • Find it difficult to see who is doing what in Hadoop
  • 22.
    22 © HortonworksInc. 2011 – 2016. All Rights Reserved Principle Roles • Data Steward – Curator, responsible for catalog verasity • Data Scientist – Analyst, primary consumer of Business Catalog • Administrator – Role management only • Data Engineer – Data ingress and egress, semantic data quality
  • 23.
    23 © HortonworksInc. 2011 – 2016. All Rights Reserved UX proto-type: Taxonomy Navigation Breadcrumbs for taxonomy context path Contents at taxonomy context
  • 24.
    24 © HortonworksInc. 2011 – 2016. All Rights Reserved Taxonomy Creation In place taxonomy management
  • 25.
    25 © HortonworksInc. 2011 – 2016. All Rights Reserved Taxonomy Classification of Assets Create new object on the fly
  • 26.
    26 © HortonworksInc. 2011 – 2016. All Rights Reserved Object Details Annotation for policies and rules
  • 27.
    27 © HortonworksInc. 2011 – 2016. All Rights Reserved Object Lineage Dataset Lineage across components Assign Tags to assets
  • 28.
    28 © HortonworksInc. 2011 – 2016. All Rights Reserved User Comments User comments for collaboration
  • 29.
    29 © HortonworksInc. 2011 – 2016. All Rights Reserved Classify and Tag Assets Keyword, DSL, and Faceted search Define authoritive tags for the whole taxonomy
  • 30.
    30 © HortonworksInc. 2011 – 2016. All Rights Reserved • Hierarchical Taxonomy Creation • Agile modeling: Model Conceptual, Logical, Physical assets • Authorization: Steward / Analytic Roles • Tag management: Definition and assignment • DQ tab for profiling and sampling • User Comments Business Taxonomy UX Prototype What other information would you like to see included?
  • 31.
    31 © HortonworksInc. 2011 – 2016. All Rights Reserved Availability: - Tech Preview VMs: May 2016 - GA Release: Summer 2016
  • 32.
    32 © HortonworksInc. 2011 – 2016. All Rights Reserved Questions ?
  • 33.
    33 © HortonworksInc. 2011 – 2016. All Rights Reserved Reference
  • 34.
    34 © HortonworksInc. 2011 – 2016. All Rights Reserved Online Resources VM: https://s3.amazonaws.com/demo-drops.hortonworks.com/HDP- Atlas-Ranger-TP.ova —> Download Public Preview VM Tutorial: https://github.com/hortonworks/tutorials/tree/atlas-ranger- tp/tutorials/hortonworks/atlas-ranger-preview Blog: http://hwxjojo.wpengine.com/blog/the-next-generation-of- hadoop-based-security-data-governance/ (this is giving an error, right now) Learn More: http://hortonworks.com/solutions/atlas-ranger- integration/

Editor's Notes

  • #2 TALK TRACK Data is powering successful clinical care and successful operations. [NEXT SLIDE]
  • #7 6
  • #8 TALK TRACK Open Enterprise Hadoop enables trusted governance, with: Data lifecycle management along the entire lifecycle Modeling with metadata, and Interoperable solutions that can access a common metadata store. [NEXT SLIDE] SUPPORTING DETAIL Trusted Governance Why this matters to our customers: As data accumulates in an HDP cluster, the enterprise needs governance policies to control how that data is ingested, transformed and eventually retired. This keeps those Big Data assets from turning into big liabilities that you can’t control. Proof point: HDP includes 100% open source Apache Atlas and Apache Falcon for centralized data governance coordinated by YARN. These data governance engines provide those mature data management and metadata modeling capabilities, and they are constantly strengthened by members of the Data Governance Initiative. The Data Governance Initiative (DGI) is working to develop an extensible foundation that addresses enterprise requirements for comprehensive data governance. The DGI coalition includes Hortonworks partner SAS and customers Merck, Target, Aetna and Schlumberger. Together, we assure that Hadoop: Snaps into existing frameworks to openly exchange metadata Addresses enterprise data governance requirements within its own stack of technologies Citation: “As customers are moving Hadoop into corporate data and processing environments, metadata and data governance are much needed capabilities. SAS participation in this initiative strengthens the integration of SAS data management, analytics and visualization into the HDP environment and more broadly it helps advance the Apache Hadoop project. This additional integration will give customers better ability to manage big data governance within the Hadoop framework,” said SAS Vice President of Product Management Randy Guard.” | http://hortonworks.com/press-releases/hortonworks-establishes-data-governance-initiative/
  • #9 How fast ? 7 months !
  • #12 Apache Atlas is the only open source project created to solve the governance challenge in the open. The founding members of the project include all the members of the data governance initiative and others from the Hadoop community. The core functionality defined by the project includes the following: Data Classification – create an understanding of the data within Hadoop and provide a classification of this data to external and internal sources Centralized Auditing – provide a framework to capture and report on access to and modifications of data within Hadoop Search & Lineage – allow pre-defined and ad hoc exploration of data and metadata while maintaining a history of how a data source or explicit data was constructed Security and Policy Engine – implement engines to protect and rationalize data access and according to compliance policy
  • #13 Show – clearly identify customer metadata. Change Add customer classification example – Aetna – make the use case story have continuity. Use DX procedures to diagnosis ** bring meta from external systems into hadoop – keep it together
  • #15 Show – clearly identify customer metadata. Change Add customer classification example – Aetna – make the use case story have continuity. Use DX procedures to diagnosis ** bring meta from external systems into hadoop – keep it together
  • #18 - Learn about who are users are and what are their needs to validate if we are solving the right problem Open ended half hour discussions about processes, challenges and current tools We record the interviews so that we can focus on the conversation and analyis them afterward
  • #19 - Test our prototype in Invision - A click through prototyping tool - Walk users through scenarios and watch how they respond - Remind our participants that we aren’t testing them, we’re testing the design and encourage thinking aloud
  • #20 - Re-watch recordings and capture verbatim quotes on stickys - Affinity mapping - Group feedback into categories and look for trends and insights - For this project we translated our sticky’s into Trello to share with the team remotely. We’ve starred the sticky’s that represented common themes and valuable insights.
  • #21 Is the product was well understood? Is the product something they would use? Where is the value?
  • #22 Findings we believe we are solving for
  • #23 Is the product was well understood? Is the product something they would use? Where is the value?
  • #37 Which Vendors would you be interested in ?
  • #38 Apache Atlas is the only open source project created to solve the governance challenge in the open. The founding members of the project include all the members of the data governance initiative and others from the Hadoop community. The core functionality defined by the project includes the following: Data Classification – create an understanding of the data within Hadoop and provide a classification of this data to external and internal sources Centralized Auditing – provide a framework to capture and report on access to and modifications of data within Hadoop Search & Lineage – allow pre-defined and ad hoc exploration of data and metadata while maintaining a history of how a data source or explicit data was constructed Security and Policy Engine – implement engines to protect and rationalize data access and according to compliance policy