Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Top Three Big Data Governance
Issues and How Apache ATLAS
resolves i...
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Disclaimer
This document may contain product features and technology...
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Atlas Data Governance
Organizations need data governance to understa...
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
STRUCTURED
UNSTRUCTURED
Vision - Enterprise Data Governance Across P...
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Atlas Overview
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Atlas Data Governance
Data governance practices provide a holistic a...
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Atlas timeline: from DGI to present
May
2015
Apache
Atlas
Incubation...
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Big Data Management Through Metadata
Management Scalability
Many tra...
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
High Level Architecture: 4 Key points
Type System
Repository
Search ...
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Governance Ready Certification Program
Discovery
Tagging
Prep /
Cle...
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Governance Ready Certification Program
Completed:
• Waterline
• Dat...
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Near Term Roadmap:
Summer 2016
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summer 2016 Release Summary
• Dynamic Access Policies
• Cross compo...
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dynamic Access Policy
Apache Ranger + Atlas Integration
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summary of Dynamic Access Policies
• Basic Tag policy – PII example...
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How does Atlas work with Ranger at scale?
Atlas provides: Metadata
...
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scalable Access Control – Reusable Tag Policy
User group
• AD
• Lin...
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Automatic update of policies – active protection
Metastore
• Tags
•...
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hadoop Cross Component
Data Lineage
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Atlas Component Integration
• Cross- component dataset linea...
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Users in the upcoming release of HDP 2.5 will be able to
track line...
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sqoop
Teradata
Connector
Apache
Kafka
Expanded Native Connector: Da...
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summer 2016 Release Summary
• Dynamic Access Policies
• Cross compo...
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Enterprise Readiness
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Security/Enterprise Readiness
• Highly reliable and scalable compon...
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Enterprise Readiness:
Scalable and Highly Reliable Components
Solr
...
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summer 2016 Release Summary
• Dynamic Access Policies
• Cross compo...
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Business Taxonomy (Catalog)
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Key Concepts
Business Taxonomy (Catalog)
The practice and science o...
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Taxonomies Benefits:
• Search / Discovery – Business catalog of
con...
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
We conduct open-ended user interviews so that we can learn more
abo...
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
We test our prototype in InVision - a click through prototyping too...
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Principle Roles & Activities
• Data Steward – Curator, responsible
...
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Atlas Value
• Designed for Hadoop at platform, not application leve...
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Additional Atlas Sessions
• Extend Governance in Hadoop with the At...
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Learn More:
• Hortonworks links: http://hortonworks.com/solutions/s...
Upcoming SlideShare
Loading in …5
×

Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

21,053 views

Published on

Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

Published in: Technology
  • Be the first to comment

Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

  1. 1. 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise June 28, 2016 Apache Atlas
  2. 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Disclaimer This document may contain product features and technology directions that are under development, may be under development in the future or may ultimately not be developed. Project capabilities are based on information that is publicly available within the Apache Software Foundation project websites ("Apache"). Progress of the project capabilities can be tracked from inception to release through Apache, however, technical feasibility, market demand, user feedback and the overarching Apache Software Foundation community development process can all effect timing and final delivery. This document’s description of these features and technology directions does not represent a contractual commitment, promise or obligation from Hortonworks to deliver these features in any generally available product. Product features and technology directions are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind. Since this document contains an outline of general product development plans, customers should not rely upon it when making purchasing decisions.
  3. 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Atlas Data Governance Organizations need data governance to understand its information to answer questions such as: • What do we know about our information? • Where did this data come from and who can use it? • Does this data adhere to company policies and rules?
  4. 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved STRUCTURED UNSTRUCTURED Vision - Enterprise Data Governance Across Platforms TRADITIONAL RDBMS METADATA MPP APPLIANCES Project 1 Project 5 Project 4 Project 3 METADATA Project 6 DATA LAKE Atlas: Metadata Truth in Hadoop Data Management along the entire data lifecycle with integrated provenance and lineage capability Modeling with Metadata enables comprehensive data lineage through a hybrid approach with enhanced tagging and attribute capabilities Interoperable Solutions across the Hadoop ecosystem, through a common metadata store
  5. 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Atlas Overview
  6. 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Atlas Data Governance Data governance practices provide a holistic approach to managing, improving and leveraging information to help you gain insight and build confidence in business decisions and operations. Atlas helps customers discover information about data objects, their meaning, location, characteristics, and usage.
  7. 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Atlas timeline: from DGI to present May 2015 Apache Atlas Incubation DGI group Kickoff Dec 2014 July 2015 HDP 2.3 Foundation GA Release First kickoff to GA in 7 months Global Financial Company * DGI: Data Governance Initiative Key Benefits: • Co-Dev = Built for real customer use cases • Faster & Safer = Customers know business + HWX knows Hadoop Jan 2016 HDP 2.4 Kafka/Storm Sqoop Falcon Tag Based Security Summer 2016 HDP 2.5 Business Catalog AD integration Versioning
  8. 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Big Data Management Through Metadata Management Scalability Many traditional tools and patterns do not scale when applied to multi- tenant data lakes. Many enterprise have silo’d data and metadata stores that collide in the data lake. This is compounded by the ability to have very large windows (years). Can traditional EDW tools manage 100 million entities effectively with room to grow ? Metadata Tools Scalable, decoupled, de-centralized manage driven through metadata is the only via solution. This allows quick integration with automation and other metamodels Tags for Management, Discovery and Security Proper metadata is the foundation for business taxonomy, stewardship, attribute based security and self-service. Key Benefits: Modern Data Lakes need new ways to govern because: • Cost – Traditional staff ratio to data size not possible • Diversity – Only way to manage velocity of new datasets • Agility – Quick change based on tags / taxonomy
  9. 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved High Level Architecture: 4 Key points Type System Repository Search DSL Bridge Hive Storm Falcon Custo m REST API Graph DB Search Kafka Sqoop Connectors MessagingFramework 3 REST API Modern, flexible access to Atlas services, HDP components, UI & external tools 1 Data Lineage Only product that captures lineage across Hadoop components at platform level. 4 Exchange Leverage existing metadata / models by importing it from current tools. Export metadata to downstream systems 2 Agile Data Modeling: Type system allows custom metadata structures in a hierarchy taxonomy
  10. 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Governance Ready Certification Program Discovery Tagging Prep / Cleanse ETL Governance BPM Self Service Visualization Choice: Customers choose features that they want to deploy—a la carte versus vendor lock Curated & Fast: Selected group of vendor partners to provide rich, complimentary and complete features ready to deploy Agile: Low switching costs, Faster deployment and innovation Centralized: Common SLA & common open metadata store Flexibility: Interoperability of products through Atlas metadata Safe: HDP at core to provide stability and interoperability
  11. 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Governance Ready Certification Program Completed: • Waterline • Dataguise • Attivo Next: • SAP ILM,VORA • IBM IGC Work in progress: • Collibra • Alation • Meta Integration (Miti) • Paxata • Syncsort • Trifacta
  12. 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Near Term Roadmap: Summer 2016
  13. 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Summer 2016 Release Summary • Dynamic Access Policies • Cross component lineage • Enterprise Readiness • Business Catalog Differentiato r Differentiato r Differentiato r
  14. 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Dynamic Access Policy Apache Ranger + Atlas Integration
  15. 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Summary of Dynamic Access Policies • Basic Tag policy – PII example. Permission mapped to re-useable tag not resource • Geo-based policy – Policy based on IP address mappings. Rule enforcement dynamically geo aware. • Time-based policy – Timer for data access for resource management, compliance reporting • Prohibitions – Prevention of toxic combinations of Hive tables or columns that may pose a risk together. Key Benefits: New scalable metadata based security paradigm Dynamic, real-time policy Automatically updates to changes in metadata Centralized and simple to manage policy
  16. 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved How does Atlas work with Ranger at scale? Atlas provides: Metadata • Business Classification (taxonomy): Company > HR > Driver • Hierarchy with Inheritance of attribute to child objects: Sensitive “PII” tag of department HR will be inherited by group HR> Driver • Atlas will notify Ranger via Kafka Topic for changes Apache Atlas Hive Ranger Falcon Kafka Storm Atlas provides the metadata tag to create policies Ranger provides: Access & Entitlements • Ranger will cache tags and asset mapping for performance • Ranger will have a policy based on tags instead of roles. • Example: PII = <group> This can work for a may assets.
  17. 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Scalable Access Control – Reusable Tag Policy User group • AD • Linux Resources: • Files • Tables • Topologies Atlas Tag • PII ANY asset PII • Files • Tables • Topologies Single Admin Group Assigns Many Stewards Tag + Single point of enforcement and audit All future tagging is covered by existing policy
  18. 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Automatic update of policies – active protection Metastore • Tags • Assets • Entities Notification Framework Kafka Topics Atlas Atlas Client • Subscribes to Topic • Gets Metadata Updates PDP Resource Cache Ranger Notification Metadata updates Message durability Optimized for Speed Event driven updates
  19. 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hadoop Cross Component Data Lineage
  20. 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Atlas Component Integration • Cross- component dataset lineage. Centralized location for all metadata inside HDP • Single Interface point for Metadata Exchange with platforms outside of HDP Apache Atlas Hive Ranger Falcon Sqoop Storm Kafka Spark NiFi HBase HDP 2.3 HDP 2.5 Beyond HDP 2.5
  21. 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Users in the upcoming release of HDP 2.5 will be able to track lineage across the following components using Atlas:  Sqoop – Import from and export to relational databases, and additional package that leverages sqoop. ATLAS-184 , SQOOP- 2609  Hive - Dataset lineage with entity versioning (including schema changes) ATLAS-75. ATLAS-183, ATLAS-492  Kafka/ Storm - IoT event-level processing, such as syslogs, or sensor data ATLAS-181 , ATLAS-183, STORM-1381  Falcon - Data lifecycle at Feed and Process entity level for replication, and repeating workflows. Tracks period-icy, throttling, ecviction. ATLAS-69 , FALCON-1570 Summary of Data Lineage Key Benefits: Enterprises need open solutions, not single app vendor More native connectors than anyone else with more coming Hardened metadata infrastructure
  22. 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sqoop Teradata Connector Apache Kafka Expanded Native Connector: Dataset Lineage Custom Activity Reporter Metadata Repository RDBMS Any process using Sqoop is covered No other tool tracks IOT of the box
  23. 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Summer 2016 Release Summary • Dynamic Access Policies • Cross component lineage • Enterprise Readiness • Business Catalog Differentiator Differentiator Differentiator
  24. 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Enterprise Readiness
  25. 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Security/Enterprise Readiness • Highly reliable and scalable components • Authorization with AD via Ranger • Rolling upgrade support HDP 2.5 + • BC & DR capabilities • Improved performance of 5x from previous version
  26. 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Enterprise Readiness: Scalable and Highly Reliable Components Solr Cloud Kafka Quorum Type System Repository Search DSL Bridge Hive Storm Falcon Custom REST API Graph DB Search Kafka SqoopConnectors MessagingFramework HBase
  27. 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Summer 2016 Release Summary • Dynamic Access Policies • Cross component lineage • Enterprise Readiness • Business Catalog Differentiator Differentiator Differentiator
  28. 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Business Taxonomy (Catalog)
  29. 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Key Concepts Business Taxonomy (Catalog) The practice and science of classification of things or concepts, including the principles that underlie such classification. The business organization model is hierarchical making authoritative with no duplication. Data Lineage (Provenance) Data lineage is defined as a data life cycle that includes the data's origins and where it moves over time. It describes what happens to data as it goes through diverse processes. It helps provide visibility into the analytics pipeline and simplifies tracing errors back to their sources Tags: Traits vs. Labels vs. Business Taxonomy Atlas has Tags that are authorative and prevent duplication. Tag can span different parts of the business taxonomy. A tag PII can be used in HR as well Finance or Sales. Benefits: A view of data assets organized by business language Impact analysis, Compliance, Acceptable use Common tag though Hadoop components
  30. 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Taxonomies Benefits: • Search / Discovery – Business catalog of conceptual, logical and physical assets • Security --Dynamic metadata based Access control
  31. 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved We conduct open-ended user interviews so that we can learn more about who are users are and what their needs are. This helps us validate whether or not we’re solving the right problem. Research: Focused on Hadoop
  32. 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved We test our prototype in InVision - a click through prototyping tool that allows users to interact with static mockups. Usability Testing
  33. 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Principle Roles & Activities • Data Steward – Curator, responsible for catalog veracity • Data Scientist – Analyst, primary consumer of Business Catalog • Administrator – Role management only • Data Engineer – Data ingress and egress, semantic data quality • 50% - 80%+ Time spend looking for data • Profit Center • Primary User of Atlas • Enables Scientist Goal: < 25% spent on finding data = Empowering scientist to spend their time uncovering insights -- faster
  34. 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Atlas Value • Designed for Hadoop at platform, not application level • High Confidence data in Hadoop for regulated verticals • Compliance and business objectives aligned to data organization • Faster discovery for analysts – reduce time to value • Agile and adaptable – ensures information is current by native connectors • Dynamic protection with Ranger in simple audited policies
  35. 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Additional Atlas Sessions • Extend Governance in Hadoop with the Atlas Ecosystem: integrations with partners Waterline, Trifacta and Attivo: Thursday 4:10PM @ Room 210A • BOF: Apache Knox and Apache Ranger provide Hadoop security while Atlas provides a Hadoop metadata store and enterprise compliance. Come learn and discuss security & governance innovations and future directions. Thursday 5-7 PM @ Room 210A
  36. 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Learn More: • Hortonworks links: http://hortonworks.com/solutions/security-and- governance/ • Tutorials: https://github.com/hortonworks/tutorials/tree/atlas-ranger- tp/tutorials/hortonworks/atlas-ranger-preview

×