Apache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas provides centralized metadata services and cross-component dataset lineage tracking for Hadoop components. It aims to enable transparent, reproducible, auditable and consistent data governance across structured, unstructured, and traditional database systems. The near term roadmap includes dynamic access policy driven by metadata and enhanced Hive integration. Apache Atlas also pursues metadata exchange with non-Hadoop systems and third party vendors through REST APIs and custom reporters.
Introduction to Apache Atlas, enhancing data governance, and tracking dataset lineage across Hadoop components.
Outline of enterprise data governance vision across various systems with focus on transparency, reproducibility, auditability, and consistency.
Timeline of Apache Atlas development from proto-type to GA release, showcasing progress and major milestones.Future enhancements in Apache Atlas with emphasis on dynamic access policies and maintaining a business catalog.
Tracking and integrating metadata and lineage within Hadoop components like Sqoop, Kafka, and Hive for streamlined governance.
Dataflow governance requirements focusing on data provenance, security, and integration with enterprise-level solutions.
Information about demo sessions, tech previews, and upcoming availability of the product.
Open floor for questions and references/resources for further exploration and information on Apache Atlas.
#8 TALK TRACK
Open Enterprise Hadoop enables trusted governance, with:
Data lifecycle management along the entire lifecycle
Modeling with metadata, and
Interoperable solutions that can access a common metadata store.
[NEXT SLIDE]
SUPPORTING DETAIL
Trusted Governance
Why this matters to our customers: As data accumulates in an HDP cluster, the enterprise needs governance policies to control how that data is ingested, transformed and eventually retired. This keeps those Big Data assets from turning into big liabilities that you can’t control.
Proof point: HDP includes 100% open source Apache Atlas and Apache Falcon for centralized data governance coordinated by YARN. These data governance engines provide those mature data management and metadata modeling capabilities, and they are constantly strengthened by members of the Data Governance Initiative. The Data Governance Initiative (DGI) is working to develop an extensible foundation that addresses enterprise requirements for comprehensive data governance. The DGI coalition includes Hortonworks partner SAS and customers Merck, Target, Aetna and Schlumberger. Together, we assure that Hadoop:
Snaps into existing frameworks to openly exchange metadata
Addresses enterprise data governance requirements within its own stack of technologies
Citation: “As customers are moving Hadoop into corporate data and processing environments, metadata and data governance are much needed capabilities. SAS participation in this initiative strengthens the integration of SAS data management, analytics and visualization into the HDP environment and more broadly it helps advance the Apache Hadoop project. This additional integration will give customers better ability to manage big data governance within the Hadoop framework,” said SAS Vice President of Product Management Randy Guard.” | http://hortonworks.com/press-releases/hortonworks-establishes-data-governance-initiative/
#12 Apache Atlas is the only open source project created to solve the governance challenge in the open. The founding members of the project include all the members of the data governance initiative and others from the Hadoop community. The core functionality defined by the project includes the following:
Data Classification – create an understanding of the data within Hadoop and provide a classification of this data to external and internal sources
Centralized Auditing – provide a framework to capture and report on access to and modifications of data within Hadoop
Search & Lineage – allow pre-defined and ad hoc exploration of data and metadata while maintaining a history of how a data source or explicit data was constructed
Security and Policy Engine – implement engines to protect and rationalize data access and according to compliance policy
#17 Show – clearly identify customer metadata. Change
Add customer classification example – Aetna – make the use case story have continuity. Use DX procedures to diagnosis
** bring meta from external systems into hadoop – keep it together
#18 Show – clearly identify customer metadata. Change
Add customer classification example – Aetna – make the use case story have continuity. Use DX procedures to diagnosis
** bring meta from external systems into hadoop – keep it together
#19 Specify
Metrics – Time / Success /user /etc…
Contrast with Ranger plug-in – pre execute
#23 Show – clearly identify customer metadata. Change
Add customer classification example – Aetna – make the use case story have continuity. Use DX procedures to diagonsis
** bring meta from external systems into hadoop – keep it together
#32 The Data Governance Framework will enable Freddie Mac to design Data Index tool from the ground up for scalability, security and reliability