Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

4

Share

Download to read offline

Fifth Elephant Apache Atlas Talk

Download to read offline

Proposal for the talk on Apache Atlas at Fifth Elephant Conference 2017

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Fifth Elephant Apache Atlas Talk

  1. 1. Governance using Apache Atlas: Why and How Vimal Sharma, Apache Atlas PMC & Committer Software Engineer, Hortonworks Apache ID: svimal2106@apache.org
  2. 2. Apache Atlas : Project Details Ø Incubated to Apache in May 2015 Ø Organizations : IBM, Hortonworks, Aetna, Merck, Target Ø 3 releases in last year Ø Graduated to a Top Level Project in June 2017 0.7 (July 2016) 0.7.1 (Jan 2017) 0.8 (Mar 2017) TLP (June 2017)
  3. 3. Apache Atlas : Introduction Ø Governance and metadata framework for Hadoop Ø Model a component and capture metadata Ø Data Assets - Hive Table, HBase column family Ø Process - Storm Topology, Sqoop Import Ø Classification - Tag metadata entities Ø Built in support for popular components Ø Extensible Architecture
  4. 4. Integration Core Apps Type System Graph Abstraction/Engine API <HTTP/REST> Titan Metadata Store <HBase> Index Store <Solr> UI Business Glossary (Roadmap) Metadata Sources Messaging <Kafka> Hive Sqoop Storm Custom Ranger Tag Based Policies Ingest / Export Search Apache Atlas: Architecture
  5. 5. Governance Problem (Use Cases) Ø ETL Pipeline Failure Scenarios • Upstream failure analysis • Alerts to downstream processes • Visual lineage of ETL pipelines Ø Redundant Processing • Does a derived dataset contain required information • Can metadata classification be used to determine this? • Avoid expensive processing
  6. 6. Use Cases Ø Compliance and Security • Impose security constraints on sensitive data • Data can span multiple Hadoop components • One policy to govern them all Ø Cluster Admin • Periodic cleanup of datasets • Which are the unused/dormant datasets • How to define the relevance of a dataset
  7. 7. Cross Component Lineage • Lineage : Upstream and downstream Data Assets • Individual Components : Own Metadata store • Cross Component events • Atlas : Flexibility to model arbitrary components
  8. 8. Ranger Integration • Ranger : Listener on Tag addition/deletion • Attribute based policies rather than asset based policies PII
  9. 9. Type System • Model of metadata to be stored • Every type has Ø Unique Name Ø Attributes Ø SuperTypes • Attributes Ø Mandatory/Optional Ø Unique Ø Composite Ø ReverseReference
  10. 10. Atlas Base Types Referenceable DataSet Process Asset Name Owner Description qualifiedName Inputs Outputs
  11. 11. Spark Introduction • RDD : Basic Unit of execution • DataFrame : Relational RDD • Let’s model DataFrame type!
  12. 12. DataFrame Type DataSet spark_dataframe dataframe_column source destination columns type dataframe comment
  13. 13. Graph Snapshot Ø 1: Dataframe Type Ø 2: Column Type Ø 3: Dataframe Entity Ø 4, 5: Column Entities 3 4 5 1 2 /hdfs/source /hdfs/destination employeeInfo@Hortonworks name id
  14. 14. Demo Example PayrollDetails (HDFS PATH) SalaryProcessor (DATAFRAME) EmployeeSalary (KAKFA TOPIC) VariableComponent (HDFS PATH)
  15. 15. Hook Design Ø Hive Hook • Multiple clients e.g Pig, Hive, Beeline • Always full update to avoid inconsistency Ø Synchronous vs Asynchronous communication • Earlier : Hook communicated with server directly • Now : Metadata entities pushed to Kafka Ø Un-partitioned Kafka topic • Avoid out of order messages
  16. 16. Roadmap Ø Hooks for Spark, HBase and NiFi Ø Column level lineage for Hive • create table dest as select (a + b) x, (c * d) y from source Ø Export/Import of metadata a b Addition x
  17. 17. Contribute Ø Project Website - http://atlas.apache.org/ Ø Dev Mailing List - dev@atlas.incubator.apache.org Ø User Mailing List - user@atlas.incubator.apache.org Ø JIRA link - https://issues.apache.org/jira/browse/ATLAS
  18. 18. Questions
  • ManoharMirle

    Mar. 10, 2020
  • h1pan

    Oct. 31, 2019
  • gaoyingju

    Jan. 21, 2018
  • hypermin

    Jan. 11, 2018

Proposal for the talk on Apache Atlas at Fifth Elephant Conference 2017

Views

Total views

1,894

On Slideshare

0

From embeds

0

Number of embeds

2

Actions

Downloads

119

Shares

0

Comments

0

Likes

4

×