Cloudera Navigator
Headline Goes Here
Speaker Name or Subhead Goes Here

DO NOT USE PUBLICLY
PRIOR TO 10/23/12
Outline
●
●
●
●

Capabilities
Architecture
Quick Demo
Q&A
Capabilities
●

Discovery
○

○

●

Lineage
○
○

●

Search through metadata to find data set/operation of
interest.
View sc...
Discovery & Lineage(Questions to be asked?)
●
●
●

Ad-hoc or only predefined?
Granularity?
Analysis?
Discovery & Lineage (Supported Systems)
●
●
●
●
●
●
●

HDFS
Hive
MR1
Oozie
Pig
YARN
...More coming...
Discovery (Metadata Search)
Discovery (Metadata Search)
Discovery (Metadata Search)
Discovery (View Schema)
Discovery (Augment Metadata )
Discovery (Search on associated metadata)
Sidecars.. (Colocation of associated metadata)
/user/root/customers/cust_demo
/user/root/customers/.cust_demo.navigator
Co...
Lineage (Hive Query)
INSERT OVERWRITE TABLE machine_vendors
SELECT upper(trim(regexp_extract(ms.dmidecode,"System Informat...
Lineage
Lineage (Path highlighted)
Lineage (Instance)
Lineage (Template)
Lineage (Pig Script)
posts = LOAD 'stackoverflow/posts/posts.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage()...
Lineage (Pig)
Discovery & Lineage Architecture
Model
●
●

Generic (Element, Relations)
Element
○
○
○

Unique Identity
Key-value pairs
Tags

(Operation, Operation Executi...
Model (Contd…)
●

Relation
○
○
○

Unique Identity
Two sets of related elements
Relationship type

(Parent Child Relation, ...
Discovery & Lineage (REST API)
●

Elements Resource
○

curl 'http://localhost:5150/api/v1/elements?query=originalName:job_...
Discovery & Lineage (REST API)
●

Relations Resource
curl 'http://localhost:7187/api/v1/relations?elementIds=83f4cdcc37c37...
Audit (Supported Systems)
●
●
●
●
●

HDFS
HBase
Hive
Impala
...More coming...
Audit Configuration
Audit View
Audit Details
●

User
○

●

Operation Information
○

●

Username, Impersonator, Ip Address
Operation Type, Session Id, Que...
Audit Architecture

Log4j
Appender
Cloudera Federal Forum 2014: Tracking Provenance in Hadoop Clusters
Upcoming SlideShare
Loading in...5
×

Cloudera Federal Forum 2014: Tracking Provenance in Hadoop Clusters

3,235

Published on

Software Engineer Sudhanshu Arora, shares the capabilities, architecture, and a quick demo of Cloudera Navigator.

Published in: Technology

Transcript of "Cloudera Federal Forum 2014: Tracking Provenance in Hadoop Clusters"

  1. 1. Cloudera Navigator Headline Goes Here Speaker Name or Subhead Goes Here DO NOT USE PUBLICLY PRIOR TO 10/23/12
  2. 2. Outline ● ● ● ● Capabilities Architecture Quick Demo Q&A
  3. 3. Capabilities ● Discovery ○ ○ ● Lineage ○ ○ ● Search through metadata to find data set/operation of interest. View schema, associated metadata etc. for a dataset Given a data set, trace back to the original source. Understand the impact of modifying a data set. Audit ○ ○ Generate report of access to a data set in Hadoop. Generate alert when a restricted data set is accessed.
  4. 4. Discovery & Lineage(Questions to be asked?) ● ● ● Ad-hoc or only predefined? Granularity? Analysis?
  5. 5. Discovery & Lineage (Supported Systems) ● ● ● ● ● ● ● HDFS Hive MR1 Oozie Pig YARN ...More coming...
  6. 6. Discovery (Metadata Search)
  7. 7. Discovery (Metadata Search)
  8. 8. Discovery (Metadata Search)
  9. 9. Discovery (View Schema)
  10. 10. Discovery (Augment Metadata )
  11. 11. Discovery (Search on associated metadata)
  12. 12. Sidecars.. (Colocation of associated metadata) /user/root/customers/cust_demo /user/root/customers/.cust_demo.navigator Contents of .cust_demo.navigator { "properties" : { "secret" : "true", "retention" : "small" }, "tags" : ["pci"] }
  13. 13. Lineage (Hive Query) INSERT OVERWRITE TABLE machine_vendors SELECT upper(trim(regexp_extract(ms.dmidecode,"System InformationntManufacturer: ([^n]+)",1))) AS manufacturer,upper (trim(regexp_extract(ms.dmidecode,"System InformationntManufacturer: ([^n]+)ntProduct Name: ([^n]+)",2))) AS product,ca. address_state,ca.customerKey,cm.clusterId,ms.machineName FROM crm_accounts ca JOIN cluster_metadata cm ON ca.customerKey = cm.customerKey JOIN machine_stats ms ON cm.customerKey = ms.customerKey AND cm.clusterId = ms.clusterId AND cm.collectionTS = ms.collectionTS
  14. 14. Lineage
  15. 15. Lineage (Path highlighted)
  16. 16. Lineage (Instance)
  17. 17. Lineage (Template)
  18. 18. Lineage (Pig Script) posts = LOAD 'stackoverflow/posts/posts.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (id:int, postTypeId:int, acceptedAnswerId:int, parentId:int, creationDate:chararray, score:int, viewCount:int, body:chararray, ownerUserId:chararray, lastEditorUserId:int, lastEditorDisplayName:chararray, lastEditDate:chararray, lastActivityDate:chararray, tile:chararray, tags:chararray, answerCount:int, commentCount:int, favoriteCount:int, closedDate: chararray, communityOwnedDate:chararray); comments = LOAD 'stackoverflow/comments/comments.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (id:int, postId:int, score:int, text:chararray, creationDate:chararray, userDisplayName:chararray, userId: int); joined_post_comments = JOIN posts by id, comments by postId; post_comments = FOREACH joined_post_comments GENERATE posts::id..posts::communityOwnedDate, comments::postId..comments::userId; grouped_comments = GROUP post_comments BY posts::id; comments_per_post = FOREACH grouped_comments GENERATE group as postId, post_comments.comments::text as comment; rmf stackoverflow/output/comments_per_post STORE comments_per_post INTO 'stackoverflow/output/comments_per_post' USING PigStorage();
  19. 19. Lineage (Pig)
  20. 20. Discovery & Lineage Architecture
  21. 21. Model ● ● Generic (Element, Relations) Element ○ ○ ○ Unique Identity Key-value pairs Tags (Operation, Operation Execution, FSElement, Table, Column…)
  22. 22. Model (Contd…) ● Relation ○ ○ ○ Unique Identity Two sets of related elements Relationship type (Parent Child Relation, Data Flow Relation, Control Flow Relation, Instance Of Relation, Alias Relation, Generic Relation)
  23. 23. Discovery & Lineage (REST API) ● Elements Resource ○ curl 'http://localhost:5150/api/v1/elements?query=originalName:job_&limit=100&offset=100' [{ "identity" : "513bf7add8d5f56b7f0f34769707cb5f", "originalName" : "job_1389320017591_0024_conf.xml", "firstClassParentId" : null, "name" : null, "description" : null, "tags" : null, "properties" : null, "fileSystemPath" : "/user/history/done/2014/01/31/000000/job_1389320017591_0024_conf.xml", "category" : "FILE", "size" : 139211, "lastModified" : "1969-12-31T23:59:59.999Z", "lastAccessed" : "2014-02-04T02:12:01.369Z", "owner" : "root", "group" : "hadoop", "blockSize" : null, "mimeType" : "application/octet-stream", "replication" : null, "deleted" : false, "resType" : "HDFS", "permission" : 432, "resId" : "858e5548b4cd3457432eb491ee74729d", "type" : "fselement" }, ...] ○ ○ curl ‘http://localhost:5150/api/v1/elements/f53ae3547a90b7519b44041db1898972’ curl -X PUT -H "Content-Type: application/json" -d '{"displayName":"test","descriptin":"describe me","tags":[]}' http://localhost: 5150/api/v1/elements/e5f94cd59a8ca6df96247ce88b6c9c28
  24. 24. Discovery & Lineage (REST API) ● Relations Resource curl 'http://localhost:7187/api/v1/relations?elementIds=83f4cdcc37c379144fef22e3dbdf7c8c&types=PARENT_CHILD&depth=2' [{ "identity" : “91540192d3dd727f912b3c0bb91cdd81”, "type" : “PARENT_CHILD", "parent" : [ { "elementId" : "83f4cdcc37c379144fef22e3dbdf7c8c", },"children" : [ { "elementIds" : [ "6144fabee63641275c5577697f16266a" ], } "name" : null},...] ● Interactive Resource curl 'http://localhost:7187/api/v1/interactive/elements?query=originalName:test&limit=2' { "offset" : 0, "totalMatched" : 2, "limit" : 1, "results" : [ { "identity" : "9b7b9d95eb06ccf0b1b0cd1a39642889", "category" : "DIRECTORY",... }, "facets" : { }, "qtime" : 10 }
  25. 25. Audit (Supported Systems) ● ● ● ● ● HDFS HBase Hive Impala ...More coming...
  26. 26. Audit Configuration
  27. 27. Audit View
  28. 28. Audit Details ● User ○ ● Operation Information ○ ● Username, Impersonator, Ip Address Operation Type, Session Id, Query Id, Operation Text, Status, Time Object Information ○ ServiceName, Path (Different in different systems)
  29. 29. Audit Architecture Log4j Appender

×