Your SlideShare is downloading. ×
  • Like
  • Save

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Cloudera Federal Forum 2014: Tracking Provenance in Hadoop Clusters

  • 3,065 views
Published

Software Engineer Sudhanshu Arora, shares the capabilities, architecture, and a quick demo of Cloudera Navigator.

Software Engineer Sudhanshu Arora, shares the capabilities, architecture, and a quick demo of Cloudera Navigator.

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,065
On SlideShare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
0
Comments
0
Likes
5

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Cloudera Navigator Headline Goes Here Speaker Name or Subhead Goes Here DO NOT USE PUBLICLY PRIOR TO 10/23/12
  • 2. Outline ● ● ● ● Capabilities Architecture Quick Demo Q&A
  • 3. Capabilities ● Discovery ○ ○ ● Lineage ○ ○ ● Search through metadata to find data set/operation of interest. View schema, associated metadata etc. for a dataset Given a data set, trace back to the original source. Understand the impact of modifying a data set. Audit ○ ○ Generate report of access to a data set in Hadoop. Generate alert when a restricted data set is accessed.
  • 4. Discovery & Lineage(Questions to be asked?) ● ● ● Ad-hoc or only predefined? Granularity? Analysis?
  • 5. Discovery & Lineage (Supported Systems) ● ● ● ● ● ● ● HDFS Hive MR1 Oozie Pig YARN ...More coming...
  • 6. Discovery (Metadata Search)
  • 7. Discovery (Metadata Search)
  • 8. Discovery (Metadata Search)
  • 9. Discovery (View Schema)
  • 10. Discovery (Augment Metadata )
  • 11. Discovery (Search on associated metadata)
  • 12. Sidecars.. (Colocation of associated metadata) /user/root/customers/cust_demo /user/root/customers/.cust_demo.navigator Contents of .cust_demo.navigator { "properties" : { "secret" : "true", "retention" : "small" }, "tags" : ["pci"] }
  • 13. Lineage (Hive Query) INSERT OVERWRITE TABLE machine_vendors SELECT upper(trim(regexp_extract(ms.dmidecode,"System InformationntManufacturer: ([^n]+)",1))) AS manufacturer,upper (trim(regexp_extract(ms.dmidecode,"System InformationntManufacturer: ([^n]+)ntProduct Name: ([^n]+)",2))) AS product,ca. address_state,ca.customerKey,cm.clusterId,ms.machineName FROM crm_accounts ca JOIN cluster_metadata cm ON ca.customerKey = cm.customerKey JOIN machine_stats ms ON cm.customerKey = ms.customerKey AND cm.clusterId = ms.clusterId AND cm.collectionTS = ms.collectionTS
  • 14. Lineage
  • 15. Lineage (Path highlighted)
  • 16. Lineage (Instance)
  • 17. Lineage (Template)
  • 18. Lineage (Pig Script) posts = LOAD 'stackoverflow/posts/posts.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (id:int, postTypeId:int, acceptedAnswerId:int, parentId:int, creationDate:chararray, score:int, viewCount:int, body:chararray, ownerUserId:chararray, lastEditorUserId:int, lastEditorDisplayName:chararray, lastEditDate:chararray, lastActivityDate:chararray, tile:chararray, tags:chararray, answerCount:int, commentCount:int, favoriteCount:int, closedDate: chararray, communityOwnedDate:chararray); comments = LOAD 'stackoverflow/comments/comments.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (id:int, postId:int, score:int, text:chararray, creationDate:chararray, userDisplayName:chararray, userId: int); joined_post_comments = JOIN posts by id, comments by postId; post_comments = FOREACH joined_post_comments GENERATE posts::id..posts::communityOwnedDate, comments::postId..comments::userId; grouped_comments = GROUP post_comments BY posts::id; comments_per_post = FOREACH grouped_comments GENERATE group as postId, post_comments.comments::text as comment; rmf stackoverflow/output/comments_per_post STORE comments_per_post INTO 'stackoverflow/output/comments_per_post' USING PigStorage();
  • 19. Lineage (Pig)
  • 20. Discovery & Lineage Architecture
  • 21. Model ● ● Generic (Element, Relations) Element ○ ○ ○ Unique Identity Key-value pairs Tags (Operation, Operation Execution, FSElement, Table, Column…)
  • 22. Model (Contd…) ● Relation ○ ○ ○ Unique Identity Two sets of related elements Relationship type (Parent Child Relation, Data Flow Relation, Control Flow Relation, Instance Of Relation, Alias Relation, Generic Relation)
  • 23. Discovery & Lineage (REST API) ● Elements Resource ○ curl 'http://localhost:5150/api/v1/elements?query=originalName:job_&limit=100&offset=100' [{ "identity" : "513bf7add8d5f56b7f0f34769707cb5f", "originalName" : "job_1389320017591_0024_conf.xml", "firstClassParentId" : null, "name" : null, "description" : null, "tags" : null, "properties" : null, "fileSystemPath" : "/user/history/done/2014/01/31/000000/job_1389320017591_0024_conf.xml", "category" : "FILE", "size" : 139211, "lastModified" : "1969-12-31T23:59:59.999Z", "lastAccessed" : "2014-02-04T02:12:01.369Z", "owner" : "root", "group" : "hadoop", "blockSize" : null, "mimeType" : "application/octet-stream", "replication" : null, "deleted" : false, "resType" : "HDFS", "permission" : 432, "resId" : "858e5548b4cd3457432eb491ee74729d", "type" : "fselement" }, ...] ○ ○ curl ‘http://localhost:5150/api/v1/elements/f53ae3547a90b7519b44041db1898972’ curl -X PUT -H "Content-Type: application/json" -d '{"displayName":"test","descriptin":"describe me","tags":[]}' http://localhost: 5150/api/v1/elements/e5f94cd59a8ca6df96247ce88b6c9c28
  • 24. Discovery & Lineage (REST API) ● Relations Resource curl 'http://localhost:7187/api/v1/relations?elementIds=83f4cdcc37c379144fef22e3dbdf7c8c&types=PARENT_CHILD&depth=2' [{ "identity" : “91540192d3dd727f912b3c0bb91cdd81”, "type" : “PARENT_CHILD", "parent" : [ { "elementId" : "83f4cdcc37c379144fef22e3dbdf7c8c", },"children" : [ { "elementIds" : [ "6144fabee63641275c5577697f16266a" ], } "name" : null},...] ● Interactive Resource curl 'http://localhost:7187/api/v1/interactive/elements?query=originalName:test&limit=2' { "offset" : 0, "totalMatched" : 2, "limit" : 1, "results" : [ { "identity" : "9b7b9d95eb06ccf0b1b0cd1a39642889", "category" : "DIRECTORY",... }, "facets" : { }, "qtime" : 10 }
  • 25. Audit (Supported Systems) ● ● ● ● ● HDFS HBase Hive Impala ...More coming...
  • 26. Audit Configuration
  • 27. Audit View
  • 28. Audit Details ● User ○ ● Operation Information ○ ● Username, Impersonator, Ip Address Operation Type, Session Id, Query Id, Operation Text, Status, Time Object Information ○ ServiceName, Path (Different in different systems)
  • 29. Audit Architecture Log4j Appender