• Save
Cloudera Federal Forum 2014: Tracking Provenance in Hadoop Clusters
 

Cloudera Federal Forum 2014: Tracking Provenance in Hadoop Clusters

on

  • 2,287 views

Software Engineer Sudhanshu Arora, shares the capabilities, architecture, and a quick demo of Cloudera Navigator.

Software Engineer Sudhanshu Arora, shares the capabilities, architecture, and a quick demo of Cloudera Navigator.

Statistics

Views

Total Views
2,287
Views on SlideShare
2,286
Embed Views
1

Actions

Likes
5
Downloads
0
Comments
0

1 Embed 1

http://www.slideee.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Cloudera Federal Forum 2014: Tracking Provenance in Hadoop Clusters Cloudera Federal Forum 2014: Tracking Provenance in Hadoop Clusters Presentation Transcript

  • Cloudera Navigator Headline Goes Here Speaker Name or Subhead Goes Here DO NOT USE PUBLICLY PRIOR TO 10/23/12
  • Outline ● ● ● ● Capabilities Architecture Quick Demo Q&A
  • Capabilities ● Discovery ○ ○ ● Lineage ○ ○ ● Search through metadata to find data set/operation of interest. View schema, associated metadata etc. for a dataset Given a data set, trace back to the original source. Understand the impact of modifying a data set. Audit ○ ○ Generate report of access to a data set in Hadoop. Generate alert when a restricted data set is accessed.
  • Discovery & Lineage(Questions to be asked?) ● ● ● Ad-hoc or only predefined? Granularity? Analysis?
  • Discovery & Lineage (Supported Systems) ● ● ● ● ● ● ● HDFS Hive MR1 Oozie Pig YARN ...More coming...
  • Discovery (Metadata Search)
  • Discovery (Metadata Search)
  • Discovery (Metadata Search)
  • Discovery (View Schema)
  • Discovery (Augment Metadata )
  • Discovery (Search on associated metadata)
  • Sidecars.. (Colocation of associated metadata) /user/root/customers/cust_demo /user/root/customers/.cust_demo.navigator Contents of .cust_demo.navigator { "properties" : { "secret" : "true", "retention" : "small" }, "tags" : ["pci"] }
  • Lineage (Hive Query) INSERT OVERWRITE TABLE machine_vendors SELECT upper(trim(regexp_extract(ms.dmidecode,"System InformationntManufacturer: ([^n]+)",1))) AS manufacturer,upper (trim(regexp_extract(ms.dmidecode,"System InformationntManufacturer: ([^n]+)ntProduct Name: ([^n]+)",2))) AS product,ca. address_state,ca.customerKey,cm.clusterId,ms.machineName FROM crm_accounts ca JOIN cluster_metadata cm ON ca.customerKey = cm.customerKey JOIN machine_stats ms ON cm.customerKey = ms.customerKey AND cm.clusterId = ms.clusterId AND cm.collectionTS = ms.collectionTS
  • Lineage
  • Lineage (Path highlighted)
  • Lineage (Instance)
  • Lineage (Template)
  • Lineage (Pig Script) posts = LOAD 'stackoverflow/posts/posts.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (id:int, postTypeId:int, acceptedAnswerId:int, parentId:int, creationDate:chararray, score:int, viewCount:int, body:chararray, ownerUserId:chararray, lastEditorUserId:int, lastEditorDisplayName:chararray, lastEditDate:chararray, lastActivityDate:chararray, tile:chararray, tags:chararray, answerCount:int, commentCount:int, favoriteCount:int, closedDate: chararray, communityOwnedDate:chararray); comments = LOAD 'stackoverflow/comments/comments.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (id:int, postId:int, score:int, text:chararray, creationDate:chararray, userDisplayName:chararray, userId: int); joined_post_comments = JOIN posts by id, comments by postId; post_comments = FOREACH joined_post_comments GENERATE posts::id..posts::communityOwnedDate, comments::postId..comments::userId; grouped_comments = GROUP post_comments BY posts::id; comments_per_post = FOREACH grouped_comments GENERATE group as postId, post_comments.comments::text as comment; rmf stackoverflow/output/comments_per_post STORE comments_per_post INTO 'stackoverflow/output/comments_per_post' USING PigStorage();
  • Lineage (Pig)
  • Discovery & Lineage Architecture
  • Model ● ● Generic (Element, Relations) Element ○ ○ ○ Unique Identity Key-value pairs Tags (Operation, Operation Execution, FSElement, Table, Column…)
  • Model (Contd…) ● Relation ○ ○ ○ Unique Identity Two sets of related elements Relationship type (Parent Child Relation, Data Flow Relation, Control Flow Relation, Instance Of Relation, Alias Relation, Generic Relation)
  • Discovery & Lineage (REST API) ● Elements Resource ○ curl 'http://localhost:5150/api/v1/elements?query=originalName:job_&limit=100&offset=100' [{ "identity" : "513bf7add8d5f56b7f0f34769707cb5f", "originalName" : "job_1389320017591_0024_conf.xml", "firstClassParentId" : null, "name" : null, "description" : null, "tags" : null, "properties" : null, "fileSystemPath" : "/user/history/done/2014/01/31/000000/job_1389320017591_0024_conf.xml", "category" : "FILE", "size" : 139211, "lastModified" : "1969-12-31T23:59:59.999Z", "lastAccessed" : "2014-02-04T02:12:01.369Z", "owner" : "root", "group" : "hadoop", "blockSize" : null, "mimeType" : "application/octet-stream", "replication" : null, "deleted" : false, "resType" : "HDFS", "permission" : 432, "resId" : "858e5548b4cd3457432eb491ee74729d", "type" : "fselement" }, ...] ○ ○ curl ‘http://localhost:5150/api/v1/elements/f53ae3547a90b7519b44041db1898972’ curl -X PUT -H "Content-Type: application/json" -d '{"displayName":"test","descriptin":"describe me","tags":[]}' http://localhost: 5150/api/v1/elements/e5f94cd59a8ca6df96247ce88b6c9c28
  • Discovery & Lineage (REST API) ● Relations Resource curl 'http://localhost:7187/api/v1/relations?elementIds=83f4cdcc37c379144fef22e3dbdf7c8c&types=PARENT_CHILD&depth=2' [{ "identity" : “91540192d3dd727f912b3c0bb91cdd81”, "type" : “PARENT_CHILD", "parent" : [ { "elementId" : "83f4cdcc37c379144fef22e3dbdf7c8c", },"children" : [ { "elementIds" : [ "6144fabee63641275c5577697f16266a" ], } "name" : null},...] ● Interactive Resource curl 'http://localhost:7187/api/v1/interactive/elements?query=originalName:test&limit=2' { "offset" : 0, "totalMatched" : 2, "limit" : 1, "results" : [ { "identity" : "9b7b9d95eb06ccf0b1b0cd1a39642889", "category" : "DIRECTORY",... }, "facets" : { }, "qtime" : 10 }
  • Audit (Supported Systems) ● ● ● ● ● HDFS HBase Hive Impala ...More coming...
  • Audit Configuration
  • Audit View
  • Audit Details ● User ○ ● Operation Information ○ ● Username, Impersonator, Ip Address Operation Type, Session Id, Query Id, Operation Text, Status, Time Object Information ○ ServiceName, Path (Different in different systems)
  • Audit Architecture Log4j Appender