DISCOVER with Data Steward Studio: Understanding and unlocking the value of data in hybrid enterprise data lake environments

1 © Hortonworks Inc. 2011–2018. All rights reserved
DISCOVER with Data Steward Studio
Understanding and unlocking the value of data
in hybrid enterprise data lake environments
Srikanth Venkat Hemanth Yamijala
Senior Director, Product Management Principal Engineer
Hortonworks Inc. Hortonworks Inc.

Presenters
Hemanth Yamijala
Principal Engineer, Hortonworks
Hortonworks DataPlane Services, Data Steward Studio, Apache
Atlas
Srikanth Venkat
Senior Director of Product Management,
Hortonworks Inc.
Security & Governance portfolio products & services
Apache Ranger, Apache Atlas, Apache Knox, Platform Security, & Hortonworks DataPlane
Service – Data Steward Studio(DSS)
@srikvenk https://www.linkedin.com/in/srikanthvenkat/
@yhemanth https://www.linkedin.com/in/yhemanth/

HDF HDP
Next Generation Data Problems
My Data Is Spread Across Multiple
Clusters and Data Sources
I Store & Analyze Data From
ERP/CRM, Systems, IoT/ Mobile
Devices, Social Media, Geo
Location etc.
Some of my data is on-premise,
some is in the cloud. I move my data
from cloud to on-premise & vice
versa between different clouds
™ ®

Forrester Calls It Data Fabric
“Bringing together disparate big data sources automatically, intelligently,
and securely and processing them in a big data platform technology, using
data lakes, Hadoop, and Apache Spark to deliver a unified, trusted, and
comprehensive view of customer and business data.”

Data Steward Studio
In the Cloud
On Premises
Data Steward Studio
Global Intelligent Data Catalog & Business Glossary
Curate & Organize Assets, Asset 360 Single View, Data & Metadata Security
(Structured
)
(Structured
)
Cluster 1 Dublin
(Unstructured)
(Structured
)
(Unstructured)
Cluster 2 Las Vegas
(Unstructured)
(Structured
)
(Structured
)
Cluster 3 Bangkok
Apache Ranger
(Structured
)
Apache Ranger Apache Ranger
(Unstructured)
(Structured
)
(Unstructured)
(Structured
)
Apache Ranger Apache Ranger

Dataplane: Bring All Data Under Management
From the edge, through movement, to rest
Hortonworks DataPlane Service
a foundational platform for the delivery of data
solutions that will:
• Support enterprise hybrid deployment strategy
and adoption of cloud
• Common Metadata, Security and Governance
across all deployments
• Simplified enterprise data asset management
• Extensible to new services: Services enablement
layer for rapidly bringing new solutions to market
HORTONWORKS
DATAPLANE
SERVICE
MULTIPLE CLUSTERS AND SOURCES
MULTIHYBRID
Manage, Secure, Govern
DATA AT REST
Hortonworks
Data Platform
DATA IN MOTION
Hortonworks
Data Flow

7 © Hortonworks Inc. 2011–2018. All rights reserved.
Data Governance: It’s a team sport!
Implements business data
requirements
Data CuratorData Steward
Manages business requirements
for data sharing
Sponsor
Champions data governance
across enterprise
Data Owner
Accountable for all data
generated by an agency
Supports the Data Steward in
data related activities
Business Data SME
Coordinate cross-agency data
management activities
Data Council

Goals

Data Steward Studio Demo

Hortonworks DataPlane Service (DPS)

Organize Your Data Assets as Collections
• Data Asset Collections - Organizational
construct for assets based on business
definition for grouping heterogenous data
• Create asset collections and attach
metadata
• Contextual attributes: Name,
Description, Owner, Datalake
• System attributes: - Created-on,
Modified-on, Modified-by, Created-by,
Version
• Search for assets using attribute facets or
free text
• View personalized dashboard of asset
collections
• Delete/update data asset collections
• Asset 360 view for assets in collection
Asset Collections

Discover and Fingerprint your Data Assets
• Computes Profile for data assets
as they are ingested or created
within the platform. Automatically
determines types of columns
based on data values
• Generates key metrics for data in
each column. Various
visualizations can be utilized (Box
plots, Histograms, Pie charts) to
view metrics
• Persists profile information in
cluster
• As more data is added, profilers
can be scheduled for execution for
updating the profile metadata for
the asset.
Data Profiler
Column Statistical Profiler

Know your Sensitive Data
• Automatically detect and
profile sensitive & personal
data
• Attach classification
annotations for sensitivity
• Manual approval and curation
of sensitive data
classifications
• Leverage classification based
data protection
• Sensitive data dashboard on
Asset 360
Sensitive Data Profiling

Track your Sensitive Data
• IBAN (27 EU Countries)
• Credit Card Numbers
• Email
• Telephone (AMER, EU)
• IP Address
• URL
• Passport (12 EU Countries)
• National ID (19 EU Countries)
• Australian Drivers License
• Australian Passport
• Australian National ID
Sensitive Data Types

Track Your Data Asset – Lineage and Impact
• Consolidated Upstream lineage and
downstream impact
• Detailed click-through to asset properties
Data Lineage and Impact

View Security Policies for your Data Assets
• View security policies on
data assets
• View classification based
policies on assets
Security Policies

Monitor Usage of your Data Assets
• Dashboard for access patterns and
trends for each asset
• Examples:
• Count of Access Events
• Top N Users over Time
• Most recent trail of access audit
events
Audit and Monitoring

Data Steward Studio Technical Overview

The DPS Ecosystem
DPS PLATFORM
DATA
LIFECYCLE
MANAGER
DATA
STEWARD
STUDIO*
DATA
ANALYTICS
STUDIO*
STREAMS
MESSAGING
MANAGER
DATA PLANE SERVICES
Authentication, Role-based access, Service lifecycle management,
Cluster registration, Cluster Service discovery and access
HDP/HDF Cluster
DLM Engine
Profiler
Service
Profile Manager
Data Analytics
Studio Service

DSS Fit into DPS / HDP Ecosystem
DPS PLATFORM
DATA
STEWARD
STUDIO*
DATA PLANE SERVICES
DATAPLANE
HDP/HDF Cluster
Hive
Metastore
Atlas Ranger
P R O F I L E R S E R V I C E
Spark/Livy HDFS

DSS Architecture
Postgres DB
Data Steward Studio
Play Web
Application
Angular UI
Knox
Profiler Service
Hive / HDFS Data
Livy Batch
Profilers
ProfilersProfilers
<Spark Jobs>
Atlas
Ranger
Summary
Files on
HDFS
Livy Interactive
Postgres /
MySQL DB

Profiler Agent
● A generic framework supporting the registration, scheduling and management of
data processing jobs that help with data discovery, classification and analysis
● APIs exposed for:
○ Registering and management of profilers and profiler instances
○ Configuring and scheduling profilers
○ Monitoring status
○ Managing serving of interactive metrics
● Goals for profiler agent
○ Extensibility (support new profilers, new asset types etc)
○ Performance and scalability

Profiler Agent
Profiler Agent Design
Asset
Source
Queue stats
Profile Queue
Asset
Selector
Job
Scheduler
Priority
rules
Asset
Align left
Asset
Filters
Profiler Job
Profiler
Metrics

DSS - Basic profilers
Profiler Purpose
Sensitivity
● Identify sensitive information in Hive tables based on regexes &
column Headers
● Aimed at GDPR like use cases
Table Stats Profiler
● Calculates statistical summaries on Hive table columns for
understanding the ‘shape’ of data
Audit Profiler
● Aggregates usage statistics of tables by different facets like user,
asset name, access status, time, etc.
Hive Metastore ● Provides metadata about all tables

Sensitive Data Profiling
● 70+ Regexes covering National ID numbers, PII information (email, credit card etc) are shipped and loaded
into HDFS
● Sensitive data profiler runs on a sample of hive tables against these regexes
● 85% weight to data, 15% to name of the column
● Tag workflow
○ Matched columns are tagged in Atlas with an attribute state ‘suggested’
○ In Asset-360 view, Data steward can ‘accept’ or ‘reject’ tags - the state is written back to Atlas
○ Can define Ranger policies against these tags (with attribute state = ‘accepted’) for governance of these columns

Asset collections
● Metadata search using Atlas
attributes using the Atlas DSL API for
all the queries. E.g.:
○ All hive tables with name ‘customer’
○ All hive tables created before date
○ All hive tables created before date and
owned by ‘admin’
● Search -> Gather -> Save workflow
● Collaboration features
○ Comments, Ratings, Favorites
● All data stored in Dataplane DB

Datalake & Asset collection Dashboards
● Three profilers contribute information:
○ Metastore profiler: Capture list of hive tables
created on a daily basis
○ Sensitivity profiler: Summary of tables with
sensitive information for generating aggregates
○ Audit profiler: Summary of Ranger audit logs
stored in HDFS
● All profiler information is stored as serialized
files in HDFS
● Queries for dashboard widgets are served
by Livy interactive sessions reading this
information.
● Livy session management is done by profiler
agent.

Asset 360
● Like a ‘Facebook profile’
page for a Hive table
○ Metadata (number of
rows/columns/sensitivity
distribution)
○ Lineage
○ Access summaries
○ Schema & Data
distributions - shape of
data
○ Applicable Ranger
policies
○ Ranger audit logs

Ranger Audit Profiler
● Needs some pre-requisites:
○ Ranger audit logging to HDFS should be enabled for Hive
○ Ranger policy allowing dpprofiler user access to the ranger audit directory is required.
■ This is added by the product automatically at MPack install
● 2 jobs are run:
○ First one runs to analyze logs of last day’s data
○ Second one runs to analyze logs of current day’s data at a faster interval
● Provides info about:
○ Top users
○ Top assets
○ Authorized vs Unauthorized accesses
○ Above is joined with other information to get:
■ Top asset collections
■ Top sensitive assets being used, ...
32

Data Steward Studio recap
Assets & Collections
Audit and Monitoring Dashboard
Security Policies
Data Lineage and Impact
Data Profiler
Data Steward Studio (DSS)
Capabilities
• Profile Data for understanding shape and
structure
• Organize and curate data for e.g. by
domains they belong to or data usage
• Identify sensitive data
• Collaborate with broader teams on how
data needs to be used and by who and
provide community ratings for
crowdsourcing knowledge
• Monitor ongoing usage, visualize chain of
custody and trustworthiness for longer
term use, understand data protection
DSS provides the “tooling” part of the People, Processes,
and Technology required for Hybrid Data Lake Governance

What Is New In Apache Atlas 1.0?
When: Wednesday June 20, 11:00 AM - 11:40 AM
Where: Grand Ballroom 220B
Overview of New Features in Apache Ranger
When: Wednesday June 20, 2:00 PM - 2:40 PM
Where: Executive Ballroom 210B/F
GDPR Crash Course
When: Wednesday June 20, 3:00PM -
6:00PM
Where: Meeting Room 212C/D
Birds of a Feather: Security &
Governance
When: Wednesday June 20, 5:40 PM - 6:50
PM
Where: Executive Ballroom 210B/F
GDPR-Focused Partner Community
Showcase for Apache Ranger and Apache
Atlas
When: Thursday June 21, 9:30 AM - 10:10
AM
Where: Meeting Room 230A
Check Out These Sessions:

Thank you!

DISCOVER with Data Steward Studio: Understanding and unlocking the value of data in hybrid enterprise data lake environments

Recommended

Recommended

More Related Content

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

DISCOVER with Data Steward Studio: Understanding and unlocking the value of data in hybrid enterprise data lake environments